Skip to main content
ArticlesCrossjoin

SQL Server 2019 Big Data Clusters

By August 29, 2019September 2nd, 2019No Comments

SQL Server

by André Batista Infrastructure Delivery Manager 

SQL Server 2019 Big Data Cluster (Preview Feature)
Part 1

Over the years, Microsoft’s data platform has kept pace with industry developments by adding native support in the database engine, XML data, JSON, in-memory, and database “graphs” technology. All customers and businesses can count on high performance, high availability and security in integrating all these technologies and features that the data platform provides. However, in the origins of the relational data engine-based data platform, the use of petabyte or exabyte-scale analytics had never been contemplated.

Also, it was never designed to use “scale-out” computation for data processing or machine learning or even to analyze data in unstructured formats such as media files.

This functionality of bringing together all these technologies and architectures in the data platform, Microsoft has called Big Data Clusters (BDC) that are embedded within SQL Server.

This feature is currently in “preview” just like the actual version of SQL Server 2019 Community Technology Preview 3.2 (CTP 3.2).

There are some curiosities and cool things in the currently available version of BDC but they can still be improved and changed until the final release of SQL Server 2019 (should be released due in the second half of 2019).

– Runs in Kubernetes.
– Integrates with the use of “sharding” in the SQL Server engine.
– Integrates also with HDFS
– Uses Apache Spark
– Spark and HDFS services run over the Apache Knox Gateway.

On top of all this, by using Polybase, we can connect to multiple data sources like Oracle, Teradata, MongoDB, and many others.

Briefly, Microsoft with BDC aims to make available to all its users and customers a platform that is scalable, highly performant, secure and robust integrating SQL, Data Warehouse, Data Lake and Data Science technologies. BDC is available in both “cloud” and “on-premise”.

Here is a map of the ecosystem on which Microsoft relied to design and implement the BDC.
SQL Server Big Data Cluster

Talking a little more about each of the implemented technologies and architectures:

– SQL Server instance:

When we deployed clustered SQL Server instance 2019 in Kubernetes, we have in our architecture a “master” instance and multiple SQL Server engines to perform computation and “sharding” operations. The “master” instance will behave just like a SQL Server 2019 instance (where we can use all the tools we are used to). When we have to stream data to our cluster, we can directly use our instances on “non-master” nodes (“shards”). This allows a performance boost as we do not have a single point of contention in our architecture and is horizontally scalable by the number of nodes we want to have in our Kubernetes cluster.

– Kubernetes:

SQL Server 2017 has already been implemented on top of an abstract layer called Platform Abstraction Layer (PAL) that allows SQL Server to run on multiple platforms such as the addition of Linux and Containers.
If we want to run BDC locally on top of a Kubernetes cluster it is possible for example to use Minikube or on the cloud with Azure Kubernetes Services.

– HDFS:

During the installation and configuration of SQL Server 2019 BDC, a Hadoop Distributed Filesystem (HDFS) is also installed in Kubernetes. Using Polybase “scale-out” groups we can easily and efficiently access this distributed data within SQL Server to external tables.
After installing SQL Server 2019 with BDC, all service configurations are performed automatically. With this piece, we can ensure in our architecture that we have a central data point for both relational data (since we can use our “master” instance as a normal instance of SQL Server) and for large unstructured data.

Polybase:

Polybase was a feature introduced in the SQL Server 2016 database engine, that allowed us to connect to HDFS data sources. With SQL Server 2019 we can also connect to relational data sources (Oracle, SAP Hana, PostgreSQL, …) or NoSQL data sources (MongoDB, Redis, Apache Cassandra, Azure Cosmos DB, …). We can even use Polybase with external tables to achieve our goals and be more productive by processing all the data faster.

– Apache Spark:

For those who are used to working with Spark projects and applications, you have the possibility to take advantage of all the features (SparkSQL, Dataframes, MLlib, …) within our SQL cluster. In our organization, with our Data Science and Data Engineers teams, there is a possibility to make SQL Server 2019 BDC a “Big Data” data center point.

In the next article, we will install and configure a SQL Server 2019 Big Data Cluster.

Leave a Reply