Is Azure Data Factory ready to take over SQL Server Integration Services? Blog Post

Is Azure Data Factory ready to take over SQL Server Integration Services?

For those companies that have (or plan on having) Business Intelligence solutions based on Microsoft’s BI suite, when it comes to the task of choosing the right tool for data management & integration, the rule of thumb used to be SQL Server Integration Services (SSIS).

 

SSIS is Microsoft’s main data integration and workflow applications tool. It is an industry-standard for ETL (Extract, Transform and Load) in on-premise solutions – especially those that include SQL Server along the way – with a vast set of tools for extraction, transformation, and load.

 

 

However, on the Microsoft Cloud, and that would be Azure, we also find Azure Data Factory (ADF). This is a service for data integration and for orchestration of data management & engineering jobs within Azure. One can build ETL pipelines in a visual environment. While SSIS runs on SQL Server and uses it as its processing engine, ADF is usually reliant on other Azure services for the transformation of data such as HDInsight, Azure Databricks or Azure Data Lake Analytics (ADLA). The tight integration with these services as well as with the Azure Data Lake Store (ADLS) make it a good choice for Big Data Cloud architectures.

 

From an ETL perspective, it is often mentioned that ADF is not so flexible and rich as SSIS. ADF is also less mature than SSIS, which is normal considering ADF was released roughly 10 years later than SSIS. However, recent updates in ADF suggest an approximation to SSIS functionality. It seems like Microsoft, as well many other vendors, is preparing for the forthcoming predominance of hybrid scenarios, as mentioned in this year’s Gartner Data Integration Tools Report: fast-growing companies that want to benefit from the flexibility that cloud services offer, while still depending on core well-functioning on-premise solutions.

 

On the SSIS side, Microsoft has made some steps regarding integration with Azure. It’s been possible for a while now to create connections with Azure SQL Database (the equivalent of SQL Server on the cloud). You can also configure connections to Azure Data Lake or Azure Storage resources. ADLA or HDInsight Tasks are among other Azure-specific components added to the SSIS Toolbox. However, the growth of ADF, as any cloud service, is relentless in comparison to that of a tool that gets updates every two to four years and often with a great deal of compatibility pains.

 

In this article, we will analyse 5 ways in which ADF is getting closer to SSIS. For this analysis, we will consider the following key aspects:

 

  • Hybridization capacity: Options to connect to different sources for extraction and load as well as combining solutions. Here we will consider factors such as availability of connections or developers’ facility to transition between environments.

 

  • Transformation options: tools available for ETL (or ELT). Often cited as one of ADF’s principal weaknesses against SSIS, therefore, an important aspect to compare.

 

1. Connectors

 

Let’s first look at the basics of data integration: the copy activity. You will typically want to extract the information from a source system and place it somewhere within a target system such as the staging area of an Enterprise Data Warehouse (EDW) or the landing area of a Data Lake:

 

  • Prior to creating a Data Flow Activity to do the copy operation, in SSIS you must define Connection Managers. The connections available include text files, popular sheet file formats as well as common relational databases. However, the options are limited. For instance, you can extract information from an FTP server but not from an SFTP, and SSIS doesn’t do that well with some popular data sources outside the Microsoft stack, at least out-of-the-box. You can rely on third-party proprietary software (e.g. Cozyroc, SentryOne, etc.) for extra components & extensions that, among other things, increase the plethora of sources one can connect to (hint: very few of these extensions are for free – here you can find a list of components). Once the connections managers are created for both source and target systems, you define the Source and Destination components of your Data Flow Activity that will perform the copy operation. In addition to copy operations, Data Flow offers many transformation possibilities, but we’ll get to that in the pertinent section.

 

  • To make a copy operation in ADF, first you must define a source as well as a sink. Sinks options are limited, and you normally use ADF to store your data in some storage service within Azure for later processing. Note that in ADF the copy activity does only that: move data. On the source side, you have a wide selection of sources that include obviously all of Azure’s services relational databases, NoSQL databases, files (including SFTP sources), generic protocols and services. Find here a comprehensive list of sources which already lists over 80 different types of sources, new types of sources are added often. For example, in February 2019 connection to Google Cloud Storage was added (among other connectors). Furthermore, Azure VPN Gateway allows you to connect Azure with your on-premise systems. We must say that something we love about ADF is that it ships out-of-box with many handy connectors that we do not find in native Data Integration services in other public Clouds.

 

2. SSIS Integration Runtimes

 

Thanks to the addition of the Azure-SSIS Integration Runtime in Data Factory,  it is now possible to lift and shift your already existing SSIS packages onto Azure using Azure’s resources (Azure Data Factory and Azure SQL Database). We tested this feature and collected our impressions in a previous blog article. In short, we considered that Microsoft has really managed to simplify the migration pains between its own on-premise and cloud solutions. Note that the execution of the SSIS package becomes just an activity within a pipeline that you can connect with other SSIS packages or Azure activities for further transformation, analysis and reporting, among which Databricks, Web activities, Machine Learning Studio, etc.

 

3. Transformation options

 

In SSIS, using its drag-and-drop interface and your T-SQL skills (SQL Server SQL dialect) you can go as far as the SQL Server system resources let you. In addition, you can leverage the resources of your system if your databases are more limited, which on some occasions can become handy. The principal tools for transformation are:

 

  • Data Flow: merge, sort, derived columns, a conditional split, and union in a drag and drop interface. It’s optimized to work in memory (SSIS DB resources).
  • Execute T-SQL: This activity allows you to execute you SQL queries directly on your SQL Server database (using its resources).
  • Execute Script (Only available for C# and VB)

 

For Data Factory instead, it used to be the case that the only transformation options available were:

 

  1. On RSBD: executing Stored Procedures (when the sink system is a SQL Server or similar);
  2. On HDFS (Azure Data Lake Store):
    1. Executing U-SQL scripts using ADLA engine: a fusion between SQL and C#, conceptually like Apache Pig but poorly-documented and complicated to debug.
    2. Executing Apache Hive scripts (HiveQL is a SQL dialect quite similar to MySQL’s) or Pig scripts on a HDInsight cluster: however this option adds the complication of having to design and start and manage the cluster, which takes time and money.

 

Recently, with the addition of Databricks into Azure (Azure Databricks) the paradigm has changed. Azure Databricks allows using optimized Spark clusters in a very agile and cost-effective manner to run data transformation operations from ADF.

 

  • Databricks allows developers to write code interactively in notebooks enabling them to create data transformation scripts using PySpark (Python), R or HiveQL. This fact gives developers much more freedom than what one can get with the other available services.
  • Table definition metadata (which relies on a Hive metastore), scripts, and clusters are three separate things from a logical point of view: Different Databricks clusters share the table definition metadata; and one can create a script in one cluster and then run it on another one. This logical separation makes working with Databricks easy and flexible.
  • Setting up Databricks clusters is an intuitive two-step process. It’s often useful to develop with a simpler (and cheaper) cluster and then switch (which is done by simply changing the cluster attached to the notebook) to a more powerful one. These clusters are efficient (the minds behind Databricks are most of the guys that initially created Spark), easy to program and take less time to start and kill than HDInsight clusters. On top of that, one can also create ephemerous clusters than only live during the execution of a script.

 

This blog article does not aim to be an introduction to Databricks. We have already published an article comparing Databricks with ADLA and HDInsight. Moreover, stay tuned because in coming weeks we will be publishing a set of articles where will deepen into Databricks.

 

4. Azure Data Factory Mapping Data Flows

 

In May 2019 ADF released (still in preview) the Azure Data Factory Mapping Data Flows (MDF) functionality. MDF is the feature that more clearly shows ADF’s strategy to absorb SSIS functionality on the cloud. Basically, it does the same things its homonym SSIS tool does: merge, sort, derived columns, a conditional split, and union in a drag-and-drop user interface. It simplifies the T in ETL for those who come from traditional BI environments and might not be familiar with Cloud & Big Data. Not only that, it brings to it the added value of running on elastic engines as you can always modify once they are created. And these elastic engines are essentially Databricks clusters.. We recommend this great article written by Matt How on getting started with MDF.

 

5. Multiple flow controller types

 

In SSIS it has been possible for a long time to control the flow of different activities. The execution of a certain activity can depend on the execution outcome (Success, Failure or Completion) of the former activity in the flow. In addition, you can configure Expressions that can  affect the flux (depending if they evaluate to True or to False).

 

Not long ago, ADF announced that they had added three options for configuring flow connections: On Success, On Error and On Skip.

 

Conclusion

 

While SSIS is still the go-to reference within the Microsoft BI suite for on-premise data integration scenarios, ADF has become an attractive, feature-rich and flexible tool for data integration and transformation operations in hybrid on-premise and cloud scenarios. With all the recent ADF additions discussed above (SSIS Integration Runtimes on Azure; Azure Data Factory Mapping Data Flows; and Multiple flow controller types on ADF pipelines) Microsoft has clearly tried to soothe the transition for developers used to the traditional BI solution.

 

Furthermore, a wider offer of connectors (new types of sources are added often) and transformation options (tight integration with Databricks as well as other Azure services) make ADF increasingly more attractive. Surely, this is not suggestive of a complete replacement of SSIS by ADF in all cases, what sense would it make to use a Cloud data integration tool to manage your data if you rely only on an on-premise system? But Azure’s bet is compelling enough to wonder if in the future we will see ADF become the main go-to BI solution within the Microsoft BI suite. In the meantime, SSIS is still a consistent solution for on-premise systems: a powerful tool to perform ETL but with some compatibility issues outside of Microsoft’s BI suite.

 

If you are now wondering which set of technologies is more suitable for you, please do not hesitate to contact us. We will be happy to help you.

Ana M, Oscar M
ana.mrrodan@clearpeaks.com