How to control your ETL performance

.

ETL processes and the importance of control

The term ETL, meaning Extract-Transform-Load, comes from the Data Warehousing field and was traditionally used to refer to those processes that transform and move data from the source system to the data warehouse.

Nowadays, the data movements that are done inside a company are much more than a simple source-data warehouse flow. Data supports many different business processes and it has to be shared across different business units and departments. Indeed, in high data-oriented companies, data and ETL processes are a vital part of key business processes such as customer analysis, fraud detection, mailing or any other high data consuming activities.

This increasing high relevance of data in companies has made ETL systems performance essential and an important concern not only for the BI team (usually in charge of the ETL system) but for many other teams. Thus, having a system that properly controls and shows the performance of all the ETL processes can have a very positive impact in terms of data reliability, early error detection and correction, and continuous process improvement.

A standard ETL Control System

At ClearPeaks, we identified this as a general requirement in all ETL systems and developed a standard easy-to-deploy system to control and show your ETL performance. We are not only talking about a simple system that logs the result of the executions in files or tables, but a more complete one that allows you to have a good visual feedback of your ETL performance. Therefore, questions such as “How did the executions go tonight?”, “Why can’t I see any data?” or “Is the execution time of this ETL increasing?” will become easy to answer by just taking a look at your Tableau dashboards.

ETL

Figure 1. Tableau dashboards of the ETL performance control system

Specifically, we developed two dashboards that offer solutions to the two principal issues when controlling a process; the final result and the time spent. There is one dashboard that informs on execution results, common errors and percentages of success (left-top of Figure 1), and another one that shows execution times and time variables (right-bottom of Figure 1).

These dashboards are designed to enhance scalability and easy implementation so they are useful for any ETL. In the same way, the technical actions required to implement this system are also simple and compatible with any ETL. To empower this high level of standardization, the basics of the Business Intelligence methodology have been adapted to the ETL Control System architecture. Thus, the system is based on an unchanging star database model with some completely standard tables, and four simple functions or boxes (the exact term varies depending on the ETL tool) that must be included in the ETL. Initially, this system has been completely developed in Kettle ETL tool, so implementing the system in Kettle ETLs is very straightforward. On the other hand, if you are using a different ETL tool (such as Oracle DI or Informatica PowerCenter), a mapping from the logs to the star database model must be done.

 

So do you think you also need to control your ETLs? Would you like to do it with good visualizations? Contact us at info@clearpeaks.com or use our web form if you would like more information about the ClearPeaks ETL Control System, or leave us your comments below!

Blog Article Authors: Pere Vegas & Daniel Albendín

 

Copy rows to result in Kettle | Tips and Tricks

.

Copy rows to results steps

Managing variables is one of the most complex operations in kettle. In this entry a few tips are explained to make these operations easier.

‘Copy rows to results’ step is the most versatile when we talk about sharing information between objects. This step only exists in transformations and it can convert the row values to parameters, variables, arguments or even pass them to other transformation.

The Case

Assuming the following job in kettle (root job):

Image 1

The transformation T1 is the following:

Image 2

The properties file content is:

Image 3

Now, these properties have to be used in a new transformation or job. The root job will now looks like:

Image 4

The Possibilities

Depending on how these parameters are going to be used in T2, different methods can be used

1- To be used as rows:

If these rows are needed in the flow, a get rows from result have to be placed into T2 transformation.

Image 5

Now, main job can be executed and log looks like the following.

code 1

2.- To be used as parameter/variable:

We have to follow these steps:

  1. In the root job open T2 options (Double left click on it).
  2. Open the advance tab
  3. Check the ‘execute for every input row?’ optionImage 6
  4. Open Parameters tab
  5. Define every Column you want to use as a parameter (A,B,C)Image 7
  6. Open the T2 transformation (double middle click on it)
  7. Define the variables in the T2 transformation.
    Image 8
  8. Place a Get Variables step and fill the fields with the proper valuesImage 9

Now, main job can be executed and log looks like the following.

code 2

This can be used in a job call instead of a transformation one.

3.- To be used as argument:

  1. In the root job open T2 options (Double left click on it).
  2. Open the advance tab
  3. Check the ‘execute for every input row?’ and ‘Copy previous results to args?’
  4. Open the T2 transformation (double middle click)
  5. Place a ‘Get system info’ step.
  6. Define Get system Info step as following:

Image 10

Now main job can be executed and log looks like the following:

code 3

Conclusion:

When working with Kettle, the rows in the flow can be used in different ways. You can execute jobs or transformations with data created or calculated in previous steps, so Kettle becomes into a powerful and versatile ETL tool, which offers big scalability and dynamic executions.

privacy policy - Copyright © 2000-2010 ClearPeaks

topnav