Retrieving Active Directory Data into BI

.

 

Introduction

Active Directory is used by most organizations to both manage users, groups and devices that are registered to the network and to maintain better network security solutions. This information can be extracted and made available to analyze authorization on different applications within your organization.

Below are few key details that can be retrieved from Active Directory:

UserID, Login Name
Employee’s First Name, Last Name
Employee Status
Job Title
User Mail ID
Employee Organization
Line Manager Details
Account Status
Member Groups

 

1. Motivation

In most business cases, HR systems will hold only information of users under the organization’s payroll, and not that of any contractors or external vendors working with the organization. This calls for extracting Active Directory data into the enterprise’s data warehouse and using it for further analysis in scenarios where we need to get the full list of employees (including direct employees, contractors, vendors etc.) along with their details, i.e, who will be accessing internal applications like service manager, BI services, general services applications, etc. If we can pull this information into OBIEE and automate the data load process, it enables us to generate audit reports and to analyse the access rights of users to various applications.

ADManager Plus data extraction and challenges: 

Manage Engine ADManager Plus is a web-based interface solution designed to meet Active Directory management requirements and report generation. It helps to perform audits for the defined security permissions for a specific AD object or for a specific user. The Report scheduler feature within enables the auto-generating of reports at specified times and delivers the report to multiple users via email in the desired format.

Although AD manager plus supports generating and scheduling reports, it can only be delivered to email accounts. Because of this limitation, automating the data extraction process from Active Directory and loading the data into a data warehouse is not possible.

2. Python solution for Automating Data retrieval and the loading process

Data Extraction from LDAP could be achieved using the LDAP library interface module for Python. Here we use ldap3, which runs off of pure, vanilla Python. ldap3 is supported in Python versions 2.6 and above.

The Ldap3 module needs to be installed in order to establish a connection to the server and retrieve data. It can be installed either using a pip package or by downloading the latest version from LDAP3 source location and install it using command - Python steup.py manually.  You should refer to Ldap3 library documentation for further references.

Defining server details and establishing connection to the LDAP server

In order to establish a connection, the first server object needs to be defined. The Server object specifies the DSA (Directory Server Agent) LDAP server that will be used by the connection. The user then needs to define a host variable, which will be host name/IP/complete URL with hostname, host port of LDAP server. This is required to create the server object.

# import class and constants 
from ldap3 import Server, Connection, ALL

#define the server
hostname = 'servename'  # hostname for LDAP server 
server = Server (hostname, get_info=ALL)

The connection object will send operation requests to the LDAP server. It takes different parameters like server, username and password for performing operations in the server. The connection object also requires authentication type and read only parameters that define the type of operation to be performed on the server.

# define the conncetion 
uername = 'xxxxx@clearpeaks.com'  #user account to access server 
password = '*********'  # password for authentication
connection = Connection (server, user=username, password=password, auto_bind=True, collect_usage=True)
connection.bind() 

Once the connection object is defined, the bind() method is executed to open the connection with the server. The bind operation allows credentials to be exchanged between the client - server and establishes a new authorization state. Connection once established enables to perform all the standard LDAP operations.

Accessing AD information and generating files

The Search method in the connection object enables search operations on the LDAP database. It takes the following parameters:

Search Base: takes the base of the search request
Search Filter: takes the filter to be applied on the search request
Search Scope: defines how broad the search context is. BASE, LEVEL, SUBTREE are the values
Attributes: a single attribute or a list of attributes which can be returned by the search
Get Operational_Attributes: if True, returns information attributes

Search method takes many other parameters like time limit, size limit etc. You may refer to the Ldap3 documentation for further reference. By default, Active Directory sets a hard limit of 1000 entries returned for any search, mainly due to security constraints. It is better to go for a paged search method when retrieving AD, as the entries to be retrieved will be more than 1000 records in many cases.

Instead of a simple search operation, here we are doing a paged search to retrieve the full list of entries from the AD server.

# define parameters and calling search method
base = 'dc=clearpeaks, dc=corp, dc=ae'  # defining case for search operation 
filter_value = 'All BI Users Group'  # defining filters for search operation 
attrs = ['SAMAccountName','Title','dislpayname','department']  #defining required attributes from AD

data = c.extend.standard.paged_search(search_base) = base_dn, search_filter = ' (memberOf = '+allBIusers+')',
search_scope = SUBTREE,
attributes = attrs, 
get_operational_attributes = True
paged_size = 15, generation = True)

Responses received from the Search operation will be in a list format and stored in the Response attribute of the connection object.  The response object can be iterated over to retrieve the results.  A few data cleansing steps are required to carry it out, as Active Directory data text will be in UTF-8 format, necessitating its decoding into ASCII format. Once retrieved, the response data is updated to a CSV file, which can be saved at a shared location in the server.

Figure 4

Figure 1: Sample of the response data updated to a CSV file

Files generated through Python scripts can be accessed by ETL Mapping, which is configured to extract data from excel as source and loads it into warehouse tables, where your data is ready to be visualized in any BI tool. Below is a sample report using LDAP data.

Figure

Figure 2: Sample of Active Directory Data displayed in BI Report

Unix Crontab command can be used to schedule the script, based on the frequency required to refresh data from the Active Directory. ETLs are further scheduled respectively to load new sets of data from the generated source file.

Conclusion

In this blog post, we describe a solution to automate the Active Directory Data retrieval and loading process into BI environments using Python. Active Directory information can be utilized for analyzing Security Privileges granted within an organization, which in turn helps to audit user access to various applications.

Although we explained only the scope of reading Active Directory data, it is not the only benefit of this method. Once an authenticated connection is established with the server, all standard LDAP operations can be performed using the LDAP3 Python library.

Contact us if you want to have more information about how pull HR information into OBIEE.

 

Data Quality Series – Introduction

.

 

This article is the first in a series of blog posts about the topic "Data Quality". In the next couple of weeks we will go through the following subject matters:

Data quality with EDQ

Part 1: Data Profiling
Part 2: Data Standardization
Part 3: Data Deduplication

Data quality with Informatica

Part 1: Data Profiling
Part 2:
Data Standardization
Part 3:
Data Deduplication

The focus of this first article is to introduce "Data Quality".

1. Introduction to Data Quality

Data quality is the perception or assessment of the fitness of data to serve its purpose in a given context [1]. Good quality data is crucial for analysts to create reports that contain accurate and consistent information, and in some businesses, bad quality or out-of-date data may increase costs. For example, a mailing campaign that sends letters to the wrong customer addresses is a waste of money.

Moreover, reports containing poor quality data may be regarded by customers as erroneous and thus reduce their confidence in the delivered dashboards. In fact, according to Gartner, a loss of confidence by users in their reports/dashboards is the number 1 cause of Data Warehouse / Data Mart / Data Governance project failures.

Many of the data quality tasks are very close to the database level and can be performed by a DBA, for instance by adding checks in the columns to allow just valid values, or by setting default date formats. But in some scenarios we may find that these validations are not performed, and certain data quality tools can help us to analyse and fix these issues.

The term “Data Quality” involves many different aspects:

Validity:

The data conforms to the syntax (format, type, range) of its definition. Database, metadata or documentation rules as to the allowable types (string, integer, floating point etc.), the format (length, number of digits etc.) and range (minimum, maximum or contained within a set of allowable values).

Accuracy:

The data correctly describes the "real world" object or event being described. Does it agree with an identified reference of correct information?

Reasonableness:

Does the data align with the operational context? For example, a birthdate of 01/01/01 is valid, but is it reasonable in context?

Completeness:

The proportion of non-blank values against blank values. Business rules define what "100% complete" represents: optional data might be missing, but if the data meets the expectations it can be considered complete.

Consistency:

Values across all systems in an organization reflect the same information and are in sync with each other.

Currency:

The data is current and "fresh"; data lifecycle is a key factor.

Precision:

The level of detail of the data element, e.g. the number of significant digits in a number. Rounding, for example, can introduce errors.

Privacy:

The need for access control and usage monitoring.

Referential Integrity:

Constraints against duplication are in place (e.g. foreign keys in a RDMBS)

Timeliness:

The data is available for use when expected and needed.

Uniqueness:

No value occurs more than once in the data set.

Data quality is affected by the way data is entered, stored and managed, and data quality assurance (DQA) is the process of verifying the reliability and effectiveness of data. In the section below we will see the basic approach to a successful DQA project.

 

2. Data Quality Projects

Every data quality project needs to start with an understanding of the benefits that we are going to obtain from it, that is, an assessment and understanding of the value of the data as a key for better decision-making and improved business performance, or,  in other words, its utility.

It is possible to analyse how data is being used to achieve business objectives, and how the achievement of these goals is impeded when flawed data is introduced into the environment. To do this, we must consider the following points:

What the business expectations for data quality are
How the business can be impacted by poor data quality
How to correlate such impacts with specific data quality issues

Once a negative impact on the ways the business operates due to poor quality data has been determined, the necessary approach to assemble a data quality management programme and institute the practices that will lead to improvement must be planned. This plan must consider:

The processes that need to be instituted
The participants that will execute those processes and their roles and responsibilities
The tools that will be used to support the processes

Normally, a data quality project lifecycle involves at least three different steps:

1. Data Profiling
2. Data Standardization or Cleansing
3. Data Matching and Deduplication

 

3. Data Profiling

Data profiling is the analysis of data in order to clarify its structure, content, relationships and derivation rules. It mainly involves gathering different aggregate statistics or informative summaries about the data, and ensuring that the values match up to expectations.

Profiling helps not only to understand anomalies and to assess data quality, but also to discover, register, and assess enterprise metadata; thus the purpose of data profiling is both to validate metadata when it is available and to discover metadata when it is not.

The typical outputs of the data profiling process are:

Column Profiling:

Record count, unique count, null count, blank count, pattern count
Minimum maximum, mean, mode, median, standard deviation
Completeness & number of non-null records
Data types
Primary key candidates

Frequency Profiling:

Count/ratio of distinct values

Primary/Foreign Key Analysis:

Referential integrity checks (can be achieved by creating a join profile)
Candidate primary and foreign keys

Duplicate Analysis:

Identify potential duplicate records

Business Rules Conformance:

The data meets an initial set of rules

Outlier Analysis:

Identify possible bad records

 

4. Data Standardization

Once the data profiling process is finished, the next step is to fix the issues that have been identified, by applying rules, replacing the wrong values, and fixing data inconsistencies. Data standardization, or data cleansing, is the process of developing and applying technical standards to the data, to maximize its compatibility, interoperability, safety and quality.

 

5. Data Matching and Deduplication

Record matching is the process of finding duplicate records in the data, that is, records which may relate to a single real world entity. This may seem like a trivial task, but in reality it is far from it. The first challenge is to identify the criteria under which two records represent the same entity, considering that the data may come in different formats and using different conventions (free text fields), may be incomplete and/or incorrect (typos, etc.), and the context of the different fields (for example, four different fields may represent a single address). We want to identify duplicates regardless of these differences in the data, so we need to define measures of distance between the records and then apply rules to decide if they classify as duplicates.

Moreover, efficient computation strategies need to be used to find duplicate pairs in large volumes of data. The number of comparisons needed using traditional methods on large datasets quickly scales up to extremely long execution times, and techniques such as a previous clustering of the data become necessary.

Data deduplication is the process of merging record pairs representing the same real world entity, also known as consolidation. It relies on record matching to find pairs of possible duplicate records.

 

6. Data Quality Tools

All the tasks involved in a data quality assurance project can be done manually on the database, or in the ETL processes if working on a data source being loaded from another source. However, there are many vendor applications on the market to make these tasks much easier. Most of these applications can be integrated with other ETL tools, and offer batch processing as well as real-time processing to ensure we always have the best quality data.

In the next article we are introducing Oracle Enterprise Data Quality (EDQ), a leading data quality tool.

References:
[1] Margaret Rouse, “data quality”. Web. TechTarget. November 2005. Accessed January 2017. http://searchdatamanagement.techtarget.com/definition/data-quality

 

Authors:
Javier Giaretta, Nicolas Ribas and Axel Bernaus

Click here if you would like to know more about Data Quality!

 

Informatica Performance – Optimization techniques

.

Informatica performance - Optimization techniques

 

Informatica provides the market´s leading data integration platform. ETL Mappings are designed for data loading into the data warehouse environment to achieve better reporting which in turn helps you to understand business trends better. The major problem faced by anyone working with Informatica ETL is to design a mapping(s) that doesn’t compromise its performance. Often we end up creating a mapping that achieves only the functionality but suffers in terms of performance.

The article explains the steps to identify the performance bottlenecks and the advanced optimization techniques that are available in Informatica to overcome them.

Following topics are discussed in detail:

* Optimizing sources, Targets and Buffer Blocks:
Performance bottlenecks can occur when the Integration Service reads from a source database or writes to a target database. Depending on the source, target data, you may need to increase or decrease the buffer block size. In this article we list the steps you have to take to ensure that the sources and the Targets are fully utilized to optimum level and how to calculate Buffer block size.

* Push down Optimization (PDO):
Push Down Optimization Option enables data transformation processing to be pushed down into any relational database to make the best use of database processing power.

* Session Partitioning & Dynamic Session Partitioning:
The Session Partitioning Option increases the performance of Power Center through parallel data processing. In dynamic partitioning, Integration Service determines the number of partitions to create at run time based on factors such as source database partitions or the number of nodes in a grid.

* Concurrent Workflow Execution:
A workflow configured for Concurrent Execution can run as multiple instances concurrently. In the article we illustrate this functionality by using a workflow that is configured with multiple instances to process data belonging to multiple sites simultaneously.

* Load Balancing:
Load balancing is a mechanism which distributes the workloads across the nodes in the gird. In the article we see how to assigning Service Levels to Workflows to achieve Load Balancing.

Click to read the full article: Informatica performance - Optimization techniques!

 

Customizing ODI Knowledge Modules

.

Lately we have seen a growing interest in ODI (Oracle Data Integrator). We have been working with this tool here in ClearPeaks for a while now, on several projects for different verticals, but it is also true that with the appearance of the newer versions of Oracle Business Intelligence Applications (OBI Apps) 11.1.1.X, it’s a pretty hot topic, and that’s why we’ve decided it was time to publish a quick “tips and tricks” guide for ODI, hoping to save our customers a lot of effort, time, and money.

The first thing to bear in mind when you start working with ODI is that you really must follow best practices in order to take full advantage of two of the most powerful weapons in the ODI arsenal: the declarative design and the knowledge modules. For those of you that don’t have too much experience with ODI, let me run through these two outstanding features.

The declarative design basically allows you to create an abstraction layer between the developer and the actual code that is going to be performed by the DBMS. This means that in ODI you define “what you want to get”, and the way to do so is automatically implemented by the Knowledge Module.

However, you might ask yourself “Is this possible? Can we really rely on the default ODI KMs?” Well, the answer is very simple: for standard needs, where performance is not a problem, yes! But in most of our BI projects, remember that we have had to tune the KMs to adapt them to our customers’ needs and to obtain the maximum benefit from the tool.

But don’t think that this undermines what is a fantastic feature. ODI comes with a great set of KMs that give you the perfect starting point to create your own customized KM. And moreover, all the developers don’t need to go into the details of the KM implementation; in a typical ODI project, the architect will be responsible for setting up the ODI environment and will provide the whole team with the appropriate KMs that will satisfy the particular project needs.

So in principle, the developers don’t need to know all the implementation details (it is up to each user/developer to go beyond and analyze the code ODI is generating, if required). This abstraction significantly speeds up the process of developing an ETL, since once the logic is established, there is no need to redo it over and over again.

A typical example to consider is the logic necessary to load a Slowly Changing Dimension (SCD Type II). With other tools, each developer would need to fully understand the logic of the SCDs and the way the integration process is performed, since it has to be replicated for each table to be loaded.

With the ODI declarative design, you just establish what you want, and the KM will take care of the logic. So you simply need to indicate:

➜ I want to treat “Table X” as a Slowly Changing Dimension (we will mark it like this).

Oracle Data Integrator

 

➜ I want to use “Column X” and “Column Y” to store the starting and ending date of the row, respectively.

Oracle Data Integrator

 

➜ I want to use “Column Z” as the current row flag.

Oracle Data Integrator

 

➜ I want “Column A” to be the Primary Key (Surrogate Key) of “Table X”.

Oracle Data Integrator

 

➜ And I want this column / set of columns (e.g. B and C) to be the Integration Key (Natural Key) of Table X (the column, or columns, that will be used to determine if the row is a new row, or if the row previously existed and has to be updated).

Oracle Data Integrator

 

➜ Finally, we can indicate for each column if we want the ETL to add a new row when the value changes in the source system, or if we prefer to update the whole table.

Oracle Data Integrator

 

✓ And that’s it! By then selecting the appropriate KM, tuned with the necessary logic by our architect, we can develop as many mappings as we want for Slowly Changing Dimensions. Just indicate the source for each column and run it. Quick and easy!

We have also mentioned the concept of Knowledge Modules. So, some of you may wonder, what is a “Knowledge Module”? This is simply a generic set of steps that will perform the needed logic for your ETL process. Each step can be written in different languages (SQL, Oracle-SQL, Jython, and many more) depending on the technology underneath, with placeholders for each column, table, and in general, “entity” that will take part in our ETL. At execution time, those placeholders are filled with the details of the mappings that have been developed, and this gives the project team the flexibility to reuse logic and speed up the delivery of the ETL process.

Well, that’s enough of an introduction to these two handy features in ODI. Now let’s see some usage examples of the things that can be done, and which can help our customers to reduce implementation time.


1. Automatizing Loads

▼ Automatizing the Incremental / Full Load Strategy
There are several ways to automatize an Incremental / Full Load Strategy using KMs. If we think back to the previous version of OBI Apps, in Informatica we had to have two separate mappings for the Incremental and the Full version of the mapping. With ODI this can be automatized in several ways:

ⓐ Manually: Adding an option to the mapping
The simplest way is to add an option to the mapping so we manually specify if we want to execute a full load or not. This option will drive the execution of a new step in our KM. For example, this step may consist in truncating the table, and if we are populating a dimension, resetting the associated sequence.Here we see the importance of following best practices and naming conventions. If we follow a good methodology when naming our DWH objects, we can distinguish the table type by the name, and also the sequences can be easily related to the corresponding dimension.

Oracle Data Integrator
Oracle Data Integrator

 

ⓑ Automatically: Adding an option to the mapping and using a control table
The second option is an extension of the previous one. By using a control table containing the table name and one column containing the “FULL_LOAD_FLAG”, we can invoke this process only if the DWH administrator has set the property in the table (FULL_LOAD_FLAG = 1). This way, there is no need to modify any ODI object to change the behavior of the interface.

▼ Automatizing the population of the unspecified rows for dimensions
A typical need in any data warehousing project is to populate an “unspecified” row for each dimension. This will give us the capability of always making inner joins between the facts and dimensions, and thus improve performance in all our OLAP queries.

Once again, we could generate a separate interface for this, for each dimension, but will it be efficient? ODI gives us the possibility of defining a generic step in our KM to be executed for any dimension table to be loaded.

A typical need in any data warehousing project is to populate an “unspecified” row for each dimension. This will give us the capability of always making inner joins between the facts and dimensions, and thus improve performance in all our OLAP queries.

Once again, we could generate a separate interface for this, for each dimension, but will it be efficient? ODI gives us the possibility of defining a generic step in our KM to be executed for any dimension table to be loaded.

At execution time, ODI is aware of the data type of each column to be populated (remember that every object is stored in the model, and so we have an internal representation stored in the ODI repository). We can prepare a simple process that will basically assign a default value for each data type. For example:

Oracle Data Integrator

Oracle Data Integrator

NOTE: See the ODI Substitution API Reference to learn more about the functions you can use to retrieve useful information from the ODI model, or contact us for further details!

▼ Automatizing the population of Fact Modules by using a control table
Another interesting functionality we usually include in our projects is control of the periods to be loaded by a control table. As we mentioned before, with this table we can change the behaviour of our ETL process just by changing one row in one table. This speeds the process of programming and administering the loads up a lot.

By using this table, the ETL administrator is capable of specifying the starting date that we need to populate, and the number of periods to be loaded. More options can be useful in this table, like a flag (PREV_EXEC_CORRECT_FLG) indicating if the automated previous execution of the ETL was correct (if not the ETL should not continue in order to avoid inconsistencies, for example, in Slowly Changing Dimensions). Another flag might be used to indicate if the population of this table should be included in the automated process (IS_PART_OF_DAILY_LOAD), or the FULL_LOAD_FLAG that we already mentioned.

In the example below, the source system needed to be queried daily due to performance restrictions; by specifying a starting date and the number of periods to be loaded, the ETL automatically ran a loop for the number of days to be loaded by leveraging the ODI variables.

Oracle Data Integrator
Oracle Data Integrator

In this case, it is also important to highlight that the interfaces have to be filtered by a variable containing the value of the day to be populated:

Oracle Data Integrator

Another interesting modification done to the KMs involved in this case study is that of the corresponding CKM to add an option that will avoid deleting the previous errors, since the same session is going to be executed several times in a loop.

Oracle Data Integrator

In some cases, it might even be interesting to remove the period from the target table before inserting, if we are performing a full period insertion (we don’t want to merge the data with the previously inserted data, but to directly reload a complete period).

Oracle Data Integrator

 

As you can see above, by using the ODI functions we can generate generic steps that will work for any object in our ODI model. For example, <%=odiRef.getTable(“L”,”TARG_NAME”,”A”)%> will retrieve the target table name.

See: http://docs.oracle.com/cd/E21764_01/integrate.1111/e12645/odiref_reference.htm#ODIKD1295


2. Boosting performance

▼ Boosting performance for populating staging tables
Since ODI 11g does not have a multi-target capability, we sometimes need to reuse the same staging table to populate several final targets; we have to load these staging tables as fast as possible to speed up the performance of the whole ETL process. We can use the default KMs to load these staging tables, but for one staging table you may not need to apply integration processes, validations checks and other time-consuming logic; to get round this, a good practice is to generate your own staging KM, including only the necessary steps:
Oracle Data Integrator

Then, in the “Insert” step, you can populate the target table directly, instead of the integration table:

<%=ODIREF.GETTABLE("L","TARG_NAME","A")%>
VS
<%=ODIREF.GETTABLE("L","INT_NAME","W")%>

Oracle Data Integrator

 

▼ Boosting performance by using partition exchange
In order to boost performance of load processes and reporting as well, it is sometimes useful to apply partitioning to your DWH tables. This will give you two main advantages:

➀ At ETL time, the population of a big table (GBs of data) is much faster if the table is partitioned, since you can populate a temporary empty table, and then execute a partition exchange to swap the partitions between both tables. This way, all the population in your temporary table, the consistency checks, constraints and all the processes you may need to do can be done much faster.
It is important to emphasize that this is useful for snapshot tables, and in general for tables that do not need to cross data between partitions.

➁ At reporting time, if you just need to analyze one partition (one period) at a time, or maybe two to compare them, the response time in a partitioned table will be faster.

To accomplish the partition exchange in ODI, you can also use a knowledge module that can encapsulate the logic needed. The command below will execute the partition exchange between the temporary table and the target table, leveraging ODI functions.


ALTER TABLE

<%=ODIREF.GETTARGETTABLE( "SCHEMA" )%>.<%=ODIREF.GETTARGETTABLE("RES_NAME")%>
EXCHANGE PARTITION ' || V_PARTITION_NAME || ' WITH TABLE
<%=ODIREF.GETTABLE("L", "INT_NAME","W")%>
<$="<%=ODIREF.GETUSEREXIT("PARTITION EXCHANGE OPTIONS")%>"$>

 

Note that the variable V_PARTITION_NAME will need to contain the name of the partition to be exchanged. This can be retrieved by a simple query on the temporary table that your mapping has populated.

➀ Loading historical information into a Slowly Changing Dimension
The last interesting modification that we are going to show you is to the Oracle SCD KM. Imagine that you launch your system with a starting date (e.g. 30 Sept., 2014) and after some months running your ETL system in production, the source system is updated with historical information so that the business can analyze trends and customer behavior over several years. Obviously, we can’t load the latest information in the source system to our DWH because it will override the current information with this stale but still relevant information. We need to look for a way to include this information in our system, so if a user queries information about old periods, they can analyze and explore the attributes the customer had at that point in time.

Well, we can do the following:

➀ Duplicate the mapping you are using to load your SCD.

➁ Duplicate the Slowly Changing Dimension IKM you are using to populate your SCDs.

➂ Modify the copy of your IKM to set the "ACTIVE_TO_DATE" to the date when you launched your system. This way you can run a separate load for older periods without impacting the daily load you are running on a daily basis with current data. Any row loaded by this second thread (older periods), will always finish by the date when you launched your system, so there won’t be duplicates in your SCD and you can explore both current and old information about the entities in your dimension (e.g. your customer profile, segment, location, etc.).

Oracle Data Integrator

 

Just by modifying small pieces of the given KM code, you can achieve your particular needs. Just be careful that you modify the appropriate properties (in this case, SCD_END and also the SCD_FLAG have to have different values from those in your daily thread).

We hope that you’ve found these techniques that can be applied to ODI KMs to speed up the process of generating and manipulating your ETL interesting, and if you have further questions, do not hesitate to contact us.
Stay tuned for more interesting features to apply to your ETL processes with #ODI!!

Get in touch with us and see what we can do for you!  ✉

Blog Article: Emilio Soriano

How to control your ETL performance

.

ETL processes and the importance of control

The term ETL, meaning Extract-Transform-Load, comes from the Data Warehousing field and was traditionally used to refer to those processes that transform and move data from the source system to the data warehouse.

Nowadays, the data movements that are done inside a company are much more than a simple source-data warehouse flow. Data supports many different business processes and it has to be shared across different business units and departments. Indeed, in high data-oriented companies, data and ETL processes are a vital part of key business processes such as customer analysis, fraud detection, mailing or any other high data consuming activities.

This increasing high relevance of data in companies has made ETL systems performance essential and an important concern not only for the BI team (usually in charge of the ETL system) but for many other teams. Thus, having a system that properly controls and shows the performance of all the ETL processes can have a very positive impact in terms of data reliability, early error detection and correction, and continuous process improvement.

A standard ETL Control System

At ClearPeaks, we identified this as a general requirement in all ETL systems and developed a standard easy-to-deploy system to control and show your ETL performance. We are not only talking about a simple system that logs the result of the executions in files or tables, but a more complete one that allows you to have a good visual feedback of your ETL performance. Therefore, questions such as “How did the executions go tonight?”, “Why can’t I see any data?” or “Is the execution time of this ETL increasing?” will become easy to answer by just taking a look at your Tableau dashboards.

ETL

Figure 1. Tableau dashboards of the ETL performance control system

Specifically, we developed two dashboards that offer solutions to the two principal issues when controlling a process; the final result and the time spent. There is one dashboard that informs on execution results, common errors and percentages of success (left-top of Figure 1), and another one that shows execution times and time variables (right-bottom of Figure 1).

These dashboards are designed to enhance scalability and easy implementation so they are useful for any ETL. In the same way, the technical actions required to implement this system are also simple and compatible with any ETL. To empower this high level of standardization, the basics of the Business Intelligence methodology have been adapted to the ETL Control System architecture. Thus, the system is based on an unchanging star database model with some completely standard tables, and four simple functions or boxes (the exact term varies depending on the ETL tool) that must be included in the ETL. Initially, this system has been completely developed in Kettle ETL tool, so implementing the system in Kettle ETLs is very straightforward. On the other hand, if you are using a different ETL tool (such as Oracle DI or Informatica PowerCenter), a mapping from the logs to the star database model must be done.

 

So do you think you also need to control your ETLs? Would you like to do it with good visualizations? Contact us at info@clearpeaks.com or use our web form if you would like more information about the ClearPeaks ETL Control System, or leave us your comments below!

Blog Article Authors: Pere Vegas & Daniel Albendín

 

privacy policy - Copyright © 2000-2010 ClearPeaks

topnav