Data Quality Series – Introduction

.

 

This article is the first in a series of blog posts about the topic "Data Quality". In the next couple of weeks we will go through the following subject matters:

Data quality with EDQ

Part 1: Data Profiling
Part 2: Data Standardization
Part 3: Data Deduplication

Data quality with Informatica

Part 1: Data Profiling
Part 2:
Data Standardization
Part 3:
Data Deduplication

The focus of this first article is to introduce "Data Quality".

1. Introduction to Data Quality

Data quality is the perception or assessment of the fitness of data to serve its purpose in a given context [1]. Good quality data is crucial for analysts to create reports that contain accurate and consistent information, and in some businesses, bad quality or out-of-date data may increase costs. For example, a mailing campaign that sends letters to the wrong customer addresses is a waste of money.

Moreover, reports containing poor quality data may be regarded by customers as erroneous and thus reduce their confidence in the delivered dashboards. In fact, according to Gartner, a loss of confidence by users in their reports/dashboards is the number 1 cause of Data Warehouse / Data Mart / Data Governance project failures.

Many of the data quality tasks are very close to the database level and can be performed by a DBA, for instance by adding checks in the columns to allow just valid values, or by setting default date formats. But in some scenarios we may find that these validations are not performed, and certain data quality tools can help us to analyse and fix these issues.

The term “Data Quality” involves many different aspects:

Validity:

The data conforms to the syntax (format, type, range) of its definition. Database, metadata or documentation rules as to the allowable types (string, integer, floating point etc.), the format (length, number of digits etc.) and range (minimum, maximum or contained within a set of allowable values).

Accuracy:

The data correctly describes the "real world" object or event being described. Does it agree with an identified reference of correct information?

Reasonableness:

Does the data align with the operational context? For example, a birthdate of 01/01/01 is valid, but is it reasonable in context?

Completeness:

The proportion of non-blank values against blank values. Business rules define what "100% complete" represents: optional data might be missing, but if the data meets the expectations it can be considered complete.

Consistency:

Values across all systems in an organization reflect the same information and are in sync with each other.

Currency:

The data is current and "fresh"; data lifecycle is a key factor.

Precision:

The level of detail of the data element, e.g. the number of significant digits in a number. Rounding, for example, can introduce errors.

Privacy:

The need for access control and usage monitoring.

Referential Integrity:

Constraints against duplication are in place (e.g. foreign keys in a RDMBS)

Timeliness:

The data is available for use when expected and needed.

Uniqueness:

No value occurs more than once in the data set.

Data quality is affected by the way data is entered, stored and managed, and data quality assurance (DQA) is the process of verifying the reliability and effectiveness of data. In the section below we will see the basic approach to a successful DQA project.

 

2. Data Quality Projects

Every data quality project needs to start with an understanding of the benefits that we are going to obtain from it, that is, an assessment and understanding of the value of the data as a key for better decision-making and improved business performance, or,  in other words, its utility.

It is possible to analyse how data is being used to achieve business objectives, and how the achievement of these goals is impeded when flawed data is introduced into the environment. To do this, we must consider the following points:

What the business expectations for data quality are
How the business can be impacted by poor data quality
How to correlate such impacts with specific data quality issues

Once a negative impact on the ways the business operates due to poor quality data has been determined, the necessary approach to assemble a data quality management programme and institute the practices that will lead to improvement must be planned. This plan must consider:

The processes that need to be instituted
The participants that will execute those processes and their roles and responsibilities
The tools that will be used to support the processes

Normally, a data quality project lifecycle involves at least three different steps:

1. Data Profiling
2. Data Standardization or Cleansing
3. Data Matching and Deduplication

 

3. Data Profiling

Data profiling is the analysis of data in order to clarify its structure, content, relationships and derivation rules. It mainly involves gathering different aggregate statistics or informative summaries about the data, and ensuring that the values match up to expectations.

Profiling helps not only to understand anomalies and to assess data quality, but also to discover, register, and assess enterprise metadata; thus the purpose of data profiling is both to validate metadata when it is available and to discover metadata when it is not.

The typical outputs of the data profiling process are:

Column Profiling:

Record count, unique count, null count, blank count, pattern count
Minimum maximum, mean, mode, median, standard deviation
Completeness & number of non-null records
Data types
Primary key candidates

Frequency Profiling:

Count/ratio of distinct values

Primary/Foreign Key Analysis:

Referential integrity checks (can be achieved by creating a join profile)
Candidate primary and foreign keys

Duplicate Analysis:

Identify potential duplicate records

Business Rules Conformance:

The data meets an initial set of rules

Outlier Analysis:

Identify possible bad records

 

4. Data Standardization

Once the data profiling process is finished, the next step is to fix the issues that have been identified, by applying rules, replacing the wrong values, and fixing data inconsistencies. Data standardization, or data cleansing, is the process of developing and applying technical standards to the data, to maximize its compatibility, interoperability, safety and quality.

 

5. Data Matching and Deduplication

Record matching is the process of finding duplicate records in the data, that is, records which may relate to a single real world entity. This may seem like a trivial task, but in reality it is far from it. The first challenge is to identify the criteria under which two records represent the same entity, considering that the data may come in different formats and using different conventions (free text fields), may be incomplete and/or incorrect (typos, etc.), and the context of the different fields (for example, four different fields may represent a single address). We want to identify duplicates regardless of these differences in the data, so we need to define measures of distance between the records and then apply rules to decide if they classify as duplicates.

Moreover, efficient computation strategies need to be used to find duplicate pairs in large volumes of data. The number of comparisons needed using traditional methods on large datasets quickly scales up to extremely long execution times, and techniques such as a previous clustering of the data become necessary.

Data deduplication is the process of merging record pairs representing the same real world entity, also known as consolidation. It relies on record matching to find pairs of possible duplicate records.

 

6. Data Quality Tools

All the tasks involved in a data quality assurance project can be done manually on the database, or in the ETL processes if working on a data source being loaded from another source. However, there are many vendor applications on the market to make these tasks much easier. Most of these applications can be integrated with other ETL tools, and offer batch processing as well as real-time processing to ensure we always have the best quality data.

In the next article we are introducing Oracle Enterprise Data Quality (EDQ), a leading data quality tool.

References:
[1] Margaret Rouse, “data quality”. Web. TechTarget. November 2005. Accessed January 2017. http://searchdatamanagement.techtarget.com/definition/data-quality

 

Authors:
Javier Giaretta, Nicolas Ribas and Axel Bernaus

Click here if you would like to know more about Data Quality!

 

Informatica Performance – Optimization techniques

.

Informatica performance - Optimization techniques

 

Informatica provides the market´s leading data integration platform. ETL Mappings are designed for data loading into the data warehouse environment to achieve better reporting which in turn helps you to understand business trends better. The major problem faced by anyone working with Informatica ETL is to design a mapping(s) that doesn’t compromise its performance. Often we end up creating a mapping that achieves only the functionality but suffers in terms of performance.

The article explains the steps to identify the performance bottlenecks and the advanced optimization techniques that are available in Informatica to overcome them.

Following topics are discussed in detail:

* Optimizing sources, Targets and Buffer Blocks:
Performance bottlenecks can occur when the Integration Service reads from a source database or writes to a target database. Depending on the source, target data, you may need to increase or decrease the buffer block size. In this article we list the steps you have to take to ensure that the sources and the Targets are fully utilized to optimum level and how to calculate Buffer block size.

* Push down Optimization (PDO):
Push Down Optimization Option enables data transformation processing to be pushed down into any relational database to make the best use of database processing power.

* Session Partitioning & Dynamic Session Partitioning:
The Session Partitioning Option increases the performance of Power Center through parallel data processing. In dynamic partitioning, Integration Service determines the number of partitions to create at run time based on factors such as source database partitions or the number of nodes in a grid.

* Concurrent Workflow Execution:
A workflow configured for Concurrent Execution can run as multiple instances concurrently. In the article we illustrate this functionality by using a workflow that is configured with multiple instances to process data belonging to multiple sites simultaneously.

* Load Balancing:
Load balancing is a mechanism which distributes the workloads across the nodes in the gird. In the article we see how to assigning Service Levels to Workflows to achieve Load Balancing.

Click to read the full article: Informatica performance - Optimization techniques!

 

Customizing ODI Knowledge Modules

.

Lately we have seen a growing interest in ODI (Oracle Data Integrator). We have been working with this tool here in ClearPeaks for a while now, on several projects for different verticals, but it is also true that with the appearance of the newer versions of Oracle Business Intelligence Applications (OBI Apps) 11.1.1.X, it’s a pretty hot topic, and that’s why we’ve decided it was time to publish a quick “tips and tricks” guide for ODI, hoping to save our customers a lot of effort, time, and money.

The first thing to bear in mind when you start working with ODI is that you really must follow best practices in order to take full advantage of two of the most powerful weapons in the ODI arsenal: the declarative design and the knowledge modules. For those of you that don’t have too much experience with ODI, let me run through these two outstanding features.

The declarative design basically allows you to create an abstraction layer between the developer and the actual code that is going to be performed by the DBMS. This means that in ODI you define “what you want to get”, and the way to do so is automatically implemented by the Knowledge Module.

However, you might ask yourself “Is this possible? Can we really rely on the default ODI KMs?” Well, the answer is very simple: for standard needs, where performance is not a problem, yes! But in most of our BI projects, remember that we have had to tune the KMs to adapt them to our customers’ needs and to obtain the maximum benefit from the tool.

But don’t think that this undermines what is a fantastic feature. ODI comes with a great set of KMs that give you the perfect starting point to create your own customized KM. And moreover, all the developers don’t need to go into the details of the KM implementation; in a typical ODI project, the architect will be responsible for setting up the ODI environment and will provide the whole team with the appropriate KMs that will satisfy the particular project needs.

So in principle, the developers don’t need to know all the implementation details (it is up to each user/developer to go beyond and analyze the code ODI is generating, if required). This abstraction significantly speeds up the process of developing an ETL, since once the logic is established, there is no need to redo it over and over again.

A typical example to consider is the logic necessary to load a Slowly Changing Dimension (SCD Type II). With other tools, each developer would need to fully understand the logic of the SCDs and the way the integration process is performed, since it has to be replicated for each table to be loaded.

With the ODI declarative design, you just establish what you want, and the KM will take care of the logic. So you simply need to indicate:

➜ I want to treat “Table X” as a Slowly Changing Dimension (we will mark it like this).

Oracle Data Integrator

 

➜ I want to use “Column X” and “Column Y” to store the starting and ending date of the row, respectively.

Oracle Data Integrator

 

➜ I want to use “Column Z” as the current row flag.

Oracle Data Integrator

 

➜ I want “Column A” to be the Primary Key (Surrogate Key) of “Table X”.

Oracle Data Integrator

 

➜ And I want this column / set of columns (e.g. B and C) to be the Integration Key (Natural Key) of Table X (the column, or columns, that will be used to determine if the row is a new row, or if the row previously existed and has to be updated).

Oracle Data Integrator

 

➜ Finally, we can indicate for each column if we want the ETL to add a new row when the value changes in the source system, or if we prefer to update the whole table.

Oracle Data Integrator

 

✓ And that’s it! By then selecting the appropriate KM, tuned with the necessary logic by our architect, we can develop as many mappings as we want for Slowly Changing Dimensions. Just indicate the source for each column and run it. Quick and easy!

We have also mentioned the concept of Knowledge Modules. So, some of you may wonder, what is a “Knowledge Module”? This is simply a generic set of steps that will perform the needed logic for your ETL process. Each step can be written in different languages (SQL, Oracle-SQL, Jython, and many more) depending on the technology underneath, with placeholders for each column, table, and in general, “entity” that will take part in our ETL. At execution time, those placeholders are filled with the details of the mappings that have been developed, and this gives the project team the flexibility to reuse logic and speed up the delivery of the ETL process.

Well, that’s enough of an introduction to these two handy features in ODI. Now let’s see some usage examples of the things that can be done, and which can help our customers to reduce implementation time.


1. Automatizing Loads

▼ Automatizing the Incremental / Full Load Strategy
There are several ways to automatize an Incremental / Full Load Strategy using KMs. If we think back to the previous version of OBI Apps, in Informatica we had to have two separate mappings for the Incremental and the Full version of the mapping. With ODI this can be automatized in several ways:

ⓐ Manually: Adding an option to the mapping
The simplest way is to add an option to the mapping so we manually specify if we want to execute a full load or not. This option will drive the execution of a new step in our KM. For example, this step may consist in truncating the table, and if we are populating a dimension, resetting the associated sequence.Here we see the importance of following best practices and naming conventions. If we follow a good methodology when naming our DWH objects, we can distinguish the table type by the name, and also the sequences can be easily related to the corresponding dimension.

Oracle Data Integrator
Oracle Data Integrator

 

ⓑ Automatically: Adding an option to the mapping and using a control table
The second option is an extension of the previous one. By using a control table containing the table name and one column containing the “FULL_LOAD_FLAG”, we can invoke this process only if the DWH administrator has set the property in the table (FULL_LOAD_FLAG = 1). This way, there is no need to modify any ODI object to change the behavior of the interface.

▼ Automatizing the population of the unspecified rows for dimensions
A typical need in any data warehousing project is to populate an “unspecified” row for each dimension. This will give us the capability of always making inner joins between the facts and dimensions, and thus improve performance in all our OLAP queries.

Once again, we could generate a separate interface for this, for each dimension, but will it be efficient? ODI gives us the possibility of defining a generic step in our KM to be executed for any dimension table to be loaded.

A typical need in any data warehousing project is to populate an “unspecified” row for each dimension. This will give us the capability of always making inner joins between the facts and dimensions, and thus improve performance in all our OLAP queries.

Once again, we could generate a separate interface for this, for each dimension, but will it be efficient? ODI gives us the possibility of defining a generic step in our KM to be executed for any dimension table to be loaded.

At execution time, ODI is aware of the data type of each column to be populated (remember that every object is stored in the model, and so we have an internal representation stored in the ODI repository). We can prepare a simple process that will basically assign a default value for each data type. For example:

Oracle Data Integrator

Oracle Data Integrator

NOTE: See the ODI Substitution API Reference to learn more about the functions you can use to retrieve useful information from the ODI model, or contact us for further details!

▼ Automatizing the population of Fact Modules by using a control table
Another interesting functionality we usually include in our projects is control of the periods to be loaded by a control table. As we mentioned before, with this table we can change the behaviour of our ETL process just by changing one row in one table. This speeds the process of programming and administering the loads up a lot.

By using this table, the ETL administrator is capable of specifying the starting date that we need to populate, and the number of periods to be loaded. More options can be useful in this table, like a flag (PREV_EXEC_CORRECT_FLG) indicating if the automated previous execution of the ETL was correct (if not the ETL should not continue in order to avoid inconsistencies, for example, in Slowly Changing Dimensions). Another flag might be used to indicate if the population of this table should be included in the automated process (IS_PART_OF_DAILY_LOAD), or the FULL_LOAD_FLAG that we already mentioned.

In the example below, the source system needed to be queried daily due to performance restrictions; by specifying a starting date and the number of periods to be loaded, the ETL automatically ran a loop for the number of days to be loaded by leveraging the ODI variables.

Oracle Data Integrator
Oracle Data Integrator

In this case, it is also important to highlight that the interfaces have to be filtered by a variable containing the value of the day to be populated:

Oracle Data Integrator

Another interesting modification done to the KMs involved in this case study is that of the corresponding CKM to add an option that will avoid deleting the previous errors, since the same session is going to be executed several times in a loop.

Oracle Data Integrator

In some cases, it might even be interesting to remove the period from the target table before inserting, if we are performing a full period insertion (we don’t want to merge the data with the previously inserted data, but to directly reload a complete period).

Oracle Data Integrator

 

As you can see above, by using the ODI functions we can generate generic steps that will work for any object in our ODI model. For example, <%=odiRef.getTable(“L”,”TARG_NAME”,”A”)%> will retrieve the target table name.

See: http://docs.oracle.com/cd/E21764_01/integrate.1111/e12645/odiref_reference.htm#ODIKD1295


2. Boosting performance

▼ Boosting performance for populating staging tables
Since ODI 11g does not have a multi-target capability, we sometimes need to reuse the same staging table to populate several final targets; we have to load these staging tables as fast as possible to speed up the performance of the whole ETL process. We can use the default KMs to load these staging tables, but for one staging table you may not need to apply integration processes, validations checks and other time-consuming logic; to get round this, a good practice is to generate your own staging KM, including only the necessary steps:
Oracle Data Integrator

Then, in the “Insert” step, you can populate the target table directly, instead of the integration table:

<%=ODIREF.GETTABLE("L","TARG_NAME","A")%>
VS
<%=ODIREF.GETTABLE("L","INT_NAME","W")%>

Oracle Data Integrator

 

▼ Boosting performance by using partition exchange
In order to boost performance of load processes and reporting as well, it is sometimes useful to apply partitioning to your DWH tables. This will give you two main advantages:

➀ At ETL time, the population of a big table (GBs of data) is much faster if the table is partitioned, since you can populate a temporary empty table, and then execute a partition exchange to swap the partitions between both tables. This way, all the population in your temporary table, the consistency checks, constraints and all the processes you may need to do can be done much faster.
It is important to emphasize that this is useful for snapshot tables, and in general for tables that do not need to cross data between partitions.

➁ At reporting time, if you just need to analyze one partition (one period) at a time, or maybe two to compare them, the response time in a partitioned table will be faster.

To accomplish the partition exchange in ODI, you can also use a knowledge module that can encapsulate the logic needed. The command below will execute the partition exchange between the temporary table and the target table, leveraging ODI functions.


ALTER TABLE

<%=ODIREF.GETTARGETTABLE( "SCHEMA" )%>.<%=ODIREF.GETTARGETTABLE("RES_NAME")%>
EXCHANGE PARTITION ' || V_PARTITION_NAME || ' WITH TABLE
<%=ODIREF.GETTABLE("L", "INT_NAME","W")%>
<$="<%=ODIREF.GETUSEREXIT("PARTITION EXCHANGE OPTIONS")%>"$>

 

Note that the variable V_PARTITION_NAME will need to contain the name of the partition to be exchanged. This can be retrieved by a simple query on the temporary table that your mapping has populated.

➀ Loading historical information into a Slowly Changing Dimension
The last interesting modification that we are going to show you is to the Oracle SCD KM. Imagine that you launch your system with a starting date (e.g. 30 Sept., 2014) and after some months running your ETL system in production, the source system is updated with historical information so that the business can analyze trends and customer behavior over several years. Obviously, we can’t load the latest information in the source system to our DWH because it will override the current information with this stale but still relevant information. We need to look for a way to include this information in our system, so if a user queries information about old periods, they can analyze and explore the attributes the customer had at that point in time.

Well, we can do the following:

➀ Duplicate the mapping you are using to load your SCD.

➁ Duplicate the Slowly Changing Dimension IKM you are using to populate your SCDs.

➂ Modify the copy of your IKM to set the "ACTIVE_TO_DATE" to the date when you launched your system. This way you can run a separate load for older periods without impacting the daily load you are running on a daily basis with current data. Any row loaded by this second thread (older periods), will always finish by the date when you launched your system, so there won’t be duplicates in your SCD and you can explore both current and old information about the entities in your dimension (e.g. your customer profile, segment, location, etc.).

Oracle Data Integrator

 

Just by modifying small pieces of the given KM code, you can achieve your particular needs. Just be careful that you modify the appropriate properties (in this case, SCD_END and also the SCD_FLAG have to have different values from those in your daily thread).

We hope that you’ve found these techniques that can be applied to ODI KMs to speed up the process of generating and manipulating your ETL interesting, and if you have further questions, do not hesitate to contact us.
Stay tuned for more interesting features to apply to your ETL processes with #ODI!!

Get in touch with us and see what we can do for you!  ✉

Blog Article: Emilio Soriano

How to control your ETL performance

.

ETL processes and the importance of control

The term ETL, meaning Extract-Transform-Load, comes from the Data Warehousing field and was traditionally used to refer to those processes that transform and move data from the source system to the data warehouse.

Nowadays, the data movements that are done inside a company are much more than a simple source-data warehouse flow. Data supports many different business processes and it has to be shared across different business units and departments. Indeed, in high data-oriented companies, data and ETL processes are a vital part of key business processes such as customer analysis, fraud detection, mailing or any other high data consuming activities.

This increasing high relevance of data in companies has made ETL systems performance essential and an important concern not only for the BI team (usually in charge of the ETL system) but for many other teams. Thus, having a system that properly controls and shows the performance of all the ETL processes can have a very positive impact in terms of data reliability, early error detection and correction, and continuous process improvement.

A standard ETL Control System

At ClearPeaks, we identified this as a general requirement in all ETL systems and developed a standard easy-to-deploy system to control and show your ETL performance. We are not only talking about a simple system that logs the result of the executions in files or tables, but a more complete one that allows you to have a good visual feedback of your ETL performance. Therefore, questions such as “How did the executions go tonight?”, “Why can’t I see any data?” or “Is the execution time of this ETL increasing?” will become easy to answer by just taking a look at your Tableau dashboards.

ETL

Figure 1. Tableau dashboards of the ETL performance control system

Specifically, we developed two dashboards that offer solutions to the two principal issues when controlling a process; the final result and the time spent. There is one dashboard that informs on execution results, common errors and percentages of success (left-top of Figure 1), and another one that shows execution times and time variables (right-bottom of Figure 1).

These dashboards are designed to enhance scalability and easy implementation so they are useful for any ETL. In the same way, the technical actions required to implement this system are also simple and compatible with any ETL. To empower this high level of standardization, the basics of the Business Intelligence methodology have been adapted to the ETL Control System architecture. Thus, the system is based on an unchanging star database model with some completely standard tables, and four simple functions or boxes (the exact term varies depending on the ETL tool) that must be included in the ETL. Initially, this system has been completely developed in Kettle ETL tool, so implementing the system in Kettle ETLs is very straightforward. On the other hand, if you are using a different ETL tool (such as Oracle DI or Informatica PowerCenter), a mapping from the logs to the star database model must be done.

 

So do you think you also need to control your ETLs? Would you like to do it with good visualizations? Contact us at info@clearpeaks.com or use our web form if you would like more information about the ClearPeaks ETL Control System, or leave us your comments below!

Blog Article Authors: Pere Vegas & Daniel Albendín

 

Copy rows to result in Kettle | Tips and Tricks

.

Copy rows to results steps

Managing variables is one of the most complex operations in kettle. In this entry a few tips are explained to make these operations easier.

‘Copy rows to results’ step is the most versatile when we talk about sharing information between objects. This step only exists in transformations and it can convert the row values to parameters, variables, arguments or even pass them to other transformation.

The Case

Assuming the following job in kettle (root job):

Image 1

The transformation T1 is the following:

Image 2

The properties file content is:

Image 3

Now, these properties have to be used in a new transformation or job. The root job will now looks like:

Image 4

The Possibilities

Depending on how these parameters are going to be used in T2, different methods can be used

1- To be used as rows:

If these rows are needed in the flow, a get rows from result have to be placed into T2 transformation.

Image 5

Now, main job can be executed and log looks like the following.

code 1

2.- To be used as parameter/variable:

We have to follow these steps:

  1. In the root job open T2 options (Double left click on it).
  2. Open the advance tab
  3. Check the ‘execute for every input row?’ optionImage 6
  4. Open Parameters tab
  5. Define every Column you want to use as a parameter (A,B,C)Image 7
  6. Open the T2 transformation (double middle click on it)
  7. Define the variables in the T2 transformation.
    Image 8
  8. Place a Get Variables step and fill the fields with the proper valuesImage 9

Now, main job can be executed and log looks like the following.

code 2

This can be used in a job call instead of a transformation one.

3.- To be used as argument:

  1. In the root job open T2 options (Double left click on it).
  2. Open the advance tab
  3. Check the ‘execute for every input row?’ and ‘Copy previous results to args?’
  4. Open the T2 transformation (double middle click)
  5. Place a ‘Get system info’ step.
  6. Define Get system Info step as following:

Image 10

Now main job can be executed and log looks like the following:

code 3

Conclusion:

When working with Kettle, the rows in the flow can be used in different ways. You can execute jobs or transformations with data created or calculated in previous steps, so Kettle becomes into a powerful and versatile ETL tool, which offers big scalability and dynamic executions.

privacy policy - Copyright © 2000-2010 ClearPeaks

topnav