Data Quality with Informatica – Part 1: Data Profiling

.

Data Quality – Part 1: Data Profiling using INFA

Welcome to the first article in the Informatica Data Quality series, where we are going to run through the basics of Informatica Analyst and the main features of Informatica Developer for data profiling.

Informatica is one of the most important data integration vendors in the market; they are behind PowerCenter, a very well-known ETL that can be integrated with other Informatica tools, such as Informatica Analyst,  a web application used by data analysts to analyse data and create data profiles, among other tasks. In the sections below we are going to go through the necessary steps to create a data profile, a business rule for column profiling and finally a scorecard to view the results.

 

1. Create a Data Profile

To start profiling our data, first open the browser, log into the Analyst tool (the default URL is http://infaServerName:8085/AnalystTool) and create a new project, which we’ll call Data_Profiling_Example :

Data Quality Series - Profiling with Informatica

Figure 1: Creating a project in Informatica Analyst

Now we add a data source; in this example we are going to load a file with information from AdWords. For demonstration purposes, several errors have been introduced into the file, like different date formats. To add a file, click on the actions menu on the right-hand side of the window and click add flat file:

Data Quality Series - Profiling with Informatica

Figure 2: Adding data from a file in Informatica Analyst

Importing data from files is straightforward if we follow the wizard. In this example, we are going to set comma separated values, header present, data starting in line 2, and all the columns will be strings. The tool will automatically detect the length of the fields.

Data Quality Series - Profiling with Informatica

Figure 3: Add flat file wizard in Informatica Analyst

Now we need to create a new profile for our data, and we can do this by clicking on new profile on the menu on the right. In the wizard, we select our data file and accept all the default values.
Once the profile has been created we can review the values of the data, the percentage of nulls, and term frequencies in the results panel, as well as being able to analyse the patterns of the data values for every column. We can also view a summary of basic statistics, such as the max value, the min value and the top and bottom values for each column.

Data Quality Series - Profiling with Informatica

Figure 4: Profiling results in Informatica Analyst

In our example we can see several issues in the data of the file. For example, in the image below we can see  that the year is incorrect for some records (we are assuming that the file should contain just the numeric value for the year). In this example, the file should only contain data for 2nd January 2015, so it looks like the file has some invalid records, as there are some records with a different year, and others with a wrong value. This could be due to a bad extraction from the source system, or a wrong delimiter in some rows. In order to measure the quality of the file, we are now going to create some business rules, add them to the data profile, and finally create a visualization.

The data analysts from our organization have given us the following business rules:

the year must be 2015 for this file
the day column must always be 1/2/2015
the file should only contain Enabled campaigns

We will create two business rules to validate the year and the day columns, and for the Enabled campaigns we will set up the value Enabled in the campaign_status column as valid.

We can create the business rules in two ways: by using the expression builder in the Analyst tool, or by creating a mapping using the Informatica Developer. To create the business rule directly in the profile we simply click on edit, then on the column profiling rules, and the on the plus sign to add a rule.

Data Quality Series - Profiling with Informatica

Figure 5: Creating rules in Informatica Analyst

Then we select new rule for the year column and enter the expression you can see in the following image. We can save the rule as reusable; this way we will be able to apply exactly the same rule for a different column in the file if necessary.

Data Quality Series - Profiling with Informatica

Figure 6: New rule wizard in Informatica Analyst

We will implement the second rule in the Developer tool. To do this, we open Informatica Developer and connect to our project, then create a mapplet with an input transformation, an expression and an output transformation, and save it as DayValidator. To validate the rule, we can right-click on the rule and select validate.

Data Quality Series - Profiling with Informatica

Figure 7: Creating a rule in Informatica Developer

We will define the expression with three possible output values: not a date, Valid date and Invalid date.

Data Quality Series - Profiling with Informatica

Figure 8: Defining rules in Informatica Developer

Once the rule has been created, we can go back to Informatica Analyst, edit the profile and now, instead of creating a new rule, we are going to apply the DayValidator rule we just created in Developer to the day column in the profile. We will call the output of the rule IsValidDay:

Data Quality Series - Profiling with Informatica

Figure 9: New rule wizard in Informatica Analyst

Now we are ready to run the profile again and review the outcome of the two different rules:

Data Quality Series - Profiling with Informatica

Figure 10: Data profiling project in Informatica Analyst

Reviewing the results, we can see that the data contains wrong values for Day and Year:

Data Quality Series - Profiling with Informatica

Figure 11: Reviewing profiling results in Informatica Analyst

 

2. Create a Scorecard for the Profile

Now that we have executed and checked the profile, we can create a scorecard to measure the quality of the file as the last step in this data quality assessment. In order to do this, we have to go to the profile and add it to a new scorecard. We can define the valid values for each column in our data. In this example, we are going to create the scorecard with three metrics called scores (both outputs from the rules and the campaign status) and then select the valid values for each different score.

The scorecard allows us to drill down from the score to the data. We select the key of the file (first three columns), the campaign status, and the output from both rules as drilldown columns; this way we can easily export the invalid rows to a file and send the results to the owner of the data so they can fix the wrong values and re-run the proper processes to update the data.

Data Quality Series - Profiling with Informatica

Figure 12: Data profiling scorecard in Informatica Analyst

This concludes the first article in this series about Data Quality with Informatica.
In the next couple of blog posts we’re going to explain how to standardize and deduplicate data. Stay tuned!
If you would like to know more about the Data Quality Services we offer click here!

Unleash the power of R to forecast data from your OLAP Cubes with Tableau

.

Unleash the power of R to forecast data from your OLAP Cubes with Tableau

R is a programming language and software environment for statistical computing, the leading tool amongst statisticians, data miners and data scientists for performing sophisticated statistical analysis, data discovery and predictive analytics on any set of data, such as linear/non-linear modelling, classification, clustering, time-series analysis and so on.

R offers a wealth of functions and, as an open source project, its capabilities are constantly enhanced and extended through user-created packages; this is why it is so popular and just keeps growing.

In this article, we will see how R can be connected to Tableau to perform highly-customizable forecasting of your data from OLAP Cubes. Your users will benefit without the need for an advanced coding background, and visualize the results in the Tableau charts that they like the most, with the help of just a few neat tricks.

 

1. Forecasting with Tableau

Tableau already comes with a forecasting feature; it uses a technique known as exponential smoothing which tries to find a regular pattern in the measures that can be continued into the future.

To access this feature, simply drag a date and a measure to your sheet and right-click on the canvas. In this example, we are using Tableau’s Superstore sample dataset.

Figure 1: Tableau’s Superstore sample dataset.

Figure 1: Tableau’s Superstore sample dataset.

Select Forecast ➜ Show Forecast, and the estimated line will appear. As you can see in the picture below, it’s very simple. What Tableau did was basically a linear regression to a straight line; it also shows the confidence interval. Right-click on the canvas again to access the Forecast Options, and an options pane will pop up.

Figure 2: Forecast Options

Figure 2: Forecast Options

In the options pane (below, 1), you can set a few parameters, such as forecast length, aggregation type, and you can change or remove the confidence intervals (called prediction intervals) and adjust the forecast model: it can be automatic, automatic without seasonality, or custom. If you choose custom, you will be able to select the type of Trend and Season: None, Additive or Multiplicative.

Figure 3: Forecast Options, Summary and Models

Figure 3: Forecast Options, Summary and Models

By right-clicking again and selecting Describe Forecast, you will be shown a pane that gives you additional forecast information. You can see this in 2 and 3.

 

2. Why use R instead?

What we have described so far is basically everything that Tableau can offer when it comes to forecasting. It works pretty well, but it’s not really customizable and, above all, it doesn’t work with OLAP Cubes!

OLAP Cubes have many advantages, but unfortunately, Tableau doesn’t allow us to do all the things we can do with a “traditional” data source, and forecasting is one of these (you can find a list of other OLAP-related limits here.

Luckily for us, Tableau is such a great tool that, despite not being able to perform forecasting on cubes, it offers us an even more powerful alternative: a smooth integration with R.

 

3. Integrating R in Tableau

There are 4 functions that allow Tableau to communicate with R: SCRIPT_REAL, SCRIPT_STR, SCRIPT_INT, SCRIPT_BOOL. They only differ in the return type, as suggested by their names.

They work very simply, and take at least 2 parameters. The first one is an actual R script; it is pure R code that is passed on to the R engine to be executed, and that will return the last element that is called. From the second argument, the values that are passed on are specified. In the script, you will refer to them as .arg1, .arg2, and so on, depending on the order in which they are declared.

R will take these values, interpret them as vectors (R basically only works with vectors and their derivatives), and return a vector of data that Tableau can use just like any other calculated field.

 

4. Connecting R to Tableau

In order to communicate, R and Tableau must be connected, and this is really easy to do:

1. Download R from here or R Studio from here, and install it

2. From the R console, download the Rserve package and run it:

install.packages(“Rserve”);
library(Rserve);
Rserve();

3. In Tableau, select Help ➜ Setting and Performance ➜ Manage External Service Connections

4. Enter a server name or an IP (in the same machine, use localhost or 127.0.0.1) and select port 6311

5. Test the connection by clicking on the “Test Connection” button, and press OK if you get a positive message

You can also automatize the start of Rserve by creating an Rscript with the code mentioned in point 2 (without repeating the installation), save it as a .R file, and call it from a .bat file with “start Rscript myRserveScript.R”. You’ll find it pretty useful, especially in a server environment, because you can schedule its execution when the machine starts via the Task Scheduler, or by copying a shortcut to %AppData%\Microsoft\Windows\Start Menu\Programs\Startup (in Windows).

After Rserve has started and Tableau is successfully connected, you can start using the 4 Script_* functions to exploit R’s power.

 

5. Forecasting in R

To perform forecasting in R, you have to download a user-made package called “forecast”.

To do so, simply write install.packages(“forecast”) in your R console, and call it by writing library(forecast) any time you need it.

It offers various types of fully-customizable forecast methods, like Arima, naïve, drift, Exponential Smoothing (ETS), Seasonal-Trend decomposition by Loess (STL) and so on, seasonable or not, with a number of different parameters.

For more details about the package, check the author’s slides, the help package pdf , or each function’s help page by writing ?func_name in the console (ex: ?stlf).

 

6. Use case

In our use case, we will take a series of monthly measures from the last 3 years and forecast them with different frequency/period combinations. We will use the function called stlf, which, without getting bogged down in details, gives us more freedom when playing with periods and frequencies whilst still providing a solid result.

In the picture below, you can see our data, from January 2014 to October 2016.

Figure 4: Data, from January 2014 to October 2016.

Figure 4: Data, from January 2014 to October 2016.

We dragged our date dimension (Year Number Month Number) to the column field, and put our desired measure in the row field. The trend is not regular, but we can see how the line has a few peaks and generally tends to grow.

 

7. Writing the R scripts

Let’s see how to write the R scripts that perform the forecast.

Firstly, we need to know what values we will be passing: we have our measure (for R to accept it, it must be aggregated, but coming from the Cube it already is, and if not, the ATTR function can be called on to help) and the two integer parameters (create them in Tableau) for the number of periods to be forecasted and for the frequency of the season.

Let’s use them in this order, calling them ForecastMeasure, ForecastPeriods and Freq. The calculated field should initially look something like this:

SCRIPT_REAL("
data <- .arg1;
nulls <- length(data[is.na(data)]);
data <- data[!is.na(data)];
freq <- .arg3[1];
periods <- .arg2[1];
",
[Forecast Measure],[ForecastPeriods],[Freq]
)

In the first argument, we have already set up the R variables that collect the values to pass to the script (this is not mandatory, but just to keep the code cleaner). Remember that R works with vectors: notice how the first one (data) takes the entire .arg1, because we actually want the whole array of measures, while the parameters (simple integers) need to taken from the first value of a 1-sized array. We then filter this vector to remove eventual null values, and we save the number of those in a variable called nulls (we will need it later).

We now perform the forecasting. Add the following lines to the script:

library(forecast);
time <- ts(data,frequency=freq);
fcast <- stlf(time, h=periods);

The first line simply calls the forecast library; we then create a time series object, by passing the measure and the frequency parameter (for more customizable details, check the resources already mentioned). Finally, we produce our forecasting result, by feeding the stlf function with the time series object and the periods parameter.

Out forecast is ready. If we use the same data and we execute the script in R, the console will look like this:

The interesting columns are Forecast, Lo80, Hi80, Lo95, Hi95. The first one contains the actual forecasted value. There are 12 because we used parameter = 12 (freq = 12, too). The others are the confidence interval (the limits can be changed when calling the function).

Now let’s complete our script. There is one thing we need to remember: Tableau expects to receive the same number of measures that it sent, so we append our forecasted array to the original one, properly shifted, along with rep(NaN, nulls), an array of nulls (we know how many there are because we saved them). Add these lines to the script:

n <- length(data);
result <- append(data[(periods+1):n],fcast$mean);
result <- append(result,rep(NaN,nulls));
result

The result that we want is fcast$mean, which corresponds to the column called “Forecast” in the R matrix that we saw. We append it along with eventual null values to the segment of original data that starts from the position number period+1, in order to get an array of the original size (note: by doing this, we won’t visualize the first period elements).

That’s it! Your new calculated field should look like the one below. Drag it to the row field, and you get your forecast.

SCRIPT_REAL("
data <- .arg1;
nulls <- length(data[is.na(data)]);
data <- data[!is.na(data)];
freq <- .arg3[1];
periods <- .arg2[1];

library(forecast);
time <- ts(data,frequency=freq);
fcast <- stlf(time, h=periods);
n <- length(data);
result <- append(data[(periods+1):n],fcast$mean);
result <- append(result,rep(NaN,nulls));
result",
[Forecast Measure],[ForecastPeriods],[Freq]
)

Figure 5: Complete R script and result

Figure 5: Complete R script and result

 

8. Shifting time measures

Now, we start getting the first issues. Tableau received a vector from R the same size as the one that we passed it, so each date-measure value is substitued for the new one, coming from the calculated field. The problem is that the date hasn’t changed!

Normally, this could be fixed by creating a new calculated date with the help of the DATEADD function, but as we can’t add a dimension in cubes, we have to overcome this issue by getting dimension value as a measure with an MDX formula, and then applying DATEADD to it.

Create a new calculated member, and set it as:

[Cube_name].[Dimension_name].MemberValue

Then create a calculate field like this:

DATE(DATEADD('month',[ForecastPeriods],DATE([YearNumberMonthNumber])))

Where [YearNumberMonthNumber] is (for example) the name of your newly calculated member. This will get you a measure with the date shifted accordingly to the forecasted period that you selected.

Figure 6: Tableau's Error

Figure 6: Tableau's Error

Drag this new field on top of the original date column and… your chart will crash!

Why? You’ve just created a new shifted date for every original date value, and basically your data is now “split” into “columns” of 1 value, 1 for each shifted month, and R will receive them 1 by 1 instead of getting an entire vector to process. This explains the error that says “series is not periodic or has less than two periods”, because what R is actually seeing is a series of indipendent 1-sized vectors.

To fix this, there’s a simple but neat trick: add your original data field to “Detail”, so that it is available in the sheet, and edit your calculated forecast field like this.

Figure 7: Table Calculation - Forecast (Months)Figure 7: Table Calculation - Forecast (Months)

This will make Tableau process the calculated field at the correct level, and help R see the received data in the correct way, thus producing the desired result.

That’s it! You’ve just forecasted your OLAP cube!

Figure 8: OLAP cube forecasted

Figure 8: OLAP cube forecasted

 

9. Adding confidence intervals

You can also add confidence intervals to improve your visualization: just create two more metrics, copying the one you created for the forecast, and change fcast$mean to fcast$lower[,1], fcast$lower[,2], fcast$upper[,1], or fcast$upper[,2], depending on the limit that you want to show (remember that the defaults are 80 and 95, but these are editable in the R function).

In the picture below you can see the forecast with the 95% bounds (fcast$upper[,2] and fcast$lower[,2]) , and the trend line (calculated by Tableau).

Figure 9: Forecast final visualization

Figure 9: Forecast final visualization

 

10. Conclusions

In this guide we have shown how R can help you build great forecast charts for your OLAP Cube data, but this isn’t the only benefit: we got a much higher degree of customization, and overcame the limitations of Tableau when it comes to defining parameters, methods or entire algorithms.

Once Tableau can connect to R, its potential is virtually unlimited, and it can enhance your charts in an infinity of ways: from any source (not necessarily OLAP Cubes), or with any goal, just send the data to R and let it do the job, whatever it is! Forecasting, clustering, classification, or anything else with an R package written for it – you could even write one yourself! Then visualize the result in Tableau, and adjust the look and feel of your dashboard with the powerful features that you love.

Click here if you would like to receive more information about the topic!

Informatica Performance – Optimization techniques

.

Informatica performance - Optimization techniques

 

Informatica provides the market´s leading data integration platform. ETL Mappings are designed for data loading into the data warehouse environment to achieve better reporting which in turn helps you to understand business trends better. The major problem faced by anyone working with Informatica ETL is to design a mapping(s) that doesn’t compromise its performance. Often we end up creating a mapping that achieves only the functionality but suffers in terms of performance.

The article explains the steps to identify the performance bottlenecks and the advanced optimization techniques that are available in Informatica to overcome them.

Following topics are discussed in detail:

* Optimizing sources, Targets and Buffer Blocks:
Performance bottlenecks can occur when the Integration Service reads from a source database or writes to a target database. Depending on the source, target data, you may need to increase or decrease the buffer block size. In this article we list the steps you have to take to ensure that the sources and the Targets are fully utilized to optimum level and how to calculate Buffer block size.

* Push down Optimization (PDO):
Push Down Optimization Option enables data transformation processing to be pushed down into any relational database to make the best use of database processing power.

* Session Partitioning & Dynamic Session Partitioning:
The Session Partitioning Option increases the performance of Power Center through parallel data processing. In dynamic partitioning, Integration Service determines the number of partitions to create at run time based on factors such as source database partitions or the number of nodes in a grid.

* Concurrent Workflow Execution:
A workflow configured for Concurrent Execution can run as multiple instances concurrently. In the article we illustrate this functionality by using a workflow that is configured with multiple instances to process data belonging to multiple sites simultaneously.

* Load Balancing:
Load balancing is a mechanism which distributes the workloads across the nodes in the gird. In the article we see how to assigning Service Levels to Workflows to achieve Load Balancing.

Click to read the full article: Informatica performance - Optimization techniques!

 

Data Mining & Business Intelligence

.

Data mining & Business Intelligence

 

The term data mining refers to one of the processes involved in the task of extracting knowledge from a database, also known as KDD (Knowledge Discovery in Data bases). However, by extension, it is referred to as the KDD global process because of its commercial appeal. Understanding data mining as a KDD sub-process, we could define the term as the process of extracting underlying knowledge from a large volume of data.

It is a recent development directly linked to the scientific fields of mathematics (mainly statistics), computer science and artificial intelligence. It can be supported by different Business Intelligence systems, from which we can obtain several advantages!

In this article we´re highlighting how a Business Intelligence system is a great starting point for the data mining process and how data mining can be used for process optimization.

The topics covered are:

  1. Definition
  2. Objectives and challenges
  3. KDD Process Phases
  4. Applications:

➀ Data mining & Business Intelligence
➁ Extracting knowledge from unstructured data
➂ Optimization of processes

Click this link to read the full article on data mining: Data Mining & Business Intelligence

OBIEE Data Lineage Solution

.

Tracking OBIEE reporting data

BI implementations may use hundreds of data warehouse tables and thousands of reports, and one of the most common requirements in a BI project is the ability to track back the data used in reports. The possibility to quickly identify the set of columns and tables used in a given report or dashboard, or to check which reports may be affected by a change of physical column in the data warehouse, is also crucial for development. This calls for a data lineage solution which is accessible to users and developers,  and allows checking such data on an ad hoc basis.

Various vendors offer data lineage solutions, but these can be expensive and vendor-specific. With our simple solution, we combine Catalog Manager and Administration Tool export sources to create an easily accessible solution for tracking data lineage in OBIEE.

By implementing the OBIEE data lineage solution, we can check the following:

  1. Which physical tables and columns are used in a given report or dashboard
  2. Which reports use given physical columns or tables; this is especially important when modifying an existing table, as any change in the table’s structure must take existing reports into consideration
  3. Which are the most commonly used columns in reports in a given subject area; identifying the most commonly used columns in a report can hint at creating indexes to improve the overall performance of the OBIEE implementation even further.

Solution

The ClearPeaks OBIEE Data Lineage Solution gathers all the required data lineage information in one place and uses it as a source for OBIEE reports within Data Lineage Subject Area. Two sources are combined to achieve this:

  1. List of OBIEE reports and presentation columns used

Catalog Manager provides an option to export the list of reports and columns used. The export can be done either with the Catalog Manager UI tool or through the command line utility. We use the latter option, as it allows automation of the whole process later.

  1. Mappings of presentation columns back to the physical columns

Such mappings can be obtained manually by creating Repository Documentation from Administration Tool’s Repository Documentation utility. The output will be a repository documentation in the form of comma separated values file (.csv). This will contain column mapping from presentation through logical to physical layers, including column formula expressions. Another way to obtain column mappings is by extracting the OBIEE repository in .xml file format through the command line utility. Our solution uses the second option, as the repository file will be used in our automated script.

OBIEE Data Lineage Solution

Once we have obtained both files, we need to populate the data lineage tables.

The data can be transformed and manually inserted into the tables, but in our solution we use a script (which can run on the OBIEE server) that parses the data and inserts it into the tables.

Data Lineage Subject Area

Once we have populated the data lineage tables and their data model has been mapped in Administration Tool, we can create and run reports in OBIEE using Data Lineage Subject Area and filter the results according to our requirements.

OBIEE Data Lineage Solution

Let us look at a few of the use cases for the Data Lineage Subject Area:

Use case 1. Which data warehouse tables and columns are used in a given report?

We would like to know which data warehouse tables and columns are used in a particular report or dashboard. We can create a report with a list of the columns used by a given OBIEE report and their data lineage:

OBIEE Data Lineage Solution


Use case 2.
Which reports use given physical tables?

We want to know how many and which reports or dashboards are using given physical tables or particular columns; this could be very useful when assessing the potential impact of column formula or table structure changes on reporting. Using Data Lineage Subject Area we can fetch up the list of OBIEE reports used by a given physical table:

OBIEE Data Lineage Solution


Use case 3.
Which reports use given subject areas?

We need to know which reports and dashboards are accessing data from given subject areas. This may be particularly useful when revising users’ access permissions.

OBIEE Data Lineage Solution


Future Improvements

OBIEE Data Lineage Subject Area can also serve as a backbone for further developments, providing complex and complete solutions for managing existing OBIEE project implementations. Here are some examples and potential benefits of merging additional information into the solution:

Usage Tracking data – allows analysis of the physical tables used in the most accessed reports, the tables and columns in reports not used by anyone, removing non-used reports or underlying tables.

Data warehouse database metadata – such as table size, indexes on columns. This allows for performance analysis of the most heavily used tables and columns by report usage.

ETL Data Lineage – additional layer of data lineage tracking – allowing the tracking of data back to the very source – can be achieved by adding ETL transformation data obtained from ETL systems. For example, it is possible to track all the ETL transformations on a given presentation column down to the source system.

Catalog metadata – it is possible to extract additional information regarding catalog objects such as user permissions, report owners, last modified date etc., to further enhance the solution usability.

Adding all the above components creates a strong core for the OBIEE management dashboard, allowing continuous improvement through:

* tracking data from the report to the source database column
* constant analysis of the most used tables in order to improve performance
* checking which users have permissions for given OBIEE reports and dashboards

All of the above is accessible from the OBIEE front-end, providing a convenient and quick way to facilitate many daily business-as-usual tasks in BI deployments.

Conclusion

The ClearPeaks OBIEE Data Lineage Solution can be easily deployed in any project using Oracle Business Intelligence Enterprise Edition. The solution can be run from the command line tools, which makes it possible to create automated jobs to extract and update data on a regular basis.

If you think your OBIEE project could benefit from our data lineage solution, please contact us for more information via our web contact form or by leaving your comments below!

Blog Article Author: Rafal Ostaszewski

privacy policy - Copyright © 2000-2010 ClearPeaks

topnav