Retrieving Active Directory Data into BI




Active Directory is used by most organizations to both manage users, groups and devices that are registered to the network and to maintain better network security solutions. This information can be extracted and made available to analyze authorization on different applications within your organization.

Below are few key details that can be retrieved from Active Directory:

UserID, Login Name
Employee’s First Name, Last Name
Employee Status
Job Title
User Mail ID
Employee Organization
Line Manager Details
Account Status
Member Groups


1. Motivation

In most business cases, HR systems will hold only information of users under the organization’s payroll, and not that of any contractors or external vendors working with the organization. This calls for extracting Active Directory data into the enterprise’s data warehouse and using it for further analysis in scenarios where we need to get the full list of employees (including direct employees, contractors, vendors etc.) along with their details, i.e, who will be accessing internal applications like service manager, BI services, general services applications, etc. If we can pull this information into OBIEE and automate the data load process, it enables us to generate audit reports and to analyse the access rights of users to various applications.

ADManager Plus data extraction and challenges: 

Manage Engine ADManager Plus is a web-based interface solution designed to meet Active Directory management requirements and report generation. It helps to perform audits for the defined security permissions for a specific AD object or for a specific user. The Report scheduler feature within enables the auto-generating of reports at specified times and delivers the report to multiple users via email in the desired format.

Although AD manager plus supports generating and scheduling reports, it can only be delivered to email accounts. Because of this limitation, automating the data extraction process from Active Directory and loading the data into a data warehouse is not possible.

2. Python solution for Automating Data retrieval and the loading process

Data Extraction from LDAP could be achieved using the LDAP library interface module for Python. Here we use ldap3, which runs off of pure, vanilla Python. ldap3 is supported in Python versions 2.6 and above.

The Ldap3 module needs to be installed in order to establish a connection to the server and retrieve data. It can be installed either using a pip package or by downloading the latest version from LDAP3 source location and install it using command - Python manually.  You should refer to Ldap3 library documentation for further references.

Defining server details and establishing connection to the LDAP server

In order to establish a connection, the first server object needs to be defined. The Server object specifies the DSA (Directory Server Agent) LDAP server that will be used by the connection. The user then needs to define a host variable, which will be host name/IP/complete URL with hostname, host port of LDAP server. This is required to create the server object.

# import class and constants 
from ldap3 import Server, Connection, ALL

#define the server
hostname = 'servename'  # hostname for LDAP server 
server = Server (hostname, get_info=ALL)

The connection object will send operation requests to the LDAP server. It takes different parameters like server, username and password for performing operations in the server. The connection object also requires authentication type and read only parameters that define the type of operation to be performed on the server.

# define the conncetion 
uername = ''  #user account to access server 
password = '*********'  # password for authentication
connection = Connection (server, user=username, password=password, auto_bind=True, collect_usage=True)

Once the connection object is defined, the bind() method is executed to open the connection with the server. The bind operation allows credentials to be exchanged between the client - server and establishes a new authorization state. Connection once established enables to perform all the standard LDAP operations.

Accessing AD information and generating files

The Search method in the connection object enables search operations on the LDAP database. It takes the following parameters:

Search Base: takes the base of the search request
Search Filter: takes the filter to be applied on the search request
Search Scope: defines how broad the search context is. BASE, LEVEL, SUBTREE are the values
Attributes: a single attribute or a list of attributes which can be returned by the search
Get Operational_Attributes: if True, returns information attributes

Search method takes many other parameters like time limit, size limit etc. You may refer to the Ldap3 documentation for further reference. By default, Active Directory sets a hard limit of 1000 entries returned for any search, mainly due to security constraints. It is better to go for a paged search method when retrieving AD, as the entries to be retrieved will be more than 1000 records in many cases.

Instead of a simple search operation, here we are doing a paged search to retrieve the full list of entries from the AD server.

# define parameters and calling search method
base = 'dc=clearpeaks, dc=corp, dc=ae'  # defining case for search operation 
filter_value = 'All BI Users Group'  # defining filters for search operation 
attrs = ['SAMAccountName','Title','dislpayname','department']  #defining required attributes from AD

data = c.extend.standard.paged_search(search_base) = base_dn, search_filter = ' (memberOf = '+allBIusers+')',
search_scope = SUBTREE,
attributes = attrs, 
get_operational_attributes = True
paged_size = 15, generation = True)

Responses received from the Search operation will be in a list format and stored in the Response attribute of the connection object.  The response object can be iterated over to retrieve the results.  A few data cleansing steps are required to carry it out, as Active Directory data text will be in UTF-8 format, necessitating its decoding into ASCII format. Once retrieved, the response data is updated to a CSV file, which can be saved at a shared location in the server.

Figure 4

Figure 1: Sample of the response data updated to a CSV file

Files generated through Python scripts can be accessed by ETL Mapping, which is configured to extract data from excel as source and loads it into warehouse tables, where your data is ready to be visualized in any BI tool. Below is a sample report using LDAP data.


Figure 2: Sample of Active Directory Data displayed in BI Report

Unix Crontab command can be used to schedule the script, based on the frequency required to refresh data from the Active Directory. ETLs are further scheduled respectively to load new sets of data from the generated source file.


In this blog post, we describe a solution to automate the Active Directory Data retrieval and loading process into BI environments using Python. Active Directory information can be utilized for analyzing Security Privileges granted within an organization, which in turn helps to audit user access to various applications.

Although we explained only the scope of reading Active Directory data, it is not the only benefit of this method. Once an authenticated connection is established with the server, all standard LDAP operations can be performed using the LDAP3 Python library.

Contact us if you want to have more information about how pull HR information into OBIEE.


Data Quality with Informatica – Part 1: Data Profiling


Data Quality – Part 1: Data Profiling using INFA

Welcome to the first article in the Informatica Data Quality series, where we are going to run through the basics of Informatica Analyst and the main features of Informatica Developer for data profiling.

Informatica is one of the most important data integration vendors in the market; they are behind PowerCenter, a very well-known ETL that can be integrated with other Informatica tools, such as Informatica Analyst,  a web application used by data analysts to analyse data and create data profiles, among other tasks. In the sections below we are going to go through the necessary steps to create a data profile, a business rule for column profiling and finally a scorecard to view the results.


1. Create a Data Profile

To start profiling our data, first open the browser, log into the Analyst tool (the default URL is http://infaServerName:8085/AnalystTool) and create a new project, which we’ll call Data_Profiling_Example :

Data Quality Series - Profiling with Informatica

Figure 1: Creating a project in Informatica Analyst

Now we add a data source; in this example we are going to load a file with information from AdWords. For demonstration purposes, several errors have been introduced into the file, like different date formats. To add a file, click on the actions menu on the right-hand side of the window and click add flat file:

Data Quality Series - Profiling with Informatica

Figure 2: Adding data from a file in Informatica Analyst

Importing data from files is straightforward if we follow the wizard. In this example, we are going to set comma separated values, header present, data starting in line 2, and all the columns will be strings. The tool will automatically detect the length of the fields.

Data Quality Series - Profiling with Informatica

Figure 3: Add flat file wizard in Informatica Analyst

Now we need to create a new profile for our data, and we can do this by clicking on new profile on the menu on the right. In the wizard, we select our data file and accept all the default values.
Once the profile has been created we can review the values of the data, the percentage of nulls, and term frequencies in the results panel, as well as being able to analyse the patterns of the data values for every column. We can also view a summary of basic statistics, such as the max value, the min value and the top and bottom values for each column.

Data Quality Series - Profiling with Informatica

Figure 4: Profiling results in Informatica Analyst

In our example we can see several issues in the data of the file. For example, in the image below we can see  that the year is incorrect for some records (we are assuming that the file should contain just the numeric value for the year). In this example, the file should only contain data for 2nd January 2015, so it looks like the file has some invalid records, as there are some records with a different year, and others with a wrong value. This could be due to a bad extraction from the source system, or a wrong delimiter in some rows. In order to measure the quality of the file, we are now going to create some business rules, add them to the data profile, and finally create a visualization.

The data analysts from our organization have given us the following business rules:

the year must be 2015 for this file
the day column must always be 1/2/2015
the file should only contain Enabled campaigns

We will create two business rules to validate the year and the day columns, and for the Enabled campaigns we will set up the value Enabled in the campaign_status column as valid.

We can create the business rules in two ways: by using the expression builder in the Analyst tool, or by creating a mapping using the Informatica Developer. To create the business rule directly in the profile we simply click on edit, then on the column profiling rules, and the on the plus sign to add a rule.

Data Quality Series - Profiling with Informatica

Figure 5: Creating rules in Informatica Analyst

Then we select new rule for the year column and enter the expression you can see in the following image. We can save the rule as reusable; this way we will be able to apply exactly the same rule for a different column in the file if necessary.

Data Quality Series - Profiling with Informatica

Figure 6: New rule wizard in Informatica Analyst

We will implement the second rule in the Developer tool. To do this, we open Informatica Developer and connect to our project, then create a mapplet with an input transformation, an expression and an output transformation, and save it as DayValidator. To validate the rule, we can right-click on the rule and select validate.

Data Quality Series - Profiling with Informatica

Figure 7: Creating a rule in Informatica Developer

We will define the expression with three possible output values: not a date, Valid date and Invalid date.

Data Quality Series - Profiling with Informatica

Figure 8: Defining rules in Informatica Developer

Once the rule has been created, we can go back to Informatica Analyst, edit the profile and now, instead of creating a new rule, we are going to apply the DayValidator rule we just created in Developer to the day column in the profile. We will call the output of the rule IsValidDay:

Data Quality Series - Profiling with Informatica

Figure 9: New rule wizard in Informatica Analyst

Now we are ready to run the profile again and review the outcome of the two different rules:

Data Quality Series - Profiling with Informatica

Figure 10: Data profiling project in Informatica Analyst

Reviewing the results, we can see that the data contains wrong values for Day and Year:

Data Quality Series - Profiling with Informatica

Figure 11: Reviewing profiling results in Informatica Analyst


2. Create a Scorecard for the Profile

Now that we have executed and checked the profile, we can create a scorecard to measure the quality of the file as the last step in this data quality assessment. In order to do this, we have to go to the profile and add it to a new scorecard. We can define the valid values for each column in our data. In this example, we are going to create the scorecard with three metrics called scores (both outputs from the rules and the campaign status) and then select the valid values for each different score.

The scorecard allows us to drill down from the score to the data. We select the key of the file (first three columns), the campaign status, and the output from both rules as drilldown columns; this way we can easily export the invalid rows to a file and send the results to the owner of the data so they can fix the wrong values and re-run the proper processes to update the data.

Data Quality Series - Profiling with Informatica

Figure 12: Data profiling scorecard in Informatica Analyst

This concludes the first article in this series about Data Quality with Informatica.
In the next couple of blog posts we’re going to explain how to standardize and deduplicate data. Stay tuned!
If you would like to know more about the Data Quality Services we offer click here!

Unleash the power of R to forecast data from your OLAP Cubes with Tableau


Unleash the power of R to forecast data from your OLAP Cubes with Tableau

R is a programming language and software environment for statistical computing, the leading tool amongst statisticians, data miners and data scientists for performing sophisticated statistical analysis, data discovery and predictive analytics on any set of data, such as linear/non-linear modelling, classification, clustering, time-series analysis and so on.

R offers a wealth of functions and, as an open source project, its capabilities are constantly enhanced and extended through user-created packages; this is why it is so popular and just keeps growing.

In this article, we will see how R can be connected to Tableau to perform highly-customizable forecasting of your data from OLAP Cubes. Your users will benefit without the need for an advanced coding background, and visualize the results in the Tableau charts that they like the most, with the help of just a few neat tricks.


1. Forecasting with Tableau

Tableau already comes with a forecasting feature; it uses a technique known as exponential smoothing which tries to find a regular pattern in the measures that can be continued into the future.

To access this feature, simply drag a date and a measure to your sheet and right-click on the canvas. In this example, we are using Tableau’s Superstore sample dataset.

Figure 1: Tableau’s Superstore sample dataset.

Figure 1: Tableau’s Superstore sample dataset.

Select Forecast ➜ Show Forecast, and the estimated line will appear. As you can see in the picture below, it’s very simple. What Tableau did was basically a linear regression to a straight line; it also shows the confidence interval. Right-click on the canvas again to access the Forecast Options, and an options pane will pop up.

Figure 2: Forecast Options

Figure 2: Forecast Options

In the options pane (below, 1), you can set a few parameters, such as forecast length, aggregation type, and you can change or remove the confidence intervals (called prediction intervals) and adjust the forecast model: it can be automatic, automatic without seasonality, or custom. If you choose custom, you will be able to select the type of Trend and Season: None, Additive or Multiplicative.

Figure 3: Forecast Options, Summary and Models

Figure 3: Forecast Options, Summary and Models

By right-clicking again and selecting Describe Forecast, you will be shown a pane that gives you additional forecast information. You can see this in 2 and 3.


2. Why use R instead?

What we have described so far is basically everything that Tableau can offer when it comes to forecasting. It works pretty well, but it’s not really customizable and, above all, it doesn’t work with OLAP Cubes!

OLAP Cubes have many advantages, but unfortunately, Tableau doesn’t allow us to do all the things we can do with a “traditional” data source, and forecasting is one of these (you can find a list of other OLAP-related limits here.

Luckily for us, Tableau is such a great tool that, despite not being able to perform forecasting on cubes, it offers us an even more powerful alternative: a smooth integration with R.


3. Integrating R in Tableau

There are 4 functions that allow Tableau to communicate with R: SCRIPT_REAL, SCRIPT_STR, SCRIPT_INT, SCRIPT_BOOL. They only differ in the return type, as suggested by their names.

They work very simply, and take at least 2 parameters. The first one is an actual R script; it is pure R code that is passed on to the R engine to be executed, and that will return the last element that is called. From the second argument, the values that are passed on are specified. In the script, you will refer to them as .arg1, .arg2, and so on, depending on the order in which they are declared.

R will take these values, interpret them as vectors (R basically only works with vectors and their derivatives), and return a vector of data that Tableau can use just like any other calculated field.


4. Connecting R to Tableau

In order to communicate, R and Tableau must be connected, and this is really easy to do:

1. Download R from here or R Studio from here, and install it

2. From the R console, download the Rserve package and run it:


3. In Tableau, select Help ➜ Setting and Performance ➜ Manage External Service Connections

4. Enter a server name or an IP (in the same machine, use localhost or and select port 6311

5. Test the connection by clicking on the “Test Connection” button, and press OK if you get a positive message

You can also automatize the start of Rserve by creating an Rscript with the code mentioned in point 2 (without repeating the installation), save it as a .R file, and call it from a .bat file with “start Rscript myRserveScript.R”. You’ll find it pretty useful, especially in a server environment, because you can schedule its execution when the machine starts via the Task Scheduler, or by copying a shortcut to %AppData%\Microsoft\Windows\Start Menu\Programs\Startup (in Windows).

After Rserve has started and Tableau is successfully connected, you can start using the 4 Script_* functions to exploit R’s power.


5. Forecasting in R

To perform forecasting in R, you have to download a user-made package called “forecast”.

To do so, simply write install.packages(“forecast”) in your R console, and call it by writing library(forecast) any time you need it.

It offers various types of fully-customizable forecast methods, like Arima, naïve, drift, Exponential Smoothing (ETS), Seasonal-Trend decomposition by Loess (STL) and so on, seasonable or not, with a number of different parameters.

For more details about the package, check the author’s slides, the help package pdf , or each function’s help page by writing ?func_name in the console (ex: ?stlf).


6. Use case

In our use case, we will take a series of monthly measures from the last 3 years and forecast them with different frequency/period combinations. We will use the function called stlf, which, without getting bogged down in details, gives us more freedom when playing with periods and frequencies whilst still providing a solid result.

In the picture below, you can see our data, from January 2014 to October 2016.

Figure 4: Data, from January 2014 to October 2016.

Figure 4: Data, from January 2014 to October 2016.

We dragged our date dimension (Year Number Month Number) to the column field, and put our desired measure in the row field. The trend is not regular, but we can see how the line has a few peaks and generally tends to grow.


7. Writing the R scripts

Let’s see how to write the R scripts that perform the forecast.

Firstly, we need to know what values we will be passing: we have our measure (for R to accept it, it must be aggregated, but coming from the Cube it already is, and if not, the ATTR function can be called on to help) and the two integer parameters (create them in Tableau) for the number of periods to be forecasted and for the frequency of the season.

Let’s use them in this order, calling them ForecastMeasure, ForecastPeriods and Freq. The calculated field should initially look something like this:

data <- .arg1;
nulls <- length(data[]);
data <- data[!];
freq <- .arg3[1];
periods <- .arg2[1];
[Forecast Measure],[ForecastPeriods],[Freq]

In the first argument, we have already set up the R variables that collect the values to pass to the script (this is not mandatory, but just to keep the code cleaner). Remember that R works with vectors: notice how the first one (data) takes the entire .arg1, because we actually want the whole array of measures, while the parameters (simple integers) need to taken from the first value of a 1-sized array. We then filter this vector to remove eventual null values, and we save the number of those in a variable called nulls (we will need it later).

We now perform the forecasting. Add the following lines to the script:

time <- ts(data,frequency=freq);
fcast <- stlf(time, h=periods);

The first line simply calls the forecast library; we then create a time series object, by passing the measure and the frequency parameter (for more customizable details, check the resources already mentioned). Finally, we produce our forecasting result, by feeding the stlf function with the time series object and the periods parameter.

Out forecast is ready. If we use the same data and we execute the script in R, the console will look like this:

The interesting columns are Forecast, Lo80, Hi80, Lo95, Hi95. The first one contains the actual forecasted value. There are 12 because we used parameter = 12 (freq = 12, too). The others are the confidence interval (the limits can be changed when calling the function).

Now let’s complete our script. There is one thing we need to remember: Tableau expects to receive the same number of measures that it sent, so we append our forecasted array to the original one, properly shifted, along with rep(NaN, nulls), an array of nulls (we know how many there are because we saved them). Add these lines to the script:

n <- length(data);
result <- append(data[(periods+1):n],fcast$mean);
result <- append(result,rep(NaN,nulls));

The result that we want is fcast$mean, which corresponds to the column called “Forecast” in the R matrix that we saw. We append it along with eventual null values to the segment of original data that starts from the position number period+1, in order to get an array of the original size (note: by doing this, we won’t visualize the first period elements).

That’s it! Your new calculated field should look like the one below. Drag it to the row field, and you get your forecast.

data <- .arg1;
nulls <- length(data[]);
data <- data[!];
freq <- .arg3[1];
periods <- .arg2[1];

time <- ts(data,frequency=freq);
fcast <- stlf(time, h=periods);
n <- length(data);
result <- append(data[(periods+1):n],fcast$mean);
result <- append(result,rep(NaN,nulls));
[Forecast Measure],[ForecastPeriods],[Freq]

Figure 5: Complete R script and result

Figure 5: Complete R script and result


8. Shifting time measures

Now, we start getting the first issues. Tableau received a vector from R the same size as the one that we passed it, so each date-measure value is substitued for the new one, coming from the calculated field. The problem is that the date hasn’t changed!

Normally, this could be fixed by creating a new calculated date with the help of the DATEADD function, but as we can’t add a dimension in cubes, we have to overcome this issue by getting dimension value as a measure with an MDX formula, and then applying DATEADD to it.

Create a new calculated member, and set it as:


Then create a calculate field like this:


Where [YearNumberMonthNumber] is (for example) the name of your newly calculated member. This will get you a measure with the date shifted accordingly to the forecasted period that you selected.

Figure 6: Tableau's Error

Figure 6: Tableau's Error

Drag this new field on top of the original date column and… your chart will crash!

Why? You’ve just created a new shifted date for every original date value, and basically your data is now “split” into “columns” of 1 value, 1 for each shifted month, and R will receive them 1 by 1 instead of getting an entire vector to process. This explains the error that says “series is not periodic or has less than two periods”, because what R is actually seeing is a series of indipendent 1-sized vectors.

To fix this, there’s a simple but neat trick: add your original data field to “Detail”, so that it is available in the sheet, and edit your calculated forecast field like this.

Figure 7: Table Calculation - Forecast (Months)Figure 7: Table Calculation - Forecast (Months)

This will make Tableau process the calculated field at the correct level, and help R see the received data in the correct way, thus producing the desired result.

That’s it! You’ve just forecasted your OLAP cube!

Figure 8: OLAP cube forecasted

Figure 8: OLAP cube forecasted


9. Adding confidence intervals

You can also add confidence intervals to improve your visualization: just create two more metrics, copying the one you created for the forecast, and change fcast$mean to fcast$lower[,1], fcast$lower[,2], fcast$upper[,1], or fcast$upper[,2], depending on the limit that you want to show (remember that the defaults are 80 and 95, but these are editable in the R function).

In the picture below you can see the forecast with the 95% bounds (fcast$upper[,2] and fcast$lower[,2]) , and the trend line (calculated by Tableau).

Figure 9: Forecast final visualization

Figure 9: Forecast final visualization


10. Conclusions

In this guide we have shown how R can help you build great forecast charts for your OLAP Cube data, but this isn’t the only benefit: we got a much higher degree of customization, and overcame the limitations of Tableau when it comes to defining parameters, methods or entire algorithms.

Once Tableau can connect to R, its potential is virtually unlimited, and it can enhance your charts in an infinity of ways: from any source (not necessarily OLAP Cubes), or with any goal, just send the data to R and let it do the job, whatever it is! Forecasting, clustering, classification, or anything else with an R package written for it – you could even write one yourself! Then visualize the result in Tableau, and adjust the look and feel of your dashboard with the powerful features that you love.

Click here if you would like to receive more information about the topic!

Informatica Performance – Optimization techniques


Informatica performance - Optimization techniques


Informatica provides the market´s leading data integration platform. ETL Mappings are designed for data loading into the data warehouse environment to achieve better reporting which in turn helps you to understand business trends better. The major problem faced by anyone working with Informatica ETL is to design a mapping(s) that doesn’t compromise its performance. Often we end up creating a mapping that achieves only the functionality but suffers in terms of performance.

The article explains the steps to identify the performance bottlenecks and the advanced optimization techniques that are available in Informatica to overcome them.

Following topics are discussed in detail:

* Optimizing sources, Targets and Buffer Blocks:
Performance bottlenecks can occur when the Integration Service reads from a source database or writes to a target database. Depending on the source, target data, you may need to increase or decrease the buffer block size. In this article we list the steps you have to take to ensure that the sources and the Targets are fully utilized to optimum level and how to calculate Buffer block size.

* Push down Optimization (PDO):
Push Down Optimization Option enables data transformation processing to be pushed down into any relational database to make the best use of database processing power.

* Session Partitioning & Dynamic Session Partitioning:
The Session Partitioning Option increases the performance of Power Center through parallel data processing. In dynamic partitioning, Integration Service determines the number of partitions to create at run time based on factors such as source database partitions or the number of nodes in a grid.

* Concurrent Workflow Execution:
A workflow configured for Concurrent Execution can run as multiple instances concurrently. In the article we illustrate this functionality by using a workflow that is configured with multiple instances to process data belonging to multiple sites simultaneously.

* Load Balancing:
Load balancing is a mechanism which distributes the workloads across the nodes in the gird. In the article we see how to assigning Service Levels to Workflows to achieve Load Balancing.

Click to read the full article: Informatica performance - Optimization techniques!


Data Mining & Business Intelligence


Data mining & Business Intelligence


The term data mining refers to one of the processes involved in the task of extracting knowledge from a database, also known as KDD (Knowledge Discovery in Data bases). However, by extension, it is referred to as the KDD global process because of its commercial appeal. Understanding data mining as a KDD sub-process, we could define the term as the process of extracting underlying knowledge from a large volume of data.

It is a recent development directly linked to the scientific fields of mathematics (mainly statistics), computer science and artificial intelligence. It can be supported by different Business Intelligence systems, from which we can obtain several advantages!

In this article we´re highlighting how a Business Intelligence system is a great starting point for the data mining process and how data mining can be used for process optimization.

The topics covered are:

  1. Definition
  2. Objectives and challenges
  3. KDD Process Phases
  4. Applications:

➀ Data mining & Business Intelligence
➁ Extracting knowledge from unstructured data
➂ Optimization of processes

Click this link to read the full article on data mining: Data Mining & Business Intelligence

privacy policy - Copyright © 2000-2010 ClearPeaks