Data Quality with Informatica – Part 1: Data Profiling

.

Data Quality – Part 1: Data Profiling using INFA

Welcome to the first article in the Informatica Data Quality series, where we are going to run through the basics of Informatica Analyst and the main features of Informatica Developer for data profiling.

Informatica is one of the most important data integration vendors in the market; they are behind PowerCenter, a very well-known ETL that can be integrated with other Informatica tools, such as Informatica Analyst,  a web application used by data analysts to analyse data and create data profiles, among other tasks. In the sections below we are going to go through the necessary steps to create a data profile, a business rule for column profiling and finally a scorecard to view the results.

 

1. Create a Data Profile

To start profiling our data, first open the browser, log into the Analyst tool (the default URL is http://infaServerName:8085/AnalystTool) and create a new project, which we’ll call Data_Profiling_Example :

Data Quality Series - Profiling with Informatica

Figure 1: Creating a project in Informatica Analyst

Now we add a data source; in this example we are going to load a file with information from AdWords. For demonstration purposes, several errors have been introduced into the file, like different date formats. To add a file, click on the actions menu on the right-hand side of the window and click add flat file:

Data Quality Series - Profiling with Informatica

Figure 2: Adding data from a file in Informatica Analyst

Importing data from files is straightforward if we follow the wizard. In this example, we are going to set comma separated values, header present, data starting in line 2, and all the columns will be strings. The tool will automatically detect the length of the fields.

Data Quality Series - Profiling with Informatica

Figure 3: Add flat file wizard in Informatica Analyst

Now we need to create a new profile for our data, and we can do this by clicking on new profile on the menu on the right. In the wizard, we select our data file and accept all the default values.
Once the profile has been created we can review the values of the data, the percentage of nulls, and term frequencies in the results panel, as well as being able to analyse the patterns of the data values for every column. We can also view a summary of basic statistics, such as the max value, the min value and the top and bottom values for each column.

Data Quality Series - Profiling with Informatica

Figure 4: Profiling results in Informatica Analyst

In our example we can see several issues in the data of the file. For example, in the image below we can see  that the year is incorrect for some records (we are assuming that the file should contain just the numeric value for the year). In this example, the file should only contain data for 2nd January 2015, so it looks like the file has some invalid records, as there are some records with a different year, and others with a wrong value. This could be due to a bad extraction from the source system, or a wrong delimiter in some rows. In order to measure the quality of the file, we are now going to create some business rules, add them to the data profile, and finally create a visualization.

The data analysts from our organization have given us the following business rules:

the year must be 2015 for this file
the day column must always be 1/2/2015
the file should only contain Enabled campaigns

We will create two business rules to validate the year and the day columns, and for the Enabled campaigns we will set up the value Enabled in the campaign_status column as valid.

We can create the business rules in two ways: by using the expression builder in the Analyst tool, or by creating a mapping using the Informatica Developer. To create the business rule directly in the profile we simply click on edit, then on the column profiling rules, and the on the plus sign to add a rule.

Data Quality Series - Profiling with Informatica

Figure 5: Creating rules in Informatica Analyst

Then we select new rule for the year column and enter the expression you can see in the following image. We can save the rule as reusable; this way we will be able to apply exactly the same rule for a different column in the file if necessary.

Data Quality Series - Profiling with Informatica

Figure 6: New rule wizard in Informatica Analyst

We will implement the second rule in the Developer tool. To do this, we open Informatica Developer and connect to our project, then create a mapplet with an input transformation, an expression and an output transformation, and save it as DayValidator. To validate the rule, we can right-click on the rule and select validate.

Data Quality Series - Profiling with Informatica

Figure 7: Creating a rule in Informatica Developer

We will define the expression with three possible output values: not a date, Valid date and Invalid date.

Data Quality Series - Profiling with Informatica

Figure 8: Defining rules in Informatica Developer

Once the rule has been created, we can go back to Informatica Analyst, edit the profile and now, instead of creating a new rule, we are going to apply the DayValidator rule we just created in Developer to the day column in the profile. We will call the output of the rule IsValidDay:

Data Quality Series - Profiling with Informatica

Figure 9: New rule wizard in Informatica Analyst

Now we are ready to run the profile again and review the outcome of the two different rules:

Data Quality Series - Profiling with Informatica

Figure 10: Data profiling project in Informatica Analyst

Reviewing the results, we can see that the data contains wrong values for Day and Year:

Data Quality Series - Profiling with Informatica

Figure 11: Reviewing profiling results in Informatica Analyst

 

2. Create a Scorecard for the Profile

Now that we have executed and checked the profile, we can create a scorecard to measure the quality of the file as the last step in this data quality assessment. In order to do this, we have to go to the profile and add it to a new scorecard. We can define the valid values for each column in our data. In this example, we are going to create the scorecard with three metrics called scores (both outputs from the rules and the campaign status) and then select the valid values for each different score.

The scorecard allows us to drill down from the score to the data. We select the key of the file (first three columns), the campaign status, and the output from both rules as drilldown columns; this way we can easily export the invalid rows to a file and send the results to the owner of the data so they can fix the wrong values and re-run the proper processes to update the data.

Data Quality Series - Profiling with Informatica

Figure 12: Data profiling scorecard in Informatica Analyst

This concludes the first article in this series about Data Quality with Informatica.
In the next couple of blog posts we’re going to explain how to standardize and deduplicate data. Stay tuned!
If you would like to know more about the Data Quality Services we offer click here!

Extracting data from Taleo

.

Over the past few years we have seen companies focusing more and more on Human Resources / Human Capital activities. This is no surprise, considering that nowadays a large number of businesses depend more on people skills and creativity than on machinery or capital, so hiring the right people has become a critical process. As a consequence, more and more emphasis is put on having the right software to support HR/HC activities, and this in turn leads to the necessity of building a BI solution on top of those systems for a correct evaluation of processes and resources. One of the most commonly used HR systems is Taleo, an Oracle product that resides in the cloud, so there is no direct access to its underlying data. Nevertheless, most BI systems are still on-premise, so if we want to use Taleo data, we need to extract it from the cloud first.

 

1. Taleo data extraction methods

As mentioned before, there is no way of direct access to Taleo data; nevertheless, there are several ways to extract it, and once extracted we will be able to use it in the BI solution:

Querying Taleo API
Using Cloud Connector
Using Taleo Connect Client

API is very robust, but the most complex of the methods, since it requires a separate application to be written. Usually, depending on the configuration of the BI system, either Oracle Cloud Connector, Taleo Connect Client or a combination of both is used.

Extracting data from Taleo

Figure 1: Cloud connector in ODI objects tree

Oracle Cloud Connector is a component of OBI Apps, and essentially it’s Java code that replicates Taleo entities / tables. It’s also easy to use: just by creating any Load Plan in BIACM using Taleo as the source system, a series of calls to Cloud Connector are generated that effectively replicate Taleo tables to local schema. Although it works well, it has 2 significant disadvantages:

It’s only available as a component of BI Apps
It doesn’t extract Taleo UDFs

So even if we have BI Apps installed and we use Cloud Connector, there will be some columns (UDFs) that will not get extracted. This is why the use of Taleo Connect Client is often a must.

 

2. Taleo Connect Client

Taleo Connect Client is a tool that is used to export or import data from / to Taleo. In this article we’re going to focus on extraction. It can extract any field, including UDFs, so it can be used in combination with BI Apps Cloud Connector or, if that’s not available, then as a unique extraction tool. There are versions for both Windows and Linux operating systems. Let’s look at the Windows version first.

Part 1 - Installation & running: Taleo Connect Client can be downloaded from the Oracle e-delivery website; just type Taleo Connect Client into the searcher and you will see it on the list. Choose the necessary version, select Application Installer and Application Data Model (required!), remembering that it must match the version of the Taleo application you will be working with; then just download and install. Important – the Data Model must be installed before the application is installed.

Extracting data from Taleo

Figure 2: Downloading TCC for Windows

After TCC is installed, we run it, providing the necessary credentials in the initial screen:

Extracting data from Taleo

Figure 3: Taleo Connect Client welcome screen

And then, after clicking on ‘ping’, we connect to Taleo.  The window that we see is initially empty, but we can create or execute new extracts from it. But before going on to this step, let’s find out how to see the UDFs: in the right panel, go to the ‘Product integration pack’ tab, also selecting the correct product and model. Then, in the Entities tab, we can see a list of entities / tables, and in fields / relations, we can see columns and relations with other entities / tables (through foreign keys). After the first run, you will probably have some UDFs that are not on the list of fields / relations available. Why is this? Because what we initially see in the field list are only Taleo out-of-the-box fields, installed with the Data Model installer. But don’t worry, this can easily be fixed: use the ‘Synchronize custom fields’ icon (highlighted on the screenshot). After clicking on it you will be taken to a log-on screen where you’ll have to provide log-on credentials again, and after clicking on the ‘Synchronize’ button, the UDFs will be retrieved.

Extracting data from Taleo

Figure 4: Synchronizing out-of-the-box model with User Defined Fields

Extracting data from Taleo

Figure 5: Synchronized list of fields, including some UDFs (marked with 'person' icon)

Part 2 - Preparing the extract: Once we have all the required fields available, preparing the extract is pretty straightforward. Go to File->New->New Export Wizard, then choose the desired Entity, and click on Finish. Now, in the General window, set Export Mode to ‘CSV-Entity’, and in the Projections tab, select the columns that you want to extract by dragging and dropping them from the Entity->Structure window on the right. You can also add filters or sort the result set. Finally, save the export file. The other component necessary to actually extract the data is so-called configuration. To create it, we select File->New->New Configuration Wizard, then we point the export file that we’ve created in the previous step and, in the subsequent step, our endpoint (the Taleo instance that we will extract the data from). Then, on the following screen, there are more extract parameters, like request format and encoding, extract file name format and encoding and much more. In most cases, using the default values of parameters will let us extract the data successfully, so unless it’s clearly required, there is no need to change anything. So now the configuration file can be saved and the extraction process can start, just by clicking on the ‘Execute the configuration’ button (on the toolbar just below the main menu). If the extraction is successful, then all the indicators in the Monitoring window on the right will turn green, as in the screenshot below.

Extracting data from Taleo

Figure 6: TCC successfull extraction

By using a bat file created during the installation, you can schedule TCC jobs to be executed on a timely basis, using Windows Scheduler, but it’s much more common to have your OBI / BI Apps (or almost any other DBMS that your organization uses as a data warehouse) installed on a Linux / Unix server. This is why we’re going to have a look at how to install and set up TCC in a Linux environment.
 
Part 3 - TCC in a Linux / Unix environment: TCC setup in a Linux / Unix environment is a bit more complex. To simplify it, we will use some of the components that were already created and used when we worked with Windows TCC, and although the frontend of the application is totally different (to be precise, there is no frontend at all in the Linux version as it’s strictly command-line), the way the data is extracted from Taleo is exactly the same (using extracts designed as XML files and Taleo APIs). So, after downloading the application installer and data model from edelivery.oracle.com , we install both components. Installation is actually just extracting the files, first from zip to tgz, and then from tgz to uncompressed content. But this time, unlike in Windows, we recommend installing (extracting) the application first, and then extracting the data model files to an application subfolder named ‘featurepacks’ (this must be created, it doesn’t exist by default). It’s also necessary to create a subfolder ‘system’ in the application directory. Once this is done, you can move some components of your Windows TCC instance to the Linux one (of course, if you have no Windows machine available, you can create any of these components manually):

Copy file default.configuration_brd.xml from windows TCC/system to the Linux TCC/system
Copy extract XML and configuration XML files, from wherever you had them created, to the main Linux TCC directory

There are also some changes that need to be made in the TaleoConnectClient.sh file

Set JAVA_HOME variable there, at the top of the file (just below #!/bin/bash line), setting it to the path of your Java SDK installation (for some reason, in our installation, system variable JAVA_HOME wasn’t captured correctly by the script)
In the line below #Execute the client,  after the TCC_PARAMETERS variable, add:

✓ parameters of proxy server if it is to be used:

-Dhttp.proxyHost=ipNumber   –Dhttp.proxyPort=portNumber

✓ path of Data Model:

-Dcom.taleo.integration.client.productpacks.dir=/u01/oracle/tcc-15A.2.0.20/featurepacks

So, in the end, the TaleoConnectClient.sh file in our environment has the following content (IP addresses where ‘masked’):

#!/bin/sh
JAVA_HOME=/u01/middleware/Oracle_BI1/jdk
# Make sure that the JAVA_HOME variable is defined if  [ ! "${JAVA_HOME}" ] then
    echo +-----------------------------------------+
    echo "+ The JAVA_HOME variable is not defined.  +"
    echo +-----------------------------------------+
    exit 1
fi

# Make sure the IC_HOME variable is defined if  [ ! "${IC_HOME}" ] then
    IC_HOME=.
fi

# Check if the IC_HOME points to a valid taleo Connect Client folder if [ -e "${IC_HOME}/lib/taleo-integrationclient.jar" ] then
    # Define the class path for the client execution
    IC_CLASSPATH="${IC_HOME}/lib/taleo-integrationclient.jar":"${IC_HOME}/log"

    # Execute the client
    ${JAVA_HOME}/bin/java ${JAVA_OPTS} -Xmx256m ${TCC_PARAMETERS}
-Dhttp.proxyHost=10.10.10.10 -Dhttp.proxyPort=8080 -Dco m.taleo.integration.client.productpacks.dir=/u01/tcc_linux/tcc-15A.2.0.20/featurepacks
  -Dcom.taleo.integration.client.i
nstall.dir="${IC_HOME}" -Djava.endorsed.dirs="${IC_HOME}/lib/endorsed"
-Djavax.xml.parsers.SAXParserFactory=org.apache.xe
rces.jaxp.SAXParserFactoryImpl
-Djavax.xml.transform.TransformerFactory=net.sf.saxon.TransformerFactoryImpl
-Dorg.apache.
commons.logging.Log=org.apache.commons.logging.impl.Log4JLogger
-Djavax.xml.xpath.XPathFactory:http://java.sun.com/jaxp/x
path/dom=net.sf.saxon.xpath.XPathFactoryImpl -classpath ${IC_CLASSPATH} com.taleo.integration.client.Client ${@} else
    echo +-----------------------------------------------------------------------------------------------
    echo "+ The IC_HOME variable is defined as (${IC_HOME}) but does not contain the Taleo Connect Client"
    echo "+ The library ${IC_HOME}/lib/taleo-integrationclient.jar
cannot be found.                      "
    echo +-----------------------------------------------------------------------------------------------
    exit 2
fi

Once this is ready, we can also apply the necessary changes to the extract and configuration files, although there is no need to change anything in the extract definition (file blog_article_sq.xml). Let’s have a quick look at content of this file:

<?xml version="1.0" encoding="UTF-8"?>
<quer:query productCode="RC1501" model="http://www.taleo.com/ws/tee800/2009/01" projectedClass="JobInformation" locale="en" mode="CSV-ENTITY" largegraph="true" preventDuplicates="false" xmlns:quer="http://www.taleo.com/ws/integration/query"><quer:subQueries/><quer:projections><quer:projection><quer:field path="BillRateMedian"/></quer:projection><quer:projection><quer:field path="JobGrade"/></quer:projection><quer:projection><quer:field path="NumberToHire"/></quer:projection><quer:projection><quer:field path="JobInformationGroup,Description"/></quer:projection></quer:projections><quer:projectionFilterings/><quer:filterings/><quer:sortings/><quer:sortingFilterings/><quer:groupings/><quer:joinings/></quer:query>

Just by seeing the file we can figure out how to add more columns manually: we just need to add more quer tags, like

<quer:projection><quer:field path="DesiredFieldPath"/></quer:projection>

With regard to the configuration file, we need to make some small changes: in tags cli:SpecificFile and cli:Folder absolute Windows paths are used. Once we move the files to Linux, we need to replace them with Linux filesystem paths, absolute or relative.

Once the files are ready, the only remaining task is to run the extract, which is done by running:

./TaleoConnectClient.sh blog_article_cfg.xml 

See the execution log:

[KKanicki@BIApps tcc-15A.2.0.20]$ ./TaleoConnectClient.sh blog_article_cfg.xml
2017-03-16 20:18:26,876 [INFO] Client - Using the following log file: /biapps/tcc_linux/tcc-15A.2.0.20/log/taleoconnectclient.log
2017-03-16 20:18:26,876 [INFO] Client - Using the following log file: /biapps/tcc_linux/tcc-15A.2.0.20/log/taleoconnectclient.log
2017-03-16 20:18:27,854 [INFO] Client - Taleo Connect Client invoked with configuration=blog_article_cfg.xml, request message=null, response message=null
2017-03-16 20:18:27,854 [INFO] Client - Taleo Connect Client invoked with configuration=blog_article_cfg.xml, request message=null, response message=null
2017-03-16 20:18:31,010 [INFO] WorkflowManager - Starting workflow execution
2017-03-16 20:18:31,010 [INFO] WorkflowManager - Starting workflow execution
2017-03-16 20:18:31,076 [INFO] WorkflowManager - Starting workflow step: Prepare Export
2017-03-16 20:18:31,076 [INFO] WorkflowManager - Starting workflow step: Prepare Export
2017-03-16 20:18:31,168 [INFO] WorkflowManager - Completed workflow step: Prepare Export
2017-03-16 20:18:31,168 [INFO] WorkflowManager - Completed workflow step: Prepare Export
2017-03-16 20:18:31,238 [INFO] WorkflowManager - Starting workflow step: Wrap SOAP
2017-03-16 20:18:31,238 [INFO] WorkflowManager - Starting workflow step: Wrap SOAP
2017-03-16 20:18:31,249 [INFO] WorkflowManager - Completed workflow step: Wrap SOAP
2017-03-16 20:18:31,249 [INFO] WorkflowManager - Completed workflow step: Wrap SOAP
2017-03-16 20:18:31,307 [INFO] WorkflowManager - Starting workflow step: Send
2017-03-16 20:18:31,307 [INFO] WorkflowManager - Starting workflow step: Send
2017-03-16 20:18:33,486 [INFO] WorkflowManager - Completed workflow step: Send
2017-03-16 20:18:33,486 [INFO] WorkflowManager - Completed workflow step: Send
2017-03-16 20:18:33,546 [INFO] WorkflowManager - Starting workflow step: Poll
2017-03-16 20:18:33,546 [INFO] WorkflowManager - Starting workflow step: Poll
2017-03-16 20:18:34,861 [INFO] Poller - Poll results: Request Message ID=Export-JobInformation-20170316T201829;Response Message Number=123952695;State=Completed;Record Count=1;Record Index=1;
2017-03-16 20:18:34,861 [INFO] Poller - Poll results: Request Message ID=Export-JobInformation-20170316T201829;Response Message Number=123952695;State=Completed;Record Count=1;Record Index=1;
2017-03-16 20:18:34,862 [INFO] WorkflowManager - Completed workflow step: Poll
2017-03-16 20:18:34,862 [INFO] WorkflowManager - Completed workflow step: Poll
2017-03-16 20:18:34,920 [INFO] WorkflowManager - Starting workflow step: Retrieve
2017-03-16 20:18:34,920 [INFO] WorkflowManager - Starting workflow step: Retrieve
2017-03-16 20:18:36,153 [INFO] WorkflowManager - Completed workflow step: Retrieve
2017-03-16 20:18:36,153 [INFO] WorkflowManager - Completed workflow step: Retrieve
2017-03-16 20:18:36,206 [INFO] WorkflowManager - Starting workflow step: Strip SOAP
2017-03-16 20:18:36,206 [INFO] WorkflowManager - Starting workflow step: Strip SOAP
2017-03-16 20:18:36,273 [INFO] WorkflowManager - Completed workflow step: Strip SOAP
2017-03-16 20:18:36,273 [INFO] WorkflowManager - Completed workflow step: Strip SOAP
2017-03-16 20:18:36,331 [INFO] WorkflowManager - Completed workflow execution
2017-03-16 20:18:36,331 [INFO] WorkflowManager - Completed workflow execution
2017-03-16 20:18:36,393 [INFO] Client - The workflow execution succeeded.

And that’s it! Assuming our files were correctly prepared, the extract will be ready in the folder declared in cli:Folder tag of the configuration file.

As for scheduling, different approaches are available, the most basic being to use the Linux crontab as the scheduler, but you can also use any ETL tool that is used in your project easily. See the screenshot below for an ODI example:

Extracting data from Taleo

Figure 7: TCC extracts placed into 1 ODI package

The file extract_candidate.sh contains simple call of TCC extraction:

[KKanicki@BIApps tcc-15A.2.0.20]$ cat extract_candidate.sh
#!/bin/bash
cd /u01/tcc_linux/tcc-15A.2.0.20/
./TaleoConnectClient.sh extracts_definitions/candidate_cfg.xml

If your extracts fail or you have any other issues with configuring Taleo Connect Client, feel free to ask us in the comments section below!

In the last couple of years we have delivered several highly successful BI Projects in the Human Resources / Human Capital space! Don´t hesitate to contact us if you would like to receive specific information about these solutions!

Use JavaScript to embed Tableau dashboards into SharePoint

.

How to use JavaScript to embed Tableau dashboards into SharePoint with automatic sizing and current date quick filter selection?

As we all know, Tableau comes with its own web-based repository (Tableau Server), but sometimes we have to consume the dashboard that we create in other platforms. One of them is SharePoint, Microsoft’s web-based application, often used by enterprises to share content with their employees and departments.

Use JavaScript to embed tableau dashboards into SharePoint with automatic sizing and current date quick filter selection

Figure 1: Schema of the integration between SharePoint, Tableau Server and the enterprise accounts

Tableau allows us to easily share our dashboards by embedding them in JavaScript code or iFrames, providing ready-to-use code snippets that we just need to copy and paste in our solution.

These are very useful, but what if we want to go even further? For example, it’s currently not possible to set a quick filter to automatically select the current date. What if we want our dashboard to show data for the current year or month by default?

In this article, we’re going to explain how to embed dashboards into SharePoint, how to dynamically resize them and how to apply the default date quick filter with the help of some JavaScript power.

 

1. Embedding the dashboard into SharePoint

To embed the dashboard into SharePoint, we must first create a page with an “Embed Code” container. Once your SharePoint page is ready, click on Edit, then Insert, and select Embed Code. This particular type of container, as its name suggests, allows us to embed code inside SharePoint, and it will be our door to Tableau.

Use JavaScript to embed tableau dashboards into SharePoint with automatic sizing and current date quick filter selection

Figure 2: SharePoint’s Insert toolbar with highlighted “Embed code” button

Now that your container is ready, you will see a canvas where you can write your code. Let’s start by embedding the dashboard as it is, with an iFrame.

First of all, we need the dashboard link. We can find it in Tableau Server: just navigate to the dashboard that you want to embed into SharePoint and click on Share, in the upper-right corner. You will see two links: “Embed Code” and “Link”. The first is a JavaScript snippet, ready to be copied and pasted; the second is a pure URL to the dashboard. Click on the second and copy it.

Use JavaScript to embed tableau dashboards into SharePoint with automatic sizing and current date quick filter selection

Figure 3: Tableau Server’s “share” option and corresponding links

We are using this link because it is slightly less verbose; it’s also interesting to see how the two technologies can easily integrate.

Now that the link is ready, return to the Embed Code container canvas and write a simple HTML page that contains an iFrame, and use the link as the source URL, like this:

<!DOCTYPE html>
<html>
  <center><body>
    <iframe src="(Tableau Server URL)" id="myIframe"></iframe>
  </body></center>
</html>

Click on insert, and the dashboard will appear in your page!
 

2. Dynamic Sizing

Now that your dashboard is into SharePoint, you might want to change its size. Sometimes, to make our dashboards look as cool as they can on every screen, we set the resolution to automatic in Tableau Desktop; this is usually a great idea, but into SharePoint we might have big headers, multiple contents on the same browser page, etc., which don’t let our dashboard look as good as it could. We can solve this issue by changing the size of the Embed Code container automatically, with just a few JavaScript lines.

Edit the container that you just set up, and add the following script AFTER the iFrame (but before the closing body tag):

<script type="text/javaScript">
var iframe = document.getElementById('myIframe');
	
//Automatic size: the following lines of code adapt the dashboard size to the browser window
var height = window.innerHeight;
var width = window.innerWidth;					   
iframe.height = height*0.95;
iframe.width = width*0.95;

document.getElementById('contentBox').style.marginLeft="0px";
document.getElementById('contentBox').style.marginRight="0px";
document.getElementById('contentRow').style.paddingTop="0px";
	
var containers = document.getElementsByClassName('container-fluid');
var container = containers[1];
container.style.marginLeft="0px";
container.style.marginRight="0px";
container.style.paddingLeft="0px";
container.style.paddingRight="0px";
		
containers = document.getElementsByClassName('welcome blank-wp');
container = containers[0];
container.style.paddingLeft="0px";
container.style.paddingRight="0px";
container.style.paddingTop="0px";
container.style.paddingBottom="0px";

containers = document.getElementsByClassName('welcome-content');
container = containers[0];
container.style.marginTop="0px";	
</script>

Again, click on Insert, and your dashboard will automatically resize when it’s loaded!In this script, we take the actual size of the browser and use it as a reference to change our “contentBox” size. In the example we basically use 95% of the available space in both dimensions, but you can change it as you wish. Notice how we also remove all possible padding.
 

3. Setting default current date filter

Our dashboard is now ready into SharePoint and it automatically resizes. The next step is to add automatic filters.

Tableau allows us to manually set parameters in the URL, with the classic name=value combination. We will exploit this, changing the parameter list when loading the page.

Let’s say, for example, that we have a date dimension called Year-Month, in the format YYYY-MM, and we want our dashboard to automatically filter to the actual month every time we open it. This is something that Tableau currently can’t do (you can vote for it in “ideas” in the community: use think link or this ). Luckily, JavaScript comes to the rescue.

Add the following lines of code after your iFrame (the additional comments help understand the various steps):

//get the parameters from the URL
var loc = window.location.href;
var params = "";
if(loc.indexOf('?') >= 0) params = loc.split('?')[1];	
	
//get the actual iframe URL
var iframe = document.getElementById('myIframe');
var iframe_url = iframe.src;

//if the date parameter doesn't exist, set the actual Year-Month	
n = params.indexOf("Year-Month"); 		
if(n<0){ //if the parameter is not in the URL yet…
var today = new Date();
var mm = today.getMonth()+1; //JS’s months go from 0 to 11
var yyyy = today.getFullYear();			
if(mm<10) mm="0"+mm;

//append the new parameter to the other ones (if any)
var ym = "Year-Month="+yyyy+"-"+mm;
if (params.length > 0) params=params+'&'+ym;
else params = ym;
}
	
//set the new iframe URL (or restore the original one)		
iframe.src = iframe_url+params;

That’s it! By clicking Import, your dashboard will filter according to the current month, automatically. Pretty cool!

 

4. Next steps

This is a pretty basic but highly useful implementation of Tableau embedding. We used the date in our example, but you can easily set whatever parameter you want. You can also set up an URL action in your dashboard that points to a SharePoint page, and then write JavaScript code that processes the various parameters, in order to apply the filters automatically when the dashboard is loaded in that page.

For an even better use of the tool, don’t forget to check out the Tableau APIs:

JavaScript API
REST API

Contact us if you would like to hear more about the topic or need any help embedding your Tableau dashboards into SharePoint.

 

Advanced analytics: Tableau + Python

.

Tableau has released TabPy, Tableau Python Server, an API that enables Python code evaluation within Tableau. Thanks to TabPy, you can create calculated fields using Python code in Tableau 10.2. As significant as the R connection with Tableau 8.1, now Python lovers will be able to leverage all the power of advanced analytics and visualize the results. Python is a widely used general-purpose programming language, and a large number of Python libraries are available to perform statistical analysis, predictive models or machine learning algorithms. Tableau and Python are a perfect match, a dream team for any data scientist.

Here at ClearPeaks we are convinced that connecting TabPy with Tableau is one of the best approaches for predictive analytics. So how does this new feature work? We are already working with this technology, and are proud to offer a demonstration of Python code running within a Tableau environment.

In this article you will learn how to connect Tableau with TabPy; just follow the example to see how to use this new feature and you´ll be ready to start your own journey in this new advanced data analytics environment.

Prerequisites:

Tableau Desktop 10.2
TabPy (Tableau Python Server)
Anaconda Navigator

 

1. Installing and setting up Anaconda Navigator

Anaconda is the leading open data science platform powered by Python, whose open source version is a high performance distribution of Python and R, and includes over 100 of the most popular Python, R and Scala packages for data science.

Download Anaconda Navigator, Python 2.7 and 64-bit version here.

TabPy is only supported in a Python 2.7 environment.

Install Anaconda Navigator with the default options in the installation set-up.

Advanced analytics: Tableau + Python

Figure 1: Anaconda Navigator installation

Once Anaconda Navigator has been installed we need to create a new enviroment:

Click on “Create” and type “Tableau-Python-Server” in the environment name, then click “Create”.

Advanced analytics: Tableau + Python

Figure 2: Creating a new environment in Anaconda

This new environment will be used when connecting Tableau with TabPy. We also need some packages in order to perform mathematical calculations, apply machine learning algorithms, plot, define dataframes, create predictive models or edit Python code. We’re going to install the required packages: “Numpy”, “scikit-learn”, “matplotlib”, “pandas”, “statsmodel” and “ipython”; search for the packages in the “Not Installed” window and install them all.

Advanced analytics: Tableau + Python

Figure 3: Installing Python Packages

You can also download “Spyder”, a powerful Python IDE with advanced editing, interactive testing, debugging and introspection features, to try your Python code before using it in Tableau calculated fields.

Go to the Home tab of Anaconda Navigator, and select the created environment “Tableau-Python-Server”; you can install “Spyder” from there.

Advanced analytics: Tableau + Python

Figure 4: Anaconda Navigator

 

2. Installing TabPy

Running Python code within a Tableau workbook requires a Python server to execute it, and the TabPy framework gets the job done. Download TabPy from github at the following link and decompress the file.

Advanced analytics: Tableau + Python

Figure 5: Download Tabpy

On Windows prompt:

activate Tableau-Python-Server

In the TabPy folder run the following commands to install TabPy client and server as well as package dependencies.

pip install -r ./tabpy-server/requirements.txt

pip install ./tabpy-client

pip install ./tabpy-server

As the packages are installed, you will see the install locations listed in the command line. These might look like /Users/username/anaconda/envs/Tableau-Python-Server/lib/python2.7/site-packages or \Users\username\anaconda\envs\Tableau-Python-Server\lib\site-packages depending on your environment.

Navigate to the tabpy_server folder under site-packages and run startup.bat 9001. 9001 is the port to connect Tableau with the TabPy server.

 

3. Connecting Tableau with TabPy

The next step is to connect Tableau 10.2 with TabPy. This can be done in Help > Settings and Performance > Manage External Service Connection:

Advanced analytics: Tableau + Python

Figure 6: Connecting Tableau + Python Server

Test the connection and you should get a “successfully connected” prompt. Now you can check out an example of predictive analysis.

 

4. Using Tableau + Python for predictive analytics

In this section, you can follow through an example of how to visualize a Time Series prediction with Tableau.

Advanced analytics: Tableau + Python

We’re going to use a dataset called AirPassengers, containing historic data on the number of passengers per month of an airline company. First, we open a new Tableau workbook and connect to the AirPassengers.csv dataset, then we create all calculated fields and parameters:

Advanced analytics: Tableau + Python

Figure 7: Calculated fields and Dimensions

Define the parameters:

We can create parameters to choose the number of autoregressive terms, the number of moving average terms and the seasonal differencing order to our arima time series model.

Advanced analytics: Tableau + Python

Figure 8: Autoregresive terms

Advanced analytics: Tableau + Python

Figure 9: Moving average terms

Advanced analytics: Tableau + Python

Figure 10: Seasonal differencing order

Advanced analytics: Tableau + Python

Figure 11: Months to forecast

Time dimensions need to be formatted.

Advanced analytics: Tableau + Python

Figures 12 and 13: Date parsing and Date shift

Finally we have to create the calculated fields. The measure “#Passengers” is defined as integer; we should fomat this field to assign float (number decimal) format, so we can combine axis of this measure with the predicted outcomes.

Advanced analytics: Tableau + Python

Figures 14 and 15: Synchronize number of passengers with correspondent month and differentiate forecast values with predicted past values

Now we’re going to embed python code to the new calculated fields. The code embedded to the calculated field “Time Series” will return the predicted values of an Arima(p, d, q) model, where p=[AR (time lag)]; d=[I Seasonal Difference]; q=[MA (Moving Average)], with the parameters we created above. Find the python code below:

SCRIPT_REAL(
'
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from matplotlib.pylab import rcParams
dates = _arg1
passeng = _arg2
order_arima = min(_arg3)
seasonal_diff = min (_arg4)
ma_param = min (_arg5)
months_forecast = min(_arg6)
ts = pd.DataFrame({"dates": dates,"Passengers": passeng})
ts["Passengers"] = ts["Passengers"].astype("float64")
ts = ts.set_index(["dates"])
ts_log = np.log(ts)
ts_log.index = pd.to_datetime(ts_log.index)
ts_log_diff = ts_log - ts_log.shift()
ts_log_diff["Passengers"][0] = 0
from statsmodels.tsa.arima_model import ARIMA
model = ARIMA(ts_log_diff, order=(order_arima, seasonal_diff, ma_param))
results_ARIMA = model.fit(disp=-1)
predictions_value = results_ARIMA.forecast(months_forecast)[0]
from dateutil.relativedelta import relativedelta
add_month = relativedelta(months=1)
predictions_dates = list()
for i in range(months_forecast):
predictions_dates.append ( results_ARIMA.fittedvalues.index[-1] + ((i+1)*add_month))
forecast_log_diff = pd.Series(predictions_value, index=predictions_dates)
predictions_ARIMA_diff = pd.Series(results_ARIMA.fittedvalues, copy=True)
predictions_ARIMA_diff_cumsum = predictions_ARIMA_diff.cumsum()
predictions_ARIMA_log = pd.Series(np.asscalar(ts_log.ix[0]), index=ts_log.index)
predictions_ARIMA_log = predictions_ARIMA_log.add(predictions_ARIMA_diff_cumsum,fill_value=0)
predictions_ARIMA = np.exp(predictions_ARIMA_log)
forecast_log_diff_ARIMA = pd.Series(forecast_log_diff, copy=True)
forecast_ARIMA_log_diff_cumsum = forecast_log_diff_ARIMA.cumsum()
forecast_ARIMA_log = pd.Series(np.asscalar(ts_log.ix[-1]), index=forecast_log_diff_ARIMA.index)
forecast_ARIMA_log = forecast_ARIMA_log.add(forecast_ARIMA_log_diff_cumsum,fill_value=0)
forecast_ARIMA = np.exp(forecast_ARIMA_log)
return list(predictions_ARIMA)
',
ATTR([Month]), ATTR([#Passengers]), MIN([AR (Time lag)]), MIN([I Seasonal Difference]), 
MIN([MA (Moving average)]), MIN([Months Forecast]))

For the forecasting part, create the calculated field “Forecast”:

SCRIPT_REAL(
'
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from matplotlib.pylab import rcParams
dates = _arg1
passeng = _arg2
order_arima = min(_arg3)
seasonal_diff = min (_arg4)
ma_param = min (_arg5)
months_forecast = min(_arg6)
ts = pd.DataFrame({"dates": dates,"Passengers": passeng})
ts["Passengers"] = ts["Passengers"].astype("float64")
ts = ts.set_index(["dates"])
ts_log = np.log(ts)
ts_log.index = pd.to_datetime(ts_log.index)
ts_log_diff = ts_log - ts_log.shift()
ts_log_diff["Passengers"][0] = 0
from statsmodels.tsa.arima_model import ARIMA
model = ARIMA(ts_log_diff, order=(order_arima, seasonal_diff, ma_param))
results_ARIMA = model.fit(disp=-1)
predictions_value = results_ARIMA.forecast(months_forecast)[0]
from dateutil.relativedelta import relativedelta
add_month = relativedelta(months=1)
predictions_dates = list()
for i in range(months_forecast):
predictions_dates.append ( results_ARIMA.fittedvalues.index[-1] + ((i+1)*add_month))
forecast_log_diff = pd.Series(predictions_value, index=predictions_dates)
predictions_ARIMA_diff = pd.Series(results_ARIMA.fittedvalues, copy=True)
predictions_ARIMA_diff_cumsum = predictions_ARIMA_diff.cumsum()
predictions_ARIMA_log = pd.Series(np.asscalar(ts_log.ix[0]), index=ts_log.index)
predictions_ARIMA_log = predictions_ARIMA_log.add(predictions_ARIMA_diff_cumsum,fill_value=0)
predictions_ARIMA = np.exp(predictions_ARIMA_log)
forecast_log_diff_ARIMA = pd.Series(forecast_log_diff, copy=True)
forecast_ARIMA_log_diff_cumsum = forecast_log_diff_ARIMA.cumsum()
forecast_ARIMA_log = pd.Series(np.asscalar(ts_log.ix[-1]), index=forecast_log_diff_ARIMA.index)
forecast_ARIMA_log = forecast_ARIMA_log.add(forecast_ARIMA_log_diff_cumsum,fill_value=0)
forecast_ARIMA = np.exp(forecast_ARIMA_log)
forecast_ARIMA_2 = predictions_ARIMA.append(forecast_ARIMA)
forecast_ARIMA_2 = forecast_ARIMA_2[len(forecast_ARIMA):]
return list(forecast_ARIMA_2)
',
ATTR([Month]), ATTR([#Passengers]), min([AR (Time lag)]), MIN([I Seasonal Difference]),
MIN([MA (Moving average)]), MIN([Months Forecast]))

We have also included the sum of squared errors calculated field so that it is possible to compare different models.

Here’s the code:

SCRIPT_REAL(
'
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from matplotlib.pylab import rcParams
dates = _arg1
passeng = _arg2
order_arima = min(_arg3)
seasonal_diff = min (_arg4)
ma_param = min (_arg5)
months_forecast = min(_arg6)
ts = pd.DataFrame({"dates": dates,"Passengers": passeng})
ts["Passengers"] = ts["Passengers"].astype("float64")
ts = ts.set_index(["dates"])
ts_log = np.log(ts)
ts_log.index = pd.to_datetime(ts_log.index)
ts_log_diff = ts_log - ts_log.shift()
ts_log_diff["Passengers"][0] = 0
from statsmodels.tsa.arima_model import ARIMA
model = ARIMA(ts_log_diff, order=(order_arima, seasonal_diff, ma_param))
results_ARIMA = model.fit(disp=-1)
predictions_value = results_ARIMA.forecast(months_forecast)[0]
from dateutil.relativedelta import relativedelta
add_month = relativedelta(months=1)
predictions_dates = list()
for i in range(months_forecast):
predictions_dates.append ( results_ARIMA.fittedvalues.index[-1] + ((i+1)*add_month))
forecast_log_diff = pd.Series(predictions_value, index=predictions_dates)
predictions_ARIMA_diff = pd.Series(results_ARIMA.fittedvalues, copy=True)
predictions_ARIMA_diff_cumsum = predictions_ARIMA_diff.cumsum()
predictions_ARIMA_log = pd.Series(np.asscalar(ts_log.ix[0]), index=ts_log.index)
predictions_ARIMA_log = predictions_ARIMA_log.add(predictions_ARIMA_diff_cumsum,fill_value=0)
predictions_ARIMA = np.exp(predictions_ARIMA_log)
forecast_log_diff_ARIMA = pd.Series(forecast_log_diff, copy=True)
forecast_ARIMA_log_diff_cumsum = forecast_log_diff_ARIMA.cumsum()
forecast_ARIMA_log = pd.Series(np.asscalar(ts_log.ix[-1]), index=forecast_log_diff_ARIMA.index)
forecast_ARIMA_log = forecast_ARIMA_log.add(forecast_ARIMA_log_diff_cumsum,fill_value=0)
forecast_ARIMA = np.exp(forecast_ARIMA_log)
predictions_ARIMA_dataframe = pd.DataFrame(predictions_ARIMA)
SSE = sum((ts.ix[:,0]-predictions_ARIMA_dataframe.ix[:,0])**2)
return SSE
',
ATTR([Month]), ATTR([#Passengers]), MIN([AR (Time lag)]), MIN([I Seasonal Difference]),
MIN([MA (Moving average)]), MIN([Months Forecast]))

Finally, we’re going to design the view by dragging “Forecast date” (month level) to columns, “Number of Passengers” and “Forecast” to rows. Convert the plots to dual axis and then synchronize axis, obtaining this view:

Advanced analytics: Tableau + Python

Figure 16: Tableau + Python Predictive analysis view

Here you can see the real values as bars, predicted values as a blue line and forecast values as an orange line; you can play with different parameters to forecast more months or create a new ARIMA(p,d,q) model. Notice that the title is also dynamic - you can get this with:

Advanced analytics: Tableau + Python

Figure 17: Dynamic plot title

As you can see, it’s not the best model ever, so you can take it as a challenge to improve it!

 

Conclusion

To summarize, we have seen how to use Python code with Tableau, a new feature which gives us the opportunity to implement advanced analytics in our data. An example of Time Series prediction is included in this article. We can see a clear advantage to using this approach when compared to using the Tableau built-in tool set: adjustability of model parameters.

We hope you enjoyed this article and please leave a comment if you have any doubts, suggestions or opinions.

Stay tuned for future posts!

Click here if you would like to know more about the topic.

Data Quality with EDQ – Part 3: Data Deduplication

.

Data Quality with EDQ – Part 3: Data Deduplication

 

In our previous article about Data Quality we reviewed some of EDQ´s basic cleansing and standardization capabilities.

In this post, which is the last in the series, we are reviewing the basic data deduplication capabilities of EDQ.

Introduction

The first step in a data deduplication or consolidation process flow is record matching, the process of finding duplicate records in the data, that is, records which may relate to a single real-world entity. This is not a trivial task, since we need to identify duplicate records despite different formats and conventions, typos, missing data, etc..

EDQ comes with some built-in transformations to match records; their names are self-explanatory in most cases:

Advanced Match: gives control over all the configuration options
Match Entities
Match Households
Match Individuals (Name, Address, DoB)
Match Individuals (Name, Address)

There are also processors for a variety of different matching scenarios:

Consolidate: matches and merges multiple datasets
Deduplicate: matches a single dataset
Enhance: adds information from one or more reference datasets
Group & Merge: gives a simple merge based on exact match only
Link: find matches between datasets but without merging the data

The matching processors execute a series of ordered sub-processors where we can configure the logic to be used in the matching and the decision rules for candidate duplicate records, among other things. The sub-processors are:

Input: select the attributes from the data stream included in the matching process
Identify: create identifiers to use in matching, i.e., what data will be used to identify records, and map the identifiers to attributes
Cluster:  divide the data streams into clusters; this can be used to reduce the number of comparisons needed and thus the time necessary to run the process
Match: choose which comparisons to perform, and how to interpret them with match rules
Merge: use rules to merge matching records, to create a ‘best’ set of output records (optional)

In the following section we are going to run through an example of data deduplication using EDQ.

 

1. Deduplicating Data using EDQ

Let’s imagine that we have a dataset with a list of people, with many fields containing personal information, including full name, date of birth, and address. We want to see if we have duplicate persons in the table, and to do this, we are going to use EDQ to find matching records based on the full name, address, and date of birth.

For this specific case, the best processor to use is Match Individuals (Name, Address, DoB). We’re going to add this processor to a new process, connecting all the fields available in the dataset as identifiers:

Data Quality Series – Data De-duplication with EDQ

Figure 1: Process for data standardization and deduplication in EDQ. The steps Create GivenName, Create FullName, and Create WholeAddress are concatenations of strings to form the full name in a single string and the whole address in another string.

Data Quality Series – Data De-duplication with EDQ

Figure 2: Matching identifiers of the Match Individuals processor in EDQ.

We can run the process with this configuration to review the results:

Data Quality Series – Data De-duplication with EDQ

Figure 3: Running the Match Individuals processor in EDQ.

In the new window that opens we can review the matching records, set decisions for actions for each individual case, and even add comments:

Data Quality Series – Data De-duplication with EDQ

Figure 4: Results of the Matching Individuals processor in EDQ. We can review and set a decision for each pair of duplicate candidates individually.

These results can also be exported to Excel for further investigation.

Now we’re going to merge the duplicated records using the Merge sub-processor of the Match Individuals processor; we simply double-click on the Merge sub-processor and check the option ‘Output unrelated records’.

We can write the deduplicated records into a new dataset; to do so, we have to add a new Write processor connected to the Merged output of the Match Individuals processor:

Data Quality Series – Data De-duplication with EDQ

Figure 5:  The data standardization and deduplication process in EDQ, with a Writer processor to export the results to a new dataset.

The records will be written into a new staged dataset called Deduplicated Customers:

Data Quality Series – Data De-duplication with EDQ

Figure 6: Writing the results of the process into a new dataset in EDQ.

Data Quality Series – Data De-duplication with EDQ

Figure 7: New staged data created from the results of the process of deduplication in EDQ.

After running the process, we can see that the new dataset with the deduplicated customers has been created in Staged Data:

Data Quality Series – Data De-duplication with EDQ

Figure 8: Dataset created from the standardization and deduplication process in EDQ.

And if we right-click on the Writer processor and select ‘Show results in new window’, we can see that some of the records have a MatchGroupSize field equal to 2, which means that these records come from the merge of two records.

 

Conclusion

Data deduplication is a complex task, one of the most complex manual tasks in the quality assurance process, where tools like EDQ are extremely handy. With EDQ we were able to find matching records using complex rules and to merge the records belonging to a single entity, all in a few clicks. The ability to review each pair of duplicate candidates individually and to decide case-by-case proved to be really helpful in reducing the risk of merging records corresponding to different real-world entities in the deduplication process.

You can try this and other examples of EDQ usage by downloading the latest virtual machine from Oracle, available here.

This concludes our series of blog articles about Oracle EDQ.
You can find the previous posts in the links below:

Introducing Data Quality

Data quality with EDQ

Part 1: Data Profiling
Part 2: Data Standardization

Have you tried EDQ in any of your projects? What was your experience using EDQ like? If you have anything you’d like to share, or if there are any questions you’d like to raise, please leave your comments below!

In the next couple of blog posts we'll explain how to profile, standardize and deduplicate data with Informatica. Stay tuned!

Click here if you would like to know more about the Data Quality Services we offer!

 

privacy policy - Copyright © 2000-2010 ClearPeaks

topnav