Data Quality with Informatica – Part 2: Data Standardization

.

Introduction

In the previous article in this series about Data Quality we explained how we can profile our data using Informatica.

We learned that the data in our example contained some columns that needed to be fixed:

Keyword_ID: our data contain records with value '--' in this field, which represents a null value; in order to standardize this value across all sources we are going to change it to 'NA'.
Currency: the currency field is also not consistent as the values are in both upper and lowercase, and some records contain values that do not correspond to a currency code. We are going to fix this by standardizing to uppercase.
Year: some records contain the year with the word 'year', e.g. 'year 2015', instead of just the value 2015; the field must be standardized to just the year in numerical format.
Quarter: the quarter field is also wrong, as some records contain the date or the year, and this field should only contain the quarters and the year number.

In this article, we are going to continue with this example and create a set of mapplets in Informatica Developer to fix these data issues.

 

1. Creating the standardization rules

Our first mapplet will retain numerical values only, so it can beapplied to any column where we need to keep only numerical values. In our case, we are going to apply it to the year column, and to do this, we open Informatica Developer, right-click on the project and click on create mapplet; we’ll call it rule_Retain_Numbers. A mapplet normally contains an input and an output transformation, as it receives values and returns results. We are going to define the mapplet logic between these two transformations, so first we add the input and the output transformations, configure the input and output ports in the transformations and set the length to be long enough, for instance 200 characters.

Data Quality with Informatica – Part 2: Data Standardization

Figure 1: Creating a mapplet in Informatica Developer

Now we have to define the mapplet logic: first, we are going to use a labeller transformation to mask the letters and spaces; the labeller can be used to set a character by character analysis of data in a field, where each character is masked according to a list of standard and user-defined types: numbers will be masked as '9', spaces as '_' , letters as 'X' and symbols as 'S'.  To add the transformation, we right-click inside the mapplet, select Add transformation and then add a labeller:

Data Quality with Informatica – Part 2: Data Standardization

Figure 2: Adding a labeller transformation to the mapplet

Now we’re going to add a new strategy to the labeller:  select character mode, then verify that the output precision is set to 200 as in the input:

Data Quality with Informatica – Part 2: Data Standardization

Figure 3: Basic labeller confirguration

The next step is to add an operation: we’re going to select Label characters using character sets instead of Label characters using reference table. We want to mask all the characters except the numbers, so we choose the English alphabet, spaces and symbols, as in the image below:

Data Quality with Informatica – Part 2: Data Standardization

Figure 4: Configuration of the labeller transformation

Click on finish and skip the ignore text dialog which appears after clicking on next, as we don't want to add another operation. With the configuration as it is now, the labeller will output only the numbers and mask the rest of the characters, so we can add a standardizer transformation to remove them.

The standardizer transformation can be viewed as a series of operations that are carried out on an input string to look for matching substrings in the input and to place the correct substring in the output. It is used to fix errors such as incorrect formats, and to search for values in the data that can be removed or replaced either with reference table items or specific values.

To continue with our example, it’s time to add a standardizer transformation to the mapplet as we did before, which we can name st_remove_noise; drag the output from the labeller to the standardizer input, then create a new strategy (called remove noise). We check the space delimiter after clicking on the choose button, and also remove the trailing spaces by checking both checkboxes in the strategy properties.

Data Quality with Informatica – Part 2: Data Standardization

Figure 5: Configuring the standardized transformation strategy

At this point we want to remove the noise labelled with ‘S’, ‘X’ and ‘_’, so we select remove custom strings in the strategy and add these characters to the custom strings in properties.

Data Quality with Informatica – Part 2: Data Standardization

Figure 6: Configuring the standardizer transformation to remove custom strings

Click on finish and finally drag the output from the standardizer transformation to the port of the output transformation, then validate the mapplet. If we want the mapplet to be available in the Analyst, we have to validate it as a rule.

Data Quality with Informatica – Part 2: Data Standardization

Figure 7: Standardization mapplet

Carrying on with our example, now we’re going to create another mapplet to replace the wrong currency codes we found in the file. We’re going to use a reference table to do this, which can be created using Informatica Analyst or Developer. We will use Analyst for this example.

Log into Analyst, open the profile, select the currency column and create a reference table. The first value will be the valid one and the rest of them will be replaced by the correct one. To create the reference table we have to go to the file profile, select the currency column and then, in actions, click on Add to - Reference Table:

Data Quality with Informatica – Part 2: Data Standardization

Figure 8: Creating reference tables in Informatica Analyst

Once the table has been created we add three new columns with the values to be replaced, the first column being the correct one.

Data Quality with Informatica – Part 2: Data Standardization

Figure 9: Reference table properties

After adding the new columns, we can edit the records and keep just one, as shown in image 10:

Data Quality with Informatica – Part 2: Data Standardization

Figure 10: Final reference table for currency standardization in Analyst

In order to keep each rule in a different mapplet, we need to create a different mapplet for this rule. We could add new ports to the mapplet and increase the complexity of the standardization, but by keeping each rule in a different mapplet, the mapplets remain as simple as possible. For the currency mapplet we proceed as with the first one we created above, but in this case the standardizer transformation will have a different strategy: to replace the values with those present in the currency reference table. To do this, we have to select the reference table replacement for the transformation:

Data Quality with Informatica – Part 2: Data Standardization

Figure 11: Standardizer transformation using a reference table

The mapplet will look like this; we validate it and proceed to create a new one:

Data Quality with Informatica – Part 2: Data Standardization

Figure 12: Mapplet for the standardization of the currency field

We need to identify the month number to replace the values for the quarter, so we will now proceed to parse the date in a new mapping. Informatica Data Quality comes with some in-built features to parse dates, but we are not going to use them in this example. Instead, we are going to parse the date manually, using a tokenizer to split it into three columns: day, month and year.

Click on create a mapplet and add an input, an output, and a parser transformation. We will parse the date field using the slash character as the delimiter and use regular expressions to validate the day, month and year. It’s important to note that the parser transformation creates two output ports by default: one for data that do not meet the parser regular expression, whilst the other is the overflow port that contains data if there are not enough output ports to keep the correctly parsed values.

In the parser transformation, select the token parser when prompted:

Data Quality with Informatica – Part 2: Data Standardization

Figure 13: Configuration of the parser type in the parser transformation

Name the port in the input as date, then drag and drop the date from the input transformation to the token parser; then go to the parser transformation and add one strategy with three outputs, day, month and year. Each of these ports will have a custom regular expression with “/” as the delimiter.

Data Quality with Informatica – Part 2: Data Standardization

Figure 14: Parser configuration

Click on next and select token parser, and then select token sets in the token definition and click on choose. In the token set selection, we create a new expression for every output port of the parser transformation:

Data Quality with Informatica – Part 2: Data Standardization

Figure 15: Configuration of token sets for parsing data

We add the monthOfYear custom token set with the regular expression shown in image 16:

Data Quality with Informatica – Part 2: Data Standardization

Figure 16: Token set configuration for the month number

Once the token set has been added, assign it to the month column.

We have to repeat the same process with the proper regular expressions for each column, and once all the columns have been assigned, the parser mapplet should look like this in image 17:

Data Quality with Informatica – Part 2: Data Standardization

Figure 17: Mapplet for parsing the date into day, month, and year columns using a parser transformation

We can now add the mapplet to the mapping to get a preview of the results:

Data Quality with Informatica – Part 2: Data Standardization

Figure 18: Results preview of the parsing mapplet

We can see that there are some records that do not meet the regular expressions we set in the parser, so we have to set a default value for those records that have populated the UnparsedOutput port.

Continuing with the mapplet, we are going to add the port quarter to the output, and replace the hyphens with the string “NA”. In order to do this, we need to add two expressions to the mapping, one to create the quarter column and the other to replace the hyphens with “NA”. We can do this by creating an expression with one port to concatenate the quarter with the year; in the same expression we add a port to replace the hyphens for “NA”, and then make a decision to populate (or not) the quarter output, depending on the unparsed port from the parser: if it is empty, then the date was parsed correctly and the quarter field will be populated; if not, the date was wrong, and the quarter will be populated with “NA”. The code in the decision strategy will look like this:

Data Quality with Informatica – Part 2: Data Standardization

Figure 19: Expression to generate the quarter based on the result of the parsing of the date

Our mapplet should look like image 20:

Data Quality with Informatica – Part 2: Data Standardization

Figure 20: Standardization mapping with quarter parsing

 

2. Creation of the standardization mapping

Now we can validate all the mapplets and add them to a mapping where we can also add the source file and a new output file with the same structure. This file will be the standardized data file. We are also going to add a union to merge the data from two different dates. The mapping should look like the one in image 21:

Data Quality with Informatica – Part 2: Data Standardization

Figure 21: Final standardization mapping

After running the mapping, we can profile the generated file and check that it is meeting the rules that we defined at the beginning. We can see the path of the output file in the Run-time tab of the properties of the target:

Data Quality with Informatica – Part 2: Data Standardization

Figure 22: Properties of the output transformation. We can see the name and path of the output generated in the Run-time tab

 

3. Results

Now we are ready to review the profile of the output file. For the currency column, we can see that the only value available is USD. If any other value appears, we can simply add it to a new column in the reference table. Notice that NULL values are appearing as we didn’t set a rule to standardize the NULL values to “NA”.

Data Quality with Informatica – Part 2: Data Standardization

Figure 23: Results of the standardization process for the currency column

The year column is now standardized in the merged file and we have only numerical values after the data masking and standardization:

Data Quality with Informatica – Part 2: Data Standardization

Figure 24: Results of the standardization process for the year column

We have fixed the quarter column to obtain standard values (quarterName Year) thanks to the expressions added to the mapplet:

Data Quality with Informatica – Part 2: Data Standardization

Figure 25: Results of the standardization process for the quarter column

We have also fixed the hyphens in the keywordID column:

Data Quality with Informatica – Part 2: Data Standardization

Figure 26: Results of the standardization process for the Keyword ID column

Conclusion

This concludes this article about Data Standardization with Informatica Data Quality. We have seen that Informatica has a lot of useful features to standardize data, and that it is a very user-friendly tool whilst still offering enough flexibility to perform complex standardization tasks.

Stay tuned for the last article in this series, where we are going to explain Data Deduplication using Informatica Data Quality.

If you would like to know more about the Data Quality Services we offer click here!

Data Quality with Informatica – Part 1: Data Profiling

.

Data Quality – Part 1: Data Profiling using INFA

Welcome to the first article in the Informatica Data Quality series, where we are going to run through the basics of Informatica Analyst and the main features of Informatica Developer for data profiling.

Informatica is one of the most important data integration vendors in the market; they are behind PowerCenter, a very well-known ETL that can be integrated with other Informatica tools, such as Informatica Analyst,  a web application used by data analysts to analyse data and create data profiles, among other tasks. In the sections below we are going to go through the necessary steps to create a data profile, a business rule for column profiling and finally a scorecard to view the results.

 

1. Create a Data Profile

To start profiling our data, first open the browser, log into the Analyst tool (the default URL is http://infaServerName:8085/AnalystTool) and create a new project, which we’ll call Data_Profiling_Example :

Data Quality Series - Profiling with Informatica

Figure 1: Creating a project in Informatica Analyst

Now we add a data source; in this example we are going to load a file with information from AdWords. For demonstration purposes, several errors have been introduced into the file, like different date formats. To add a file, click on the actions menu on the right-hand side of the window and click add flat file:

Data Quality Series - Profiling with Informatica

Figure 2: Adding data from a file in Informatica Analyst

Importing data from files is straightforward if we follow the wizard. In this example, we are going to set comma separated values, header present, data starting in line 2, and all the columns will be strings. The tool will automatically detect the length of the fields.

Data Quality Series - Profiling with Informatica

Figure 3: Add flat file wizard in Informatica Analyst

Now we need to create a new profile for our data, and we can do this by clicking on new profile on the menu on the right. In the wizard, we select our data file and accept all the default values.
Once the profile has been created we can review the values of the data, the percentage of nulls, and term frequencies in the results panel, as well as being able to analyse the patterns of the data values for every column. We can also view a summary of basic statistics, such as the max value, the min value and the top and bottom values for each column.

Data Quality Series - Profiling with Informatica

Figure 4: Profiling results in Informatica Analyst

In our example we can see several issues in the data of the file. For example, in the image below we can see  that the year is incorrect for some records (we are assuming that the file should contain just the numeric value for the year). In this example, the file should only contain data for 2nd January 2015, so it looks like the file has some invalid records, as there are some records with a different year, and others with a wrong value. This could be due to a bad extraction from the source system, or a wrong delimiter in some rows. In order to measure the quality of the file, we are now going to create some business rules, add them to the data profile, and finally create a visualization.

The data analysts from our organization have given us the following business rules:

the year must be 2015 for this file
the day column must always be 1/2/2015
the file should only contain Enabled campaigns

We will create two business rules to validate the year and the day columns, and for the Enabled campaigns we will set up the value Enabled in the campaign_status column as valid.

We can create the business rules in two ways: by using the expression builder in the Analyst tool, or by creating a mapping using the Informatica Developer. To create the business rule directly in the profile we simply click on edit, then on the column profiling rules, and the on the plus sign to add a rule.

Data Quality Series - Profiling with Informatica

Figure 5: Creating rules in Informatica Analyst

Then we select new rule for the year column and enter the expression you can see in the following image. We can save the rule as reusable; this way we will be able to apply exactly the same rule for a different column in the file if necessary.

Data Quality Series - Profiling with Informatica

Figure 6: New rule wizard in Informatica Analyst

We will implement the second rule in the Developer tool. To do this, we open Informatica Developer and connect to our project, then create a mapplet with an input transformation, an expression and an output transformation, and save it as DayValidator. To validate the rule, we can right-click on the rule and select validate.

Data Quality Series - Profiling with Informatica

Figure 7: Creating a rule in Informatica Developer

We will define the expression with three possible output values: not a date, Valid date and Invalid date.

Data Quality Series - Profiling with Informatica

Figure 8: Defining rules in Informatica Developer

Once the rule has been created, we can go back to Informatica Analyst, edit the profile and now, instead of creating a new rule, we are going to apply the DayValidator rule we just created in Developer to the day column in the profile. We will call the output of the rule IsValidDay:

Data Quality Series - Profiling with Informatica

Figure 9: New rule wizard in Informatica Analyst

Now we are ready to run the profile again and review the outcome of the two different rules:

Data Quality Series - Profiling with Informatica

Figure 10: Data profiling project in Informatica Analyst

Reviewing the results, we can see that the data contains wrong values for Day and Year:

Data Quality Series - Profiling with Informatica

Figure 11: Reviewing profiling results in Informatica Analyst

 

2. Create a Scorecard for the Profile

Now that we have executed and checked the profile, we can create a scorecard to measure the quality of the file as the last step in this data quality assessment. In order to do this, we have to go to the profile and add it to a new scorecard. We can define the valid values for each column in our data. In this example, we are going to create the scorecard with three metrics called scores (both outputs from the rules and the campaign status) and then select the valid values for each different score.

The scorecard allows us to drill down from the score to the data. We select the key of the file (first three columns), the campaign status, and the output from both rules as drilldown columns; this way we can easily export the invalid rows to a file and send the results to the owner of the data so they can fix the wrong values and re-run the proper processes to update the data.

Data Quality Series - Profiling with Informatica

Figure 12: Data profiling scorecard in Informatica Analyst

This concludes the first article in this series about Data Quality with Informatica.
In the next couple of blog posts we’re going to explain how to standardize and deduplicate data. Stay tuned!
If you would like to know more about the Data Quality Services we offer click here!

Extracting data from Taleo

.

Over the past few years we have seen companies focusing more and more on Human Resources / Human Capital activities. This is no surprise, considering that nowadays a large number of businesses depend more on people skills and creativity than on machinery or capital, so hiring the right people has become a critical process. As a consequence, more and more emphasis is put on having the right software to support HR/HC activities, and this in turn leads to the necessity of building a BI solution on top of those systems for a correct evaluation of processes and resources. One of the most commonly used HR systems is Taleo, an Oracle product that resides in the cloud, so there is no direct access to its underlying data. Nevertheless, most BI systems are still on-premise, so if we want to use Taleo data, we need to extract it from the cloud first.

 

1. Taleo data extraction methods

As mentioned before, there is no way of direct access to Taleo data; nevertheless, there are several ways to extract it, and once extracted we will be able to use it in the BI solution:

Querying Taleo API
Using Cloud Connector
Using Taleo Connect Client

API is very robust, but the most complex of the methods, since it requires a separate application to be written. Usually, depending on the configuration of the BI system, either Oracle Cloud Connector, Taleo Connect Client or a combination of both is used.

Extracting data from Taleo

Figure 1: Cloud connector in ODI objects tree

Oracle Cloud Connector is a component of OBI Apps, and essentially it’s Java code that replicates Taleo entities / tables. It’s also easy to use: just by creating any Load Plan in BIACM using Taleo as the source system, a series of calls to Cloud Connector are generated that effectively replicate Taleo tables to local schema. Although it works well, it has 2 significant disadvantages:

It’s only available as a component of BI Apps
It doesn’t extract Taleo UDFs

So even if we have BI Apps installed and we use Cloud Connector, there will be some columns (UDFs) that will not get extracted. This is why the use of Taleo Connect Client is often a must.

 

2. Taleo Connect Client

Taleo Connect Client is a tool that is used to export or import data from / to Taleo. In this article we’re going to focus on extraction. It can extract any field, including UDFs, so it can be used in combination with BI Apps Cloud Connector or, if that’s not available, then as a unique extraction tool. There are versions for both Windows and Linux operating systems. Let’s look at the Windows version first.

Part 1 - Installation & running: Taleo Connect Client can be downloaded from the Oracle e-delivery website; just type Taleo Connect Client into the searcher and you will see it on the list. Choose the necessary version, select Application Installer and Application Data Model (required!), remembering that it must match the version of the Taleo application you will be working with; then just download and install. Important – the Data Model must be installed before the application is installed.

Extracting data from Taleo

Figure 2: Downloading TCC for Windows

After TCC is installed, we run it, providing the necessary credentials in the initial screen:

Extracting data from Taleo

Figure 3: Taleo Connect Client welcome screen

And then, after clicking on ‘ping’, we connect to Taleo.  The window that we see is initially empty, but we can create or execute new extracts from it. But before going on to this step, let’s find out how to see the UDFs: in the right panel, go to the ‘Product integration pack’ tab, also selecting the correct product and model. Then, in the Entities tab, we can see a list of entities / tables, and in fields / relations, we can see columns and relations with other entities / tables (through foreign keys). After the first run, you will probably have some UDFs that are not on the list of fields / relations available. Why is this? Because what we initially see in the field list are only Taleo out-of-the-box fields, installed with the Data Model installer. But don’t worry, this can easily be fixed: use the ‘Synchronize custom fields’ icon (highlighted on the screenshot). After clicking on it you will be taken to a log-on screen where you’ll have to provide log-on credentials again, and after clicking on the ‘Synchronize’ button, the UDFs will be retrieved.

Extracting data from Taleo

Figure 4: Synchronizing out-of-the-box model with User Defined Fields

Extracting data from Taleo

Figure 5: Synchronized list of fields, including some UDFs (marked with 'person' icon)

Part 2 - Preparing the extract: Once we have all the required fields available, preparing the extract is pretty straightforward. Go to File->New->New Export Wizard, then choose the desired Entity, and click on Finish. Now, in the General window, set Export Mode to ‘CSV-Entity’, and in the Projections tab, select the columns that you want to extract by dragging and dropping them from the Entity->Structure window on the right. You can also add filters or sort the result set. Finally, save the export file. The other component necessary to actually extract the data is so-called configuration. To create it, we select File->New->New Configuration Wizard, then we point the export file that we’ve created in the previous step and, in the subsequent step, our endpoint (the Taleo instance that we will extract the data from). Then, on the following screen, there are more extract parameters, like request format and encoding, extract file name format and encoding and much more. In most cases, using the default values of parameters will let us extract the data successfully, so unless it’s clearly required, there is no need to change anything. So now the configuration file can be saved and the extraction process can start, just by clicking on the ‘Execute the configuration’ button (on the toolbar just below the main menu). If the extraction is successful, then all the indicators in the Monitoring window on the right will turn green, as in the screenshot below.

Extracting data from Taleo

Figure 6: TCC successfull extraction

By using a bat file created during the installation, you can schedule TCC jobs to be executed on a timely basis, using Windows Scheduler, but it’s much more common to have your OBI / BI Apps (or almost any other DBMS that your organization uses as a data warehouse) installed on a Linux / Unix server. This is why we’re going to have a look at how to install and set up TCC in a Linux environment.
 
Part 3 - TCC in a Linux / Unix environment: TCC setup in a Linux / Unix environment is a bit more complex. To simplify it, we will use some of the components that were already created and used when we worked with Windows TCC, and although the frontend of the application is totally different (to be precise, there is no frontend at all in the Linux version as it’s strictly command-line), the way the data is extracted from Taleo is exactly the same (using extracts designed as XML files and Taleo APIs). So, after downloading the application installer and data model from edelivery.oracle.com , we install both components. Installation is actually just extracting the files, first from zip to tgz, and then from tgz to uncompressed content. But this time, unlike in Windows, we recommend installing (extracting) the application first, and then extracting the data model files to an application subfolder named ‘featurepacks’ (this must be created, it doesn’t exist by default). It’s also necessary to create a subfolder ‘system’ in the application directory. Once this is done, you can move some components of your Windows TCC instance to the Linux one (of course, if you have no Windows machine available, you can create any of these components manually):

Copy file default.configuration_brd.xml from windows TCC/system to the Linux TCC/system
Copy extract XML and configuration XML files, from wherever you had them created, to the main Linux TCC directory

There are also some changes that need to be made in the TaleoConnectClient.sh file

Set JAVA_HOME variable there, at the top of the file (just below #!/bin/bash line), setting it to the path of your Java SDK installation (for some reason, in our installation, system variable JAVA_HOME wasn’t captured correctly by the script)
In the line below #Execute the client,  after the TCC_PARAMETERS variable, add:

✓ parameters of proxy server if it is to be used:

-Dhttp.proxyHost=ipNumber   –Dhttp.proxyPort=portNumber

✓ path of Data Model:

-Dcom.taleo.integration.client.productpacks.dir=/u01/oracle/tcc-15A.2.0.20/featurepacks

So, in the end, the TaleoConnectClient.sh file in our environment has the following content (IP addresses where ‘masked’):

#!/bin/sh
JAVA_HOME=/u01/middleware/Oracle_BI1/jdk
# Make sure that the JAVA_HOME variable is defined if  [ ! "${JAVA_HOME}" ] then
    echo +-----------------------------------------+
    echo "+ The JAVA_HOME variable is not defined.  +"
    echo +-----------------------------------------+
    exit 1
fi

# Make sure the IC_HOME variable is defined if  [ ! "${IC_HOME}" ] then
    IC_HOME=.
fi

# Check if the IC_HOME points to a valid taleo Connect Client folder if [ -e "${IC_HOME}/lib/taleo-integrationclient.jar" ] then
    # Define the class path for the client execution
    IC_CLASSPATH="${IC_HOME}/lib/taleo-integrationclient.jar":"${IC_HOME}/log"

    # Execute the client
    ${JAVA_HOME}/bin/java ${JAVA_OPTS} -Xmx256m ${TCC_PARAMETERS}
-Dhttp.proxyHost=10.10.10.10 -Dhttp.proxyPort=8080 -Dco m.taleo.integration.client.productpacks.dir=/u01/tcc_linux/tcc-15A.2.0.20/featurepacks
  -Dcom.taleo.integration.client.i
nstall.dir="${IC_HOME}" -Djava.endorsed.dirs="${IC_HOME}/lib/endorsed"
-Djavax.xml.parsers.SAXParserFactory=org.apache.xe
rces.jaxp.SAXParserFactoryImpl
-Djavax.xml.transform.TransformerFactory=net.sf.saxon.TransformerFactoryImpl
-Dorg.apache.
commons.logging.Log=org.apache.commons.logging.impl.Log4JLogger
-Djavax.xml.xpath.XPathFactory:http://java.sun.com/jaxp/x
path/dom=net.sf.saxon.xpath.XPathFactoryImpl -classpath ${IC_CLASSPATH} com.taleo.integration.client.Client ${@} else
    echo +-----------------------------------------------------------------------------------------------
    echo "+ The IC_HOME variable is defined as (${IC_HOME}) but does not contain the Taleo Connect Client"
    echo "+ The library ${IC_HOME}/lib/taleo-integrationclient.jar
cannot be found.                      "
    echo +-----------------------------------------------------------------------------------------------
    exit 2
fi

Once this is ready, we can also apply the necessary changes to the extract and configuration files, although there is no need to change anything in the extract definition (file blog_article_sq.xml). Let’s have a quick look at content of this file:

<?xml version="1.0" encoding="UTF-8"?>
<quer:query productCode="RC1501" model="http://www.taleo.com/ws/tee800/2009/01" projectedClass="JobInformation" locale="en" mode="CSV-ENTITY" largegraph="true" preventDuplicates="false" xmlns:quer="http://www.taleo.com/ws/integration/query"><quer:subQueries/><quer:projections><quer:projection><quer:field path="BillRateMedian"/></quer:projection><quer:projection><quer:field path="JobGrade"/></quer:projection><quer:projection><quer:field path="NumberToHire"/></quer:projection><quer:projection><quer:field path="JobInformationGroup,Description"/></quer:projection></quer:projections><quer:projectionFilterings/><quer:filterings/><quer:sortings/><quer:sortingFilterings/><quer:groupings/><quer:joinings/></quer:query>

Just by seeing the file we can figure out how to add more columns manually: we just need to add more quer tags, like

<quer:projection><quer:field path="DesiredFieldPath"/></quer:projection>

With regard to the configuration file, we need to make some small changes: in tags cli:SpecificFile and cli:Folder absolute Windows paths are used. Once we move the files to Linux, we need to replace them with Linux filesystem paths, absolute or relative.

Once the files are ready, the only remaining task is to run the extract, which is done by running:

./TaleoConnectClient.sh blog_article_cfg.xml 

See the execution log:

[KKanicki@BIApps tcc-15A.2.0.20]$ ./TaleoConnectClient.sh blog_article_cfg.xml
2017-03-16 20:18:26,876 [INFO] Client - Using the following log file: /biapps/tcc_linux/tcc-15A.2.0.20/log/taleoconnectclient.log
2017-03-16 20:18:26,876 [INFO] Client - Using the following log file: /biapps/tcc_linux/tcc-15A.2.0.20/log/taleoconnectclient.log
2017-03-16 20:18:27,854 [INFO] Client - Taleo Connect Client invoked with configuration=blog_article_cfg.xml, request message=null, response message=null
2017-03-16 20:18:27,854 [INFO] Client - Taleo Connect Client invoked with configuration=blog_article_cfg.xml, request message=null, response message=null
2017-03-16 20:18:31,010 [INFO] WorkflowManager - Starting workflow execution
2017-03-16 20:18:31,010 [INFO] WorkflowManager - Starting workflow execution
2017-03-16 20:18:31,076 [INFO] WorkflowManager - Starting workflow step: Prepare Export
2017-03-16 20:18:31,076 [INFO] WorkflowManager - Starting workflow step: Prepare Export
2017-03-16 20:18:31,168 [INFO] WorkflowManager - Completed workflow step: Prepare Export
2017-03-16 20:18:31,168 [INFO] WorkflowManager - Completed workflow step: Prepare Export
2017-03-16 20:18:31,238 [INFO] WorkflowManager - Starting workflow step: Wrap SOAP
2017-03-16 20:18:31,238 [INFO] WorkflowManager - Starting workflow step: Wrap SOAP
2017-03-16 20:18:31,249 [INFO] WorkflowManager - Completed workflow step: Wrap SOAP
2017-03-16 20:18:31,249 [INFO] WorkflowManager - Completed workflow step: Wrap SOAP
2017-03-16 20:18:31,307 [INFO] WorkflowManager - Starting workflow step: Send
2017-03-16 20:18:31,307 [INFO] WorkflowManager - Starting workflow step: Send
2017-03-16 20:18:33,486 [INFO] WorkflowManager - Completed workflow step: Send
2017-03-16 20:18:33,486 [INFO] WorkflowManager - Completed workflow step: Send
2017-03-16 20:18:33,546 [INFO] WorkflowManager - Starting workflow step: Poll
2017-03-16 20:18:33,546 [INFO] WorkflowManager - Starting workflow step: Poll
2017-03-16 20:18:34,861 [INFO] Poller - Poll results: Request Message ID=Export-JobInformation-20170316T201829;Response Message Number=123952695;State=Completed;Record Count=1;Record Index=1;
2017-03-16 20:18:34,861 [INFO] Poller - Poll results: Request Message ID=Export-JobInformation-20170316T201829;Response Message Number=123952695;State=Completed;Record Count=1;Record Index=1;
2017-03-16 20:18:34,862 [INFO] WorkflowManager - Completed workflow step: Poll
2017-03-16 20:18:34,862 [INFO] WorkflowManager - Completed workflow step: Poll
2017-03-16 20:18:34,920 [INFO] WorkflowManager - Starting workflow step: Retrieve
2017-03-16 20:18:34,920 [INFO] WorkflowManager - Starting workflow step: Retrieve
2017-03-16 20:18:36,153 [INFO] WorkflowManager - Completed workflow step: Retrieve
2017-03-16 20:18:36,153 [INFO] WorkflowManager - Completed workflow step: Retrieve
2017-03-16 20:18:36,206 [INFO] WorkflowManager - Starting workflow step: Strip SOAP
2017-03-16 20:18:36,206 [INFO] WorkflowManager - Starting workflow step: Strip SOAP
2017-03-16 20:18:36,273 [INFO] WorkflowManager - Completed workflow step: Strip SOAP
2017-03-16 20:18:36,273 [INFO] WorkflowManager - Completed workflow step: Strip SOAP
2017-03-16 20:18:36,331 [INFO] WorkflowManager - Completed workflow execution
2017-03-16 20:18:36,331 [INFO] WorkflowManager - Completed workflow execution
2017-03-16 20:18:36,393 [INFO] Client - The workflow execution succeeded.

And that’s it! Assuming our files were correctly prepared, the extract will be ready in the folder declared in cli:Folder tag of the configuration file.

As for scheduling, different approaches are available, the most basic being to use the Linux crontab as the scheduler, but you can also use any ETL tool that is used in your project easily. See the screenshot below for an ODI example:

Extracting data from Taleo

Figure 7: TCC extracts placed into 1 ODI package

The file extract_candidate.sh contains simple call of TCC extraction:

[KKanicki@BIApps tcc-15A.2.0.20]$ cat extract_candidate.sh
#!/bin/bash
cd /u01/tcc_linux/tcc-15A.2.0.20/
./TaleoConnectClient.sh extracts_definitions/candidate_cfg.xml

If your extracts fail or you have any other issues with configuring Taleo Connect Client, feel free to ask us in the comments section below!

In the last couple of years we have delivered several highly successful BI Projects in the Human Resources / Human Capital space! Don´t hesitate to contact us if you would like to receive specific information about these solutions!

Use JavaScript to embed Tableau dashboards into SharePoint

.

How to use JavaScript to embed Tableau dashboards into SharePoint with automatic sizing and current date quick filter selection?

As we all know, Tableau comes with its own web-based repository (Tableau Server), but sometimes we have to consume the dashboard that we create in other platforms. One of them is SharePoint, Microsoft’s web-based application, often used by enterprises to share content with their employees and departments.

Use JavaScript to embed tableau dashboards into SharePoint with automatic sizing and current date quick filter selection

Figure 1: Schema of the integration between SharePoint, Tableau Server and the enterprise accounts

Tableau allows us to easily share our dashboards by embedding them in JavaScript code or iFrames, providing ready-to-use code snippets that we just need to copy and paste in our solution.

These are very useful, but what if we want to go even further? For example, it’s currently not possible to set a quick filter to automatically select the current date. What if we want our dashboard to show data for the current year or month by default?

In this article, we’re going to explain how to embed dashboards into SharePoint, how to dynamically resize them and how to apply the default date quick filter with the help of some JavaScript power.

 

1. Embedding the dashboard into SharePoint

To embed the dashboard into SharePoint, we must first create a page with an “Embed Code” container. Once your SharePoint page is ready, click on Edit, then Insert, and select Embed Code. This particular type of container, as its name suggests, allows us to embed code inside SharePoint, and it will be our door to Tableau.

Use JavaScript to embed tableau dashboards into SharePoint with automatic sizing and current date quick filter selection

Figure 2: SharePoint’s Insert toolbar with highlighted “Embed code” button

Now that your container is ready, you will see a canvas where you can write your code. Let’s start by embedding the dashboard as it is, with an iFrame.

First of all, we need the dashboard link. We can find it in Tableau Server: just navigate to the dashboard that you want to embed into SharePoint and click on Share, in the upper-right corner. You will see two links: “Embed Code” and “Link”. The first is a JavaScript snippet, ready to be copied and pasted; the second is a pure URL to the dashboard. Click on the second and copy it.

Use JavaScript to embed tableau dashboards into SharePoint with automatic sizing and current date quick filter selection

Figure 3: Tableau Server’s “share” option and corresponding links

We are using this link because it is slightly less verbose; it’s also interesting to see how the two technologies can easily integrate.

Now that the link is ready, return to the Embed Code container canvas and write a simple HTML page that contains an iFrame, and use the link as the source URL, like this:

<!DOCTYPE html>
<html>
  <center><body>
    <iframe src="(Tableau Server URL)" id="myIframe"></iframe>
  </body></center>
</html>

Click on insert, and the dashboard will appear in your page!
 

2. Dynamic Sizing

Now that your dashboard is into SharePoint, you might want to change its size. Sometimes, to make our dashboards look as cool as they can on every screen, we set the resolution to automatic in Tableau Desktop; this is usually a great idea, but into SharePoint we might have big headers, multiple contents on the same browser page, etc., which don’t let our dashboard look as good as it could. We can solve this issue by changing the size of the Embed Code container automatically, with just a few JavaScript lines.

Edit the container that you just set up, and add the following script AFTER the iFrame (but before the closing body tag):

<script type="text/javaScript">
var iframe = document.getElementById('myIframe');
	
//Automatic size: the following lines of code adapt the dashboard size to the browser window
var height = window.innerHeight;
var width = window.innerWidth;					   
iframe.height = height*0.95;
iframe.width = width*0.95;

document.getElementById('contentBox').style.marginLeft="0px";
document.getElementById('contentBox').style.marginRight="0px";
document.getElementById('contentRow').style.paddingTop="0px";
	
var containers = document.getElementsByClassName('container-fluid');
var container = containers[1];
container.style.marginLeft="0px";
container.style.marginRight="0px";
container.style.paddingLeft="0px";
container.style.paddingRight="0px";
		
containers = document.getElementsByClassName('welcome blank-wp');
container = containers[0];
container.style.paddingLeft="0px";
container.style.paddingRight="0px";
container.style.paddingTop="0px";
container.style.paddingBottom="0px";

containers = document.getElementsByClassName('welcome-content');
container = containers[0];
container.style.marginTop="0px";	
</script>

Again, click on Insert, and your dashboard will automatically resize when it’s loaded!In this script, we take the actual size of the browser and use it as a reference to change our “contentBox” size. In the example we basically use 95% of the available space in both dimensions, but you can change it as you wish. Notice how we also remove all possible padding.
 

3. Setting default current date filter

Our dashboard is now ready into SharePoint and it automatically resizes. The next step is to add automatic filters.

Tableau allows us to manually set parameters in the URL, with the classic name=value combination. We will exploit this, changing the parameter list when loading the page.

Let’s say, for example, that we have a date dimension called Year-Month, in the format YYYY-MM, and we want our dashboard to automatically filter to the actual month every time we open it. This is something that Tableau currently can’t do (you can vote for it in “ideas” in the community: use think link or this ). Luckily, JavaScript comes to the rescue.

Add the following lines of code after your iFrame (the additional comments help understand the various steps):

//get the parameters from the URL
var loc = window.location.href;
var params = "";
if(loc.indexOf('?') >= 0) params = loc.split('?')[1];	
	
//get the actual iframe URL
var iframe = document.getElementById('myIframe');
var iframe_url = iframe.src;

//if the date parameter doesn't exist, set the actual Year-Month	
n = params.indexOf("Year-Month"); 		
if(n<0){ //if the parameter is not in the URL yet…
var today = new Date();
var mm = today.getMonth()+1; //JS’s months go from 0 to 11
var yyyy = today.getFullYear();			
if(mm<10) mm="0"+mm;

//append the new parameter to the other ones (if any)
var ym = "Year-Month="+yyyy+"-"+mm;
if (params.length > 0) params=params+'&'+ym;
else params = ym;
}
	
//set the new iframe URL (or restore the original one)		
iframe.src = iframe_url+params;

That’s it! By clicking Import, your dashboard will filter according to the current month, automatically. Pretty cool!

 

4. Next steps

This is a pretty basic but highly useful implementation of Tableau embedding. We used the date in our example, but you can easily set whatever parameter you want. You can also set up an URL action in your dashboard that points to a SharePoint page, and then write JavaScript code that processes the various parameters, in order to apply the filters automatically when the dashboard is loaded in that page.

For an even better use of the tool, don’t forget to check out the Tableau APIs:

JavaScript API
REST API

Contact us if you would like to hear more about the topic or need any help embedding your Tableau dashboards into SharePoint.

 

Advanced analytics: Tableau + Python

.

Tableau has released TabPy, Tableau Python Server, an API that enables Python code evaluation within Tableau. Thanks to TabPy, you can create calculated fields using Python code in Tableau 10.2. As significant as the R connection with Tableau 8.1, now Python lovers will be able to leverage all the power of advanced analytics and visualize the results. Python is a widely used general-purpose programming language, and a large number of Python libraries are available to perform statistical analysis, predictive models or machine learning algorithms. Tableau and Python are a perfect match, a dream team for any data scientist.

Here at ClearPeaks we are convinced that connecting TabPy with Tableau is one of the best approaches for predictive analytics. So how does this new feature work? We are already working with this technology, and are proud to offer a demonstration of Python code running within a Tableau environment.

In this article you will learn how to connect Tableau with TabPy; just follow the example to see how to use this new feature and you´ll be ready to start your own journey in this new advanced data analytics environment.

Prerequisites:

Tableau Desktop 10.2
TabPy (Tableau Python Server)
Anaconda Navigator

 

1. Installing and setting up Anaconda Navigator

Anaconda is the leading open data science platform powered by Python, whose open source version is a high performance distribution of Python and R, and includes over 100 of the most popular Python, R and Scala packages for data science.

Download Anaconda Navigator, Python 2.7 and 64-bit version here.

TabPy is only supported in a Python 2.7 environment.

Install Anaconda Navigator with the default options in the installation set-up.

Advanced analytics: Tableau + Python

Figure 1: Anaconda Navigator installation

Once Anaconda Navigator has been installed we need to create a new enviroment:

Click on “Create” and type “Tableau-Python-Server” in the environment name, then click “Create”.

Advanced analytics: Tableau + Python

Figure 2: Creating a new environment in Anaconda

This new environment will be used when connecting Tableau with TabPy. We also need some packages in order to perform mathematical calculations, apply machine learning algorithms, plot, define dataframes, create predictive models or edit Python code. We’re going to install the required packages: “Numpy”, “scikit-learn”, “matplotlib”, “pandas”, “statsmodel” and “ipython”; search for the packages in the “Not Installed” window and install them all.

Advanced analytics: Tableau + Python

Figure 3: Installing Python Packages

You can also download “Spyder”, a powerful Python IDE with advanced editing, interactive testing, debugging and introspection features, to try your Python code before using it in Tableau calculated fields.

Go to the Home tab of Anaconda Navigator, and select the created environment “Tableau-Python-Server”; you can install “Spyder” from there.

Advanced analytics: Tableau + Python

Figure 4: Anaconda Navigator

 

2. Installing TabPy

Running Python code within a Tableau workbook requires a Python server to execute it, and the TabPy framework gets the job done. Download TabPy from github at the following link and decompress the file.

Advanced analytics: Tableau + Python

Figure 5: Download Tabpy

On Windows prompt:

activate Tableau-Python-Server

In the TabPy folder run the following commands to install TabPy client and server as well as package dependencies.

pip install -r ./tabpy-server/requirements.txt

pip install ./tabpy-client

pip install ./tabpy-server

As the packages are installed, you will see the install locations listed in the command line. These might look like /Users/username/anaconda/envs/Tableau-Python-Server/lib/python2.7/site-packages or \Users\username\anaconda\envs\Tableau-Python-Server\lib\site-packages depending on your environment.

Navigate to the tabpy_server folder under site-packages and run startup.bat 9001. 9001 is the port to connect Tableau with the TabPy server.

 

3. Connecting Tableau with TabPy

The next step is to connect Tableau 10.2 with TabPy. This can be done in Help > Settings and Performance > Manage External Service Connection:

Advanced analytics: Tableau + Python

Figure 6: Connecting Tableau + Python Server

Test the connection and you should get a “successfully connected” prompt. Now you can check out an example of predictive analysis.

 

4. Using Tableau + Python for predictive analytics

In this section, you can follow through an example of how to visualize a Time Series prediction with Tableau.

Advanced analytics: Tableau + Python

We’re going to use a dataset called AirPassengers, containing historic data on the number of passengers per month of an airline company. First, we open a new Tableau workbook and connect to the AirPassengers.csv dataset, then we create all calculated fields and parameters:

Advanced analytics: Tableau + Python

Figure 7: Calculated fields and Dimensions

Define the parameters:

We can create parameters to choose the number of autoregressive terms, the number of moving average terms and the seasonal differencing order to our arima time series model.

Advanced analytics: Tableau + Python

Figure 8: Autoregresive terms

Advanced analytics: Tableau + Python

Figure 9: Moving average terms

Advanced analytics: Tableau + Python

Figure 10: Seasonal differencing order

Advanced analytics: Tableau + Python

Figure 11: Months to forecast

Time dimensions need to be formatted.

Advanced analytics: Tableau + Python

Figures 12 and 13: Date parsing and Date shift

Finally we have to create the calculated fields. The measure “#Passengers” is defined as integer; we should fomat this field to assign float (number decimal) format, so we can combine axis of this measure with the predicted outcomes.

Advanced analytics: Tableau + Python

Figures 14 and 15: Synchronize number of passengers with correspondent month and differentiate forecast values with predicted past values

Now we’re going to embed python code to the new calculated fields. The code embedded to the calculated field “Time Series” will return the predicted values of an Arima(p, d, q) model, where p=[AR (time lag)]; d=[I Seasonal Difference]; q=[MA (Moving Average)], with the parameters we created above. Find the python code below:

SCRIPT_REAL(
'
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from matplotlib.pylab import rcParams
dates = _arg1
passeng = _arg2
order_arima = min(_arg3)
seasonal_diff = min (_arg4)
ma_param = min (_arg5)
months_forecast = min(_arg6)
ts = pd.DataFrame({"dates": dates,"Passengers": passeng})
ts["Passengers"] = ts["Passengers"].astype("float64")
ts = ts.set_index(["dates"])
ts_log = np.log(ts)
ts_log.index = pd.to_datetime(ts_log.index)
ts_log_diff = ts_log - ts_log.shift()
ts_log_diff["Passengers"][0] = 0
from statsmodels.tsa.arima_model import ARIMA
model = ARIMA(ts_log_diff, order=(order_arima, seasonal_diff, ma_param))
results_ARIMA = model.fit(disp=-1)
predictions_value = results_ARIMA.forecast(months_forecast)[0]
from dateutil.relativedelta import relativedelta
add_month = relativedelta(months=1)
predictions_dates = list()
for i in range(months_forecast):
predictions_dates.append ( results_ARIMA.fittedvalues.index[-1] + ((i+1)*add_month))
forecast_log_diff = pd.Series(predictions_value, index=predictions_dates)
predictions_ARIMA_diff = pd.Series(results_ARIMA.fittedvalues, copy=True)
predictions_ARIMA_diff_cumsum = predictions_ARIMA_diff.cumsum()
predictions_ARIMA_log = pd.Series(np.asscalar(ts_log.ix[0]), index=ts_log.index)
predictions_ARIMA_log = predictions_ARIMA_log.add(predictions_ARIMA_diff_cumsum,fill_value=0)
predictions_ARIMA = np.exp(predictions_ARIMA_log)
forecast_log_diff_ARIMA = pd.Series(forecast_log_diff, copy=True)
forecast_ARIMA_log_diff_cumsum = forecast_log_diff_ARIMA.cumsum()
forecast_ARIMA_log = pd.Series(np.asscalar(ts_log.ix[-1]), index=forecast_log_diff_ARIMA.index)
forecast_ARIMA_log = forecast_ARIMA_log.add(forecast_ARIMA_log_diff_cumsum,fill_value=0)
forecast_ARIMA = np.exp(forecast_ARIMA_log)
return list(predictions_ARIMA)
',
ATTR([Month]), ATTR([#Passengers]), MIN([AR (Time lag)]), MIN([I Seasonal Difference]), 
MIN([MA (Moving average)]), MIN([Months Forecast]))

For the forecasting part, create the calculated field “Forecast”:

SCRIPT_REAL(
'
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from matplotlib.pylab import rcParams
dates = _arg1
passeng = _arg2
order_arima = min(_arg3)
seasonal_diff = min (_arg4)
ma_param = min (_arg5)
months_forecast = min(_arg6)
ts = pd.DataFrame({"dates": dates,"Passengers": passeng})
ts["Passengers"] = ts["Passengers"].astype("float64")
ts = ts.set_index(["dates"])
ts_log = np.log(ts)
ts_log.index = pd.to_datetime(ts_log.index)
ts_log_diff = ts_log - ts_log.shift()
ts_log_diff["Passengers"][0] = 0
from statsmodels.tsa.arima_model import ARIMA
model = ARIMA(ts_log_diff, order=(order_arima, seasonal_diff, ma_param))
results_ARIMA = model.fit(disp=-1)
predictions_value = results_ARIMA.forecast(months_forecast)[0]
from dateutil.relativedelta import relativedelta
add_month = relativedelta(months=1)
predictions_dates = list()
for i in range(months_forecast):
predictions_dates.append ( results_ARIMA.fittedvalues.index[-1] + ((i+1)*add_month))
forecast_log_diff = pd.Series(predictions_value, index=predictions_dates)
predictions_ARIMA_diff = pd.Series(results_ARIMA.fittedvalues, copy=True)
predictions_ARIMA_diff_cumsum = predictions_ARIMA_diff.cumsum()
predictions_ARIMA_log = pd.Series(np.asscalar(ts_log.ix[0]), index=ts_log.index)
predictions_ARIMA_log = predictions_ARIMA_log.add(predictions_ARIMA_diff_cumsum,fill_value=0)
predictions_ARIMA = np.exp(predictions_ARIMA_log)
forecast_log_diff_ARIMA = pd.Series(forecast_log_diff, copy=True)
forecast_ARIMA_log_diff_cumsum = forecast_log_diff_ARIMA.cumsum()
forecast_ARIMA_log = pd.Series(np.asscalar(ts_log.ix[-1]), index=forecast_log_diff_ARIMA.index)
forecast_ARIMA_log = forecast_ARIMA_log.add(forecast_ARIMA_log_diff_cumsum,fill_value=0)
forecast_ARIMA = np.exp(forecast_ARIMA_log)
forecast_ARIMA_2 = predictions_ARIMA.append(forecast_ARIMA)
forecast_ARIMA_2 = forecast_ARIMA_2[len(forecast_ARIMA):]
return list(forecast_ARIMA_2)
',
ATTR([Month]), ATTR([#Passengers]), min([AR (Time lag)]), MIN([I Seasonal Difference]),
MIN([MA (Moving average)]), MIN([Months Forecast]))

We have also included the sum of squared errors calculated field so that it is possible to compare different models.

Here’s the code:

SCRIPT_REAL(
'
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from matplotlib.pylab import rcParams
dates = _arg1
passeng = _arg2
order_arima = min(_arg3)
seasonal_diff = min (_arg4)
ma_param = min (_arg5)
months_forecast = min(_arg6)
ts = pd.DataFrame({"dates": dates,"Passengers": passeng})
ts["Passengers"] = ts["Passengers"].astype("float64")
ts = ts.set_index(["dates"])
ts_log = np.log(ts)
ts_log.index = pd.to_datetime(ts_log.index)
ts_log_diff = ts_log - ts_log.shift()
ts_log_diff["Passengers"][0] = 0
from statsmodels.tsa.arima_model import ARIMA
model = ARIMA(ts_log_diff, order=(order_arima, seasonal_diff, ma_param))
results_ARIMA = model.fit(disp=-1)
predictions_value = results_ARIMA.forecast(months_forecast)[0]
from dateutil.relativedelta import relativedelta
add_month = relativedelta(months=1)
predictions_dates = list()
for i in range(months_forecast):
predictions_dates.append ( results_ARIMA.fittedvalues.index[-1] + ((i+1)*add_month))
forecast_log_diff = pd.Series(predictions_value, index=predictions_dates)
predictions_ARIMA_diff = pd.Series(results_ARIMA.fittedvalues, copy=True)
predictions_ARIMA_diff_cumsum = predictions_ARIMA_diff.cumsum()
predictions_ARIMA_log = pd.Series(np.asscalar(ts_log.ix[0]), index=ts_log.index)
predictions_ARIMA_log = predictions_ARIMA_log.add(predictions_ARIMA_diff_cumsum,fill_value=0)
predictions_ARIMA = np.exp(predictions_ARIMA_log)
forecast_log_diff_ARIMA = pd.Series(forecast_log_diff, copy=True)
forecast_ARIMA_log_diff_cumsum = forecast_log_diff_ARIMA.cumsum()
forecast_ARIMA_log = pd.Series(np.asscalar(ts_log.ix[-1]), index=forecast_log_diff_ARIMA.index)
forecast_ARIMA_log = forecast_ARIMA_log.add(forecast_ARIMA_log_diff_cumsum,fill_value=0)
forecast_ARIMA = np.exp(forecast_ARIMA_log)
predictions_ARIMA_dataframe = pd.DataFrame(predictions_ARIMA)
SSE = sum((ts.ix[:,0]-predictions_ARIMA_dataframe.ix[:,0])**2)
return SSE
',
ATTR([Month]), ATTR([#Passengers]), MIN([AR (Time lag)]), MIN([I Seasonal Difference]),
MIN([MA (Moving average)]), MIN([Months Forecast]))

Finally, we’re going to design the view by dragging “Forecast date” (month level) to columns, “Number of Passengers” and “Forecast” to rows. Convert the plots to dual axis and then synchronize axis, obtaining this view:

Advanced analytics: Tableau + Python

Figure 16: Tableau + Python Predictive analysis view

Here you can see the real values as bars, predicted values as a blue line and forecast values as an orange line; you can play with different parameters to forecast more months or create a new ARIMA(p,d,q) model. Notice that the title is also dynamic - you can get this with:

Advanced analytics: Tableau + Python

Figure 17: Dynamic plot title

As you can see, it’s not the best model ever, so you can take it as a challenge to improve it!

 

Conclusion

To summarize, we have seen how to use Python code with Tableau, a new feature which gives us the opportunity to implement advanced analytics in our data. An example of Time Series prediction is included in this article. We can see a clear advantage to using this approach when compared to using the Tableau built-in tool set: adjustability of model parameters.

We hope you enjoyed this article and please leave a comment if you have any doubts, suggestions or opinions.

Stay tuned for future posts!

Click here if you would like to know more about the topic.

privacy policy - Copyright © 2000-2010 ClearPeaks

topnav