Extending Oracle DVD Map functionality

.

 

1. The challenge

Despite Oracle Data Visualization Desktop (DVD) having now caught up with the leaders in the data visualization quadrant, there are, like in any product, still detailed functionalities which can always be enhanced. As such we are excited to share this product enhancement related to the map functionality with you.

In the example below, you can see an out-of-the-box DVD map visualisation showing traffic violations in US.

Extending Oracle DVD Map functionality

Figure 1: US traffic violations map chart

As you can see there is only a limited amount of insight which can be drawn from this visualization. Let´s try to build on this by adding some sizing based on a measure - in this case, the number of accidents:

Extending Oracle DVD Map functionality

Figure 2: US traffic violations map chart- sizing added

Even after colouring it with the app’s fine amount (Fine Amt), it still looks like something is missing.

Extending Oracle DVD Map functionality

Figure 3: US traffic violation - Formatted version

It’s possible that we could improve our visualisation by colouring by a dimension instead of by a measure; however, DVD does not allow dimension values at this time.

 

2. DVD 12.3 Plugin

Let´s try downloading the Custom Points Map Plugin from the Oracle Store instead.

Extending Oracle DVD Map functionality

Figure 4: US traffic violations - plugin used

Ok, that looks nicer. The ability to control the size of the dots really helps for better  chart comprehension, but the plugin still lacks helpful functionalities, such as being able to colour values by dimension.  Further inspection  of the plugin leaves much to be desired. What if you decide to maximize or minimize the visualisation? (⤢):

Extending Oracle DVD Map functionality

Figure 5: Maximize map chart

Well, not the best vantage point. Of course, you can resolve this by simply zooming in several times… until the moment you have to max out again.

 

3. The ClearPeak's solution

So are you ready now for a bit of magic? Let me present to you the solution that ClearPeaks developed in order to enhance the user experience. By simply adding some adjustments to the code, we were able to address all of these issues while utilizing the same visualisation tool (Oracle DVD 12.2.3) and plugin (Custom Points Map Plugin) as before. Alright, now we are ready to explore our data!

Extending Oracle DVD Map functionality

Figure 6: US traffic violations - using updated plugin version 1

Our first order of business was to resolve the dimension colour problem, enabling this feature for our customer. This was such an improvement! Now it´s much easier to find recognizable patterns or dependencies. We also thought that it may be wise to implement a colour gradient for measures in order to promote further and more accurate analysis of our data, so we did that too!

Extending Oracle DVD Map functionality

Figure 7: US traffic violations - using updated plugin version 2

Oh, and the problem with the zooming in/out? Yeah, we fixed that as well.  This is the final result of our tweaks:

Extending Oracle DVD Map functionality

Figure 8: US traffic violations - using updated plugin - final version

Looks much nicer, right?

Unfortunately, as nothing in life is ever truly perfect, there are still some inconveniences with our touch-up, namely, legends are  not available at this time. The good news is that there is a work-around for this problem, called “list chart”. To take advantage of it, add a second visualisation onto your canvas and drag the dimension (SubAgency) into it. Select “list chart” as a visualisation and you’ll have your legend.

Do not forget that DVD keeps the colour context within a canvas, so you should not worry about losing your legend.

Curious about how to make this solution work for you too? Luckily, you just have to follow the architecture of the plugin. Define which boxes (colour dimensions) you want to see and which values are allowed in it (i.e. colour boxes for dimension values and size box and gradient colours for measure values). Then, make sure to include them in the plugin.xml file. After that, link them with the map visualisation in the pointsMapVizdatamodelhandler.js file. Defining further details about the colours, buckets, etc. can be done in pointsMapViz.js. Of course, all changes should be synchronized with the main map library used by Oracle - oraclemapsv2.js.

 

4. Demo

Check out a demo of this solution below:

Conclusion

Even though there are still some restrictions when using DVD as a reporting tool, Oracle is really heading in the right direction with this new version - 12.2.3, especially with their custom plugins idea. By giving the users the opportunity to download and create new plugins and adjust the code of the plugins according to everybody´s needs, Oracle DVD offers a flexibility that nowadays is required on the BI market. If we combine all this with ClearPeaks’s know-how, it really is a recipe for turning every Oracle DVD project into a success.

Contact us if you’d like to receive more information about this solution.

Think Big, Think Data!

.

Big Data & Data Science
A recent publication in The Economist asserts that data has now surpassed oil as the world’s most valuable resource. In the last two years alone, the world has collected and stored more data than all previous years in recorded history — combined. With over 60 billion connected devices worldwide, this trend will only continue. By 2020, it is estimated that over 30% of all data will be stored in cloud services. Data Mania is truly upon us!

However, you may still be pondering on whether Big Data (BD) could provide true value to your organisation and how effectively you could kick-off such an ambitious initiative. Perhaps you’re in a quandary about how you could leverage BD for competitive advantage, or perhaps you’re actively considering a digital transformation program. Be rest assured that your competitors certainly are. On average, only 0.5% of corporate data is actually analysed, leading to untold numbers of missed opportunities for those that discount it, and substantial benefits for those that do not. Retailers leveraging BD alone have increased their margins by up to 60%!

Digitalization is rewriting the rules of competition, with incumbent companies most at risk of being left behind by new disrupters in the industry. It is no secret that BD has a paramount role to play in any digitalization initiative. This year alone, 73% of all Fortune 1000 companies have reported that they are investing in BD, and that number is set to grow in coming years.

But BD, as the Economist rightly suggests, is a resource. If data is not captured, analysed and exploited, its value becomes inconsequential. This is where Data Science - specifically Data Discovery and Prediction - comes in. Not only can you now report on past actuals as in classic Business Intelligence (BI), but calling on algorithms and machine learning techniques, you can now predict into the future, leveraging your BD resources to feed your predictive models to a high degree of accuracy. Imagine the business opportunities this presents in all functional areas of your organisation!

Big DataHopefully we now have you thinking about BD and the possibilities it provides! But how does this BD initiative impact on the past years of investment in your Corporate Data Warehouse (DWH)? Simply stated, any BD initiative absolutely co-exists with the DWH — only with BD, the type of data, the velocity and the increased volume serves to enrich the DWH platform, providing a much more holistic business picture. BD technology platforms, either on premise or more often on Cloud, allow for this high volume data capture, which more classical relational database technologies cannot.

 

So let's get started!

At ClearPeaks, we offer our customers a pragmatic proof-of-concept (POC) service in which we will work with you to:

Define the right POC business case, the expected ROI, the problem and the desired BD solution.
Deploy a scaled-down POC environment, capturing data from various diverse sources beyond what you are capturing in your DWH. This could be high volume data, real-time streaming data, social media data, etc.
Use Cloud BD platforms to provide a full POC experience, after which an in-house hosting / cloud decision can be made. We start with Cloud as it´s quick to deploy, elastic and cost-efficient.
Demonstrate how the combination of DWH, BD, predictive modeling and powerful visualisations can bring tangible benefits to your organisation, all in an acceptable timeframe and with minimal costs.

 

Some useful Big Data definitions:

Volume, Variety & Velocity: Dealing with large, fast and varied data is now a possibility for any business. The key point now is defining what knowledge can be extracted from the data, not implementing complex software solutions.
Cloud: On-premise hosting of BD platforms is not always possible, and in some cases is not really recommended. The perfect partner for BD is the Cloud. The Cloud enables your BD solution to grow with you in a flexible and cost-effective manner, without the headaches of maintaining complex hardware systems.
Real-time: Provide up-to-the-second information about an enterprise's customers and present it in a way that allows for better and quicker business decision-making — perhaps even within the time span of a customer interaction. Real-time analytics can support instant refreshes to corporate dashboards to reflect business changes throughout the day.
Predictive analytics and machine learning: Predictive models exploit patterns found in historical and transactional data to identify risks and opportunities. Apply it to cross-sell products to your costumers, improve customer retention and credit risk assessments, optimize marketing campaigns, increase your revenues and more.

 

Authors:
Gordon, Oscar, Marc

Click here if you would like to know more about our Big Data services!

Tableau 10.3: Data-Driven Alerts and a smart trick for KPIs

.

 
With version 10.3, Tableau has introduced a very useful feature: Data-Driven Alerts.

Now, when looking at a dashboard in Tableau Server, you can set up automatic mail notifications to a set of recipients when a certain value reaches a specific threshold.

In this article, we’re going to go through the basics of this feature, explaining how to use it, and then show you how to get even more from it when dealing with KPIs.

 

1. How to set up an alert

In our example we have a dashboard that shows the yearly earnings and revenues of our company. We have two KPIs (amount of earnings and revenues for a certain year compared to a target) and three charts comparing them with expenses and with each other.

Tableau 10.3: Data-Driven Alerts and a smart trick for KPIs

Figure 1: Dashboard example

To set up the alert on one of the measures shown, all we have to do is click on one of the axes (without applying any filters or actions) and then click on Alert. We will be able to set up our alert and save it in the pop-up window.

Tableau 10.3: Data-Driven Alerts and a smart trick for KPIs

Figure 2: Alert creation

The various options for the condition are:

Above or equal to
Above
Below or equal to
Below
Equal to

The frequency can be set as:

Once - the first time it's true
As frequently as possible
Hourly at most
Daily at most
Weekly at most

In the example shown, we’re going to set up an alert on the Earnings measure, and we want to receive an email as soon as the value reaches €5M.

Once saved, the alert becomes active. If you want to see the status of all the alerts when browsing a site, click on Tasks (if you have permission), and then on Alerts. You will see a list of all the active alerts, with name, view, owner, recipients, email frequency, last check time, and last alert time. You can also edit or delete them by clicking on “…”.

You can set up more alerts on the same view or dashboard, and even on the same measure or axis. Just remember that you must do it on the original dashboard, with no filters or actions applied (otherwise Tableau will give you an error).

Tableau 10.3: Data-Driven Alerts and a smart trick for KPIs

Figure 3: Alerts management pane

 

2. A smart trick for KPIs

Alerts are a really good new feature and work pretty well. In the previous example, you get an email as soon as your earnings reach €5M. There is, however, one major restriction: alerts can only be set up by clicking on a continuous axis. This means that we can’t use them on our beloved KPIs, as they usually only show a number and maybe a few indicators.

In our example, the KPIs show the value of the earnings and the revenues. To set up an alert on them, we could click on any of the axes below, and even set one up on revenues. But what if there is no axis in the dashboard for the measure that we want to track?

Don’t worry, we’ve got a neat trick to show you.

We’re going to add a third measure, a calculated field that shows the ratio between the earnings and the revenues, that we can call Quality.

Tableau 10.3: Data-Driven Alerts and a smart trick for KPIs

Figure 4: Dashboard example with new KPI

How can we add an alert on it? It has no axis, but we can modify it and “cheat?” Tableau while still showing our KPI. We can edit the sheet like this:

Create a dummy calculated field that simply has value 1
Drag it to the sheet, alongside your quality measure
Set the visualization as a bar chart and use the double axis feature
Fix the axis of the dummy field to go from 0 to 1, so that it always shows a full bar
Do the same with the second axis, synchronizing the two axes
Clean the two axes, removing text and ticks. Leave some text on the second one to indicate where to click to set up the alert
Set the size as maximum and the colours the same as the background, in order to “hide” the bars
Drag your quality field to the dummy text box, and align it in the middle

Done! Your sheet will now look like this (we added some extra fancy text to it, but you don’t need to):

Tableau 10.3: Data-Driven Alerts and a smart trick for KPIs

Figure 5: New KPI sheet

Drag your new sheet to the dashboard, then upload it to Tableau Server and click on “Click for Alert”. You can now set up an alert on your custom KPI!

Tableau 10.3: Data-Driven Alerts and a smart trick for KPIs

Figure 6: Alert creation on new KPI with no axis

 

Conclusion

Data-Driven Alerts are a very powerful new feature that users had been asking for quite a long time, allowing the tracking of the most important measures in the easiest possible way (by receiving a notification in your business email box) without having to repeatedly check the dashboard! This should save you a significant amount of time…

Yes, the axis chart restriction is a bit of an inconvenience, but this can be overcome with the trick shown above. We are also sure that Tableau will improve it even more in future versions.

Data Quality with Informatica – Part 3: Data Deduplication

.

Data Quality with Informatica – Part 3: Data Deduplication
In the previous article in the series about Data Quality with Informatica we learned that Informatica has a lot of useful features to standardize data, and that it is a very user-friendly tool whilst still offering enough flexibility to perform complex standardization tasks.  In this article, we´ll focus on Data Deduplication.

Introduction

The process of data deduplication consists of two parts: the first is data matching, where we try to find duplicates or the likelihood of records being the same, and from which we obtain a score of similarity between records. The second part of the data deduplication process is data consolidation, that is, the deletion of the records that we have identified as duplicates whilst keeping the masters.

 

1. Record Matching

In this section we are going to explain some Informatica transformations that simplify the process of finding duplicates in our data. The process of data deduplication may use a large amount of resources, since we will be making many comparisons in a large data file, so bearing this in mind, the first step is to group the data so we can make the comparisons over small groups and not the whole dataset. In order to do this, Informatica provides the key generator transformation; grouping the data before the matching process will significantly reduce the time needed when looking for matches in the dataset.

Create a mapping and add the file input, a key generator transformation and a match transformation. The output of the key generator will be the input for the match transformation. We will review the data using the data preview, so there is no need for a valid mapping with a defined target.

Data Quality with Informatica – Part 3: Data Deduplication

Figure 1: Mapping for record matching

The first transformation is intended to reduce some groups in order to make the match transformation more efficient; we are going to configure the key generator transformation to generate a new sequence and a Soundex strategy. Soundex works for characters in the English alphabet, using the first character of the input string as the first character in the return value and encodes the remaining three unique consonants as numbers. We will use the enterprise column in the strategy:

Data Quality with Informatica – Part 3: Data Deduplication

Figure 2: Configuration of the KeyGenerator transformation to use the Soundex strategy

Now we can preview the data in the key generator transformation, and we can see that there are several values that are likely to be duplicated, as well as two groups that have been generated according to the Soundex strategy:

Data Quality with Informatica – Part 3: Data Deduplication

Figure 3: Results preview of the KeyGenerator transformation

Now we are going to add three strategies to the Match transformation, using the Bigram distance for field matching. We can select four different measures of distance between the fields, Bigram, Edit, Hamming Jaro and Reverse Hamming Each of these is intended for specific cases and will work better or worse depending on the values of the column. In this case we are going to use Bigram, as the values are small. The transformation creates a temporary column to compare the values, so in every strategy we select the original port and the temporary one, named <port_name>_2 :

Data Quality with Informatica – Part 3: Data Deduplication

Figure 4: Configuring the fields that will be used in the record match strategy

We are going to create three strategies using the same distance, so all three columns in our file will be taken into account to find the duplicates. The first strategy will measure the likelihood for the enterprise column, the second for the department and the third for the employee:

Data Quality with Informatica – Part 3: Data Deduplication

Figure 5: Configuring strategies for record matching

Once we have defined the strategies we can preview the data of the transformation and analyse the results. The output of the transformation only shows the records that have some probability of being duplicates, and the driver score shows the degree of likelihood of being duplicates. Depending on the number of strategies defined for the columns we will view more columns in the results (two additional columns per strategy), and the values are the value of the record and the value of the record it has been compared to:

Data Quality with Informatica – Part 3: Data Deduplication

Figure 6: Results preview of the Match transformation, showing likelihood scores for three different strategies

The degree of similarity is shown in the driver score column, with  1 being a perfect match. We can adjust the weight of each different strategy to give more importance to some columns. Looking at this result we can see that the first record (1-1) matches with the second record of the file (1-2) with a score of 0.9257, and that the difference comes from the employee name, as it contains a typo error in the second row of the file (employee_2). We can also see that there are perfect matches for rows 3, 5 and 6 (as the output score is 1) and that columns 11 and 12 are duplicated, so we could consolidate the original file into a couple of records, keeping just the highlighted rows in the source file:

Data Quality with Informatica – Part 3: Data Deduplication

Figure 7: Preview of the data showing a large number of duplicates. After the consolidation only the highlighted records will be maintained.

At some point we will have to decide which are the valid records that we should keep this process is known as data consolidation.

There is another transformation called Comparison that works in a similar way, in that we can choose different strategies and distance measures, but it compares the columns in a record instead of different records.

This is just one example of what can be done in a few simple steps, but note that the different transformations that work identifying duplicates can be configured in different ways, as well as the key generator transformation. For a complete description of the transformations, the different distance measures specifications and how to select the one that best suits your needs, you should refer to the INFA transformation reference. Among these transformations there is another interesting one called  Identity Match. The advantage of this transformation it that it offers the ability to match name and address data without the need for standardization prior to the matching. The identity match transformation performs matching operations on input data at an identity level; an identity is a set of columns providing name and address information for a person or organization. The transformation treats one or more input columns as a defined identity and performs a matching analysis between the identities it locates in the input data.

 

2. Data Consolidation

Data consolidation is the next step in data deduplication. Once we have found the matching records, we can automatically consolidate the records to obtain a master list of unique rows. To do this, we add the consolidation transformation to the mapping we had. The consolidation transformation can be configured in different ways, but basically it will group the records selecting one key column, so it will keep one record per group depending on the strategy we have selected. By default it will create an output port called IsSurvivor that shows if the record is the master or not. In our example we are going to add the consolidation transformation and drag and drop the output ports from the matching transformation into the consolidation transformation as shown below:

Data Quality with Informatica – Part 3: Data Deduplication

Figure 8: Mapping for record matching and data consolidation

We have to be careful when selecting the group field, as it will determine the output of the transformation. In this case, we are going to configure a simple strategy, grouping by groupkey1 and with a most frequent non-blank strategy in every column. We can add more complex or customized strategies by selecting advance and writing out code; for instance, we may want to create a custom strategy that could depend on the matching score generated by the previous matching transformation.

Data Quality with Informatica – Part 3: Data Deduplication

Figure 9: Configuring the grouping strategy for the data consolidation

If we preview the results of the transformation we can see the survivor records:

Data Quality with Informatica – Part 3: Data Deduplication

Figure 10: Results preview of the consolidation process. The highlighted rows are selected as the surviving records of the de-duplication.

These will be our master records in this example.

From the above result we might think that there are two survival records when there should be only one, as the department and the employee are the same. This happens because we are consolidating the data grouping by the cluster id, so we still have to find the duplicates by employee and department. In order to implement multiple grouping and multiple match criteria in a single mapping we can use the association transformation. With this transformation we can consolidate the data as if they belonged to the same cluster. We’re going to configure the association transformation to associate the department and the employee name, make a second consolidation (consolidation2), and then check the results.

Data Quality with Informatica – Part 3: Data Deduplication

Figure 11: Final mapping for record matching and data consolidation using multiple startegies and match criteria.

The figure below shows the properties tab for the association transformation:

Data Quality with Informatica – Part 3: Data Deduplication

Figure 12: Configuration of the Association transformation

With this change, the association id will be common for both output clusters coming from the match transformation. If we check the results now we can see that we have only one survivor record:

Data Quality with Informatica – Part 3: Data Deduplication

Figure 13: Final results of the matching and de-duplication process. We can see only one surviving record for each set of duplicate records.

 

Conclusion

In this example we have seen how we can consolidate data automatically with Informatica DQ. It is important to note that automatic does not mean intelligent, as we have to set up the transformations properly, carefully selecting the measures of distance whilst paying special attention to the data grouping.

This article concludes our series about Data Quality with Informatica.

If you would like to know more about the Data Quality Services we offer click here!

Data Quality with Informatica – Part 2: Data Standardization

.

Introduction

In the previous article in this series about Data Quality we explained how we can profile our data using Informatica.

We learned that the data in our example contained some columns that needed to be fixed:

Keyword_ID: our data contain records with value '--' in this field, which represents a null value; in order to standardize this value across all sources we are going to change it to 'NA'.
Currency: the currency field is also not consistent as the values are in both upper and lowercase, and some records contain values that do not correspond to a currency code. We are going to fix this by standardizing to uppercase.
Year: some records contain the year with the word 'year', e.g. 'year 2015', instead of just the value 2015; the field must be standardized to just the year in numerical format.
Quarter: the quarter field is also wrong, as some records contain the date or the year, and this field should only contain the quarters and the year number.

In this article, we are going to continue with this example and create a set of mapplets in Informatica Developer to fix these data issues.

 

1. Creating the standardization rules

Our first mapplet will retain numerical values only, so it can beapplied to any column where we need to keep only numerical values. In our case, we are going to apply it to the year column, and to do this, we open Informatica Developer, right-click on the project and click on create mapplet; we’ll call it rule_Retain_Numbers. A mapplet normally contains an input and an output transformation, as it receives values and returns results. We are going to define the mapplet logic between these two transformations, so first we add the input and the output transformations, configure the input and output ports in the transformations and set the length to be long enough, for instance 200 characters.

Data Quality with Informatica – Part 2: Data Standardization

Figure 1: Creating a mapplet in Informatica Developer

Now we have to define the mapplet logic: first, we are going to use a labeller transformation to mask the letters and spaces; the labeller can be used to set a character by character analysis of data in a field, where each character is masked according to a list of standard and user-defined types: numbers will be masked as '9', spaces as '_' , letters as 'X' and symbols as 'S'.  To add the transformation, we right-click inside the mapplet, select Add transformation and then add a labeller:

Data Quality with Informatica – Part 2: Data Standardization

Figure 2: Adding a labeller transformation to the mapplet

Now we’re going to add a new strategy to the labeller:  select character mode, then verify that the output precision is set to 200 as in the input:

Data Quality with Informatica – Part 2: Data Standardization

Figure 3: Basic labeller confirguration

The next step is to add an operation: we’re going to select Label characters using character sets instead of Label characters using reference table. We want to mask all the characters except the numbers, so we choose the English alphabet, spaces and symbols, as in the image below:

Data Quality with Informatica – Part 2: Data Standardization

Figure 4: Configuration of the labeller transformation

Click on finish and skip the ignore text dialog which appears after clicking on next, as we don't want to add another operation. With the configuration as it is now, the labeller will output only the numbers and mask the rest of the characters, so we can add a standardizer transformation to remove them.

The standardizer transformation can be viewed as a series of operations that are carried out on an input string to look for matching substrings in the input and to place the correct substring in the output. It is used to fix errors such as incorrect formats, and to search for values in the data that can be removed or replaced either with reference table items or specific values.

To continue with our example, it’s time to add a standardizer transformation to the mapplet as we did before, which we can name st_remove_noise; drag the output from the labeller to the standardizer input, then create a new strategy (called remove noise). We check the space delimiter after clicking on the choose button, and also remove the trailing spaces by checking both checkboxes in the strategy properties.

Data Quality with Informatica – Part 2: Data Standardization

Figure 5: Configuring the standardized transformation strategy

At this point we want to remove the noise labelled with ‘S’, ‘X’ and ‘_’, so we select remove custom strings in the strategy and add these characters to the custom strings in properties.

Data Quality with Informatica – Part 2: Data Standardization

Figure 6: Configuring the standardizer transformation to remove custom strings

Click on finish and finally drag the output from the standardizer transformation to the port of the output transformation, then validate the mapplet. If we want the mapplet to be available in the Analyst, we have to validate it as a rule.

Data Quality with Informatica – Part 2: Data Standardization

Figure 7: Standardization mapplet

Carrying on with our example, now we’re going to create another mapplet to replace the wrong currency codes we found in the file. We’re going to use a reference table to do this, which can be created using Informatica Analyst or Developer. We will use Analyst for this example.

Log into Analyst, open the profile, select the currency column and create a reference table. The first value will be the valid one and the rest of them will be replaced by the correct one. To create the reference table we have to go to the file profile, select the currency column and then, in actions, click on Add to - Reference Table:

Data Quality with Informatica – Part 2: Data Standardization

Figure 8: Creating reference tables in Informatica Analyst

Once the table has been created we add three new columns with the values to be replaced, the first column being the correct one.

Data Quality with Informatica – Part 2: Data Standardization

Figure 9: Reference table properties

After adding the new columns, we can edit the records and keep just one, as shown in image 10:

Data Quality with Informatica – Part 2: Data Standardization

Figure 10: Final reference table for currency standardization in Analyst

In order to keep each rule in a different mapplet, we need to create a different mapplet for this rule. We could add new ports to the mapplet and increase the complexity of the standardization, but by keeping each rule in a different mapplet, the mapplets remain as simple as possible. For the currency mapplet we proceed as with the first one we created above, but in this case the standardizer transformation will have a different strategy: to replace the values with those present in the currency reference table. To do this, we have to select the reference table replacement for the transformation:

Data Quality with Informatica – Part 2: Data Standardization

Figure 11: Standardizer transformation using a reference table

The mapplet will look like this; we validate it and proceed to create a new one:

Data Quality with Informatica – Part 2: Data Standardization

Figure 12: Mapplet for the standardization of the currency field

We need to identify the month number to replace the values for the quarter, so we will now proceed to parse the date in a new mapping. Informatica Data Quality comes with some in-built features to parse dates, but we are not going to use them in this example. Instead, we are going to parse the date manually, using a tokenizer to split it into three columns: day, month and year.

Click on create a mapplet and add an input, an output, and a parser transformation. We will parse the date field using the slash character as the delimiter and use regular expressions to validate the day, month and year. It’s important to note that the parser transformation creates two output ports by default: one for data that do not meet the parser regular expression, whilst the other is the overflow port that contains data if there are not enough output ports to keep the correctly parsed values.

In the parser transformation, select the token parser when prompted:

Data Quality with Informatica – Part 2: Data Standardization

Figure 13: Configuration of the parser type in the parser transformation

Name the port in the input as date, then drag and drop the date from the input transformation to the token parser; then go to the parser transformation and add one strategy with three outputs, day, month and year. Each of these ports will have a custom regular expression with “/” as the delimiter.

Data Quality with Informatica – Part 2: Data Standardization

Figure 14: Parser configuration

Click on next and select token parser, and then select token sets in the token definition and click on choose. In the token set selection, we create a new expression for every output port of the parser transformation:

Data Quality with Informatica – Part 2: Data Standardization

Figure 15: Configuration of token sets for parsing data

We add the monthOfYear custom token set with the regular expression shown in image 16:

Data Quality with Informatica – Part 2: Data Standardization

Figure 16: Token set configuration for the month number

Once the token set has been added, assign it to the month column.

We have to repeat the same process with the proper regular expressions for each column, and once all the columns have been assigned, the parser mapplet should look like this in image 17:

Data Quality with Informatica – Part 2: Data Standardization

Figure 17: Mapplet for parsing the date into day, month, and year columns using a parser transformation

We can now add the mapplet to the mapping to get a preview of the results:

Data Quality with Informatica – Part 2: Data Standardization

Figure 18: Results preview of the parsing mapplet

We can see that there are some records that do not meet the regular expressions we set in the parser, so we have to set a default value for those records that have populated the UnparsedOutput port.

Continuing with the mapplet, we are going to add the port quarter to the output, and replace the hyphens with the string “NA”. In order to do this, we need to add two expressions to the mapping, one to create the quarter column and the other to replace the hyphens with “NA”. We can do this by creating an expression with one port to concatenate the quarter with the year; in the same expression we add a port to replace the hyphens for “NA”, and then make a decision to populate (or not) the quarter output, depending on the unparsed port from the parser: if it is empty, then the date was parsed correctly and the quarter field will be populated; if not, the date was wrong, and the quarter will be populated with “NA”. The code in the decision strategy will look like this:

Data Quality with Informatica – Part 2: Data Standardization

Figure 19: Expression to generate the quarter based on the result of the parsing of the date

Our mapplet should look like image 20:

Data Quality with Informatica – Part 2: Data Standardization

Figure 20: Standardization mapping with quarter parsing

 

2. Creation of the standardization mapping

Now we can validate all the mapplets and add them to a mapping where we can also add the source file and a new output file with the same structure. This file will be the standardized data file. We are also going to add a union to merge the data from two different dates. The mapping should look like the one in image 21:

Data Quality with Informatica – Part 2: Data Standardization

Figure 21: Final standardization mapping

After running the mapping, we can profile the generated file and check that it is meeting the rules that we defined at the beginning. We can see the path of the output file in the Run-time tab of the properties of the target:

Data Quality with Informatica – Part 2: Data Standardization

Figure 22: Properties of the output transformation. We can see the name and path of the output generated in the Run-time tab

 

3. Results

Now we are ready to review the profile of the output file. For the currency column, we can see that the only value available is USD. If any other value appears, we can simply add it to a new column in the reference table. Notice that NULL values are appearing as we didn’t set a rule to standardize the NULL values to “NA”.

Data Quality with Informatica – Part 2: Data Standardization

Figure 23: Results of the standardization process for the currency column

The year column is now standardized in the merged file and we have only numerical values after the data masking and standardization:

Data Quality with Informatica – Part 2: Data Standardization

Figure 24: Results of the standardization process for the year column

We have fixed the quarter column to obtain standard values (quarterName Year) thanks to the expressions added to the mapplet:

Data Quality with Informatica – Part 2: Data Standardization

Figure 25: Results of the standardization process for the quarter column

We have also fixed the hyphens in the keywordID column:

Data Quality with Informatica – Part 2: Data Standardization

Figure 26: Results of the standardization process for the Keyword ID column

Conclusion

This concludes this article about Data Standardization with Informatica Data Quality. We have seen that Informatica has a lot of useful features to standardize data, and that it is a very user-friendly tool whilst still offering enough flexibility to perform complex standardization tasks.

Stay tuned for the last article in this series, where we are going to explain Data Deduplication using Informatica Data Quality.

If you would like to know more about the Data Quality Services we offer click here!

privacy policy - Copyright © 2000-2010 ClearPeaks

topnav