Data Quality with Informatica – Part 2: Data Standardization

.

Introduction

In the previous article in this series about Data Quality we explained how we can profile our data using Informatica.

We learned that the data in our example contained some columns that needed to be fixed:

Keyword_ID: our data contain records with value '--' in this field, which represents a null value; in order to standardize this value across all sources we are going to change it to 'NA'.
Currency: the currency field is also not consistent as the values are in both upper and lowercase, and some records contain values that do not correspond to a currency code. We are going to fix this by standardizing to uppercase.
Year: some records contain the year with the word 'year', e.g. 'year 2015', instead of just the value 2015; the field must be standardized to just the year in numerical format.
Quarter: the quarter field is also wrong, as some records contain the date or the year, and this field should only contain the quarters and the year number.

In this article, we are going to continue with this example and create a set of mapplets in Informatica Developer to fix these data issues.

 

1. Creating the standardization rules

Our first mapplet will retain numerical values only, so it can beapplied to any column where we need to keep only numerical values. In our case, we are going to apply it to the year column, and to do this, we open Informatica Developer, right-click on the project and click on create mapplet; we’ll call it rule_Retain_Numbers. A mapplet normally contains an input and an output transformation, as it receives values and returns results. We are going to define the mapplet logic between these two transformations, so first we add the input and the output transformations, configure the input and output ports in the transformations and set the length to be long enough, for instance 200 characters.

Data Quality with Informatica – Part 2: Data Standardization

Figure 1: Creating a mapplet in Informatica Developer

Now we have to define the mapplet logic: first, we are going to use a labeller transformation to mask the letters and spaces; the labeller can be used to set a character by character analysis of data in a field, where each character is masked according to a list of standard and user-defined types: numbers will be masked as '9', spaces as '_' , letters as 'X' and symbols as 'S'.  To add the transformation, we right-click inside the mapplet, select Add transformation and then add a labeller:

Data Quality with Informatica – Part 2: Data Standardization

Figure 2: Adding a labeller transformation to the mapplet

Now we’re going to add a new strategy to the labeller:  select character mode, then verify that the output precision is set to 200 as in the input:

Data Quality with Informatica – Part 2: Data Standardization

Figure 3: Basic labeller confirguration

The next step is to add an operation: we’re going to select Label characters using character sets instead of Label characters using reference table. We want to mask all the characters except the numbers, so we choose the English alphabet, spaces and symbols, as in the image below:

Data Quality with Informatica – Part 2: Data Standardization

Figure 4: Configuration of the labeller transformation

Click on finish and skip the ignore text dialog which appears after clicking on next, as we don't want to add another operation. With the configuration as it is now, the labeller will output only the numbers and mask the rest of the characters, so we can add a standardizer transformation to remove them.

The standardizer transformation can be viewed as a series of operations that are carried out on an input string to look for matching substrings in the input and to place the correct substring in the output. It is used to fix errors such as incorrect formats, and to search for values in the data that can be removed or replaced either with reference table items or specific values.

To continue with our example, it’s time to add a standardizer transformation to the mapplet as we did before, which we can name st_remove_noise; drag the output from the labeller to the standardizer input, then create a new strategy (called remove noise). We check the space delimiter after clicking on the choose button, and also remove the trailing spaces by checking both checkboxes in the strategy properties.

Data Quality with Informatica – Part 2: Data Standardization

Figure 5: Configuring the standardized transformation strategy

At this point we want to remove the noise labelled with ‘S’, ‘X’ and ‘_’, so we select remove custom strings in the strategy and add these characters to the custom strings in properties.

Data Quality with Informatica – Part 2: Data Standardization

Figure 6: Configuring the standardizer transformation to remove custom strings

Click on finish and finally drag the output from the standardizer transformation to the port of the output transformation, then validate the mapplet. If we want the mapplet to be available in the Analyst, we have to validate it as a rule.

Data Quality with Informatica – Part 2: Data Standardization

Figure 7: Standardization mapplet

Carrying on with our example, now we’re going to create another mapplet to replace the wrong currency codes we found in the file. We’re going to use a reference table to do this, which can be created using Informatica Analyst or Developer. We will use Analyst for this example.

Log into Analyst, open the profile, select the currency column and create a reference table. The first value will be the valid one and the rest of them will be replaced by the correct one. To create the reference table we have to go to the file profile, select the currency column and then, in actions, click on Add to - Reference Table:

Data Quality with Informatica – Part 2: Data Standardization

Figure 8: Creating reference tables in Informatica Analyst

Once the table has been created we add three new columns with the values to be replaced, the first column being the correct one.

Data Quality with Informatica – Part 2: Data Standardization

Figure 9: Reference table properties

After adding the new columns, we can edit the records and keep just one, as shown in image 10:

Data Quality with Informatica – Part 2: Data Standardization

Figure 10: Final reference table for currency standardization in Analyst

In order to keep each rule in a different mapplet, we need to create a different mapplet for this rule. We could add new ports to the mapplet and increase the complexity of the standardization, but by keeping each rule in a different mapplet, the mapplets remain as simple as possible. For the currency mapplet we proceed as with the first one we created above, but in this case the standardizer transformation will have a different strategy: to replace the values with those present in the currency reference table. To do this, we have to select the reference table replacement for the transformation:

Data Quality with Informatica – Part 2: Data Standardization

Figure 11: Standardizer transformation using a reference table

The mapplet will look like this; we validate it and proceed to create a new one:

Data Quality with Informatica – Part 2: Data Standardization

Figure 12: Mapplet for the standardization of the currency field

We need to identify the month number to replace the values for the quarter, so we will now proceed to parse the date in a new mapping. Informatica Data Quality comes with some in-built features to parse dates, but we are not going to use them in this example. Instead, we are going to parse the date manually, using a tokenizer to split it into three columns: day, month and year.

Click on create a mapplet and add an input, an output, and a parser transformation. We will parse the date field using the slash character as the delimiter and use regular expressions to validate the day, month and year. It’s important to note that the parser transformation creates two output ports by default: one for data that do not meet the parser regular expression, whilst the other is the overflow port that contains data if there are not enough output ports to keep the correctly parsed values.

In the parser transformation, select the token parser when prompted:

Data Quality with Informatica – Part 2: Data Standardization

Figure 13: Configuration of the parser type in the parser transformation

Name the port in the input as date, then drag and drop the date from the input transformation to the token parser; then go to the parser transformation and add one strategy with three outputs, day, month and year. Each of these ports will have a custom regular expression with “/” as the delimiter.

Data Quality with Informatica – Part 2: Data Standardization

Figure 14: Parser configuration

Click on next and select token parser, and then select token sets in the token definition and click on choose. In the token set selection, we create a new expression for every output port of the parser transformation:

Data Quality with Informatica – Part 2: Data Standardization

Figure 15: Configuration of token sets for parsing data

We add the monthOfYear custom token set with the regular expression shown in image 16:

Data Quality with Informatica – Part 2: Data Standardization

Figure 16: Token set configuration for the month number

Once the token set has been added, assign it to the month column.

We have to repeat the same process with the proper regular expressions for each column, and once all the columns have been assigned, the parser mapplet should look like this in image 17:

Data Quality with Informatica – Part 2: Data Standardization

Figure 17: Mapplet for parsing the date into day, month, and year columns using a parser transformation

We can now add the mapplet to the mapping to get a preview of the results:

Data Quality with Informatica – Part 2: Data Standardization

Figure 18: Results preview of the parsing mapplet

We can see that there are some records that do not meet the regular expressions we set in the parser, so we have to set a default value for those records that have populated the UnparsedOutput port.

Continuing with the mapplet, we are going to add the port quarter to the output, and replace the hyphens with the string “NA”. In order to do this, we need to add two expressions to the mapping, one to create the quarter column and the other to replace the hyphens with “NA”. We can do this by creating an expression with one port to concatenate the quarter with the year; in the same expression we add a port to replace the hyphens for “NA”, and then make a decision to populate (or not) the quarter output, depending on the unparsed port from the parser: if it is empty, then the date was parsed correctly and the quarter field will be populated; if not, the date was wrong, and the quarter will be populated with “NA”. The code in the decision strategy will look like this:

Data Quality with Informatica – Part 2: Data Standardization

Figure 19: Expression to generate the quarter based on the result of the parsing of the date

Our mapplet should look like image 20:

Data Quality with Informatica – Part 2: Data Standardization

Figure 20: Standardization mapping with quarter parsing

 

2. Creation of the standardization mapping

Now we can validate all the mapplets and add them to a mapping where we can also add the source file and a new output file with the same structure. This file will be the standardized data file. We are also going to add a union to merge the data from two different dates. The mapping should look like the one in image 21:

Data Quality with Informatica – Part 2: Data Standardization

Figure 21: Final standardization mapping

After running the mapping, we can profile the generated file and check that it is meeting the rules that we defined at the beginning. We can see the path of the output file in the Run-time tab of the properties of the target:

Data Quality with Informatica – Part 2: Data Standardization

Figure 22: Properties of the output transformation. We can see the name and path of the output generated in the Run-time tab

 

3. Results

Now we are ready to review the profile of the output file. For the currency column, we can see that the only value available is USD. If any other value appears, we can simply add it to a new column in the reference table. Notice that NULL values are appearing as we didn’t set a rule to standardize the NULL values to “NA”.

Data Quality with Informatica – Part 2: Data Standardization

Figure 23: Results of the standardization process for the currency column

The year column is now standardized in the merged file and we have only numerical values after the data masking and standardization:

Data Quality with Informatica – Part 2: Data Standardization

Figure 24: Results of the standardization process for the year column

We have fixed the quarter column to obtain standard values (quarterName Year) thanks to the expressions added to the mapplet:

Data Quality with Informatica – Part 2: Data Standardization

Figure 25: Results of the standardization process for the quarter column

We have also fixed the hyphens in the keywordID column:

Data Quality with Informatica – Part 2: Data Standardization

Figure 26: Results of the standardization process for the Keyword ID column

Conclusion

This concludes this article about Data Standardization with Informatica Data Quality. We have seen that Informatica has a lot of useful features to standardize data, and that it is a very user-friendly tool whilst still offering enough flexibility to perform complex standardization tasks.

Stay tuned for the last article in this series, where we are going to explain Data Deduplication using Informatica Data Quality.

If you would like to know more about the Data Quality Services we offer click here!

Data Quality with Informatica – Part 1: Data Profiling

.

Data Quality – Part 1: Data Profiling using INFA

Welcome to the first article in the Informatica Data Quality series, where we are going to run through the basics of Informatica Analyst and the main features of Informatica Developer for data profiling.

Informatica is one of the most important data integration vendors in the market; they are behind PowerCenter, a very well-known ETL that can be integrated with other Informatica tools, such as Informatica Analyst,  a web application used by data analysts to analyse data and create data profiles, among other tasks. In the sections below we are going to go through the necessary steps to create a data profile, a business rule for column profiling and finally a scorecard to view the results.

 

1. Create a Data Profile

To start profiling our data, first open the browser, log into the Analyst tool (the default URL is http://infaServerName:8085/AnalystTool) and create a new project, which we’ll call Data_Profiling_Example :

Data Quality Series - Profiling with Informatica

Figure 1: Creating a project in Informatica Analyst

Now we add a data source; in this example we are going to load a file with information from AdWords. For demonstration purposes, several errors have been introduced into the file, like different date formats. To add a file, click on the actions menu on the right-hand side of the window and click add flat file:

Data Quality Series - Profiling with Informatica

Figure 2: Adding data from a file in Informatica Analyst

Importing data from files is straightforward if we follow the wizard. In this example, we are going to set comma separated values, header present, data starting in line 2, and all the columns will be strings. The tool will automatically detect the length of the fields.

Data Quality Series - Profiling with Informatica

Figure 3: Add flat file wizard in Informatica Analyst

Now we need to create a new profile for our data, and we can do this by clicking on new profile on the menu on the right. In the wizard, we select our data file and accept all the default values.
Once the profile has been created we can review the values of the data, the percentage of nulls, and term frequencies in the results panel, as well as being able to analyse the patterns of the data values for every column. We can also view a summary of basic statistics, such as the max value, the min value and the top and bottom values for each column.

Data Quality Series - Profiling with Informatica

Figure 4: Profiling results in Informatica Analyst

In our example we can see several issues in the data of the file. For example, in the image below we can see  that the year is incorrect for some records (we are assuming that the file should contain just the numeric value for the year). In this example, the file should only contain data for 2nd January 2015, so it looks like the file has some invalid records, as there are some records with a different year, and others with a wrong value. This could be due to a bad extraction from the source system, or a wrong delimiter in some rows. In order to measure the quality of the file, we are now going to create some business rules, add them to the data profile, and finally create a visualization.

The data analysts from our organization have given us the following business rules:

the year must be 2015 for this file
the day column must always be 1/2/2015
the file should only contain Enabled campaigns

We will create two business rules to validate the year and the day columns, and for the Enabled campaigns we will set up the value Enabled in the campaign_status column as valid.

We can create the business rules in two ways: by using the expression builder in the Analyst tool, or by creating a mapping using the Informatica Developer. To create the business rule directly in the profile we simply click on edit, then on the column profiling rules, and the on the plus sign to add a rule.

Data Quality Series - Profiling with Informatica

Figure 5: Creating rules in Informatica Analyst

Then we select new rule for the year column and enter the expression you can see in the following image. We can save the rule as reusable; this way we will be able to apply exactly the same rule for a different column in the file if necessary.

Data Quality Series - Profiling with Informatica

Figure 6: New rule wizard in Informatica Analyst

We will implement the second rule in the Developer tool. To do this, we open Informatica Developer and connect to our project, then create a mapplet with an input transformation, an expression and an output transformation, and save it as DayValidator. To validate the rule, we can right-click on the rule and select validate.

Data Quality Series - Profiling with Informatica

Figure 7: Creating a rule in Informatica Developer

We will define the expression with three possible output values: not a date, Valid date and Invalid date.

Data Quality Series - Profiling with Informatica

Figure 8: Defining rules in Informatica Developer

Once the rule has been created, we can go back to Informatica Analyst, edit the profile and now, instead of creating a new rule, we are going to apply the DayValidator rule we just created in Developer to the day column in the profile. We will call the output of the rule IsValidDay:

Data Quality Series - Profiling with Informatica

Figure 9: New rule wizard in Informatica Analyst

Now we are ready to run the profile again and review the outcome of the two different rules:

Data Quality Series - Profiling with Informatica

Figure 10: Data profiling project in Informatica Analyst

Reviewing the results, we can see that the data contains wrong values for Day and Year:

Data Quality Series - Profiling with Informatica

Figure 11: Reviewing profiling results in Informatica Analyst

 

2. Create a Scorecard for the Profile

Now that we have executed and checked the profile, we can create a scorecard to measure the quality of the file as the last step in this data quality assessment. In order to do this, we have to go to the profile and add it to a new scorecard. We can define the valid values for each column in our data. In this example, we are going to create the scorecard with three metrics called scores (both outputs from the rules and the campaign status) and then select the valid values for each different score.

The scorecard allows us to drill down from the score to the data. We select the key of the file (first three columns), the campaign status, and the output from both rules as drilldown columns; this way we can easily export the invalid rows to a file and send the results to the owner of the data so they can fix the wrong values and re-run the proper processes to update the data.

Data Quality Series - Profiling with Informatica

Figure 12: Data profiling scorecard in Informatica Analyst

This concludes the first article in this series about Data Quality with Informatica.
In the next couple of blog posts we’re going to explain how to standardize and deduplicate data. Stay tuned!
If you would like to know more about the Data Quality Services we offer click here!

Data Quality with EDQ – Part 3: Data Deduplication

.

Data Quality with EDQ – Part 3: Data Deduplication

 

In our previous article about Data Quality we reviewed some of EDQ´s basic cleansing and standardization capabilities.

In this post, which is the last in the series, we are reviewing the basic data deduplication capabilities of EDQ.

Introduction

The first step in a data deduplication or consolidation process flow is record matching, the process of finding duplicate records in the data, that is, records which may relate to a single real-world entity. This is not a trivial task, since we need to identify duplicate records despite different formats and conventions, typos, missing data, etc..

EDQ comes with some built-in transformations to match records; their names are self-explanatory in most cases:

Advanced Match: gives control over all the configuration options
Match Entities
Match Households
Match Individuals (Name, Address, DoB)
Match Individuals (Name, Address)

There are also processors for a variety of different matching scenarios:

Consolidate: matches and merges multiple datasets
Deduplicate: matches a single dataset
Enhance: adds information from one or more reference datasets
Group & Merge: gives a simple merge based on exact match only
Link: find matches between datasets but without merging the data

The matching processors execute a series of ordered sub-processors where we can configure the logic to be used in the matching and the decision rules for candidate duplicate records, among other things. The sub-processors are:

Input: select the attributes from the data stream included in the matching process
Identify: create identifiers to use in matching, i.e., what data will be used to identify records, and map the identifiers to attributes
Cluster:  divide the data streams into clusters; this can be used to reduce the number of comparisons needed and thus the time necessary to run the process
Match: choose which comparisons to perform, and how to interpret them with match rules
Merge: use rules to merge matching records, to create a ‘best’ set of output records (optional)

In the following section we are going to run through an example of data deduplication using EDQ.

 

1. Deduplicating Data using EDQ

Let’s imagine that we have a dataset with a list of people, with many fields containing personal information, including full name, date of birth, and address. We want to see if we have duplicate persons in the table, and to do this, we are going to use EDQ to find matching records based on the full name, address, and date of birth.

For this specific case, the best processor to use is Match Individuals (Name, Address, DoB). We’re going to add this processor to a new process, connecting all the fields available in the dataset as identifiers:

Data Quality Series – Data De-duplication with EDQ

Figure 1: Process for data standardization and deduplication in EDQ. The steps Create GivenName, Create FullName, and Create WholeAddress are concatenations of strings to form the full name in a single string and the whole address in another string.

Data Quality Series – Data De-duplication with EDQ

Figure 2: Matching identifiers of the Match Individuals processor in EDQ.

We can run the process with this configuration to review the results:

Data Quality Series – Data De-duplication with EDQ

Figure 3: Running the Match Individuals processor in EDQ.

In the new window that opens we can review the matching records, set decisions for actions for each individual case, and even add comments:

Data Quality Series – Data De-duplication with EDQ

Figure 4: Results of the Matching Individuals processor in EDQ. We can review and set a decision for each pair of duplicate candidates individually.

These results can also be exported to Excel for further investigation.

Now we’re going to merge the duplicated records using the Merge sub-processor of the Match Individuals processor; we simply double-click on the Merge sub-processor and check the option ‘Output unrelated records’.

We can write the deduplicated records into a new dataset; to do so, we have to add a new Write processor connected to the Merged output of the Match Individuals processor:

Data Quality Series – Data De-duplication with EDQ

Figure 5:  The data standardization and deduplication process in EDQ, with a Writer processor to export the results to a new dataset.

The records will be written into a new staged dataset called Deduplicated Customers:

Data Quality Series – Data De-duplication with EDQ

Figure 6: Writing the results of the process into a new dataset in EDQ.

Data Quality Series – Data De-duplication with EDQ

Figure 7: New staged data created from the results of the process of deduplication in EDQ.

After running the process, we can see that the new dataset with the deduplicated customers has been created in Staged Data:

Data Quality Series – Data De-duplication with EDQ

Figure 8: Dataset created from the standardization and deduplication process in EDQ.

And if we right-click on the Writer processor and select ‘Show results in new window’, we can see that some of the records have a MatchGroupSize field equal to 2, which means that these records come from the merge of two records.

 

Conclusion

Data deduplication is a complex task, one of the most complex manual tasks in the quality assurance process, where tools like EDQ are extremely handy. With EDQ we were able to find matching records using complex rules and to merge the records belonging to a single entity, all in a few clicks. The ability to review each pair of duplicate candidates individually and to decide case-by-case proved to be really helpful in reducing the risk of merging records corresponding to different real-world entities in the deduplication process.

You can try this and other examples of EDQ usage by downloading the latest virtual machine from Oracle, available here.

This concludes our series of blog articles about Oracle EDQ.
You can find the previous posts in the links below:

Introducing Data Quality

Data quality with EDQ

Part 1: Data Profiling
Part 2: Data Standardization

Have you tried EDQ in any of your projects? What was your experience using EDQ like? If you have anything you’d like to share, or if there are any questions you’d like to raise, please leave your comments below!

In the next couple of blog posts we'll explain how to profile, standardize and deduplicate data with Informatica. Stay tuned!

Click here if you would like to know more about the Data Quality Services we offer!

 

Data Quality with EDQ – Part 2: Data Standardization

.

Data Quality with EDQ – Part 2: Data Standardization

 

In the first post of the Data Quality Series we introduced the main concepts of data quality. In the second article we explained how Oracle Enterprise Data Quality (EDQ) can help us profile our data.

The focus in this post is to review some of EDQ’s basic cleansing and standardization capabilities.

Introduction

EDQ provides a number of built-in transformations for data cleansing, and some of the most commonly used transformations are:

Character Replace: replaces characters according to a map
Denoise: removes noise characters from text attributes
Lower Case: converts string values to lower case
Upper Case: converts string values to upper case
Proper Case: converts string values to proper case
Normalize No Data: normalizes attributes with no data to nulls
Normalize Whitespace: removes leading and trailing whitespace, and normalizes inter-word whitespace to a single space
RegEx Replace: replaces a string value matching a regular expression with a given value, or part of the matching expression
Replace: replaces values using a map

EDQ also comes with a series of built-in processors for data type conversion that make the normalization of types in our data really easy. There are also transformations that can be used to enhance our data, for example by adding dates, concatenating or parsing fields, or adding initials from text values.

Maps can be very useful for data standardization; by keeping all the possible values representing an entity together with the corresponding standard value for that entity in map tables, we can standardize the data using EDQ in just a couple of simple steps. In the following example, we will see how we can improve the quality of two common data fields containing people information: Country and Gender.

 

1. Data Standardization with EDQ

Let’s suppose that we have a dataset with a Country field. The table should contain data mostly for the United States; however, a quick Frequency profile shows the following results:

Data Quality Series – Data Standardization with EDQ

Figure 1: Frequency profile of the Country field. We can see that some countries are represented by more than one value in the database.

We can see that we have very bad data quality for this field: the values ‘USA’, ‘US’, ‘U.S.A’, ‘United States’ and ‘U.S’ all belong to the single entity that we would like to call ‘United States of America’. Moreover, there are two different values for Canada (‘Canada’ and ‘CAN’), and also some null values. With EDQ, we can easily fix issues like this using mapping tables.

The first transformation that we need is the Normalize Whitespace processor, which removes leading and trailing whitespace, and normalizes inter-word spaces to a single space. Simply create a new process, add the transformation, connect the Reader output to the normalizer input, and select the fields that need whitespace normalization:

Data Quality Series – Data Standardization with EDQ

Figure 2: Normalize Whitespace processor in EDQ. Before processing text fields for standardization, it is recommended to remove leading and trailing spaces and to normalize inter-word spaces to a single space.

We will generate the values for the null ones from another field containing the city, using an external mapping called City to Country Mappings. This table contains a mapping of cities to their corresponding countries:

Data Quality Series – Data Standardization with EDQ

Figure 3: City to Country Mappings

To use it in our project, we just need to add it to the Reference Data section:

Data Quality Series – Data Standardization with EDQ

Figure 4: Adding Reference Data

We will use the mapping to generate the name of the country in a new column called “Derived Country” for each city in our dataset. To do this, we add the processor Enhance from Map, connected to the output of the Normalize Whitespace processor, using City as the field to match, and the added mapping:

Data Quality Series – Data Standardization with EDQ

Figure 5: Process for normalizing whitespace and populating the Country field in EDQ

Data Quality Series – Data Standardization with EDQ

Figure 6: Enhance from Map processor in EDQ. We will populate the “Country” field using the “City to Country” mapping, matching by the field “City”.

After running this process, we can see in the Enhance from Map processor results that we were able to enhance 3338 fields:

Data Quality Series – Data Standardization with EDQ

Figure 7: Results of the country enhancement using the City to Country Mapping in EDQ

However, 2100 records remain with an unenhanced country, so we still have work to do on this field to make it fit for use. The next step is the normalization of the name, using another mapping called Country Variants. This mapping contains the most common variants for ‘United States of America’, plus variants for Canada:

Data Quality Series – Data Standardization with EDQ

Figure 8: Country Variants mapping. The Standardized Country column contains the standard that we want to use for each real world entity, in this case countries.

To normalize these different variants, we add a Replace processor to the process, calling it Standardize Country:

Data Quality Series – Data Standardization with EDQ

Figure 9: Country Standardization process in EDQ.

In the properties of the transformation, we need to define Country as the input field (this is the field that we are standardizing), and we have to specify the replacements in the Options tab. We will use the mentioned Country Variants mapping, containing the replacements.

Data Quality Series – Data Standardization with EDQ

Figure 10: Replace processor in EDQ. The processor will replace the original value of the Country field with the standard from the Country Variants mapping.

Finally, we simply add a Merge Attributes processor to merge the country derived from the city and the standardized country into a new column, ‘Best Country’, and a Frequency Profile processor to check the results:

Data Quality Series – Data Standardization with EDQ

Figure 11: Complete Country Standardization process in EDQ. The Merge Attributes processor, called “Create Best Country Attribute”, merges the values of the country derived from the city and the country obtained after standardization using the Country Variants mapping.

The transformation will merge the derived country with the standardized country into a new column called “Best Country”:

Data Quality Series – Data Standardization with EDQ

Figure 12: Merging attributes in EDQ.

With a Frequency profiler we can analyse the results of the new field:

Data Quality Series – Data Standardization with EDQ

Figure 13: Frequency profiler to verify the results of the standardization of the Country field in EDQ.

We can see how the quality of the field has improved thanks to the transformations applied:

Data Quality Series – Data Standardization with EDQ

Figure 14: Country Frequency profiler before standardization in EDQ, showing null values and many values for the same countries.

 

Data Quality Series – Data Standardization with EDQ

Figure 15: Country Frequency profiler after standardization in EDQ.

We were able to eliminate all the variants of United States of America, and we also identified other countries that were not appearing correctly before the standardization process.

Now let’s suppose we have a dataset with people including a field for Gender, which should contain either ‘M’ or ‘F’ (Male or Female) for each record. However, a quick profiling of the field shows the following results:

Data Quality Series – Data Standardization with EDQ

Figure 16: Gender frequency profiling in EDQ.

We can see that we have a lot of null values, and some records with ‘U’ in the gender field. We’re going to improve the quality of the field with EDQ, by trying to generate the gender using the title and the first name fields, and merge this generated gender with the Gender field available in the data to get the best result.

We’ll begin by adding an Unstructured Name Parser. We will use it to parse the name field, so that will be the input of the transformation, specified in the Input sub-processor of the Parser processor:

Data Quality Series – Data Standardization with EDQ

Figure 17: Parsing text fields with EDQ.

Data Quality Series – Data Standardization with EDQ

Figure 18: Input sub-processor of the Unstructed Name Parser processor in EDQ. The input is specified here.

The name will be mapped to the only attribute of the processor called “FullName”:

Data Quality Series – Data Standardization with EDQ

Figure 19: Map sub-processor of the Unstructed Name Parser processor in EDQ. The FullName attribute is mapped to the input field Name.

After running the process, we can see all the steps taken and all the rules created in the results of the parser in order to parse the name field into three new fields: P1_Prefix with the title, P1_First with the first name, and P1_Last with the last name, for each record:

Data Quality Series – Data Standardization with EDQ

Figure 20: Results of the Unstructured Name Parser, with the fuzzy rules created to parse the string containing the name into three separate fields for the title, first name, and last name.

Now we can use these fields to derive the gender of each person in the dataset. To do this, we add the built-in transformation Add Gender to the process, and join it to the pass output of the parser transformation:

Data Quality Series – Data Standardization with EDQ

Figure 21: Enhancing a Gender field in EDQ. The process contains a parser transformation and EDQ's built-in transformation for adding the gender.

To derive the gender, we need to specify the fields that contain the title and the first name in the processor configuration, as well as the stated gender, which the processor will use in combination with the derived gender to create the output “Best Gender”.

Data Quality Series – Data Standardization with EDQ

Figure 22: Enhancing a Gender field in EDQ. The Add Gender processor requires three attributes: title, first name, and stated gender.

If we run the process now, we can see how we have improved the quality of this field in the results of the Add Gender processor:

Data Quality Series – Data Standardization with EDQ

Figure 23: Results of the Gender standardization in EDQ.

Now there are only 6 records with missing gender, and we have reduced the percentage of missing gender records from 19.5% to 0.1%.

 

Conclusion

EDQ provides some solid features that make data cleansing and standardization tasks much easier: we can apply built-in transformations, use external maps, and apply business rules to standardize, correct wrong values, normalize fields, and enhance our data.

You can try these and other usage examples of EDQ for standardization downloading the latest EDQ virtual machine and hands-on lab from the following links:

Latest EDQ virtual machine from Oracle
Getting Started Hands-On Lab

Our last article, which is the last in the series, is reviewing the basic data deduplication capabilities of EDQ.


Click here
if you would like to know more about the Data Quality Services we offer!

Data Quality with EDQ – Part 1: Data Profiling

.

Data Quality Series with EDQ – Part 1: Data Profiling

 

This is the second article in our blog series about Data Quality. In the first part of the series we introduced the main concepts of Data Quality.

The focus of this article is to introduce Data Profiling with EDQ.

Introduction

Enterprise Data Quality (EDQ) is Oracle’s total solution for data governance and data quality management. It is part of the Oracle Fusion Middleware product range and the current version, 12.2.1.2, can be downloaded here.

There is a virtual machine for testing EDQ with a sample dataset, available for download too at the following link.

You can also find extensive documentation on the Oracle webpage - go to the following links for documentation and more resources.

Oracle EDQ can be used to profile, audit, standardize, normalize, and deduplicate our data. The list below summarizes the key features of EDQ:

Integrated data profiling, auditing, cleansing and matching
Browser-based access to the different client applications used to operate and configure the product
Ability to handle all types of data
Connection to files (text, XML, Office) and any JDBC and ODBC compliant data sources and targets
Multi-user project support
SOA support to create processes that may be exposed to external applications as a service
A single repository to hold data, gathered statistics, and project tracking information
Fully extensible architecture allowing the insertion of any required custom processing

EDQ is fully extensible, offering extensions for Address Verification and Watchlist Screening, for example, and can be integrated as batch processing or as real-time processing with other integration tools.

In this article we will review the basic profiling capabilities of EDQ and explain how we can profile our data using EDQ in just a few steps.

 

1. Data Profiling

The first step in a data quality project is profiling the data. Data profiling is the analysis of data to clarify its structure, content, relationships and derivation rules. Profiling helps not only to understand anomalies and to assess data quality, but also to discover, register, and assess enterprise metadata, and so the purpose of data profiling is both to validate metadata when it is available and to discover metadata when it is not.

EDQ comes with several built-in transformations to profile our data quickly and to start discovering patterns and quality issues. Some of the most important transformations for profiling are:

Quickstats profiler: analyses high-level completeness, duplication, and value frequency across many attributes, and highlights possible issues. For example, we could use this transformation to identify duplicate values in columns that should contain unique values, or columns with more unique values than expected.

Data Quality Series with EDQ – Part 1: Data Profiling

Figure 1: Example of the results of the Quickstats profiler in EDQ. We can start discovering issues in the data using this transformation, for example, fields with null values, or with more unique values than expected. 

Data types profiler: analyses attribute values for their data types, and assesses data type consistency. In the example below, we can see that some columns have inconsistent data types, and we can clearly identify wrong values (the Street column has 1 numeric value and the rest text, the Cell column has 6 Date/time values, the Active columns have 15 numeric values, etc.).

Figure 2: Data Quality Series – Data Profiling with EDQ

Figure 2: Example of the results of the Data types profiler in EDQ. This transformation is very useful to find data type consistency issues.

Max/Min profiler: finds minimum and maximum values – longest, shortest, lowest, and highest.

Figure 3: Data Quality Series – Data Profiling with EDQ

Figure 3: Example of the results of the Max/Min profiler in EDQ. One of the basic profiles that we can do on our data is an analysis of the maximum and minimum values of each field to find wrong values.

Frequency profiler: analyses value frequency across many attributes. In the image below, we can see how this transformation can be used to rapidly identify consistency problems in the values of the data.

Figure 4: Data Quality Series – Data Profiling with EDQ

Figure 4: Example of the results of the Frequency profiler in EDQ. We can already see data quality.

Patterns profiler: analyses character patterns and pattern frequency across many attributes.

Figure 5: Data Quality Series – Data Profiling with EDQ

Figure 5: Example of the results of the Patterns profiler in EDQ, very useful to find data quality issues in fields such as zip codes, telephone numbers, or email addresses.

Record completeness profiler: analyses records for their completeness across many attributes.

Figure 6: Data Quality Series – Data Profiling with EDQ

Figure 6: Example of the results of the Record completeness profiler in EDQ.

Other profiling transformations include, but are not limited to, number profiler, character profiler, date profiler, and RegEx patterns profiler.

 

2. Data Profiling with Oracle EDQ

EDQ developments are called projects, which contain processes that run processors (transformations). There is also the possibility to include processors in jobs, to make them accessible to external applications. All these developments are created using the Director application, accessible from the EDQ Launchpad on the EDQ server:

Figure 7: Data Quality Series – Data Profiling with EDQ

Figure 7: The EDQ Launchpad. All the client applications are accesible from this page.

The Director is a Java application, and is the main application for the operation of EDQ.

It is divided into five different panes:

Figure 8: Data Quality Series – Data Profiling with EDQ

Figure 8: The Director client application, the main application for developing in EDQ.

The projects and processors, together with the data stores, staged data, reference datasets, and other files, will be shown in the Project Browser pane (1). The main section of the application is the Project Canvas (2), where the different processes and jobs are built. The Tool Palette (3) will show the different tools available for use in different data quality processes and jobs. The Results Browser (4) will show the results of the processor selected on the canvas. Finally, the Tasks window in the bottom left corner (5) shows the progress of the execution of the processes as they are running.

To start profiling our data, the first thing we need to do is create a new project in the Director:

Figure 9: Data Quality Series – Data Profiling with EDQ

Figure 9: Creating a project in EDQ.

Then we need to add a data store to connect to:

Figure 10: Data Quality Series – Data Profiling with EDQ

Figure 10: Creating a new data store in EDQ.

Once we have a valid data store we can connect to, we have to create a snapshot of the desired tables in the data store. All the processing done in EDQ will be done on this snapshot, so the original data source will remain unchanged.

Figure 11: Data Quality Series – Data Profiling with EDQ

Figure 11: Creating a data snapshot in EDQ.

Finally, we need to create the process that will contain all the profiling transformations or processors:

Figure 12: Data Quality Series – Data Profiling with EDQ

Figure 12: Creating a process in EDQ.

After selecting our created snapshot, we can select the processors that we need on the next screen of the New Process window, checking the box ‘Add Profiling’, or we can leave it un-checked and manually drag and drop the processors one by one to the process canvas from the tool palette.

Figure 13: Data Quality Series – Data Profiling with EDQ

Figure 13: A simple data profiling process, containing the Reader processor and six profiling processors.

If we add the processors to the process manually, we will have to connect the output of the reader transformation to the input of the profiling transformations.

To run the process, we simply need to save it then click on the run icon:

Figure 14: Data Quality Series – Data Profiling with EDQ

Figure 14: Executing a process in EDQ.

The execution progress will be shown in the Tasks window, and once all the processors have finished the results of each individual transformation will be shown in the Results Browser. All these results can be exported to Excel using the buttons above the results in the Results Browser.

 

3. Creating Issues

A lot of common problems that could jeopardize the quality of BI or Analytics solutions can be identified with data profiling. EDQ includes features to create a collaborative environment between the different users involved in the data governance processes. For example, we can create issues and assign them to DBAs, application developers, ETL developers, BI developers, and business owners as they are found. To do this, simply right-click on any value that might indicate a data quality issue in the Results Browser, and select Create Issue.

Figure 15: Data Quality Series – Data Profiling with EDQ

Figure 15: Creating issues in EDQ.

The issues can be reviewed via the Issue Management application, accessible from the EDQ Launchpad or from the Director:

Figure 16: Data Quality Series – Data Profiling with EDQ

Figure 16: Link to the Issue Manager application accessible from the Director in EDQ. The Issue Manager application can also be accessed from the EDQ Launchpad.

 

4. Creating Reference Datasets

The data profiling step is useful for creating datasets that can be used later on as reference for data auditing. To do this, simply right-click on the valid values for a given field and select Create Reference Data.

Figure 17: Data Quality Series – Data Profiling with EDQ

Figure 17: Creating reference data in EDQ.

Figure 18: Data Quality Series – Data Profiling with EDQ

Figure 18: Reviewing reference data in EDQ.

We can add extra rows if there are missing values in our data.

After saving, this dataset will be the single source of truth for the field, and it can be used in other processes for auditing the data.

We can also create reference data from patterns instead of from individual values. EDQ also comes with many different types of reference data out-of-the-box, which can speed up the creation of data auditing processes.

 

5. Data Auditing

There are a number of processors available to audit data, called data checks. The most important ones are:

Data Type Check: checks that a string attribute contains data of the expected data type
Duplicate Check: checks for records that are duplicated across selected attributes
Email Check: checks email addresses are in a valid syntactic format
Invalid Character Check: checks a string attribute for invalid characters
Pattern Check: checks values for an attribute against reference data of valid and invalid character formats
No Data Check: checks whether or not values in an attribute contain any meaningful data
RegEx Check: checks values for a string attribute against reference data of valid and invalid regular expressions
Business Rules Check: checks data against a set of business rules

These processors allow the output streams to be divided for valid and invalid records to be handled separately or together as they are audited:

Figure 19: Data Quality Series – Data Profiling with EDQ

Figure 19: Different streams of data from a data check transformation in EDQ. The records with populated email can be treated separately from the records without email using the different outputs.

Data can be audited using reference data created by users. For example, the List Check and Pattern Check processors allow specifying reference datasets to determine valid and invalid records:

Figure 20: Data Quality Series – Data Profiling with EDQ

Figure 20: Using reference data for auditing in EDQ.

This way the processors will check if the selected field of each record in the database matches the reference data. We can drill down in the results of these processors to see the matching and un-matching records, and export the records to Excel to share them with DBAs, business users, or other users.

 

6. Data Profiling with ODI

Oracle Data Integrator comes with some data profiling capabilities, which although not as complete as EDQ, are enough for small projects with basic profiling needs. The table below shows the comparison between the data profiling features included in ODI and the more advanced features of Oracle EDQ:

Figure 21: Data Quality Series – Data Profiling with EDQ

Figure 21: Data Profiling Feature.


Conclusion

EDQ offers some great features to profile and audit data in a quick and simple way, and in a collaborative environment to raise issues with other users involved in the data quality assurance process. Tasks that we would normally do manually on the databases can be done with EDQ much faster, speeding up the process of finding quality issues in our data. A good profiling and auditing of our data is the first requisite for a successful quality assurance process, which in turn is a requisite for the success of any Analytics / BI project.

Click here to read the next article in the Oracle EDQ series, in which we explain data standardization and cleansing using EDQ.

Don´t hesitate to contact us if you would like to know more about the Data Quality Services we offer!

privacy policy - Copyright © 2000-2010 ClearPeaks

topnav