Data Quality with Informatica – Part 2: Data Standardization

.

Introduction

In the previous article in this series about Data Quality we explained how we can profile our data using Informatica.

We learned that the data in our example contained some columns that needed to be fixed:

Keyword_ID: our data contain records with value '--' in this field, which represents a null value; in order to standardize this value across all sources we are going to change it to 'NA'.
Currency: the currency field is also not consistent as the values are in both upper and lowercase, and some records contain values that do not correspond to a currency code. We are going to fix this by standardizing to uppercase.
Year: some records contain the year with the word 'year', e.g. 'year 2015', instead of just the value 2015; the field must be standardized to just the year in numerical format.
Quarter: the quarter field is also wrong, as some records contain the date or the year, and this field should only contain the quarters and the year number.

In this article, we are going to continue with this example and create a set of mapplets in Informatica Developer to fix these data issues.

 

1. Creating the standardization rules

Our first mapplet will retain numerical values only, so it can beapplied to any column where we need to keep only numerical values. In our case, we are going to apply it to the year column, and to do this, we open Informatica Developer, right-click on the project and click on create mapplet; we’ll call it rule_Retain_Numbers. A mapplet normally contains an input and an output transformation, as it receives values and returns results. We are going to define the mapplet logic between these two transformations, so first we add the input and the output transformations, configure the input and output ports in the transformations and set the length to be long enough, for instance 200 characters.

Data Quality with Informatica – Part 2: Data Standardization

Figure 1: Creating a mapplet in Informatica Developer

Now we have to define the mapplet logic: first, we are going to use a labeller transformation to mask the letters and spaces; the labeller can be used to set a character by character analysis of data in a field, where each character is masked according to a list of standard and user-defined types: numbers will be masked as '9', spaces as '_' , letters as 'X' and symbols as 'S'.  To add the transformation, we right-click inside the mapplet, select Add transformation and then add a labeller:

Data Quality with Informatica – Part 2: Data Standardization

Figure 2: Adding a labeller transformation to the mapplet

Now we’re going to add a new strategy to the labeller:  select character mode, then verify that the output precision is set to 200 as in the input:

Data Quality with Informatica – Part 2: Data Standardization

Figure 3: Basic labeller confirguration

The next step is to add an operation: we’re going to select Label characters using character sets instead of Label characters using reference table. We want to mask all the characters except the numbers, so we choose the English alphabet, spaces and symbols, as in the image below:

Data Quality with Informatica – Part 2: Data Standardization

Figure 4: Configuration of the labeller transformation

Click on finish and skip the ignore text dialog which appears after clicking on next, as we don't want to add another operation. With the configuration as it is now, the labeller will output only the numbers and mask the rest of the characters, so we can add a standardizer transformation to remove them.

The standardizer transformation can be viewed as a series of operations that are carried out on an input string to look for matching substrings in the input and to place the correct substring in the output. It is used to fix errors such as incorrect formats, and to search for values in the data that can be removed or replaced either with reference table items or specific values.

To continue with our example, it’s time to add a standardizer transformation to the mapplet as we did before, which we can name st_remove_noise; drag the output from the labeller to the standardizer input, then create a new strategy (called remove noise). We check the space delimiter after clicking on the choose button, and also remove the trailing spaces by checking both checkboxes in the strategy properties.

Data Quality with Informatica – Part 2: Data Standardization

Figure 5: Configuring the standardized transformation strategy

At this point we want to remove the noise labelled with ‘S’, ‘X’ and ‘_’, so we select remove custom strings in the strategy and add these characters to the custom strings in properties.

Data Quality with Informatica – Part 2: Data Standardization

Figure 6: Configuring the standardizer transformation to remove custom strings

Click on finish and finally drag the output from the standardizer transformation to the port of the output transformation, then validate the mapplet. If we want the mapplet to be available in the Analyst, we have to validate it as a rule.

Data Quality with Informatica – Part 2: Data Standardization

Figure 7: Standardization mapplet

Carrying on with our example, now we’re going to create another mapplet to replace the wrong currency codes we found in the file. We’re going to use a reference table to do this, which can be created using Informatica Analyst or Developer. We will use Analyst for this example.

Log into Analyst, open the profile, select the currency column and create a reference table. The first value will be the valid one and the rest of them will be replaced by the correct one. To create the reference table we have to go to the file profile, select the currency column and then, in actions, click on Add to - Reference Table:

Data Quality with Informatica – Part 2: Data Standardization

Figure 8: Creating reference tables in Informatica Analyst

Once the table has been created we add three new columns with the values to be replaced, the first column being the correct one.

Data Quality with Informatica – Part 2: Data Standardization

Figure 9: Reference table properties

After adding the new columns, we can edit the records and keep just one, as shown in image 10:

Data Quality with Informatica – Part 2: Data Standardization

Figure 10: Final reference table for currency standardization in Analyst

In order to keep each rule in a different mapplet, we need to create a different mapplet for this rule. We could add new ports to the mapplet and increase the complexity of the standardization, but by keeping each rule in a different mapplet, the mapplets remain as simple as possible. For the currency mapplet we proceed as with the first one we created above, but in this case the standardizer transformation will have a different strategy: to replace the values with those present in the currency reference table. To do this, we have to select the reference table replacement for the transformation:

Data Quality with Informatica – Part 2: Data Standardization

Figure 11: Standardizer transformation using a reference table

The mapplet will look like this; we validate it and proceed to create a new one:

Data Quality with Informatica – Part 2: Data Standardization

Figure 12: Mapplet for the standardization of the currency field

We need to identify the month number to replace the values for the quarter, so we will now proceed to parse the date in a new mapping. Informatica Data Quality comes with some in-built features to parse dates, but we are not going to use them in this example. Instead, we are going to parse the date manually, using a tokenizer to split it into three columns: day, month and year.

Click on create a mapplet and add an input, an output, and a parser transformation. We will parse the date field using the slash character as the delimiter and use regular expressions to validate the day, month and year. It’s important to note that the parser transformation creates two output ports by default: one for data that do not meet the parser regular expression, whilst the other is the overflow port that contains data if there are not enough output ports to keep the correctly parsed values.

In the parser transformation, select the token parser when prompted:

Data Quality with Informatica – Part 2: Data Standardization

Figure 13: Configuration of the parser type in the parser transformation

Name the port in the input as date, then drag and drop the date from the input transformation to the token parser; then go to the parser transformation and add one strategy with three outputs, day, month and year. Each of these ports will have a custom regular expression with “/” as the delimiter.

Data Quality with Informatica – Part 2: Data Standardization

Figure 14: Parser configuration

Click on next and select token parser, and then select token sets in the token definition and click on choose. In the token set selection, we create a new expression for every output port of the parser transformation:

Data Quality with Informatica – Part 2: Data Standardization

Figure 15: Configuration of token sets for parsing data

We add the monthOfYear custom token set with the regular expression shown in image 16:

Data Quality with Informatica – Part 2: Data Standardization

Figure 16: Token set configuration for the month number

Once the token set has been added, assign it to the month column.

We have to repeat the same process with the proper regular expressions for each column, and once all the columns have been assigned, the parser mapplet should look like this in image 17:

Data Quality with Informatica – Part 2: Data Standardization

Figure 17: Mapplet for parsing the date into day, month, and year columns using a parser transformation

We can now add the mapplet to the mapping to get a preview of the results:

Data Quality with Informatica – Part 2: Data Standardization

Figure 18: Results preview of the parsing mapplet

We can see that there are some records that do not meet the regular expressions we set in the parser, so we have to set a default value for those records that have populated the UnparsedOutput port.

Continuing with the mapplet, we are going to add the port quarter to the output, and replace the hyphens with the string “NA”. In order to do this, we need to add two expressions to the mapping, one to create the quarter column and the other to replace the hyphens with “NA”. We can do this by creating an expression with one port to concatenate the quarter with the year; in the same expression we add a port to replace the hyphens for “NA”, and then make a decision to populate (or not) the quarter output, depending on the unparsed port from the parser: if it is empty, then the date was parsed correctly and the quarter field will be populated; if not, the date was wrong, and the quarter will be populated with “NA”. The code in the decision strategy will look like this:

Data Quality with Informatica – Part 2: Data Standardization

Figure 19: Expression to generate the quarter based on the result of the parsing of the date

Our mapplet should look like image 20:

Data Quality with Informatica – Part 2: Data Standardization

Figure 20: Standardization mapping with quarter parsing

 

2. Creation of the standardization mapping

Now we can validate all the mapplets and add them to a mapping where we can also add the source file and a new output file with the same structure. This file will be the standardized data file. We are also going to add a union to merge the data from two different dates. The mapping should look like the one in image 21:

Data Quality with Informatica – Part 2: Data Standardization

Figure 21: Final standardization mapping

After running the mapping, we can profile the generated file and check that it is meeting the rules that we defined at the beginning. We can see the path of the output file in the Run-time tab of the properties of the target:

Data Quality with Informatica – Part 2: Data Standardization

Figure 22: Properties of the output transformation. We can see the name and path of the output generated in the Run-time tab

 

3. Results

Now we are ready to review the profile of the output file. For the currency column, we can see that the only value available is USD. If any other value appears, we can simply add it to a new column in the reference table. Notice that NULL values are appearing as we didn’t set a rule to standardize the NULL values to “NA”.

Data Quality with Informatica – Part 2: Data Standardization

Figure 23: Results of the standardization process for the currency column

The year column is now standardized in the merged file and we have only numerical values after the data masking and standardization:

Data Quality with Informatica – Part 2: Data Standardization

Figure 24: Results of the standardization process for the year column

We have fixed the quarter column to obtain standard values (quarterName Year) thanks to the expressions added to the mapplet:

Data Quality with Informatica – Part 2: Data Standardization

Figure 25: Results of the standardization process for the quarter column

We have also fixed the hyphens in the keywordID column:

Data Quality with Informatica – Part 2: Data Standardization

Figure 26: Results of the standardization process for the Keyword ID column

Conclusion

This concludes this article about Data Standardization with Informatica Data Quality. We have seen that Informatica has a lot of useful features to standardize data, and that it is a very user-friendly tool whilst still offering enough flexibility to perform complex standardization tasks.

Stay tuned for the last article in this series, where we are going to explain Data Deduplication using Informatica Data Quality.

If you would like to know more about the Data Quality Services we offer click here!

Data Quality with Informatica – Part 1: Data Profiling

.

Data Quality – Part 1: Data Profiling using INFA

Welcome to the first article in the Informatica Data Quality series, where we are going to run through the basics of Informatica Analyst and the main features of Informatica Developer for data profiling.

Informatica is one of the most important data integration vendors in the market; they are behind PowerCenter, a very well-known ETL that can be integrated with other Informatica tools, such as Informatica Analyst,  a web application used by data analysts to analyse data and create data profiles, among other tasks. In the sections below we are going to go through the necessary steps to create a data profile, a business rule for column profiling and finally a scorecard to view the results.

 

1. Create a Data Profile

To start profiling our data, first open the browser, log into the Analyst tool (the default URL is http://infaServerName:8085/AnalystTool) and create a new project, which we’ll call Data_Profiling_Example :

Data Quality Series - Profiling with Informatica

Figure 1: Creating a project in Informatica Analyst

Now we add a data source; in this example we are going to load a file with information from AdWords. For demonstration purposes, several errors have been introduced into the file, like different date formats. To add a file, click on the actions menu on the right-hand side of the window and click add flat file:

Data Quality Series - Profiling with Informatica

Figure 2: Adding data from a file in Informatica Analyst

Importing data from files is straightforward if we follow the wizard. In this example, we are going to set comma separated values, header present, data starting in line 2, and all the columns will be strings. The tool will automatically detect the length of the fields.

Data Quality Series - Profiling with Informatica

Figure 3: Add flat file wizard in Informatica Analyst

Now we need to create a new profile for our data, and we can do this by clicking on new profile on the menu on the right. In the wizard, we select our data file and accept all the default values.
Once the profile has been created we can review the values of the data, the percentage of nulls, and term frequencies in the results panel, as well as being able to analyse the patterns of the data values for every column. We can also view a summary of basic statistics, such as the max value, the min value and the top and bottom values for each column.

Data Quality Series - Profiling with Informatica

Figure 4: Profiling results in Informatica Analyst

In our example we can see several issues in the data of the file. For example, in the image below we can see  that the year is incorrect for some records (we are assuming that the file should contain just the numeric value for the year). In this example, the file should only contain data for 2nd January 2015, so it looks like the file has some invalid records, as there are some records with a different year, and others with a wrong value. This could be due to a bad extraction from the source system, or a wrong delimiter in some rows. In order to measure the quality of the file, we are now going to create some business rules, add them to the data profile, and finally create a visualization.

The data analysts from our organization have given us the following business rules:

the year must be 2015 for this file
the day column must always be 1/2/2015
the file should only contain Enabled campaigns

We will create two business rules to validate the year and the day columns, and for the Enabled campaigns we will set up the value Enabled in the campaign_status column as valid.

We can create the business rules in two ways: by using the expression builder in the Analyst tool, or by creating a mapping using the Informatica Developer. To create the business rule directly in the profile we simply click on edit, then on the column profiling rules, and the on the plus sign to add a rule.

Data Quality Series - Profiling with Informatica

Figure 5: Creating rules in Informatica Analyst

Then we select new rule for the year column and enter the expression you can see in the following image. We can save the rule as reusable; this way we will be able to apply exactly the same rule for a different column in the file if necessary.

Data Quality Series - Profiling with Informatica

Figure 6: New rule wizard in Informatica Analyst

We will implement the second rule in the Developer tool. To do this, we open Informatica Developer and connect to our project, then create a mapplet with an input transformation, an expression and an output transformation, and save it as DayValidator. To validate the rule, we can right-click on the rule and select validate.

Data Quality Series - Profiling with Informatica

Figure 7: Creating a rule in Informatica Developer

We will define the expression with three possible output values: not a date, Valid date and Invalid date.

Data Quality Series - Profiling with Informatica

Figure 8: Defining rules in Informatica Developer

Once the rule has been created, we can go back to Informatica Analyst, edit the profile and now, instead of creating a new rule, we are going to apply the DayValidator rule we just created in Developer to the day column in the profile. We will call the output of the rule IsValidDay:

Data Quality Series - Profiling with Informatica

Figure 9: New rule wizard in Informatica Analyst

Now we are ready to run the profile again and review the outcome of the two different rules:

Data Quality Series - Profiling with Informatica

Figure 10: Data profiling project in Informatica Analyst

Reviewing the results, we can see that the data contains wrong values for Day and Year:

Data Quality Series - Profiling with Informatica

Figure 11: Reviewing profiling results in Informatica Analyst

 

2. Create a Scorecard for the Profile

Now that we have executed and checked the profile, we can create a scorecard to measure the quality of the file as the last step in this data quality assessment. In order to do this, we have to go to the profile and add it to a new scorecard. We can define the valid values for each column in our data. In this example, we are going to create the scorecard with three metrics called scores (both outputs from the rules and the campaign status) and then select the valid values for each different score.

The scorecard allows us to drill down from the score to the data. We select the key of the file (first three columns), the campaign status, and the output from both rules as drilldown columns; this way we can easily export the invalid rows to a file and send the results to the owner of the data so they can fix the wrong values and re-run the proper processes to update the data.

Data Quality Series - Profiling with Informatica

Figure 12: Data profiling scorecard in Informatica Analyst

This concludes the first article in this series about Data Quality with Informatica.
In the next couple of blog posts we’re going to explain how to standardize and deduplicate data. Stay tuned!
If you would like to know more about the Data Quality Services we offer click here!

Data Quality Series – Introduction

.

 

This article is the first in a series of blog posts about the topic "Data Quality". In the next couple of weeks we will go through the following subject matters:

Data quality with EDQ

Part 1: Data Profiling
Part 2: Data Standardization
Part 3: Data Deduplication

Data quality with Informatica

Part 1: Data Profiling
Part 2:
Data Standardization
Part 3:
Data Deduplication

The focus of this first article is to introduce "Data Quality".

1. Introduction to Data Quality

Data quality is the perception or assessment of the fitness of data to serve its purpose in a given context [1]. Good quality data is crucial for analysts to create reports that contain accurate and consistent information, and in some businesses, bad quality or out-of-date data may increase costs. For example, a mailing campaign that sends letters to the wrong customer addresses is a waste of money.

Moreover, reports containing poor quality data may be regarded by customers as erroneous and thus reduce their confidence in the delivered dashboards. In fact, according to Gartner, a loss of confidence by users in their reports/dashboards is the number 1 cause of Data Warehouse / Data Mart / Data Governance project failures.

Many of the data quality tasks are very close to the database level and can be performed by a DBA, for instance by adding checks in the columns to allow just valid values, or by setting default date formats. But in some scenarios we may find that these validations are not performed, and certain data quality tools can help us to analyse and fix these issues.

The term “Data Quality” involves many different aspects:

Validity:

The data conforms to the syntax (format, type, range) of its definition. Database, metadata or documentation rules as to the allowable types (string, integer, floating point etc.), the format (length, number of digits etc.) and range (minimum, maximum or contained within a set of allowable values).

Accuracy:

The data correctly describes the "real world" object or event being described. Does it agree with an identified reference of correct information?

Reasonableness:

Does the data align with the operational context? For example, a birthdate of 01/01/01 is valid, but is it reasonable in context?

Completeness:

The proportion of non-blank values against blank values. Business rules define what "100% complete" represents: optional data might be missing, but if the data meets the expectations it can be considered complete.

Consistency:

Values across all systems in an organization reflect the same information and are in sync with each other.

Currency:

The data is current and "fresh"; data lifecycle is a key factor.

Precision:

The level of detail of the data element, e.g. the number of significant digits in a number. Rounding, for example, can introduce errors.

Privacy:

The need for access control and usage monitoring.

Referential Integrity:

Constraints against duplication are in place (e.g. foreign keys in a RDMBS)

Timeliness:

The data is available for use when expected and needed.

Uniqueness:

No value occurs more than once in the data set.

Data quality is affected by the way data is entered, stored and managed, and data quality assurance (DQA) is the process of verifying the reliability and effectiveness of data. In the section below we will see the basic approach to a successful DQA project.

 

2. Data Quality Projects

Every data quality project needs to start with an understanding of the benefits that we are going to obtain from it, that is, an assessment and understanding of the value of the data as a key for better decision-making and improved business performance, or,  in other words, its utility.

It is possible to analyse how data is being used to achieve business objectives, and how the achievement of these goals is impeded when flawed data is introduced into the environment. To do this, we must consider the following points:

What the business expectations for data quality are
How the business can be impacted by poor data quality
How to correlate such impacts with specific data quality issues

Once a negative impact on the ways the business operates due to poor quality data has been determined, the necessary approach to assemble a data quality management programme and institute the practices that will lead to improvement must be planned. This plan must consider:

The processes that need to be instituted
The participants that will execute those processes and their roles and responsibilities
The tools that will be used to support the processes

Normally, a data quality project lifecycle involves at least three different steps:

1. Data Profiling
2. Data Standardization or Cleansing
3. Data Matching and Deduplication

 

3. Data Profiling

Data profiling is the analysis of data in order to clarify its structure, content, relationships and derivation rules. It mainly involves gathering different aggregate statistics or informative summaries about the data, and ensuring that the values match up to expectations.

Profiling helps not only to understand anomalies and to assess data quality, but also to discover, register, and assess enterprise metadata; thus the purpose of data profiling is both to validate metadata when it is available and to discover metadata when it is not.

The typical outputs of the data profiling process are:

Column Profiling:

Record count, unique count, null count, blank count, pattern count
Minimum maximum, mean, mode, median, standard deviation
Completeness & number of non-null records
Data types
Primary key candidates

Frequency Profiling:

Count/ratio of distinct values

Primary/Foreign Key Analysis:

Referential integrity checks (can be achieved by creating a join profile)
Candidate primary and foreign keys

Duplicate Analysis:

Identify potential duplicate records

Business Rules Conformance:

The data meets an initial set of rules

Outlier Analysis:

Identify possible bad records

 

4. Data Standardization

Once the data profiling process is finished, the next step is to fix the issues that have been identified, by applying rules, replacing the wrong values, and fixing data inconsistencies. Data standardization, or data cleansing, is the process of developing and applying technical standards to the data, to maximize its compatibility, interoperability, safety and quality.

 

5. Data Matching and Deduplication

Record matching is the process of finding duplicate records in the data, that is, records which may relate to a single real world entity. This may seem like a trivial task, but in reality it is far from it. The first challenge is to identify the criteria under which two records represent the same entity, considering that the data may come in different formats and using different conventions (free text fields), may be incomplete and/or incorrect (typos, etc.), and the context of the different fields (for example, four different fields may represent a single address). We want to identify duplicates regardless of these differences in the data, so we need to define measures of distance between the records and then apply rules to decide if they classify as duplicates.

Moreover, efficient computation strategies need to be used to find duplicate pairs in large volumes of data. The number of comparisons needed using traditional methods on large datasets quickly scales up to extremely long execution times, and techniques such as a previous clustering of the data become necessary.

Data deduplication is the process of merging record pairs representing the same real world entity, also known as consolidation. It relies on record matching to find pairs of possible duplicate records.

 

6. Data Quality Tools

All the tasks involved in a data quality assurance project can be done manually on the database, or in the ETL processes if working on a data source being loaded from another source. However, there are many vendor applications on the market to make these tasks much easier. Most of these applications can be integrated with other ETL tools, and offer batch processing as well as real-time processing to ensure we always have the best quality data.

In the next article we are introducing Oracle Enterprise Data Quality (EDQ), a leading data quality tool.

References:
[1] Margaret Rouse, “data quality”. Web. TechTarget. November 2005. Accessed January 2017. http://searchdatamanagement.techtarget.com/definition/data-quality

 

Authors:
Javier Giaretta, Nicolas Ribas and Axel Bernaus

Click here if you would like to know more about Data Quality!

 

Data Mining & Business Intelligence

.

Data mining & Business Intelligence

 

The term data mining refers to one of the processes involved in the task of extracting knowledge from a database, also known as KDD (Knowledge Discovery in Data bases). However, by extension, it is referred to as the KDD global process because of its commercial appeal. Understanding data mining as a KDD sub-process, we could define the term as the process of extracting underlying knowledge from a large volume of data.

It is a recent development directly linked to the scientific fields of mathematics (mainly statistics), computer science and artificial intelligence. It can be supported by different Business Intelligence systems, from which we can obtain several advantages!

In this article we´re highlighting how a Business Intelligence system is a great starting point for the data mining process and how data mining can be used for process optimization.

The topics covered are:

  1. Definition
  2. Objectives and challenges
  3. KDD Process Phases
  4. Applications:

➀ Data mining & Business Intelligence
➁ Extracting knowledge from unstructured data
➂ Optimization of processes

Click this link to read the full article on data mining: Data Mining & Business Intelligence

BI system on Amazon Cloud | Amazon Web Services

.

Introduction

The purpose of this blog is to explain how to configure a BI system on cloud using Amazon Web services (AWS). Our system will include an ETL server (pentaho data integrator AKA Kettle), a reporting server (Tableau) and a data warehouse (Redshift). Every of these components will be based on one AWS, these services will be detailed below.

Amazon provides a set of web services completely hosted on cloud in a single account,  these services are easy to manage through the AWS console. The services are paid on demand, this helps us to scale up the resources needed and create a budget plan that can be managed and modified easily. It allows the flexibily to remove or add new on demand services.

For payments, AWS provides also a set of dashboards, where we can review the detailed amount broken down by service.

From the variety of the AWS, some of them are enough to create the infrastructure we need to create our BI system completely hosted on cloud.

In this blog article I will explain 3 AWS to create a complete BI system:

  • EC2 (used to host the servers, ETL and reporting)
  • S3 (used to store and send files to Redshift)
  • Redshift (data warehouse)

From the console we can manage all of the web services we have signed up for, in our case we will focus on the following ones:

Picture1

Amazon Web Services:

1. EC2

EC2 is a compute AWS used to create instances of machines needed to support our infrastructure, in our case of a BI system, we will use 2 instances, one for the ETL server and a different one for the reporting server.

EC2 is completely dynamic, it allows maintenance of the infrastructure with a simple and intuitive front end, where we can operate into our instances. As main features,  it allows resizing of the resources of the instance on demand,  to add more memory, increase the number of CPUs and add new HDD volumes to the instance.

There are so many other features detailed on the following video:

In this scenario for our BI system, we have created 2 Windows machines, the instance can be selected from a set of preconfigured machines, then once created we can modify some properties as explained above.

Picture2

Figure 1 Creating a new instance

There are different prices and paying methods for the instances, the pricing and the licenses for the different sort of instances can be reviewed in the links below:

https://aws.amazon.com/ec2/instance-types/

https://aws.amazon.com/ec2/pricing/

 

One of the great features on EC2 instance is that with only a little knowledge of IT we can manage the infrastructure by ourselves, we can set up our network, connect to the machines using remote desktop, and share files between the instances and our local machines, we can take snapshots of the volumes, images of the instances that can be downloaded and deployed on premises.

Regarding the network and security configurations, we can assign a static IP to the instances, we can limit the access to that instance to be only reachable from certain IPs, so the instances can be secured.

Picture3

Figure 2 EC2 Landing page

 

As a conclusion, we can use this service to create any kind of instance that fit our needs and we will pay for the resources and usage we make of it, it is flexible and securable.

For the BI system we want to configure, EC2 will host 2 instances:

  • ETL server running on Windows: this server will be the responsible of make the data extraction and transformations and send the files generated to S3. We will use an open source ETL tool, Pentaho data integrator, the features of this ETL tool can be reviewed in the following link:

http://community.pentaho.com/projects/data-integration/

 

  • Reporting server running on Windows: this server will contain the dashboards and visualizations of the information hosted on redshift, we will use tableau as a reporting server, the features of tableau can be reviewed in the following link:

http://www.tableau.com/products/server

 

2. S3

S3 is one of the storage AWS, basically it is used to store data into a file directory inside a bucket. We will use this service for optimization reasons

image 7 blur

Figure 3 S3 Buckets

One of the bottlenecks that can appear in a BI system is the data loading into the database tables in the data warehouse,  as this tables use to be very large, usually we want to bulk load the tables, using the tandem redshift-S3 this can be done in a very efficient way

Once we have configured our bucket and assign a user to it, we can send files to the S3 bucket given a URL and using the AWS command line interface (AWS CLI). This will improve the performance of the table loads, as the files on S3 can be bulk loaded into tables in a very efficient way.

The service allows to secure the files, add encryption and some other interesting features.

3. Redshift

Redshift completes our BI system, it is a database service, scalable, columnar postgre database.

The latest visualization tools such as tableau, have in built connectors to access the information. It's easy to connect a database client to Redshift by specifying the URL. Redshift does not support table partitioning or indexing, however we can set sort and distribution keys on our tables to improve query performance, it also allows table compression setting the encoding on the columns.

As explained above, in order to improve the performance, we will use S3 to load the tables, in order to do this, we will create a set of files in our ETL server and after we will send it to S3, once the file has been set we will launch the copy command to load the table, the reference for the copy command can be reviewed at the following link:

http://docs.aws.amazon.com/cli/latest/reference/s3/cp.html

The relation between S3 and redshift is tight, we can also issue commands from our SQL client to store extracts from the tables directly into files in an S3 buckets.

Redshift can be configured in nodes, there are different kinds of nodes depending on our needs, we will chose between the different kind of nodes (computing or storage), once the node has been created it can be resized, it permits snapshots to be taken of the data and the size can be scalable to petabytes We can also apply security settings and configure alerts that will be received on an email inbox

picture 1 blur

Figure 4 Redshift Cluster configuration and properties

 

Another good feature of redshift on the management console is the ability to check the query status and monitor the resources used by the database such as disk and cpu usage, query time, etc as seen on the following figure:

Picture6

Figure 5 Redshift reports

Conclusion

AWS provides a set of on demand services that can be used to create any kind of IT system.

Regarding the benefits of using it to configure a BI system, it provides scalable on high performance services to create a data warehouse on redshift, host BI tools in EC2 instances with easy maintenance and security configuration, as well as fast data transfers using S3, these services working together are a great option to consider for saving time and money on our BI system infrastructure and configuration.

 

privacy policy - Copyright © 2000-2010 ClearPeaks

topnav