Data Quality Series – Part 1: Introduction

.

Data Quality

This article is the first in a series of blog posts about the topic "Data Quality". In the next couple of weeks we will go through the following subject matters:

Part 1: Introduction
Part 2: Data Profiling with EDQ
Part 3: Data Standardization with EDQ
Part 4: Data De-Duplication with EDQ

The focus of this first article is to introduce "Data Quality".

 

1. Introduction

Data quality is the perception or assessment of the fitness of data to serve its purpose in a given context [1]. Good quality data is crucial for analysts to create reports that contain accurate and consistent information, and in some businesses, bad quality or out-of-date data may increase costs. For example, a mailing campaign that sends letters to the wrong customer addresses is a waste of money.

Moreover, reports containing poor quality data may be regarded by customers as erroneous and thus reduce their confidence in the delivered dashboards. In fact, according to Gartner, a loss of confidence by users in their reports/dashboards is the number 1 cause of Data Warehouse / Data Mart / Data Governance project failures.

Many of the data quality tasks are very close to the database level and can be performed by a DBA, for instance by adding checks in the columns to allow just valid values, or by setting default date formats. But in some scenarios we may find that these validations are not performed, and certain data quality tools can help us to analyse and fix these issues.

The term “Data Quality” involves many different aspects:

Validity:

The data conforms to the syntax (format, type, range) of its definition. Database, metadata or documentation rules as to the allowable types (string, integer, floating point etc.), the format (length, number of digits etc.) and range (minimum, maximum or contained within a set of allowable values).

Accuracy:

The data correctly describes the "real world" object or event being described. Does it agree with an identified reference of correct information?

Reasonableness:

Does the data align with the operational context? For example, a birthdate of 01/01/01 is valid, but is it reasonable in context?

Completeness:

The proportion of non-blank values against blank values. Business rules define what "100% complete" represents: optional data might be missing, but if the data meets the expectations it can be considered complete.

Consistency:

Values across all systems in an organization reflect the same information and are in sync with each other.

Currency:

The data is current and "fresh"; data lifecycle is a key factor.

Precision:

The level of detail of the data element, e.g. the number of significant digits in a number. Rounding, for example, can introduce errors.

Privacy:

The need for access control and usage monitoring.

Referential Integrity:

Constraints against duplication are in place (e.g. foreign keys in a RDMBS)

Timeliness:

The data is available for use when expected and needed.

Uniqueness:

No value occurs more than once in the data set.

Data quality is affected by the way data is entered, stored and managed, and data quality assurance (DQA) is the process of verifying the reliability and effectiveness of data. In the section below we will see the basic approach to a successful DQA project.

 

2. Data Quality Projects

Every data quality project needs to start with an understanding of the benefits that we are going to obtain from it, that is, an assessment and understanding of the value of the data as a key for better decision-making and improved business performance, or,  in other words, its utility.

It is possible to analyse how data is being used to achieve business objectives, and how the achievement of these goals is impeded when flawed data is introduced into the environment. To do this, we must consider the following points:

What the business expectations for data quality are
How the business can be impacted by poor data quality
How to correlate such impacts with specific data quality issues

Once a negative impact on the ways the business operates due to poor quality data has been determined, the necessary approach to assemble a data quality management programme and institute the practices that will lead to improvement must be planned. This plan must consider:

The processes that need to be instituted
The participants that will execute those processes and their roles and responsibilities
The tools that will be used to support the processes

Normally, a data quality project lifecycle involves at least three different steps:

1. Data Profiling
2. Data Standardization or Cleansing
3. Data Matching and De-Duplication

 

3. Data Profiling

Data profiling is the analysis of data in order to clarify its structure, content, relationships and derivation rules. It mainly involves gathering different aggregate statistics or informative summaries about the data, and ensuring that the values match up to expectations.

Profiling helps not only to understand anomalies and to assess data quality, but also to discover, register, and assess enterprise metadata; thus the purpose of data profiling is both to validate metadata when it is available and to discover metadata when it is not.

The typical outputs of the data profiling process are:

Column Profiling:

Record count, unique count, null count, blank count, pattern count
Minimum maximum, mean, mode, median, standard deviation
Completeness & number of non-null records
Data types
Primary key candidates

Frequency Profiling:

Count/ratio of distinct values

Primary/Foreign Key Analysis:

Referential integrity checks (can be achieved by creating a join profile)
Candidate primary and foreign keys

Duplicate Analysis:

Identify potential duplicate records

Business Rules Conformance:

The data meets an initial set of rules

Outlier Analysis:

Identify possible bad records

 

4. Data Standardization

Once the data profiling process is finished, the next step is to fix the issues that have been identified, by applying rules, replacing the wrong values, and fixing data inconsistencies. Data standardization, or data cleansing, is the process of developing and applying technical standards to the data, to maximize its compatibility, interoperability, safety and quality.

 

5. Data Matching and De-Duplication

Record matching is the process of finding duplicate records in the data, that is, records which may relate to a single real world entity. This may seem like a trivial task, but in reality it is far from it. The first challenge is to identify the criteria under which two records represent the same entity, considering that the data may come in different formats and using different conventions (free text fields), may be incomplete and/or incorrect (typos, etc.), and the context of the different fields (for example, four different fields may represent a single address). We want to identify duplicates regardless of these differences in the data, so we need to define measures of distance between the records and then apply rules to decide if they classify as duplicates.

Moreover, efficient computation strategies need to be used to find duplicate pairs in large volumes of data. The number of comparisons needed using traditional methods on large datasets quickly scales up to extremely long execution times, and techniques such as a previous clustering of the data become necessary.

Data de-duplication is the process of merging record pairs representing the same real world entity, also known as consolidation. It relies on record matching to find pairs of possible duplicate records.

 

6. Data Quality Tools

All the tasks involved in a data quality assurance project can be done manually on the database, or in the ETL processes if working on a data source being loaded from another source. However, there are many vendor applications on the market to make these tasks much easier. Most of these applications can be integrated with other ETL tools, and offer batch processing as well as real-time processing to ensure we always have the best quality data.

In the next post in this series of blog articles we are introducing Oracle Enterprise Data Quality (EDQ), a leading data quality tool, so stay tuned!

References:
[1] Margaret Rouse, “data quality”. Web. TechTarget. November 2005. Accessed January 2017. http://searchdatamanagement.techtarget.com/definition/data-quality

 

Authors:
Javier Giaretta, Nicolas Ribas and Axel Bernaus

Click here if you would like to know more about Data Quality!

 

Data Mining & Business Intelligence

.

Data mining & Business Intelligence

 

The term data mining refers to one of the processes involved in the task of extracting knowledge from a database, also known as KDD (Knowledge Discovery in Data bases). However, by extension, it is referred to as the KDD global process because of its commercial appeal. Understanding data mining as a KDD sub-process, we could define the term as the process of extracting underlying knowledge from a large volume of data.

It is a recent development directly linked to the scientific fields of mathematics (mainly statistics), computer science and artificial intelligence. It can be supported by different Business Intelligence systems, from which we can obtain several advantages!

In this article we´re highlighting how a Business Intelligence system is a great starting point for the data mining process and how data mining can be used for process optimization.

The topics covered are:

  1. Definition
  2. Objectives and challenges
  3. KDD Process Phases
  4. Applications:

➀ Data mining & Business Intelligence
➁ Extracting knowledge from unstructured data
➂ Optimization of processes

Click this link to read the full article on data mining: Data Mining & Business Intelligence

BI system on Amazon Cloud | Amazon Web Services

.

Introduction

The purpose of this blog is to explain how to configure a BI system on cloud using Amazon Web services (AWS). Our system will include an ETL server (pentaho data integrator AKA Kettle), a reporting server (Tableau) and a data warehouse (Redshift). Every of these components will be based on one AWS, these services will be detailed below.

Amazon provides a set of web services completely hosted on cloud in a single account,  these services are easy to manage through the AWS console. The services are paid on demand, this helps us to scale up the resources needed and create a budget plan that can be managed and modified easily. It allows the flexibily to remove or add new on demand services.

For payments, AWS provides also a set of dashboards, where we can review the detailed amount broken down by service.

From the variety of the AWS, some of them are enough to create the infrastructure we need to create our BI system completely hosted on cloud.

In this blog article I will explain 3 AWS to create a complete BI system:

  • EC2 (used to host the servers, ETL and reporting)
  • S3 (used to store and send files to Redshift)
  • Redshift (data warehouse)

From the console we can manage all of the web services we have signed up for, in our case we will focus on the following ones:

Picture1

Amazon Web Services:

1. EC2

EC2 is a compute AWS used to create instances of machines needed to support our infrastructure, in our case of a BI system, we will use 2 instances, one for the ETL server and a different one for the reporting server.

EC2 is completely dynamic, it allows maintenance of the infrastructure with a simple and intuitive front end, where we can operate into our instances. As main features,  it allows resizing of the resources of the instance on demand,  to add more memory, increase the number of CPUs and add new HDD volumes to the instance.

There are so many other features detailed on the following video:

In this scenario for our BI system, we have created 2 Windows machines, the instance can be selected from a set of preconfigured machines, then once created we can modify some properties as explained above.

Picture2

Figure 1 Creating a new instance

There are different prices and paying methods for the instances, the pricing and the licenses for the different sort of instances can be reviewed in the links below:

https://aws.amazon.com/ec2/instance-types/

https://aws.amazon.com/ec2/pricing/

 

One of the great features on EC2 instance is that with only a little knowledge of IT we can manage the infrastructure by ourselves, we can set up our network, connect to the machines using remote desktop, and share files between the instances and our local machines, we can take snapshots of the volumes, images of the instances that can be downloaded and deployed on premises.

Regarding the network and security configurations, we can assign a static IP to the instances, we can limit the access to that instance to be only reachable from certain IPs, so the instances can be secured.

Picture3

Figure 2 EC2 Landing page

 

As a conclusion, we can use this service to create any kind of instance that fit our needs and we will pay for the resources and usage we make of it, it is flexible and securable.

For the BI system we want to configure, EC2 will host 2 instances:

  • ETL server running on Windows: this server will be the responsible of make the data extraction and transformations and send the files generated to S3. We will use an open source ETL tool, Pentaho data integrator, the features of this ETL tool can be reviewed in the following link:

http://community.pentaho.com/projects/data-integration/

 

  • Reporting server running on Windows: this server will contain the dashboards and visualizations of the information hosted on redshift, we will use tableau as a reporting server, the features of tableau can be reviewed in the following link:

http://www.tableau.com/products/server

 

2. S3

S3 is one of the storage AWS, basically it is used to store data into a file directory inside a bucket. We will use this service for optimization reasons

image 7 blur

Figure 3 S3 Buckets

One of the bottlenecks that can appear in a BI system is the data loading into the database tables in the data warehouse,  as this tables use to be very large, usually we want to bulk load the tables, using the tandem redshift-S3 this can be done in a very efficient way

Once we have configured our bucket and assign a user to it, we can send files to the S3 bucket given a URL and using the AWS command line interface (AWS CLI). This will improve the performance of the table loads, as the files on S3 can be bulk loaded into tables in a very efficient way.

The service allows to secure the files, add encryption and some other interesting features.

3. Redshift

Redshift completes our BI system, it is a database service, scalable, columnar postgre database.

The latest visualization tools such as tableau, have in built connectors to access the information. It's easy to connect a database client to Redshift by specifying the URL. Redshift does not support table partitioning or indexing, however we can set sort and distribution keys on our tables to improve query performance, it also allows table compression setting the encoding on the columns.

As explained above, in order to improve the performance, we will use S3 to load the tables, in order to do this, we will create a set of files in our ETL server and after we will send it to S3, once the file has been set we will launch the copy command to load the table, the reference for the copy command can be reviewed at the following link:

http://docs.aws.amazon.com/cli/latest/reference/s3/cp.html

The relation between S3 and redshift is tight, we can also issue commands from our SQL client to store extracts from the tables directly into files in an S3 buckets.

Redshift can be configured in nodes, there are different kinds of nodes depending on our needs, we will chose between the different kind of nodes (computing or storage), once the node has been created it can be resized, it permits snapshots to be taken of the data and the size can be scalable to petabytes We can also apply security settings and configure alerts that will be received on an email inbox

picture 1 blur

Figure 4 Redshift Cluster configuration and properties

 

Another good feature of redshift on the management console is the ability to check the query status and monitor the resources used by the database such as disk and cpu usage, query time, etc as seen on the following figure:

Picture6

Figure 5 Redshift reports

Conclusion

AWS provides a set of on demand services that can be used to create any kind of IT system.

Regarding the benefits of using it to configure a BI system, it provides scalable on high performance services to create a data warehouse on redshift, host BI tools in EC2 instances with easy maintenance and security configuration, as well as fast data transfers using S3, these services working together are a great option to consider for saving time and money on our BI system infrastructure and configuration.

 

Oracle BI Mobile App Designer v2 – Key Features

.

Introduction

Oracle BI Mobile App Designer (MAD) is the tool integrated on OBIEE 11g to perform analytics on mobile devices, as explained on a previous blog article, this article will extend the Oracle BI Mobile App Designer explaining the new features of the Mobile App Designer V2.

This version of bimad comes into the patch 18794832

Redesigned UI

The layout of the tool has completely changed, now we can add components on the pages using the “+” button on the top of the page designer, filters and calculated fields will appear on the left hand side of the window, below the datasources.This new design is much intuitive than the previous one:

Oracle MAD

Oracle MAD

MAD 3

 

Exploration Template

This new page template will allow us to make an exploratory analysis on the data by adding filters on the right hand of the layout, for numerical data, the filter displays as an slidebar, for categorical data, we can choose several values from the list of the available filters

Oracle MAD

 New Map Component

This is probably one of the most useful changes on this version, it comes by default with a map component, so there is no need to code your own plugins to visualize maps, we can choose between region maps and location maps, which allow us to display lat-long coded data

Oracle MAD

there are several country maps you can download from the oracle OTN here:

http://www.oracle.com/technetwork/middleware/bi-foundation/default-2397753.html

the map component of the first releases of MAD V2 did not have zoom button features on the map, so you may need to download and install the April’15 patch to use it, you will have just to deploy the new .ear file. Can be downloaded here:

https://support.oracle.com/epmos/faces/PatchDetail?_afrLoop=280648419500544&patchId=20823740&_afrWindowMode=0&_adf.ctrl-state=qiyeq9qlu_66

Calculated Fields and filters

Now we will be able to create calculated fields and filter. Using the Formula editor we can create custom fields, this fields will be available as any other element in the data source.

Oracle MAD

To create a Filter select the filters tab in the right pane and click the + icon, you can the apply this filter to the different charts added into the layout.

Oracle MAD

These filters can be added to another components on the data tab into the properties of the component

Plugin SDK

This new feature allow you to create your custom plugins without the need to have a complete environment. It comes with a test server based on node.js that will help you to develop your own plugins. You can download and follow the guide to install it by clicking on Create plugin and the download bimad.zip. Once coded you can deploy them as in the previous version, pasting them into the /plugins folder of the bipublisher

Oracle MAD

Oracle MAD Key Features

.

In 2013, mobile devices passed PCs to be most common Web access tools. Through 2014, JavaScript performance will push HTML5 and the browser as a mainstream application developer environment and by 2015, over 80% of handsets in mature markets will be smart phones. On the other hand connection speeds are increasing, local storage is losing importance and cloud computing is getting more importance every day.mad1

Under these facts, don’t you think it’s the right time to start thinking on how to move your BI platform to the mobile world?

In this blog we’ll review some of the relevant features of the Oracle Business Intelligence Mobile App Designer (MAD), a component from Oracle BI Foundation Suite developed to deal with mobile devices.

Don’t panic! Despite the name we will not get MAD using it. As you will see it is a very simple tool and provides excellent results even for non-expert users.

In the first part of this blog post we’ll review some visualization techniques introduced in Oracle BI MAD: navigation list, tile & accordion layouts (which are the most striking ones) and a new type of filters. If you are used to Oracle BI EE reporting, you might find them quite interesting.

In the second part of the blog post we’ll focus on some other interesting features: plugins, subscriptions and links.

1. Visualization Features

As a dashboard designer you’ll have to consider some specific aspects related to mobile devices: The screen is smaller than a regular monitor, and graphics and data must fit into the screen. On the other side, navigation is different than using a mouse. Oracle MAD was developed considering that. Based on HTML5, it includes enhanced visualization of charts, with transitions and animations that allow us to show data in a different way than a classical BI tool, as we will see below.

1.1  Navigation list

Navigation pages enable you to define a navigable set of filters that you can use to filter the data visualized on the page. The navigation area of the page contains a hierarchy of dimensions that you tap through to drive the display on the detail side of the page; you can see successively more refined sets of data, or navigate up and down the hierarchical filters to see the subset of data that interests you. Additionally, you can also give a conditional format into the navigation pane, so it is easier to drill into the hierarchy of interest.

The navigation pane is available through the new page editor menu.

In the next sample we’ll set a navigation list to analyze the revenue by period, drilling from Year to Month, just with a few steps:

Select New Page to add a navigation pane
mad2

In the navigation List we select the top level in the time hierarchy (YEAR) and the aggregation type Summation for the revenue metric.

mad3

Next, we add the Quarter and Month in the navigation hierarchy to be able to drill to the lower level.

Then we can set a conditional formatting for the revenue metric, based on the target revenue metric by clicking first the revenue text box, and then Edit spotlight button, to set up the thresholds

mad5

In the spotlight editor, we set up the conditions based on user defined thresholds, and the target revenue measure.

mad6

Once we have configured the navigation list we can proceed to add some charts into the layout, and we will be able to check the data at different levels of the hierarchy just tapping on the desired level in the navigation pane.

1.2  Tile and Accordion Layout

Tile pages provide a scrollable grid of cells, each containing identical components that are repeated for a specific field in your data. You can use tile pages to provide an instant view of a key measure across a specific dimension. This view not only provides a complete picture within each cell, but also enables an at-a-glance comparison across the data set. You can further enhance the visual impact of the values represented in each cell by creating stoplight formatting to highlight cells whose results fall outside the range of specific threshold values.

In the following image you can check the look and feel of the tile layout.

Accordion Layout allows to view data in a set of elements that we can expand vertically; it also allows to set a metric for each element in the dimension and conditional format in a similar way to the navigation pages.

To create it, just select new tile or accordion layout and give your custom format.

mad10

1.3  Filters

MAD allow to add filters to the layout, these filters are simply added to the layout in the insert tab in the menu bar, and will display values which we can tap to filter the data within the page.

mad11

Just click on filter and add it to the layout, once added, drag and drop an object from the data source into it.

mad12

mad13

Notice the top bar in the image above, where you can see years and quarters, acts as a filter to the all values in the layout.

Charts can be used as filters as well by tapping data into one chart. For example, we may want to filter a bar chart just to have the values that we have tapped in another chart.

As you will see in the following images, once you tap a graph, a text box appears showing that data has been filtered to show just the values for the object you tapped.

mad14

2. Other Features

2.1  Plugins    

A great feature of Oracle MAD is that allows us to code Java Script plugins enabling to customize the BI App at your own desire. Some plugins are built-in with the installation, such as geomaps. But you can also create your own custom plugins. To make a plugin available you just need to copy the code (plugin_name.js) into the BI publisher plugins directory.  It will appear in the Plugins toolbar. Once the plugin is available you just have to drag and drop it into the layout.

Let’s see some of the plugins you can add to your mobile app:

mad15

Another interesting plugin is the heat map, which shows rectangular areas and the value for a metric. We just need to add two levels of hierarchy and a metric value. It is also zoomed in when tapped the rectangular area, to see the details.In the screenshot above we drop the geomap plugin into the layout, together with the country dimension into the region box and a metric (in this case paid amount) into the value box.

mad16

Other less sophisticated, but effective as well is the following YouTube plugin that allows us to include a video in our mobile app.

mad17

2.2  Links

The Link button it’s a really cool feature included in Oracle BI Mad. As you might be guessing you can include a link in your report to a web page, but also you can send an e-mail, make a call or send a text message.

On your mobile App just click on the link button

mad18

And then select the type of link you want to set

mad19

As you can see for a Phone App there are 4 kind of links: web page, email, Phone and SMS. To create the link you can enter a data field value (from the list of fields on the left screen), a static value or both.

2.3  Displaying the app

Oracle BI MAD has an built-in emulator that allows us to see how the mobile app will look like. You can test the app at any time going to the emulator and selecting on PC. To test it into a physical device just click preview and click on Mobile. The page will show a QR that we can scan with our device to run the app, or the URL of the app we can connect to.

mad20

Once you display your app into the physical device, you may realize that layouts may not display identically as they were into the emulator. In order to avoid that, you just have to use the in build templates or use frame objects to set the display properly.

2.4  Subscriptions

The distribution of the mobile apps to the users is easily managed with the Apps library. As a developer you can publish a new app in the library once is it finished and ready to be distributed.

As a consumer, in the mobile device, you can choose the apps you want to subscribe. By doing that, when the mobile App is updated, you are notified a new version is available. The apps you are subscribed to, appear in the My Apps Library folder.

madx

Conclusions

Oracle BI MAD allows you to create mobile BI apps quickly offering very good results in a short period of time.  It’s based on HTML5 technology, so it works on most mobiles devices, including Android, iOS, Blackberry and Windows phone operating systems, either phones or tablets.

The designer offers you multiple options to get enhanced visualizations of your data with an easy navigation, tap driven. And there is no need to install any app on the devices, as the app is web browser based. The subscription is controlled by the Oracle Business Intelligence App Store.

If you want to get more information about Oracle BI MAD contact us or leave a comment below this post!

 

privacy policy - Copyright © 2000-2010 ClearPeaks

topnav