Oracle Data Visualization Desktop – April 2016 Release

.

1. Introduction - Oracle Data Visualization Desktop

With the release of Oracle Business Intelligence Enterprise Edition 12c (OBIEE 12c), Oracle announced a new Data Visualization tool, aimed at ad hoc personal data discovery tasks. Oracle is putting a great deal of effort into developing this new tool, that is available as:

* component of Oracle BI Cloud Service (BICS)
* standalone cloud service named Data Visualization Cloud Services (DVCS)
* component of OBIEE 12c on premise
* a standalone desktop tool named Data Visualization Desktop (DVD)

Oracle Data Visualization Desktop

At the end of April 2016 Oracle released the first publicly available version of Oracle Data Visualization Desktop (DVD), under version 12.2.1.0.0 (and timestamp 20160422085526). In this blog post we will present the main characteristics of this tool (DVD); but most aspects are common to the other three above modalities.


2. Data Sources and Data Management

There are 3 types of data sources that can be used in DVD:

  1. Excel sheets, which allow complete offline analysis
  2. Databases like Oracle, SQL Server, MySQL, Teradata, Redshift, MongoDB, Spark, etc.
  3. OBIEE, where the user can connect to an existing Analysis or use a Logical QueryOracle Data Visualization Desktop

Once a source has been added to a project, DVD offers different options to manage the data:

* Modify data types (string, number, date, etc.)
* Alternate between Attributes and Measures
* Select the Aggregation rule of Measures
* Create Calculations using a wide variety of functions

Oracle Data Visualization Desktop

Multiple sources of different types can be added to a DVD project, joined automatically using fields with matching names; the joins can always be edited using the Source Diagram.

Oracle Data Visualization Desktop


3. Visualizations

One of the most important characteristics of DVD is the high number of visualizations available out-of-the-box. There are 22 data visualizations in total, plus the possibility of including Text Boxes and Images. All the available visualizations are shown in the image below:

Oracle Data Visualization Desktop

The visualizations are very easily created by dragging and dropping Data Elements (data columns in DVD) to the different Drop Targets (that is, the corresponding visual dimensions of the visualization).

Oracle Data Visualization Desktop

The visualizations can be highly customized in DVD. The user can edit the titles and axis, modify the colour schemes, sort the data, change the data format, etc.

In addition, the Map visualization allows you to create custom maps using the GeoJSON format. The underlying map engine is able to render the new maps and join them to the corresponding data element.

Oracle Data Visualization Desktop

Multiple visualizations can be combined in the Canvas, thus allowing the creation of complete dashboards to analyse the data. In addition, through the Data Brushing feature, the data selected in any visualization is highlighted in the others.

Oracle Data Visualization Desktop


4. Data Discovery and Advanced Analytics

As a Data Discovery tool, DVD includes multiple features to facilitate the data analysis process. One simple tool used for data discovery is the filters: the user can decide to filter Attributes based on values or Measures based on ranges.

Oracle Data Visualization Desktop

Together with the filters, Reference Lines and Trend Lines are available in DVD straight out-of-the-box. As well as these features, more Advanced Analytics tools are available in combination with R. For this reason, DVD includes an Oracle R Distribution (version 3.1.1) installer executable after the installation of DVD. When R and the required libraries are installed, we will be able to use Clustering, Outlier Detection and Forecasting, as well as custom R scripts.

In the example below we use Clusters to identify how the number of apartments by neighbourhood affects the price. In addition, we have a Reference Line to analyse the average apartment price for different room types. Finally, using Trend Lines, we can see that the relationship between minimum number of nights and price has been increasing over the last few years.

Oracle Data Visualization Desktop

Thanks to the data discovery and advanced analytics capabilities of DVD, we can easily identify hidden information and patterns in our data. In order to keep track of the data discovery flow, we can use the Story Navigator, which allows different insights to be saved. These insights are very useful when we want to share a project, letting other users understand our findings fast.

Oracle Data Visualization Desktop


5. Managing Projects

It is very easy to share Projects in DVD. The first thing to do is to save them locally; the different Projects are shown in the Home page. From the Home page we can select the option to export the Project, which will create a DVA (.dva) file. It is possible to store the source data in this file and protect it with a password.

Oracle Data Visualization Desktop

At the other end, we can similarly use the Import option to add the Project to our main folder in the Home page.


6. Oracle Data Visualization Training

We provide a wide range of training services designed to propel course delegates to a project-ready state, ensuring they have both the necessary BI theory as well as the hands-on practical skills needed to engage in a BI project with confidence.
Here in ClearPeaks we are experts on Oracle Data Visualization and we can offer you this expertise in our specialized training sessions.

Get in touch with us and see what we can do for you!

Blog Article Author: Iñigo Hernáez

OBIEE Data Lineage Solution

.

Tracking OBIEE reporting data

BI implementations may use hundreds of data warehouse tables and thousands of reports, and one of the most common requirements in a BI project is the ability to track back the data used in reports. The possibility to quickly identify the set of columns and tables used in a given report or dashboard, or to check which reports may be affected by a change of physical column in the data warehouse, is also crucial for development. This calls for a data lineage solution which is accessible to users and developers,  and allows checking such data on an ad hoc basis.

Various vendors offer data lineage solutions, but these can be expensive and vendor-specific. With our simple solution, we combine Catalog Manager and Administration Tool export sources to create an easily accessible solution for tracking data lineage in OBIEE.

By implementing the OBIEE data lineage solution, we can check the following:

  1. Which physical tables and columns are used in a given report or dashboard
  2. Which reports use given physical columns or tables; this is especially important when modifying an existing table, as any change in the table’s structure must take existing reports into consideration
  3. Which are the most commonly used columns in reports in a given subject area; identifying the most commonly used columns in a report can hint at creating indexes to improve the overall performance of the OBIEE implementation even further.

Solution

The ClearPeaks OBIEE Data Lineage Solution gathers all the required data lineage information in one place and uses it as a source for OBIEE reports within Data Lineage Subject Area. Two sources are combined to achieve this:

  1. List of OBIEE reports and presentation columns used

Catalog Manager provides an option to export the list of reports and columns used. The export can be done either with the Catalog Manager UI tool or through the command line utility. We use the latter option, as it allows automation of the whole process later.

  1. Mappings of presentation columns back to the physical columns

Such mappings can be obtained manually by creating Repository Documentation from Administration Tool’s Repository Documentation utility. The output will be a repository documentation in the form of comma separated values file (.csv). This will contain column mapping from presentation through logical to physical layers, including column formula expressions. Another way to obtain column mappings is by extracting the OBIEE repository in .xml file format through the command line utility. Our solution uses the second option, as the repository file will be used in our automated script.

OBIEE Data Lineage Solution

Once we have obtained both files, we need to populate the data lineage tables.

The data can be transformed and manually inserted into the tables, but in our solution we use a script (which can run on the OBIEE server) that parses the data and inserts it into the tables.

Data Lineage Subject Area

Once we have populated the data lineage tables and their data model has been mapped in Administration Tool, we can create and run reports in OBIEE using Data Lineage Subject Area and filter the results according to our requirements.

OBIEE Data Lineage Solution

Let us look at a few of the use cases for the Data Lineage Subject Area:

Use case 1. Which data warehouse tables and columns are used in a given report?

We would like to know which data warehouse tables and columns are used in a particular report or dashboard. We can create a report with a list of the columns used by a given OBIEE report and their data lineage:

OBIEE Data Lineage Solution


Use case 2.
Which reports use given physical tables?

We want to know how many and which reports or dashboards are using given physical tables or particular columns; this could be very useful when assessing the potential impact of column formula or table structure changes on reporting. Using Data Lineage Subject Area we can fetch up the list of OBIEE reports used by a given physical table:

OBIEE Data Lineage Solution


Use case 3.
Which reports use given subject areas?

We need to know which reports and dashboards are accessing data from given subject areas. This may be particularly useful when revising users’ access permissions.

OBIEE Data Lineage Solution


Future Improvements

OBIEE Data Lineage Subject Area can also serve as a backbone for further developments, providing complex and complete solutions for managing existing OBIEE project implementations. Here are some examples and potential benefits of merging additional information into the solution:

Usage Tracking data – allows analysis of the physical tables used in the most accessed reports, the tables and columns in reports not used by anyone, removing non-used reports or underlying tables.

Data warehouse database metadata – such as table size, indexes on columns. This allows for performance analysis of the most heavily used tables and columns by report usage.

ETL Data Lineage – additional layer of data lineage tracking – allowing the tracking of data back to the very source – can be achieved by adding ETL transformation data obtained from ETL systems. For example, it is possible to track all the ETL transformations on a given presentation column down to the source system.

Catalog metadata – it is possible to extract additional information regarding catalog objects such as user permissions, report owners, last modified date etc., to further enhance the solution usability.

Adding all the above components creates a strong core for the OBIEE management dashboard, allowing continuous improvement through:

* tracking data from the report to the source database column
* constant analysis of the most used tables in order to improve performance
* checking which users have permissions for given OBIEE reports and dashboards

All of the above is accessible from the OBIEE front-end, providing a convenient and quick way to facilitate many daily business-as-usual tasks in BI deployments.

Conclusion

The ClearPeaks OBIEE Data Lineage Solution can be easily deployed in any project using Oracle Business Intelligence Enterprise Edition. The solution can be run from the command line tools, which makes it possible to create automated jobs to extract and update data on a regular basis.

If you think your OBIEE project could benefit from our data lineage solution, please contact us for more information via our web contact form or by leaving your comments below!

Blog Article Author: Rafal Ostaszewski

How to control your ETL performance

.

ETL processes and the importance of control

The term ETL, meaning Extract-Transform-Load, comes from the Data Warehousing field and was traditionally used to refer to those processes that transform and move data from the source system to the data warehouse.

Nowadays, the data movements that are done inside a company are much more than a simple source-data warehouse flow. Data supports many different business processes and it has to be shared across different business units and departments. Indeed, in high data-oriented companies, data and ETL processes are a vital part of key business processes such as customer analysis, fraud detection, mailing or any other high data consuming activities.

This increasing high relevance of data in companies has made ETL systems performance essential and an important concern not only for the BI team (usually in charge of the ETL system) but for many other teams. Thus, having a system that properly controls and shows the performance of all the ETL processes can have a very positive impact in terms of data reliability, early error detection and correction, and continuous process improvement.

A standard ETL Control System

At ClearPeaks, we identified this as a general requirement in all ETL systems and developed a standard easy-to-deploy system to control and show your ETL performance. We are not only talking about a simple system that logs the result of the executions in files or tables, but a more complete one that allows you to have a good visual feedback of your ETL performance. Therefore, questions such as “How did the executions go tonight?”, “Why can’t I see any data?” or “Is the execution time of this ETL increasing?” will become easy to answer by just taking a look at your Tableau dashboards.

ETL

Figure 1. Tableau dashboards of the ETL performance control system

Specifically, we developed two dashboards that offer solutions to the two principal issues when controlling a process; the final result and the time spent. There is one dashboard that informs on execution results, common errors and percentages of success (left-top of Figure 1), and another one that shows execution times and time variables (right-bottom of Figure 1).

These dashboards are designed to enhance scalability and easy implementation so they are useful for any ETL. In the same way, the technical actions required to implement this system are also simple and compatible with any ETL. To empower this high level of standardization, the basics of the Business Intelligence methodology have been adapted to the ETL Control System architecture. Thus, the system is based on an unchanging star database model with some completely standard tables, and four simple functions or boxes (the exact term varies depending on the ETL tool) that must be included in the ETL. Initially, this system has been completely developed in Kettle ETL tool, so implementing the system in Kettle ETLs is very straightforward. On the other hand, if you are using a different ETL tool (such as Oracle DI or Informatica PowerCenter), a mapping from the logs to the star database model must be done.

 

So do you think you also need to control your ETLs? Would you like to do it with good visualizations? Contact us at info@clearpeaks.com or use our web form if you would like more information about the ClearPeaks ETL Control System, or leave us your comments below!

Blog Article Authors: Pere Vegas & Daniel Albendín

 

Copy rows to result in Kettle | Tips and Tricks

.

Copy rows to results steps

Managing variables is one of the most complex operations in kettle. In this entry a few tips are explained to make these operations easier.

‘Copy rows to results’ step is the most versatile when we talk about sharing information between objects. This step only exists in transformations and it can convert the row values to parameters, variables, arguments or even pass them to other transformation.

The Case

Assuming the following job in kettle (root job):

Image 1

The transformation T1 is the following:

Image 2

The properties file content is:

Image 3

Now, these properties have to be used in a new transformation or job. The root job will now looks like:

Image 4

The Possibilities

Depending on how these parameters are going to be used in T2, different methods can be used

1- To be used as rows:

If these rows are needed in the flow, a get rows from result have to be placed into T2 transformation.

Image 5

Now, main job can be executed and log looks like the following.

code 1

2.- To be used as parameter/variable:

We have to follow these steps:

  1. In the root job open T2 options (Double left click on it).
  2. Open the advance tab
  3. Check the ‘execute for every input row?’ optionImage 6
  4. Open Parameters tab
  5. Define every Column you want to use as a parameter (A,B,C)Image 7
  6. Open the T2 transformation (double middle click on it)
  7. Define the variables in the T2 transformation.
    Image 8
  8. Place a Get Variables step and fill the fields with the proper valuesImage 9

Now, main job can be executed and log looks like the following.

code 2

This can be used in a job call instead of a transformation one.

3.- To be used as argument:

  1. In the root job open T2 options (Double left click on it).
  2. Open the advance tab
  3. Check the ‘execute for every input row?’ and ‘Copy previous results to args?’
  4. Open the T2 transformation (double middle click)
  5. Place a ‘Get system info’ step.
  6. Define Get system Info step as following:

Image 10

Now main job can be executed and log looks like the following:

code 3

Conclusion:

When working with Kettle, the rows in the flow can be used in different ways. You can execute jobs or transformations with data created or calculated in previous steps, so Kettle becomes into a powerful and versatile ETL tool, which offers big scalability and dynamic executions.

BI system on Amazon Cloud | Amazon Web Services

.

Introduction

The purpose of this blog is to explain how to configure a BI system on cloud using Amazon Web services (AWS). Our system will include an ETL server (pentaho data integrator AKA Kettle), a reporting server (Tableau) and a data warehouse (Redshift). Every of these components will be based on one AWS, these services will be detailed below.

Amazon provides a set of web services completely hosted on cloud in a single account,  these services are easy to manage through the AWS console. The services are paid on demand, this helps us to scale up the resources needed and create a budget plan that can be managed and modified easily. It allows the flexibily to remove or add new on demand services.

For payments, AWS provides also a set of dashboards, where we can review the detailed amount broken down by service.

From the variety of the AWS, some of them are enough to create the infrastructure we need to create our BI system completely hosted on cloud.

In this blog article I will explain 3 AWS to create a complete BI system:

  • EC2 (used to host the servers, ETL and reporting)
  • S3 (used to store and send files to Redshift)
  • Redshift (data warehouse)

From the console we can manage all of the web services we have signed up for, in our case we will focus on the following ones:

Picture1

Amazon Web Services:

1. EC2

EC2 is a compute AWS used to create instances of machines needed to support our infrastructure, in our case of a BI system, we will use 2 instances, one for the ETL server and a different one for the reporting server.

EC2 is completely dynamic, it allows maintenance of the infrastructure with a simple and intuitive front end, where we can operate into our instances. As main features,  it allows resizing of the resources of the instance on demand,  to add more memory, increase the number of CPUs and add new HDD volumes to the instance.

There are so many other features detailed on the following video:

In this scenario for our BI system, we have created 2 Windows machines, the instance can be selected from a set of preconfigured machines, then once created we can modify some properties as explained above.

Picture2

Figure 1 Creating a new instance

There are different prices and paying methods for the instances, the pricing and the licenses for the different sort of instances can be reviewed in the links below:

https://aws.amazon.com/ec2/instance-types/

https://aws.amazon.com/ec2/pricing/

 

One of the great features on EC2 instance is that with only a little knowledge of IT we can manage the infrastructure by ourselves, we can set up our network, connect to the machines using remote desktop, and share files between the instances and our local machines, we can take snapshots of the volumes, images of the instances that can be downloaded and deployed on premises.

Regarding the network and security configurations, we can assign a static IP to the instances, we can limit the access to that instance to be only reachable from certain IPs, so the instances can be secured.

Picture3

Figure 2 EC2 Landing page

 

As a conclusion, we can use this service to create any kind of instance that fit our needs and we will pay for the resources and usage we make of it, it is flexible and securable.

For the BI system we want to configure, EC2 will host 2 instances:

  • ETL server running on Windows: this server will be the responsible of make the data extraction and transformations and send the files generated to S3. We will use an open source ETL tool, Pentaho data integrator, the features of this ETL tool can be reviewed in the following link:

http://community.pentaho.com/projects/data-integration/

 

  • Reporting server running on Windows: this server will contain the dashboards and visualizations of the information hosted on redshift, we will use tableau as a reporting server, the features of tableau can be reviewed in the following link:

http://www.tableau.com/products/server

 

2. S3

S3 is one of the storage AWS, basically it is used to store data into a file directory inside a bucket. We will use this service for optimization reasons

image 7 blur

Figure 3 S3 Buckets

One of the bottlenecks that can appear in a BI system is the data loading into the database tables in the data warehouse,  as this tables use to be very large, usually we want to bulk load the tables, using the tandem redshift-S3 this can be done in a very efficient way

Once we have configured our bucket and assign a user to it, we can send files to the S3 bucket given a URL and using the AWS command line interface (AWS CLI). This will improve the performance of the table loads, as the files on S3 can be bulk loaded into tables in a very efficient way.

The service allows to secure the files, add encryption and some other interesting features.

3. Redshift

Redshift completes our BI system, it is a database service, scalable, columnar postgre database.

The latest visualization tools such as tableau, have in built connectors to access the information. It's easy to connect a database client to Redshift by specifying the URL. Redshift does not support table partitioning or indexing, however we can set sort and distribution keys on our tables to improve query performance, it also allows table compression setting the encoding on the columns.

As explained above, in order to improve the performance, we will use S3 to load the tables, in order to do this, we will create a set of files in our ETL server and after we will send it to S3, once the file has been set we will launch the copy command to load the table, the reference for the copy command can be reviewed at the following link:

http://docs.aws.amazon.com/cli/latest/reference/s3/cp.html

The relation between S3 and redshift is tight, we can also issue commands from our SQL client to store extracts from the tables directly into files in an S3 buckets.

Redshift can be configured in nodes, there are different kinds of nodes depending on our needs, we will chose between the different kind of nodes (computing or storage), once the node has been created it can be resized, it permits snapshots to be taken of the data and the size can be scalable to petabytes We can also apply security settings and configure alerts that will be received on an email inbox

picture 1 blur

Figure 4 Redshift Cluster configuration and properties

 

Another good feature of redshift on the management console is the ability to check the query status and monitor the resources used by the database such as disk and cpu usage, query time, etc as seen on the following figure:

Picture6

Figure 5 Redshift reports

Conclusion

AWS provides a set of on demand services that can be used to create any kind of IT system.

Regarding the benefits of using it to configure a BI system, it provides scalable on high performance services to create a data warehouse on redshift, host BI tools in EC2 instances with easy maintenance and security configuration, as well as fast data transfers using S3, these services working together are a great option to consider for saving time and money on our BI system infrastructure and configuration.

 

privacy policy - Copyright © 2000-2010 ClearPeaks

topnav