How to control your ETL performance

.

ETL processes and the importance of control

The term ETL, meaning Extract-Transform-Load, comes from the Data Warehousing field and was traditionally used to refer to those processes that transform and move data from the source system to the data warehouse.

Nowadays, the data movements that are done inside a company are much more than a simple source-data warehouse flow. Data supports many different business processes and it has to be shared across different business units and departments. Indeed, in high data-oriented companies, data and ETL processes are a vital part of key business processes such as customer analysis, fraud detection, mailing or any other high data consuming activities.

This increasing high relevance of data in companies has made ETL systems performance essential and an important concern not only for the BI team (usually in charge of the ETL system) but for many other teams. Thus, having a system that properly controls and shows the performance of all the ETL processes can have a very positive impact in terms of data reliability, early error detection and correction, and continuous process improvement.

A standard ETL Control System

At ClearPeaks, we identified this as a general requirement in all ETL systems and developed a standard easy-to-deploy system to control and show your ETL performance. We are not only talking about a simple system that logs the result of the executions in files or tables, but a more complete one that allows you to have a good visual feedback of your ETL performance. Therefore, questions such as “How did the executions go tonight?”, “Why can’t I see any data?” or “Is the execution time of this ETL increasing?” will become easy to answer by just taking a look at your Tableau dashboards.

ETL

Figure 1. Tableau dashboards of the ETL performance control system

Specifically, we developed two dashboards that offer solutions to the two principal issues when controlling a process; the final result and the time spent. There is one dashboard that informs on execution results, common errors and percentages of success (left-top of Figure 1), and another one that shows execution times and time variables (right-bottom of Figure 1).

These dashboards are designed to enhance scalability and easy implementation so they are useful for any ETL. In the same way, the technical actions required to implement this system are also simple and compatible with any ETL. To empower this high level of standardization, the basics of the Business Intelligence methodology have been adapted to the ETL Control System architecture. Thus, the system is based on an unchanging star database model with some completely standard tables, and four simple functions or boxes (the exact term varies depending on the ETL tool) that must be included in the ETL. Initially, this system has been completely developed in Kettle ETL tool, so implementing the system in Kettle ETLs is very straightforward. On the other hand, if you are using a different ETL tool (such as Oracle DI or Informatica PowerCenter), a mapping from the logs to the star database model must be done.

 

So do you think you also need to control your ETLs? Would you like to do it with good visualizations? Contact us at info@clearpeaks.com or use our web form if you would like more information about the ClearPeaks ETL Control System, or leave us your comments below!

Blog Article Authors: Pere Vegas & Daniel Albendín

 

Copy rows to result in Kettle | Tips and Tricks

.

Copy rows to results steps

Managing variables is one of the most complex operations in kettle. In this entry a few tips are explained to make these operations easier.

‘Copy rows to results’ step is the most versatile when we talk about sharing information between objects. This step only exists in transformations and it can convert the row values to parameters, variables, arguments or even pass them to other transformation.

The Case

Assuming the following job in kettle (root job):

Image 1

The transformation T1 is the following:

Image 2

The properties file content is:

Image 3

Now, these properties have to be used in a new transformation or job. The root job will now looks like:

Image 4

The Possibilities

Depending on how these parameters are going to be used in T2, different methods can be used

1- To be used as rows:

If these rows are needed in the flow, a get rows from result have to be placed into T2 transformation.

Image 5

Now, main job can be executed and log looks like the following.

code 1

2.- To be used as parameter/variable:

We have to follow these steps:

  1. In the root job open T2 options (Double left click on it).
  2. Open the advance tab
  3. Check the ‘execute for every input row?’ optionImage 6
  4. Open Parameters tab
  5. Define every Column you want to use as a parameter (A,B,C)Image 7
  6. Open the T2 transformation (double middle click on it)
  7. Define the variables in the T2 transformation.
    Image 8
  8. Place a Get Variables step and fill the fields with the proper valuesImage 9

Now, main job can be executed and log looks like the following.

code 2

This can be used in a job call instead of a transformation one.

3.- To be used as argument:

  1. In the root job open T2 options (Double left click on it).
  2. Open the advance tab
  3. Check the ‘execute for every input row?’ and ‘Copy previous results to args?’
  4. Open the T2 transformation (double middle click)
  5. Place a ‘Get system info’ step.
  6. Define Get system Info step as following:

Image 10

Now main job can be executed and log looks like the following:

code 3

Conclusion:

When working with Kettle, the rows in the flow can be used in different ways. You can execute jobs or transformations with data created or calculated in previous steps, so Kettle becomes into a powerful and versatile ETL tool, which offers big scalability and dynamic executions.

BI system on Amazon Cloud | Amazon Web Services

.

Introduction

The purpose of this blog is to explain how to configure a BI system on cloud using Amazon Web services (AWS). Our system will include an ETL server (pentaho data integrator AKA Kettle), a reporting server (Tableau) and a data warehouse (Redshift). Every of these components will be based on one AWS, these services will be detailed below.

Amazon provides a set of web services completely hosted on cloud in a single account,  these services are easy to manage through the AWS console. The services are paid on demand, this helps us to scale up the resources needed and create a budget plan that can be managed and modified easily. It allows the flexibily to remove or add new on demand services.

For payments, AWS provides also a set of dashboards, where we can review the detailed amount broken down by service.

From the variety of the AWS, some of them are enough to create the infrastructure we need to create our BI system completely hosted on cloud.

In this blog article I will explain 3 AWS to create a complete BI system:

  • EC2 (used to host the servers, ETL and reporting)
  • S3 (used to store and send files to Redshift)
  • Redshift (data warehouse)

From the console we can manage all of the web services we have signed up for, in our case we will focus on the following ones:

Picture1

Amazon Web Services:

1. EC2

EC2 is a compute AWS used to create instances of machines needed to support our infrastructure, in our case of a BI system, we will use 2 instances, one for the ETL server and a different one for the reporting server.

EC2 is completely dynamic, it allows maintenance of the infrastructure with a simple and intuitive front end, where we can operate into our instances. As main features,  it allows resizing of the resources of the instance on demand,  to add more memory, increase the number of CPUs and add new HDD volumes to the instance.

There are so many other features detailed on the following video:

In this scenario for our BI system, we have created 2 Windows machines, the instance can be selected from a set of preconfigured machines, then once created we can modify some properties as explained above.

Picture2

Figure 1 Creating a new instance

There are different prices and paying methods for the instances, the pricing and the licenses for the different sort of instances can be reviewed in the links below:

https://aws.amazon.com/ec2/instance-types/

https://aws.amazon.com/ec2/pricing/

 

One of the great features on EC2 instance is that with only a little knowledge of IT we can manage the infrastructure by ourselves, we can set up our network, connect to the machines using remote desktop, and share files between the instances and our local machines, we can take snapshots of the volumes, images of the instances that can be downloaded and deployed on premises.

Regarding the network and security configurations, we can assign a static IP to the instances, we can limit the access to that instance to be only reachable from certain IPs, so the instances can be secured.

Picture3

Figure 2 EC2 Landing page

 

As a conclusion, we can use this service to create any kind of instance that fit our needs and we will pay for the resources and usage we make of it, it is flexible and securable.

For the BI system we want to configure, EC2 will host 2 instances:

  • ETL server running on Windows: this server will be the responsible of make the data extraction and transformations and send the files generated to S3. We will use an open source ETL tool, Pentaho data integrator, the features of this ETL tool can be reviewed in the following link:

http://community.pentaho.com/projects/data-integration/

 

  • Reporting server running on Windows: this server will contain the dashboards and visualizations of the information hosted on redshift, we will use tableau as a reporting server, the features of tableau can be reviewed in the following link:

http://www.tableau.com/products/server

 

2. S3

S3 is one of the storage AWS, basically it is used to store data into a file directory inside a bucket. We will use this service for optimization reasons

image 7 blur

Figure 3 S3 Buckets

One of the bottlenecks that can appear in a BI system is the data loading into the database tables in the data warehouse,  as this tables use to be very large, usually we want to bulk load the tables, using the tandem redshift-S3 this can be done in a very efficient way

Once we have configured our bucket and assign a user to it, we can send files to the S3 bucket given a URL and using the AWS command line interface (AWS CLI). This will improve the performance of the table loads, as the files on S3 can be bulk loaded into tables in a very efficient way.

The service allows to secure the files, add encryption and some other interesting features.

3. Redshift

Redshift completes our BI system, it is a database service, scalable, columnar postgre database.

The latest visualization tools such as tableau, have in built connectors to access the information. It's easy to connect a database client to Redshift by specifying the URL. Redshift does not support table partitioning or indexing, however we can set sort and distribution keys on our tables to improve query performance, it also allows table compression setting the encoding on the columns.

As explained above, in order to improve the performance, we will use S3 to load the tables, in order to do this, we will create a set of files in our ETL server and after we will send it to S3, once the file has been set we will launch the copy command to load the table, the reference for the copy command can be reviewed at the following link:

http://docs.aws.amazon.com/cli/latest/reference/s3/cp.html

The relation between S3 and redshift is tight, we can also issue commands from our SQL client to store extracts from the tables directly into files in an S3 buckets.

Redshift can be configured in nodes, there are different kinds of nodes depending on our needs, we will chose between the different kind of nodes (computing or storage), once the node has been created it can be resized, it permits snapshots to be taken of the data and the size can be scalable to petabytes We can also apply security settings and configure alerts that will be received on an email inbox

picture 1 blur

Figure 4 Redshift Cluster configuration and properties

 

Another good feature of redshift on the management console is the ability to check the query status and monitor the resources used by the database such as disk and cpu usage, query time, etc as seen on the following figure:

Picture6

Figure 5 Redshift reports

Conclusion

AWS provides a set of on demand services that can be used to create any kind of IT system.

Regarding the benefits of using it to configure a BI system, it provides scalable on high performance services to create a data warehouse on redshift, host BI tools in EC2 instances with easy maintenance and security configuration, as well as fast data transfers using S3, these services working together are a great option to consider for saving time and money on our BI system infrastructure and configuration.

 

Write Back Functionality in OBIEE

.

About Write Back Functionality:

One of the interesting attributes that OBIEE provides is the facility to enable users to add/update data back to the database. The user can have a column for which values can be entered in the user interface (UI) section on their platform and this can be updated in database. This could have multiple benefits as end users may want to rank their customers or rate their regional business based on performance, and be able to use this data from time to time. This converts OBIEE into a useful reporting tool and mini application for modifying business data.

Requirements for implementing the functionality:

Implementing write back requires the configuration of multiple objects within the architecture i.e. Database, Connection Pool, Presentation, BMM and Physical Layers, UI privileges, Column/Table properties etc.

Example on implementing the Write back functionality:

Here I am going to demonstrate how to make the Attribute2 column in the Product table (Sample apps) to be a writeable column.

  • Edit instanceconfig.xml

This is the initial step to enabling Write Back in OBIEE. Open the instance config file from the location – <Middleware>/instances/instance1/config/OracleBIPresentationServicesComponent/coreapplication_obipsn

Under <DSN>, Add <LightWriteBack >true</LightWriteBack >

  • Enable Write Back in the Repository tables

Open the RPD in Offline mode. Then expand the Logical table Product in the BMM layer. Double click on the column Attribute2 and in the general tab enable ‘Writeable’.

image 1

In the presentation layer expand the table Product, double click on the column Attribute2, and in permissions change this column as Read/Write for BI author.

image 2

  • Setting direct database request permission

In the RPD, goto manage > Identity > application roles > BI Author > Permission> select execute Direct DB request> select Allow

image 3

  • Disable cache for physical tables

Select the SAMP_PRODUCTS_D table in the physical layer and disable cacheable option.

Double click on D2 customer > unselect override source table and cacheable.

image 4

Deploy the modified RPD and restart the BI Presentation services.

  • Grant write back privilege to users

Log on to OBIEE presentation services > Administration > manage privileges > Write Back property and click on denied: authenticated user > granted: to authenticated user

  • Create Analysis for Write Back

Create a new analysis with columns P1 Product and P6 Attribute2. Open the column property of Attribute2, select the Write Back tab and enable it. Save the analysis.

image 5

  • Create write back XML template

Goto <Middleware>/instances/instance1/bifoundation/OracleBIPresentationServicesComponent/coreapplication_obips1/analyticsRes/customMessages

Append the attached tags to the Write Back template.xml file (attached Write Back template.xml for reference)

<WebMessage name="wb_prod_attribute"> -- This web message is the reference for this block in the presentation

<XML>

<writeBack connectionPool="Sample Relational Connection"> -- Set the name as in the RPD file

<insert></insert>

<update>

UPDATE SAMP_PRODUCTS_D SET ATTRIBUTE_2='@2' WHERE PROD_DSC='@1' –- define the update query and refer the columns with their position in the answers

</update>

</writeBack>

</XML>

</WebMessage>

image 7 image 6

  • Enable Write Back in table view

Open the saved analysis > table view > edit view > Table view property > Write Back tab > Select enable Write Back and provide the name as wb_prod_attribute (Saved WebMessage name in the xml). Save the Analysis.

image 8

With this step, we have completed the configuration of Write Back in OBIEE. Now this should be tested in order to validate the Write Back configuration.

  • Testing the Write Back Option

Open the saved report > Click on Update.

This changes the column attribute2 to writeable. Change the value and click apply

image 9

Edit the column to the desired value.

image 10

Click Apply and Done

Now open the SQL developer and check the Product in the edited row.

SELECT PROD_DSC,ATTRIBUTE_2 FROM SAMP_PRODUCTS_D where prod_dsc = ‘7 Megapixel Digital Camera’

image 11

Now we can see that the changes made in the answers are reflected in the DB.
By using this simple this technique OBIEE can act as a front end form for updating data in the database.

OBIEE 11.1.1.9 UPGRADE

.

How to upgrade your Oracle BI System to 11.1.1.9

2593358

On our previous technical post we looked at what is new in OBIEE 11.1.1.9, below we describe how to upgrade your Oracle BI system from version 11.1.1.7 to 11.1.1.9.

Not sure about the upgrade? Ask us for access to our online testing showcase.

The installation files are available from OTN - Oracle Business Intelligence (11.1.1.9.0) Downloads. As per Oracle advice ensure to download this OBIEE release from the Oracle Software Delivery Cloud for production environments.

Before proceeding with the upgrade always remember to back up your Middleware Home, Domain Home and Oracle Instances, also database schemas and any other additional configuration.

OBIEE upgrade procedure on a Linux x86-64 environment:

  1. Upgrade Oracle WebLogic Server from 10.3.5 to 10.3.6
  2. Upgrade OBIEE from 11.1.1.7 to 11.1.1.9
  3. Upgrade database schemas MDS and BIPLATFORM
  4. Upgrade OPMN system components

Upgrade Oracle WebLogic Server from 10.3.5 to 10.3.6

Oracle BI 11.1.1.7.X and the previous 11.1.1.6 run within Oracle Weblogic Server 10.3.5. However the newly released OBIEE 11.1.1.9 ONLY supports 10.3.6.

To check your current version go to your console page on http://host:7001/console. If you are on 10.3.6 proceed to the next step, otherwise follow the instructions below.

image2

Download Oracle Weblogic 10.3.6 bin files from My Oracle Support. Go to Patches & Updates search for Patch 13529623: PLACEHOLDER BUG FOR WEBLOGIC SERVER 11GR1 (10.3.6) UPGRADE THE INSTALLER corresponding to your platform, this example is based on Linux x86-64.

img0

Unzip the files once downloaded and execute the upgrade. The welcome screen for the upgrade will pop up. Note that you have the command line or the UI options, below is use the latter.

java –jar wls10.36_upgrade_generic.jar

picture 1 picture 2

Click next and select “Use an existing Middleware Home”

img3

On register for security updates screen uncheck the option to receive security updates and confirm.

img4

img5img6

The connection will fail and you have to keep the option to remain uninformed of security issues checked.

img7

On the product and components to upgrade select WebLogic Server, leaving Oracle Coherence unchecked and confirm selection.

img8 img9

After this the upgrade starts and should last a couple of minutes. At the completion click on the Done button.

img10img11

Once the upgrade has completed start the WebLogic Admin Server by command line.

> cd /MW_HOME/user_projects/domains/bifoundation_domain/bin /startWebLogic.sh

Go to http://server:7001/console and make sure you have upgraded to Oracle WebLogic Server 10.3.6 correctly.

image 12

Upgrade OBIEE from 11.1.1.7 to 11.1.1.9

Before proceeding make sure WebLogic Server has completely stopped:

> cd /MW_HOME/user_projects/domains/bifoundation_domain/bin /stopWebLogic.sh

Change directory to Disk1folder inside the unzipped archive folder and start the installer - note that OBIEE installer only has a graphical interface. The installer can be run in Silent Mode.

> ./runInstaller

img13img15

On the software updates select the option “Skip Software Updates”

img17

For the installation type select “Software Only Install” since the system has already been configured and it is only being upgraded.

img18img19

Select your Oracle Middleware Home location and click Next.

img20img23

After the summary click next and the upgrade of the BI components will start, taking less than 10 minutes to complete. Click Finish when it has completed.

img24img26

img21

Upgrade database schemas MDS and BIPLATFORM

The next step is to upgrade the database schemas. Go to MW_HOME/Oracle_BI1/bin and run the Patch Set Assistant (.psa) to upgrade the database repository. In this case you can use graphical interface or command line.

Right after the welcome screen pop ups, click Next.

> cd /MW_HOME/Oracle_BI1/bin/psa

img27

img28

Select “Oracle Business Intelligence” components and check all options from the prerequisites.

img29 img30

Set up the connections for MDS and BIPLATFORM schemas

img31img32

img33 img34

Click on Upgrade button and it will take one minute to complete.

img36

Use the SQL command line to verify that the upgrade of the schema is in VALID status and the version is 11.1.1.9.0.

> SELECT OWNER, VERSION, STATUS, UPGRADED FROM SCHEMA_VERSION_REGISTRY;

UPGRADE OPMN system components

The last post-patching task is to upgrade the system components managed by the OPMN – BI Server, BI Scheduler and BI Presentation Services.

Before starting, make sure that Node Manager and Admin Server are up and running.

> cd /MW_HOME/wlserver_10.3/server/bin/startNodeManager.sh

> cd /MW_HOME/user_projects/domains/bifoundation_domain/bin /startWebLogic.sh

Tip: use NOHUP option to avoid killing the services when closing the command window

img37

Start the Configuration Assistant:

cd /MW_HOME/Oracle_BI1/bin/config.sh

img38img40

Select “Update BI Domain” option and add the credentials of your existing BI Domain.

img41img42

img43img44

Important note: Possibly a process could get stuck during one of the steps without notification. In this event, see the Troubleshooting section on how to fix it.

Start the BI services and start enjoying your OBIEE 11.1.1.9!

img45

For more detailed information you can access the OBIEE documentation - Oracle Fusion Middleware 11g Release 1 (11.1.1.9)

 

Do you need help? Contact us!

privacy policy - Copyright © 2000-2010 ClearPeaks

topnav