Informatica provides the market´s leading data integration platform. ETL Mappings are designed for data loading into the data warehouse environment to achieve better reporting which in turn helps you to understand business trends better. The major problem faced by anyone working with Informatica ETL is to design a mapping(s) that doesn’t compromise its performance. Often we end up creating a mapping that achieves only the functionality but suffers in terms of performance.
The article explains the steps to identify the performance bottlenecks and the advanced optimization techniques that are available in Informatica to overcome them.
Following topics are discussed in detail:
* Optimizing sources, Targets and Buffer Blocks: Performance bottlenecks can occur when the Integration Service reads from a source database or writes to a target database. Depending on the source, target data, you may need to increase or decrease the buffer block size. In this article we list the steps you have to take to ensure that the sources and the Targets are fully utilized to optimum level and how to calculate Buffer block size.
* Push down Optimization (PDO): Push Down Optimization Option enables data transformation processing to be pushed down into any relational database to make the best use of database processing power.
* Session Partitioning & Dynamic Session Partitioning: The Session Partitioning Option increases the performance of Power Center through parallel data processing. In dynamic partitioning, Integration Service determines the number of partitions to create at run time based on factors such as source database partitions or the number of nodes in a grid.
* Concurrent Workflow Execution: A workflow configured for Concurrent Execution can run as multiple instances concurrently. In the article we illustrate this functionality by using a workflow that is configured with multiple instances to process data belonging to multiple sites simultaneously.
* Load Balancing: Load balancing is a mechanism which distributes the workloads across the nodes in the gird. In the article we see how to assigning Service Levels to Workflows to achieve Load Balancing.
The purpose of this blog is to explain how to configure a BI system on cloud using Amazon Web services (AWS). Our system will include an ETL server (pentaho data integrator AKA Kettle), a reporting server (Tableau) and a data warehouse (Redshift). Every of these components will be based on one AWS, these services will be detailed below.
Amazon provides a set of web services completely hosted on cloud in a single account, these services are easy to manage through the AWS console. The services are paid on demand, this helps us to scale up the resources needed and create a budget plan that can be managed and modified easily. It allows the flexibily to remove or add new on demand services.
For payments, AWS provides also a set of dashboards, where we can review the detailed amount broken down by service.
From the variety of the AWS, some of them are enough to create the infrastructure we need to create our BI system completely hosted on cloud.
In this blog article I will explain 3 AWS to create a complete BI system:
EC2 (used to host the servers, ETL and reporting)
S3 (used to store and send files to Redshift)
Redshift (data warehouse)
From the console we can manage all of the web services we have signed up for, in our case we will focus on the following ones:
Amazon Web Services:
EC2 is a compute AWS used to create instances of machines needed to support our infrastructure, in our case of a BI system, we will use 2 instances, one for the ETL server and a different one for the reporting server.
EC2 is completely dynamic, it allows maintenance of the infrastructure with a simple and intuitive front end, where we can operate into our instances. As main features, it allows resizing of the resources of the instance on demand, to add more memory, increase the number of CPUs and add new HDD volumes to the instance.
There are so many other features detailed on the following video:
In this scenario for our BI system, we have created 2 Windows machines, the instance can be selected from a set of preconfigured machines, then once created we can modify some properties as explained above.
Figure 1 Creating a new instance
There are different prices and paying methods for the instances, the pricing and the licenses for the different sort of instances can be reviewed in the links below:
One of the great features on EC2 instance is that with only a little knowledge of IT we can manage the infrastructure by ourselves, we can set up our network, connect to the machines using remote desktop, and share files between the instances and our local machines, we can take snapshots of the volumes, images of the instances that can be downloaded and deployed on premises.
Regarding the network and security configurations, we can assign a static IP to the instances, we can limit the access to that instance to be only reachable from certain IPs, so the instances can be secured.
Figure 2 EC2 Landing page
As a conclusion, we can use this service to create any kind of instance that fit our needs and we will pay for the resources and usage we make of it, it is flexible and securable.
For the BI system we want to configure, EC2 will host 2 instances:
ETL server running on Windows: this server will be the responsible of make the data extraction and transformations and send the files generated to S3. We will use an open source ETL tool, Pentaho data integrator, the features of this ETL tool can be reviewed in the following link:
Reporting server running on Windows: this server will contain the dashboards and visualizations of the information hosted on redshift, we will use tableau as a reporting server, the features of tableau can be reviewed in the following link:
S3 is one of the storage AWS, basically it is used to store data into a file directory inside a bucket. We will use this service for optimization reasons
Figure 3 S3 Buckets
One of the bottlenecks that can appear in a BI system is the data loading into the database tables in the data warehouse, as this tables use to be very large, usually we want to bulk load the tables, using the tandem redshift-S3 this can be done in a very efficient way
Once we have configured our bucket and assign a user to it, we can send files to the S3 bucket given a URL and using the AWS command line interface (AWS CLI). This will improve the performance of the table loads, as the files on S3 can be bulk loaded into tables in a very efficient way.
The service allows to secure the files, add encryption and some other interesting features.
Redshift completes our BI system, it is a database service, scalable, columnar postgre database.
The latest visualization tools such as tableau, have in built connectors to access the information. It's easy to connect a database client to Redshift by specifying the URL. Redshift does not support table partitioning or indexing, however we can set sort and distribution keys on our tables to improve query performance, it also allows table compression setting the encoding on the columns.
As explained above, in order to improve the performance, we will use S3 to load the tables, in order to do this, we will create a set of files in our ETL server and after we will send it to S3, once the file has been set we will launch the copy command to load the table, the reference for the copy command can be reviewed at the following link:
The relation between S3 and redshift is tight, we can also issue commands from our SQL client to store extracts from the tables directly into files in an S3 buckets.
Redshift can be configured in nodes, there are different kinds of nodes depending on our needs, we will chose between the different kind of nodes (computing or storage), once the node has been created it can be resized, it permits snapshots to be taken of the data and the size can be scalable to petabytes We can also apply security settings and configure alerts that will be received on an email inbox
Figure 4 Redshift Cluster configuration and properties
Another good feature of redshift on the management console is the ability to check the query status and monitor the resources used by the database such as disk and cpu usage, query time, etc as seen on the following figure:
Figure 5 Redshift reports
AWS provides a set of on demand services that can be used to create any kind of IT system.
Regarding the benefits of using it to configure a BI system, it provides scalable on high performance services to create a data warehouse on redshift, host BI tools in EC2 instances with easy maintenance and security configuration, as well as fast data transfers using S3, these services working together are a great option to consider for saving time and money on our BI system infrastructure and configuration.
As seen in one of our previous blog articles “Step-by-step on how to generate a Metadata Dictionary”, sometimes you might need to make accessible to the users information about metrics or attributes from repository objects. It’s useful to check quickly this information when you’re developing or creating a report using only a web browser or when you don’t have permissions to access the repository.
A metadata dictionary is a static set of XML documents where each XML document describes a metadata object with its properties and relationships with other metadata objects. The metadata dictionary has to be generated each time repository changes are made.
In one of my previous blog articles called ´´Deduplication using Analytic Function´´, I explained a simple introduction of Oracle Analytic Functions. In this article (addressed for SQL developers), I will focus on the LAG function which from my point of view, is one of the most useful functions. Let’s see the use of this function in a common Data Warehouse scenario: Slowly Changing Dimension. Continue reading this post >
In February of this year Oracle launched Oracle Advanced Analytics, a new option for Oracle Database 11g that extends the database into a platform for big data business analytics. Oracle Advanced Analytics packages Oracle R Enterprise (ORE) with Oracle Data Mining (ODM), providing the capability to sort predictive analysis, statistical computing and interactive graphics inside the database.
With Oracle Advanced Analytics option, analytical capabilities are accessible through SQL and R programming languages. ODM provides predictive algorithms that run as SQL functions. ORE extends the database´s analytical capabilities using R’s statistical functionality library.
ODM is an existing option of Oracle Database that allows using predictive data models in the database; but how Oracle R Enterprise contributes to Oracle exactly? How is it integrated in Analytics? Which are the benefits of this integration?