Real Time Business Intelligence with Oracle Technologies




In our previous blog article we discussed the necessity of real time solutions for Business Intelligence (BI) platforms and presented a user case with Microsoft Technologies (namely Azure and Power BI). In this case, we are analysing the same scenario, but we instead propose a design using Oracle Cloud and Oracle On Premise technologies.

We recommend going through the previous blog article to understand completely the scenario under analysis and its requirements.


1. Oracle Technologies for Real Time BI

Oracle offers both in cloud and on-premises solutions, focusing on Big Data. These solutions can be used for a variety of Big Data applications, including real-time streaming analytics.

a. Oracle Cloud Services

From all the services offered on Oracle Cloud, the following suit the needs of a real-time BI application:

Oracle Event Hub Service. This service provides a managed Apache Kafka cluster, where we can easily assign resources and create topics using the web interface.
Oracle Big Data Compute Edition Service (OBDCE). This service provides a managed Big Data cluster with most of the Apache Hadoop/Spark stack elements, where the resource management and execution of tasks is also managed from the web interface.

Both the Event Hub and the OBDCE services are part of the PaaS offerings of Oracle Cloud, which means that the whole hardware and software infrastructure behind them is managed by Oracle. The key benefit is that, even though we are using standard open source technologies, we don’t have to worry about resource provisioning, software updates, connectivity, etc. With this, the developers can focus on building the solutions, without losing time on administrative tasks.

Another important point is that the connectivity between services on the cloud can be configured very easily using the web console, which ensures a reliable and safe path for the data.

b. On Premise Technologies

In addition to the cloud solution, a similar environment can be built on premises. For this we are using the Oracle Big Data Appliance. The Big Data Appliance consists, generally speaking, of the following components:

Hardware: Several server nodes and other accessory elements for networking, power, etc. The configuration can be a starter rack with 6 nodes to a full rack with 18 nodes. Multi-rack configurations allow for an even larger number of nodes.
Software: All the components of the Cloudera Distribution for Hadoop (CDH) and additional components from Oracle like Oracle Database, Oracle NoSQL Database, Oracle Spatial and Graph, Oracle R Enterprise, Oracle Connectors, etc.

For the purpose of our Real Time project, the required components are Kafka and Spark, as we will see later. Both are part of CDH and key elements of the standard open source real-time analytics scenario. In this case, all the components will be available in the same cluster.

It is also important to know that, for demo projects like this, Oracle offers a Big Data Lite virtual machine, that contains most of the components of the Big Data Appliance.

c. Real-Time Visualization

At the time of writing this article, there is no BI tool from Oracle (OBIEE, DV) that allows visualization of real-time data. To tackle this, we decided to build a custom front-end that could be used both as a standalone web application or as one integrated into OBIEE.
The key technologies we are using for this purpose are:

Flask, which is a lightweight web framework for Python.
SocketIO, which is a framework based on WebSockets for real-time web applications using asynchronous communication.
ChartJS, which is a JavaScript library for building charts.


2. Solution Design and Development

The architectural design for the solutions is shown in the figure below and explained throughout the rest of this section:

Realtime BI solution design diagram, using both cloud and on premise Oracle technologies

Figure 1: Realtime BI solution design diagram, using both cloud and on premise Oracle technologies

a. On Premise Source System
The on-premises source system part of the solution simulates a real-time operational system using a public data feed provided by Network Rail. The data from the feed is processed and inserted into a PostgreSQL database. An event monitor script listens for notifications sent from the database and forwards them to the real-time processing queue (Kafka), either located on Oracle Cloud or on the Oracle Big Data Appliance.

For more details on this part of the solution, please, refer to the previous article using Microsoft Technologies. The design is exactly the same in this case, except that, in this solution, the Event Monitor sends the events to a Kafka queue (cloud or on premises), instead of sending them to an Azure Event Hub.

b. Oracle Cloud and Oracle Big Data Appliance

As explained in previous sections and shown in the diagram above, the Oracle solution for real-time stream processing can be developed both in the Oracle Cloud and using the Oracle Big Data Appliance. Both solutions are similar, as the underlying technologies are Kafka for event queueing and Spark Streaming for event processing.

Apache Kafka is a message queueing system where Producers and Consumers can send and receive messages from the different queues, respectively. Each queue is called Topic, and there can be many of them per Kafka Broker (or node). The Topics can be optionally split into Partitions, which means that the messages will be distributed among them. Kafka uses Zookeper for configuration and managing tasks.

In our scenario, we are using a simple configuration with just a single Kafka Broker, three Topics, each of them with a single partition. The process works as follows:

The Event Monitor sends the events received from the database to Topic 1.
Spark Streaming consumes the messages from Topic 1, processes them and sends the results to Topics 2 and 3.
In the Flask Web Server a couple of Kafka Consumers are listening to Topics 2 and 3, and forwarding them to the web application.
Interaction between main components of the solution and the different Kafka topics

Figure 2: Interaction between main components of the solution and the different Kafka topics

Spark Streaming is one of the multiple tools of the Apache Spark stack, built on top of the Spark Core. Basically, it converts a continuous stream of messages (called DStream) into a batched stream using a specific time window. Each batch is treated as a normal Spark RDD (Resilient Distributed Dataset), which is the basic unit of data in Spark. The Spark Core can apply most of the available operations to process this RDDs of the batched stream.

Spark Streaming processing workflow

Figure 3: Spark Streaming processing workflow

In our scenario, Spark Streaming is being used to aggregate the input data and to calculate averages of the PPM metric by timestamp and by operator. The process to calculate these averages requires few operations, as shown in the Python code snippet below:

# Create Kafka consumer (using Spark Streaming API)
consumer = KafkaUtils.createDirectStream(streamingContext,
{"": kafkaBroker})

# Create Kafka Producer (using Kafka for Python API)
producer = KafkaProducer(bootstrap_servers=brokers,
value_serializer=lambda v: json.dumps(v).encode('utf-8'))

# Consume the input topic
# Calculate average by timestamp
# Produce to the output topic
ppmAvg = x: json.loads(x[1]))
                 .map(lambda x: (x[u'timestamp'], float(x[u'ppm']))
                 .reduceByKey(lambda a, b: (a+b)/2)\
                 .transform(lambda rdd: rdd.sortByKey())\
                 .foreachRDD(lambda rdd: sendkafka(producer, topicOut, rdd))

From this sample code, we can see that the RDD processing is similar to Spark Core, with operations such as map, reduceByKey and transform. The main difference in Spark Streaming is that we can use the foreachRDD operation, which executes the specified function for each of the processed batches.

It is also important to know that, at the time of writing this article, the Spark Streaming API in Python does not offer the option to create a Kafka Producer. Therefore, we are using the Kafka for Python library to create it and send the processed messages.

Together with the average by timestamp shown above, we are also generating an average by operator. As we have two sets of processed data, Spark needs to send the data to two separate Kafka Topics (2 and 3).

One key problem with Spark Streaming is that it does not allow data processing in Event Time. This basically means that we can’t synchronize the windowing applied by Spark to create the batches to the timestamps of the source events. Therefore, as shown in the diagram below, this can lead to the events of different source timestamps being mixed in with the aggregates created by Spark.

Misalignment between Kafka and Spark Streaming windows causes events to be processed inside the incorrect time window

Figure 4: Misalignment between Kafka and Spark Streaming windows causes events to be processed inside the incorrect time window

In fact, in our scenario, we have events that are timestamped in the source, but we were not able to align the Spark Streaming batching process to this time.

There are some possible solutions for this issue, namely:

Spark Streaming with Updates.This is a workaround, were we can tell Spark Streaming to update the results of a batch with data coming in a “future” processing window. We tested this approach but, unfortunately, it lead to compatibility errors with Kafka and we couldn’t go ahead with it.
Spark Structured Streaming.This is a separate tool of the stack built on top of Spark SQL, which is meant to solve the problem with Event Time processing. Unfortunately, at the time of writing this article, it is still only available in “alpha” version as part of Spark 2.X. Again, we were able to test this experimental feature but couldn’t get it to properly work with Kafka.
Other streaming processing tools.There are other existing tools in the Big Data ecosystem that can work with Event Time. Kafka itself has a Streams API that can be used for simple data processing. There is also Storm and others.

c. On Premise BI System

As introduced earlier, Oracle standard BI solutions don’t offer the possibility to connect to real-time sources. Therefore, we decided to build our own custom platform and integrate it with OBIEE.

The key component of this part of the solution is the real-time messaging between the web server and the browser provided by SocketIO. In this library, the client requests the server to start a session. If the server accepts, a continuous stream of bidirectional messages is opened (in our case the messages are unidirectional, from the server to the client). Both the client and the server react to the received messages. Finally, either the client or the server can close the connection (in our system the client closes the connection when the browser is closed).

Continuous bidirectional communication channel using SocketIO

Figure 5: Continuous bidirectional communication channel using SocketIO

Although the message channel is bidirectional, only the server is sending messages to the clients. What it does is consume the events coming from the Kafka Topics populated by Spark Streaming and forwards them, with a slight manipulation, to a SocketIO namespace using two different named channels.

The web server sends the messages received from the Kafka topics to the SocketIO channels

Figure 6: The web server sends the messages received from the Kafka topics to the SocketIO channels

# Create a SocketIO object using the Flask-SocketIO add-in for Flask
socketio = SocketIO(app, async_mode=async_mode)

# Create a Kafka Consumer using PyKafka
client = KafkaClient(hosts=kafka_host, zookeeper_hosts=zookeeper_host)
topic = client.topics[kafka_topic]
consumer = topic.get_simple_consumer(consumer_group=kafka_consumer_group,

# Start the Consumer, monitor the input
# and send the received data through the SocketIO channel
while True:
message = consumer.consume(block=False)
if message is not None:
data = json.loads(message.value.decode('utf-8'))
{'key': data[0], 'value': data[1]}, namespace=namespace)

On the client side we have two possibilities to display the data, a standalone website served by the same web server or a set of custom analysis developed in OBIEE. In both cases, the key elements are the SocketIO and ChartJS JavaScript libraries. The first one establishes the connection with the server and the second one is used to create the charts. The SocketIO object is configured so that anytime it receives a message from any of the channels, it will update the data of the chart and ask ChartJS to refresh it accordingly. The required code in Javascript is shown in the following snippet:

# Create the socket using the JavaScript SocketIO client library
socket = io(location.protocol + '//'
+ document.domain + ':'
+ location.port
+ namespace);

# Create an event listener for the “realtime_data01” channel
# and update the corresponding charts
socket.on('realtime_data01', function(msg) {
if ( > 20) {;[0].data.shift();

Here, we are asking the socket object to update the values of the lineChart object, shift the values if required and update the chart when a new message is received in the realtime_data01 channel. The result will be a set of charts updating automatically as soon as the new data is sent through the socket:

: Standalone web application with real time visualizations

Figure 7: Streaming tiles showing real time data coming from Stream Analytics

Moreover, using dummy analysis with static text visualizations, we can embed the HTML and JavaScript into a normal OBIEE dashboard. In the example below, we can see exactly the same visualisations as in the standalone web. However, we will always be able to combine these real time visualisations with normal RPD-based OBIEE analyses.

Real-time visualizations embedded into an OBIEE dashboard

Figure 8: Real-time visualizations embedded into an OBIEE dashboard


3. Scenario Analysis and Conclusions

Based on our experience developing this solution with Oracle technologies for Real Time BI scenario, we have identified the following benefits, as well as areas for future improvement:

Advantageous features:

Oracle solutions for Big Data are available both in the Cloud and On Premise, suiting perfectly to different companies and scenarios.
The Oracle Big Data stack is based on open source standard software, which makes it really easy to develop solutions and to find guidance.
Kafka topics are really easy to create and configure in a matter of minutes.
Spark Streaming leverages all the processing power of Spark to real time streams, thus allowing for very sophisticated analytics in a few lines of code.
SocketIO allows creating bidirectional channels between web servers and applications and suits very well the needs of a real time application.
The web based front end is very flexible as it can be served standalone or integrated into other tools as OBIEE.

Areas for improvement:

The Oracle Big Data stack comes with a plethora of components pre-installed, which can be unnecessary in many applications.
Spark Streaming windowing is not ready yet for event time processing, which makes it flawed for some real-time applications. Other solutions from the Spark stack are still not yet completely ready for production environments.
Oracle standard BI solutions do not offer real-time visualisation solutions.

Authors: Iñigo Hernáez, Oscar Martinez

Click here if you would like to receive more information about the Real Time BI Services we offer!

Real Time Business Intelligence with Azure and Power BI




The purpose of this blog article is to analyse the increasing need for real time solutions applied to BI platforms. We will first understand the context and then present a user case and its corresponding solution. The technical development takes advantage of the Event Hub, Stream Analytics and SQL Database services of Microsoft Azure, together with visualisations in Power BI.

1. Traditional and Real Time BI

Traditionally, a BI solution consists of a centralized DWH that is loaded incrementally in batches. The most common approach is to run the ETL process once a day, outside of working hours, to avoid undesired workloads and performance issues on the source transactional systems. The main drawback of this approach is that the information analysed by business users is always slightly outdated.

In most cases, business processes can be successfully analysed even if the available information is from previous days. However, in recent years, the number of business processes that require real-time monitoring has increased dramatically. Fraud detection, sensor networks, call centre overloads, mobile traffic monitoring, and transport fleet geolocation are just a few examples of where real-time analysis has transformed from a luxury to a necessity.

To understand the difference between traditional and real-time BI necessities, let’s use the Value-Time Curve, as proposed by Richard Hackathorn. This curve represents the value of an action initiated by a business event along time.

Real Time Business Intelligence with Azure and Power BI

Figure 1: Value-Time curve for business processes, as proposed by Richard Hackathorn

In the figure above, we can see different types of business processes. In all cases, there is an event happening at time 0 and an action triggered at time 8. We can see that:

For business process 1, the decay is quadratic, so we are losing value from the very beginning.
For business process 2, the decay is exponential, so we don’t lose much value at the beginning, but there is a sudden drop in value at one point in time.
For business process 3, the increase is exponential, so, in contrast to the previous cases, the value of the action increases with time.

In these examples, a real-time action is especially critical for Business Process 1, while traditional BI with a batch incremental load can be enough for Business Process 2. The special case of Business Process 3 is clearly not suitable for real-time analytics.

Considering these special business processes, BI is starting to adopt solutions that provide real-time analytic capabilities to business users. The key objective of these solutions is to reduce the latency between the event generated by the business process and the corresponding actions taken by business users.

2. Case Study Overview

For this case study, we are using an open data feed provided by Network Rail, the owner of a large part of the rail network of England, Scotland and Wales. From all their feeds, we selected the Real Time PPM (Public Performance Measure), with the following characteristics:

The feed is provided in JSON format
The data is provided on a 60 seconds basis for around 120 rail services
PPM is measuring the ratio of delayed to total trains
Along with PPM, the total number of on-time, delayed and very late trains are provided

In this case study, we consider the data for each service and minute as a business event. This means that the feed provides around 120 events per minute. This is around 2 events per second. As this event rate is a bit low for analysing the limitations of a real-time solution, we built an event interpolator, which basically takes 2 consecutive feeds and generates interpolated events at a higher rate.

3. Solution Design and Development

The overall architecture of the solution is as displayed in the below figure and is explained in the following sections:

Real Time Business Intelligence with Azure and Power BI

Figure 2: Realtime BI solution design diagram


4. On Premise

Using the data feed provided by Network Rail, we simulated an Operational System with continuous data events. This module does the following operations:

A feed receiver script downloads the feed every minute using the Network Rail API and stores a JSON file in the server
A feed splitter script reads the JSON files and, if required, interpolates the data to create events at a higher frequency. Finally, it inserts the data in a PostgreSQL database.
In the PostgreSQL, a trigger calls a custom function every time a row is inserted in the operational table. This function launches a notification including the data of the new row in JSON format.

Once the Operational System is ready, we setup an Event Monitor, which does the following:

An event monitor script listens to the notifications created by the PostgreSQL database and sends these events to an Azure Event Hub. If required, the script can encapsulate multiple events in one single message sent to the Event Hub. We will explain in detail the necessity of this feature later on.

Apart from the elements explained above, which compose the main information flow in the scenario, we also used other elements in our development:

We used Power BI Desktop to connect to the PostgreSQL database locally and analyse the events being inserted.
We also installed the Power BI Data Gateway and configured it so that we were able to connect to our on-premises PostgreSQL database from Power BI Online


5. Azure

Microsoft’s cloud ecosystem Azure offers a plethora of services — many of them oriented to the BI/Data market. Among them, some services are specially designed for real time applications. In this solution, we have used the following services from Azure:

Event Hub, which is a cloud end-point capable of receiving, queueing and storing (temporally) millions of events per second.
Stream Analytics, which allows the querying and analysing of real time events coming from an Event Hub and sends them to a variety of services like Power BI, SQL Databases and even other Event Hubs.
SQL Database, which is a transparent and scalable database that works as a classical SQL Server.

In this scenario, we are using an Event Hub as the single entry point of events. That is, our on-premises event monitor script is using the Azure SDK for Python to connect to the Azure Service Bus and to send the events to this Event Hub. It is very straightforward to send events using the API. In our case, as the PostgreSQL notifications are sending a JSON object, we have to pass it to the Event Hub with very limited manipulation (just adding a few timestamps for monitoring).

As we introduced before, the event monitor can encapsulate multiple events to reduce the number of messages to be sent to the Event Hub. The reason to do this is that, despite the ingest throughput of the Event Hub being high (and scalable if required), the network latency impacts the number of events we can send. Basically, each event sent is a call to the REST API of the Event Hub, which means that until a response is received, the process is blocked. Depending on the network latency, this can take a long time and limit the overall application throughput.

Luckily, the Event Hub API allows encapsulating multiple events in one message. To do this we create a JSON Array and push multiple event objects into it. The Event Hub is able to extract each individual event and process it independently.

Real Time Business Intelligence with Azure and Power BI

Figure 3: Schematic view of multiple events encapsulated into one packet to overcome the limitation imposed by network latency

By increasing the number of events per message, we can overcome the network latency limitation, as shown in the figure below. It is also important to select correctly the Azure Region where we deploy our services, so that latency is minimised. There are online services that can estimate the latency to different regions. However, it is always better to test it for each specific case and environment.


Real Time Business Intelligence with Azure and Power BI

Figure 4: As the number of events per packet increases, the system can deal with higher network latencies.

Once the stream is queued by the Event Hub, we can use it as the input of a Stream Analytics service. With this service we can apply SQL-like queries to our stream of events and create aggregates and other calculations. Two of the most important points of the queries are the Timestamp and Time Window functions. See the example below:

AVG(ppm) avgPpm
INTO powerBIOutput1
FROM eventHubInput1
GROUP BY operatorName, TUMBLINGWINDOW(second, 5)

In this query we are doing the following:

Reading data from the eventHubInput1 input
Aggregating the ppm metric by applying an average
Applying a Timestamp to the input stream using the eventDate column
Grouping the values for each operatorName using a tumblingwindow of 5 seconds
Sending the result to the powerBIOutput1 output

There are 3 types of Time Windows that can be applied in Stream Analytics, but the most common one is the Tumbling Window, which uses non-overlapping time windows to aggregate the events.

The data processes in the Stream Analytics are then forwarded to the following outputs:

Other Event Hubs and Stream Analytics so that more complex aggregations can be created. In our case, we used this pipe to create a Bottom 10 aggregation.
An Azure SQL Database service to store the values. We are using it to store the last 1 hour of data.
Power BI Cloud as a Streaming Dataset for real-time analysis.


6. Power BI Cloud

As explained before, one of the outputs from Azure Stream Analytics is Power BI Cloud. When we create this type of output, one Power BI account is authorized to access the stream. Automatically, when the data starts to flow, the new stream will be added to the Streaming Datasets section in Power BI Cloud:

Real Time Business Intelligence with Azure and Power BI

Figure 5: Real time datasets of Power BI generated in Stream Analytics

Once our Streaming Dataset is ready, we can create real-time visualizations, using the Real-Time Data Tiles that can be added to any of our dashboards, and select the appropriate dataset and visualisation:

Real Time Business Intelligence with Azure and Power BI

Figure 6: Process of adding a streaming tile to a Power BI dashboard

At the time of writing this article, the following visualisations are available:


The Real-Time Data Tiles are automatically refreshed depending on the Windowing applied to the incoming dataset in Stream Analytics. In the query shown above, we were applying a Tumbling Window of 5 seconds, so the data in the tiles will be refreshed accordingly:

Real Time Business Intelligence with Azure and Power BI

Figure 7: Streaming tiles showing real time data coming from Stream Analytics

The main drawback that we have experienced with real-time visualisations is that they are currently limited both in terms of quantity and customisation, as compared to the Power BI reports that we can create with standard datasets. We believe this is due to the relatively early stage of development of these visualisations.

To overcome this issue, we also tested the Direct Query feature, which allows direct connection to the data source without prefetching the data. We can use this feature only with some technologies, including on-premises and cloud databases. In our case, we tested the following scenarios:

On-premises SQL Server through Power BI Data Gateway
Azure SQL Database

For this type of datasets, we can configure a cache refresh interval, which, as of today, is limited to 15 minutes.

Real Time Business Intelligence with Azure and Power BI

Figure 8: Setting up the refresh schedule of a Direct Query dataset

The reports that we create using these data sources will be automatically refreshed at the specified frequency:

Real Time Business Intelligence with Azure and Power BI

Figure 9: Power BI report created using the Azure SQL Database data with Direct Query

The combination of Real-Time Data Tiles using Streaming Datasets and Reports using Datasets with Direct Query, both on-premises and on Azure, provides the best results for a real-time BI dashboard in Power BI.

Real Time Business Intelligence with Azure and Power BI

Figure 10: Final Power BI dashboard that combines Streaming and Direct Query tiles

7. Scenario Analysis and Conclusions

Considering this experience developing a solution with Azure and Power BI for a Real Time BI platform, we have identified the following benefits, as well as areas for future improvement:

Advantageous features:

Setting up the Azure services for real-time data queueing and processing is very simple.
Scaling out the Azure services when we need more resources is just a matter of a few clicks and is completely transparent for the developer.
The integration between Azure and Power BI is very powerful.
The Real Time Data Tiles of Power BI Cloud, especially their automatic refreshing, is something that can differentiate the product from its competitors.
Using Direct Query data sources, we can complement the real-time dashboard with near real-time data.

Areas for improvement:

In a BI environment, events will mainly be generated in transactional systems, so a continuous monitoring is required, which involves custom developments for each scenario.
The network latency of the Azure service imposes a limit on the number of events that can be sent per second. We can packetize multiple events to increase the throughput, but there might be a bottleneck.
Real Time Data Tiles in Power BI seem to be at an early stage of development and they are almost non-customizable.
Cache refreshing for Direct Query sources in Power BI is limited to 15 minutes, which might be not enough in some scenarios.
Even though starting the Event Hub and Stream Analytics services is rather fast and easy, it can take up to one hour until we start seeing the real-time tiles showing data and being refreshed. However, once the real-time tiles start showing data they are stable.
The SQL Database service of Azure gets unresponsive sometimes and requires manual intervention to fix it (restarting, truncating, etc.). The current information provided by Azure is not clear enough to help in finding why the service gets unresponsive.

Authors: Iñigo Hernáez, Oscar Martinez

Click here if you would like to receive more information about the Real Time BI Services we offer!

privacy policy - Copyright © 2000-2010 ClearPeaks