Real Time Business Intelligence with Oracle Technologies

.

 

Introduction

In our previous blog article we discussed the necessity of real time solutions for Business Intelligence (BI) platforms and presented a user case with Microsoft Technologies (namely Azure and Power BI). In this case, we are analysing the same scenario, but we instead propose a design using Oracle Cloud and Oracle On Premise technologies.

We recommend going through the previous blog article to understand completely the scenario under analysis and its requirements.

 

1. Oracle Technologies for Real Time BI

Oracle offers both in cloud and on-premises solutions, focusing on Big Data. These solutions can be used for a variety of Big Data applications, including real-time streaming analytics.

a. Oracle Cloud Services

From all the services offered on Oracle Cloud, the following suit the needs of a real-time BI application:

Oracle Event Hub Service. This service provides a managed Apache Kafka cluster, where we can easily assign resources and create topics using the web interface.
Oracle Big Data Compute Edition Service (OBDCE). This service provides a managed Big Data cluster with most of the Apache Hadoop/Spark stack elements, where the resource management and execution of tasks is also managed from the web interface.

Both the Event Hub and the OBDCE services are part of the PaaS offerings of Oracle Cloud, which means that the whole hardware and software infrastructure behind them is managed by Oracle. The key benefit is that, even though we are using standard open source technologies, we don’t have to worry about resource provisioning, software updates, connectivity, etc. With this, the developers can focus on building the solutions, without losing time on administrative tasks.

Another important point is that the connectivity between services on the cloud can be configured very easily using the web console, which ensures a reliable and safe path for the data.

b. On Premise Technologies

In addition to the cloud solution, a similar environment can be built on premises. For this we are using the Oracle Big Data Appliance. The Big Data Appliance consists, generally speaking, of the following components:

Hardware: Several server nodes and other accessory elements for networking, power, etc. The configuration can be a starter rack with 6 nodes to a full rack with 18 nodes. Multi-rack configurations allow for an even larger number of nodes.
Software: All the components of the Cloudera Distribution for Hadoop (CDH) and additional components from Oracle like Oracle Database, Oracle NoSQL Database, Oracle Spatial and Graph, Oracle R Enterprise, Oracle Connectors, etc.

For the purpose of our Real Time project, the required components are Kafka and Spark, as we will see later. Both are part of CDH and key elements of the standard open source real-time analytics scenario. In this case, all the components will be available in the same cluster.

It is also important to know that, for demo projects like this, Oracle offers a Big Data Lite virtual machine, that contains most of the components of the Big Data Appliance.

c. Real-Time Visualization

At the time of writing this article, there is no BI tool from Oracle (OBIEE, DV) that allows visualization of real-time data. To tackle this, we decided to build a custom front-end that could be used both as a standalone web application or as one integrated into OBIEE.
The key technologies we are using for this purpose are:

Flask, which is a lightweight web framework for Python.
SocketIO, which is a framework based on WebSockets for real-time web applications using asynchronous communication.
ChartJS, which is a JavaScript library for building charts.

 

2. Solution Design and Development

The architectural design for the solutions is shown in the figure below and explained throughout the rest of this section:

Realtime BI solution design diagram, using both cloud and on premise Oracle technologies

Figure 1: Realtime BI solution design diagram, using both cloud and on premise Oracle technologies

a. On Premise Source System
The on-premises source system part of the solution simulates a real-time operational system using a public data feed provided by Network Rail. The data from the feed is processed and inserted into a PostgreSQL database. An event monitor script listens for notifications sent from the database and forwards them to the real-time processing queue (Kafka), either located on Oracle Cloud or on the Oracle Big Data Appliance.

For more details on this part of the solution, please, refer to the previous article using Microsoft Technologies. The design is exactly the same in this case, except that, in this solution, the Event Monitor sends the events to a Kafka queue (cloud or on premises), instead of sending them to an Azure Event Hub.

b. Oracle Cloud and Oracle Big Data Appliance

As explained in previous sections and shown in the diagram above, the Oracle solution for real-time stream processing can be developed both in the Oracle Cloud and using the Oracle Big Data Appliance. Both solutions are similar, as the underlying technologies are Kafka for event queueing and Spark Streaming for event processing.

Apache Kafka is a message queueing system where Producers and Consumers can send and receive messages from the different queues, respectively. Each queue is called Topic, and there can be many of them per Kafka Broker (or node). The Topics can be optionally split into Partitions, which means that the messages will be distributed among them. Kafka uses Zookeper for configuration and managing tasks.

In our scenario, we are using a simple configuration with just a single Kafka Broker, three Topics, each of them with a single partition. The process works as follows:

The Event Monitor sends the events received from the database to Topic 1.
Spark Streaming consumes the messages from Topic 1, processes them and sends the results to Topics 2 and 3.
In the Flask Web Server a couple of Kafka Consumers are listening to Topics 2 and 3, and forwarding them to the web application.
Interaction between main components of the solution and the different Kafka topics

Figure 2: Interaction between main components of the solution and the different Kafka topics

Spark Streaming is one of the multiple tools of the Apache Spark stack, built on top of the Spark Core. Basically, it converts a continuous stream of messages (called DStream) into a batched stream using a specific time window. Each batch is treated as a normal Spark RDD (Resilient Distributed Dataset), which is the basic unit of data in Spark. The Spark Core can apply most of the available operations to process this RDDs of the batched stream.

Spark Streaming processing workflow

Figure 3: Spark Streaming processing workflow

In our scenario, Spark Streaming is being used to aggregate the input data and to calculate averages of the PPM metric by timestamp and by operator. The process to calculate these averages requires few operations, as shown in the Python code snippet below:

# Create Kafka consumer (using Spark Streaming API)
consumer = KafkaUtils.createDirectStream(streamingContext,
[topicIn],
{"metadata.broker.list": kafkaBroker})

# Create Kafka Producer (using Kafka for Python API)
producer = KafkaProducer(bootstrap_servers=brokers,
value_serializer=lambda v: json.dumps(v).encode('utf-8'))

# Consume the input topic
# Calculate average by timestamp
# Produce to the output topic
ppmAvg = consumer.map(lambda x: json.loads(x[1]))
                 .map(lambda x: (x[u'timestamp'], float(x[u'ppm']))
                 .reduceByKey(lambda a, b: (a+b)/2)\
                 .transform(lambda rdd: rdd.sortByKey())\
                 .foreachRDD(lambda rdd: sendkafka(producer, topicOut, rdd))

From this sample code, we can see that the RDD processing is similar to Spark Core, with operations such as map, reduceByKey and transform. The main difference in Spark Streaming is that we can use the foreachRDD operation, which executes the specified function for each of the processed batches.

It is also important to know that, at the time of writing this article, the Spark Streaming API in Python does not offer the option to create a Kafka Producer. Therefore, we are using the Kafka for Python library to create it and send the processed messages.

Together with the average by timestamp shown above, we are also generating an average by operator. As we have two sets of processed data, Spark needs to send the data to two separate Kafka Topics (2 and 3).

One key problem with Spark Streaming is that it does not allow data processing in Event Time. This basically means that we can’t synchronize the windowing applied by Spark to create the batches to the timestamps of the source events. Therefore, as shown in the diagram below, this can lead to the events of different source timestamps being mixed in with the aggregates created by Spark.

Misalignment between Kafka and Spark Streaming windows causes events to be processed inside the incorrect time window

Figure 4: Misalignment between Kafka and Spark Streaming windows causes events to be processed inside the incorrect time window

In fact, in our scenario, we have events that are timestamped in the source, but we were not able to align the Spark Streaming batching process to this time.

There are some possible solutions for this issue, namely:

Spark Streaming with Updates.This is a workaround, were we can tell Spark Streaming to update the results of a batch with data coming in a “future” processing window. We tested this approach but, unfortunately, it lead to compatibility errors with Kafka and we couldn’t go ahead with it.
Spark Structured Streaming.This is a separate tool of the stack built on top of Spark SQL, which is meant to solve the problem with Event Time processing. Unfortunately, at the time of writing this article, it is still only available in “alpha” version as part of Spark 2.X. Again, we were able to test this experimental feature but couldn’t get it to properly work with Kafka.
Other streaming processing tools.There are other existing tools in the Big Data ecosystem that can work with Event Time. Kafka itself has a Streams API that can be used for simple data processing. There is also Storm and others.

c. On Premise BI System

As introduced earlier, Oracle standard BI solutions don’t offer the possibility to connect to real-time sources. Therefore, we decided to build our own custom platform and integrate it with OBIEE.

The key component of this part of the solution is the real-time messaging between the web server and the browser provided by SocketIO. In this library, the client requests the server to start a session. If the server accepts, a continuous stream of bidirectional messages is opened (in our case the messages are unidirectional, from the server to the client). Both the client and the server react to the received messages. Finally, either the client or the server can close the connection (in our system the client closes the connection when the browser is closed).

Continuous bidirectional communication channel using SocketIO

Figure 5: Continuous bidirectional communication channel using SocketIO

Although the message channel is bidirectional, only the server is sending messages to the clients. What it does is consume the events coming from the Kafka Topics populated by Spark Streaming and forwards them, with a slight manipulation, to a SocketIO namespace using two different named channels.

The web server sends the messages received from the Kafka topics to the SocketIO channels

Figure 6: The web server sends the messages received from the Kafka topics to the SocketIO channels

# Create a SocketIO object using the Flask-SocketIO add-in for Flask
socketio = SocketIO(app, async_mode=async_mode)

# Create a Kafka Consumer using PyKafka
client = KafkaClient(hosts=kafka_host, zookeeper_hosts=zookeeper_host)
topic = client.topics[kafka_topic]
consumer = topic.get_simple_consumer(consumer_group=kafka_consumer_group,
                                     auto_offset_reset=OffsetType.LATEST,
                                     auto_commit_enable=True,
                                     auto_commit_interval_ms=1000,
                                     auto_start=False)

# Start the Consumer, monitor the input
# and send the received data through the SocketIO channel
consumer.start()
while True:
socketio.sleep(0.1)
message = consumer.consume(block=False)
if message is not None:
data = json.loads(message.value.decode('utf-8'))
socketio.emit(channel,
{'key': data[0], 'value': data[1]}, namespace=namespace)

On the client side we have two possibilities to display the data, a standalone website served by the same web server or a set of custom analysis developed in OBIEE. In both cases, the key elements are the SocketIO and ChartJS JavaScript libraries. The first one establishes the connection with the server and the second one is used to create the charts. The SocketIO object is configured so that anytime it receives a message from any of the channels, it will update the data of the chart and ask ChartJS to refresh it accordingly. The required code in Javascript is shown in the following snippet:

# Create the socket using the JavaScript SocketIO client library
socket = io(location.protocol + '//'
+ document.domain + ':'
+ location.port
+ namespace);

# Create an event listener for the “realtime_data01” channel
# and update the corresponding charts
socket.on('realtime_data01', function(msg) {
if (lineChart.data.labels.length > 20) {
lineChart.data.labels.shift();
lineChart.data.datasets[0].data.shift();
}
lineChart.data.labels.push(msg.key);
lineChart.data.datasets[0].data.push(msg.value);
lineChart.update();
});

Here, we are asking the socket object to update the values of the lineChart object, shift the values if required and update the chart when a new message is received in the realtime_data01 channel. The result will be a set of charts updating automatically as soon as the new data is sent through the socket:

: Standalone web application with real time visualizations

Figure 7: Streaming tiles showing real time data coming from Stream Analytics

Moreover, using dummy analysis with static text visualizations, we can embed the HTML and JavaScript into a normal OBIEE dashboard. In the example below, we can see exactly the same visualisations as in the standalone web. However, we will always be able to combine these real time visualisations with normal RPD-based OBIEE analyses.

Real-time visualizations embedded into an OBIEE dashboard

Figure 8: Real-time visualizations embedded into an OBIEE dashboard

 

3. Scenario Analysis and Conclusions

Based on our experience developing this solution with Oracle technologies for Real Time BI scenario, we have identified the following benefits, as well as areas for future improvement:

Advantageous features:

Oracle solutions for Big Data are available both in the Cloud and On Premise, suiting perfectly to different companies and scenarios.
The Oracle Big Data stack is based on open source standard software, which makes it really easy to develop solutions and to find guidance.
Kafka topics are really easy to create and configure in a matter of minutes.
Spark Streaming leverages all the processing power of Spark to real time streams, thus allowing for very sophisticated analytics in a few lines of code.
SocketIO allows creating bidirectional channels between web servers and applications and suits very well the needs of a real time application.
The web based front end is very flexible as it can be served standalone or integrated into other tools as OBIEE.

Areas for improvement:

The Oracle Big Data stack comes with a plethora of components pre-installed, which can be unnecessary in many applications.
Spark Streaming windowing is not ready yet for event time processing, which makes it flawed for some real-time applications. Other solutions from the Spark stack are still not yet completely ready for production environments.
Oracle standard BI solutions do not offer real-time visualisation solutions.

Authors: Iñigo Hernáez, Oscar Martinez

Click here if you would like to receive more information about the Real Time BI Services we offer!

Real Time Business Intelligence with Azure and Power BI

.

 

Introduction

The purpose of this blog article is to analyse the increasing need for real time solutions applied to BI platforms. We will first understand the context and then present a user case and its corresponding solution. The technical development takes advantage of the Event Hub, Stream Analytics and SQL Database services of Microsoft Azure, together with visualisations in Power BI.
 

1. Traditional and Real Time BI

Traditionally, a BI solution consists of a centralized DWH that is loaded incrementally in batches. The most common approach is to run the ETL process once a day, outside of working hours, to avoid undesired workloads and performance issues on the source transactional systems. The main drawback of this approach is that the information analysed by business users is always slightly outdated.

In most cases, business processes can be successfully analysed even if the available information is from previous days. However, in recent years, the number of business processes that require real-time monitoring has increased dramatically. Fraud detection, sensor networks, call centre overloads, mobile traffic monitoring, and transport fleet geolocation are just a few examples of where real-time analysis has transformed from a luxury to a necessity.

To understand the difference between traditional and real-time BI necessities, let’s use the Value-Time Curve, as proposed by Richard Hackathorn. This curve represents the value of an action initiated by a business event along time.

Real Time Business Intelligence with Azure and Power BI

Figure 1: Value-Time curve for business processes, as proposed by Richard Hackathorn

In the figure above, we can see different types of business processes. In all cases, there is an event happening at time 0 and an action triggered at time 8. We can see that:

For business process 1, the decay is quadratic, so we are losing value from the very beginning.
For business process 2, the decay is exponential, so we don’t lose much value at the beginning, but there is a sudden drop in value at one point in time.
For business process 3, the increase is exponential, so, in contrast to the previous cases, the value of the action increases with time.

In these examples, a real-time action is especially critical for Business Process 1, while traditional BI with a batch incremental load can be enough for Business Process 2. The special case of Business Process 3 is clearly not suitable for real-time analytics.

Considering these special business processes, BI is starting to adopt solutions that provide real-time analytic capabilities to business users. The key objective of these solutions is to reduce the latency between the event generated by the business process and the corresponding actions taken by business users.
 

2. Case Study Overview

For this case study, we are using an open data feed provided by Network Rail, the owner of a large part of the rail network of England, Scotland and Wales. From all their feeds, we selected the Real Time PPM (Public Performance Measure), with the following characteristics:

The feed is provided in JSON format
The data is provided on a 60 seconds basis for around 120 rail services
PPM is measuring the ratio of delayed to total trains
Along with PPM, the total number of on-time, delayed and very late trains are provided

In this case study, we consider the data for each service and minute as a business event. This means that the feed provides around 120 events per minute. This is around 2 events per second. As this event rate is a bit low for analysing the limitations of a real-time solution, we built an event interpolator, which basically takes 2 consecutive feeds and generates interpolated events at a higher rate.
 

3. Solution Design and Development

The overall architecture of the solution is as displayed in the below figure and is explained in the following sections:

Real Time Business Intelligence with Azure and Power BI

Figure 2: Realtime BI solution design diagram

 

4. On Premise

Using the data feed provided by Network Rail, we simulated an Operational System with continuous data events. This module does the following operations:

A feed receiver script downloads the feed every minute using the Network Rail API and stores a JSON file in the server
A feed splitter script reads the JSON files and, if required, interpolates the data to create events at a higher frequency. Finally, it inserts the data in a PostgreSQL database.
In the PostgreSQL, a trigger calls a custom function every time a row is inserted in the operational table. This function launches a notification including the data of the new row in JSON format.

Once the Operational System is ready, we setup an Event Monitor, which does the following:

An event monitor script listens to the notifications created by the PostgreSQL database and sends these events to an Azure Event Hub. If required, the script can encapsulate multiple events in one single message sent to the Event Hub. We will explain in detail the necessity of this feature later on.

Apart from the elements explained above, which compose the main information flow in the scenario, we also used other elements in our development:

We used Power BI Desktop to connect to the PostgreSQL database locally and analyse the events being inserted.
We also installed the Power BI Data Gateway and configured it so that we were able to connect to our on-premises PostgreSQL database from Power BI Online

 

5. Azure

Microsoft’s cloud ecosystem Azure offers a plethora of services — many of them oriented to the BI/Data market. Among them, some services are specially designed for real time applications. In this solution, we have used the following services from Azure:

Event Hub, which is a cloud end-point capable of receiving, queueing and storing (temporally) millions of events per second.
Stream Analytics, which allows the querying and analysing of real time events coming from an Event Hub and sends them to a variety of services like Power BI, SQL Databases and even other Event Hubs.
SQL Database, which is a transparent and scalable database that works as a classical SQL Server.

In this scenario, we are using an Event Hub as the single entry point of events. That is, our on-premises event monitor script is using the Azure SDK for Python to connect to the Azure Service Bus and to send the events to this Event Hub. It is very straightforward to send events using the API. In our case, as the PostgreSQL notifications are sending a JSON object, we have to pass it to the Event Hub with very limited manipulation (just adding a few timestamps for monitoring).

As we introduced before, the event monitor can encapsulate multiple events to reduce the number of messages to be sent to the Event Hub. The reason to do this is that, despite the ingest throughput of the Event Hub being high (and scalable if required), the network latency impacts the number of events we can send. Basically, each event sent is a call to the REST API of the Event Hub, which means that until a response is received, the process is blocked. Depending on the network latency, this can take a long time and limit the overall application throughput.

Luckily, the Event Hub API allows encapsulating multiple events in one message. To do this we create a JSON Array and push multiple event objects into it. The Event Hub is able to extract each individual event and process it independently.

Real Time Business Intelligence with Azure and Power BI

Figure 3: Schematic view of multiple events encapsulated into one packet to overcome the limitation imposed by network latency

By increasing the number of events per message, we can overcome the network latency limitation, as shown in the figure below. It is also important to select correctly the Azure Region where we deploy our services, so that latency is minimised. There are online services that can estimate the latency to different regions. However, it is always better to test it for each specific case and environment.

 

Real Time Business Intelligence with Azure and Power BI

Figure 4: As the number of events per packet increases, the system can deal with higher network latencies.

Once the stream is queued by the Event Hub, we can use it as the input of a Stream Analytics service. With this service we can apply SQL-like queries to our stream of events and create aggregates and other calculations. Two of the most important points of the queries are the Timestamp and Time Window functions. See the example below:

SELECT
operatorName,
AVG(ppm) avgPpm
INTO powerBIOutput1
FROM eventHubInput1
TIMESTAMP BY eventDate
GROUP BY operatorName, TUMBLINGWINDOW(second, 5)

In this query we are doing the following:

Reading data from the eventHubInput1 input
Aggregating the ppm metric by applying an average
Applying a Timestamp to the input stream using the eventDate column
Grouping the values for each operatorName using a tumblingwindow of 5 seconds
Sending the result to the powerBIOutput1 output

There are 3 types of Time Windows that can be applied in Stream Analytics, but the most common one is the Tumbling Window, which uses non-overlapping time windows to aggregate the events.

The data processes in the Stream Analytics are then forwarded to the following outputs:

Other Event Hubs and Stream Analytics so that more complex aggregations can be created. In our case, we used this pipe to create a Bottom 10 aggregation.
An Azure SQL Database service to store the values. We are using it to store the last 1 hour of data.
Power BI Cloud as a Streaming Dataset for real-time analysis.

 

6. Power BI Cloud

As explained before, one of the outputs from Azure Stream Analytics is Power BI Cloud. When we create this type of output, one Power BI account is authorized to access the stream. Automatically, when the data starts to flow, the new stream will be added to the Streaming Datasets section in Power BI Cloud:

Real Time Business Intelligence with Azure and Power BI

Figure 5: Real time datasets of Power BI generated in Stream Analytics

Once our Streaming Dataset is ready, we can create real-time visualizations, using the Real-Time Data Tiles that can be added to any of our dashboards, and select the appropriate dataset and visualisation:

Real Time Business Intelligence with Azure and Power BI

Figure 6: Process of adding a streaming tile to a Power BI dashboard

At the time of writing this article, the following visualisations are available:

Card
Gauge
Line
Bar
Column

The Real-Time Data Tiles are automatically refreshed depending on the Windowing applied to the incoming dataset in Stream Analytics. In the query shown above, we were applying a Tumbling Window of 5 seconds, so the data in the tiles will be refreshed accordingly:

Real Time Business Intelligence with Azure and Power BI

Figure 7: Streaming tiles showing real time data coming from Stream Analytics

The main drawback that we have experienced with real-time visualisations is that they are currently limited both in terms of quantity and customisation, as compared to the Power BI reports that we can create with standard datasets. We believe this is due to the relatively early stage of development of these visualisations.

To overcome this issue, we also tested the Direct Query feature, which allows direct connection to the data source without prefetching the data. We can use this feature only with some technologies, including on-premises and cloud databases. In our case, we tested the following scenarios:

On-premises SQL Server through Power BI Data Gateway
Azure SQL Database

For this type of datasets, we can configure a cache refresh interval, which, as of today, is limited to 15 minutes.

Real Time Business Intelligence with Azure and Power BI

Figure 8: Setting up the refresh schedule of a Direct Query dataset

The reports that we create using these data sources will be automatically refreshed at the specified frequency:

Real Time Business Intelligence with Azure and Power BI

Figure 9: Power BI report created using the Azure SQL Database data with Direct Query

The combination of Real-Time Data Tiles using Streaming Datasets and Reports using Datasets with Direct Query, both on-premises and on Azure, provides the best results for a real-time BI dashboard in Power BI.

Real Time Business Intelligence with Azure and Power BI

Figure 10: Final Power BI dashboard that combines Streaming and Direct Query tiles


7. Scenario Analysis and Conclusions

Considering this experience developing a solution with Azure and Power BI for a Real Time BI platform, we have identified the following benefits, as well as areas for future improvement:

Advantageous features:

Setting up the Azure services for real-time data queueing and processing is very simple.
Scaling out the Azure services when we need more resources is just a matter of a few clicks and is completely transparent for the developer.
The integration between Azure and Power BI is very powerful.
The Real Time Data Tiles of Power BI Cloud, especially their automatic refreshing, is something that can differentiate the product from its competitors.
Using Direct Query data sources, we can complement the real-time dashboard with near real-time data.

Areas for improvement:

In a BI environment, events will mainly be generated in transactional systems, so a continuous monitoring is required, which involves custom developments for each scenario.
The network latency of the Azure service imposes a limit on the number of events that can be sent per second. We can packetize multiple events to increase the throughput, but there might be a bottleneck.
Real Time Data Tiles in Power BI seem to be at an early stage of development and they are almost non-customizable.
Cache refreshing for Direct Query sources in Power BI is limited to 15 minutes, which might be not enough in some scenarios.
Even though starting the Event Hub and Stream Analytics services is rather fast and easy, it can take up to one hour until we start seeing the real-time tiles showing data and being refreshed. However, once the real-time tiles start showing data they are stable.
The SQL Database service of Azure gets unresponsive sometimes and requires manual intervention to fix it (restarting, truncating, etc.). The current information provided by Azure is not clear enough to help in finding why the service gets unresponsive.

Authors: Iñigo Hernáez, Oscar Martinez

Click here if you would like to receive more information about the Real Time BI Services we offer!

Oracle Data Visualization Desktop – April 2016 Release

.

1. Introduction - Oracle Data Visualization Desktop

With the release of Oracle Business Intelligence Enterprise Edition 12c (OBIEE 12c), Oracle announced a new Data Visualization tool, aimed at ad hoc personal data discovery tasks. Oracle is putting a great deal of effort into developing this new tool, that is available as:

* component of Oracle BI Cloud Service (BICS)
* standalone cloud service named Data Visualization Cloud Services (DVCS)
* component of OBIEE 12c on premise
* a standalone desktop tool named Data Visualization Desktop (DVD)

Oracle Data Visualization Desktop

At the end of April 2016 Oracle released the first publicly available version of Oracle Data Visualization Desktop (DVD), under version 12.2.1.0.0 (and timestamp 20160422085526). In this blog post we will present the main characteristics of this tool (DVD); but most aspects are common to the other three above modalities.


2. Data Sources and Data Management

There are 3 types of data sources that can be used in DVD:

  1. Excel sheets, which allow complete offline analysis
  2. Databases like Oracle, SQL Server, MySQL, Teradata, Redshift, MongoDB, Spark, etc.
  3. OBIEE, where the user can connect to an existing Analysis or use a Logical QueryOracle Data Visualization Desktop

Once a source has been added to a project, DVD offers different options to manage the data:

* Modify data types (string, number, date, etc.)
* Alternate between Attributes and Measures
* Select the Aggregation rule of Measures
* Create Calculations using a wide variety of functions

Oracle Data Visualization Desktop

Multiple sources of different types can be added to a DVD project, joined automatically using fields with matching names; the joins can always be edited using the Source Diagram.

Oracle Data Visualization Desktop


3. Visualizations

One of the most important characteristics of DVD is the high number of visualizations available out-of-the-box. There are 22 data visualizations in total, plus the possibility of including Text Boxes and Images. All the available visualizations are shown in the image below:

Oracle Data Visualization Desktop

The visualizations are very easily created by dragging and dropping Data Elements (data columns in DVD) to the different Drop Targets (that is, the corresponding visual dimensions of the visualization).

Oracle Data Visualization Desktop

The visualizations can be highly customized in DVD. The user can edit the titles and axis, modify the colour schemes, sort the data, change the data format, etc.

In addition, the Map visualization allows you to create custom maps using the GeoJSON format. The underlying map engine is able to render the new maps and join them to the corresponding data element.

Oracle Data Visualization Desktop

Multiple visualizations can be combined in the Canvas, thus allowing the creation of complete dashboards to analyse the data. In addition, through the Data Brushing feature, the data selected in any visualization is highlighted in the others.

Oracle Data Visualization Desktop


4. Data Discovery and Advanced Analytics

As a Data Discovery tool, DVD includes multiple features to facilitate the data analysis process. One simple tool used for data discovery is the filters: the user can decide to filter Attributes based on values or Measures based on ranges.

Oracle Data Visualization Desktop

Together with the filters, Reference Lines and Trend Lines are available in DVD straight out-of-the-box. As well as these features, more Advanced Analytics tools are available in combination with R. For this reason, DVD includes an Oracle R Distribution (version 3.1.1) installer executable after the installation of DVD. When R and the required libraries are installed, we will be able to use Clustering, Outlier Detection and Forecasting, as well as custom R scripts.

In the example below we use Clusters to identify how the number of apartments by neighbourhood affects the price. In addition, we have a Reference Line to analyse the average apartment price for different room types. Finally, using Trend Lines, we can see that the relationship between minimum number of nights and price has been increasing over the last few years.

Oracle Data Visualization Desktop

Thanks to the data discovery and advanced analytics capabilities of DVD, we can easily identify hidden information and patterns in our data. In order to keep track of the data discovery flow, we can use the Story Navigator, which allows different insights to be saved. These insights are very useful when we want to share a project, letting other users understand our findings fast.

Oracle Data Visualization Desktop


5. Managing Projects

It is very easy to share Projects in DVD. The first thing to do is to save them locally; the different Projects are shown in the Home page. From the Home page we can select the option to export the Project, which will create a DVA (.dva) file. It is possible to store the source data in this file and protect it with a password.

Oracle Data Visualization Desktop

At the other end, we can similarly use the Import option to add the Project to our main folder in the Home page.


6. Oracle Data Visualization Training

We provide a wide range of training services designed to propel course delegates to a project-ready state, ensuring they have both the necessary BI theory as well as the hands-on practical skills needed to engage in a BI project with confidence.
Here in ClearPeaks we are experts on Oracle Data Visualization and we can offer you this expertise in our specialized training sessions.

Get in touch with us and see what we can do for you!

Blog Article Author: Iñigo Hernáez

Kerberos authentication and delegation with Tableau and MSSAS

.

Introduction

 

Kerberos is a three-way authentication protocol developed by the Massachusetts Institute of Technology (MIT). As of Windows 2000, it has been the default authentication method in Microsoft Windows systems. Tableau Server version 8.3 has supported Kerberos since the release of version 8.3.

In this article we will explain how to configure Tableau Server with Kerberos to enable:

- Single Sign-On (SSO) authentication of Active Directory (AD) users.

- Authentication delegation to Microsoft SQL Server Analysis Services (MSSAS).

Kerberos configuration

Kerberos is configured using the “Configure Tableau Server” application. The first step is to enable it in the “Kerberos” tab as shown below:

Kerberos

After enabling Kerberos, you must create the configuration script. In most cases you can simply use the default script generated by Tableau when you click the “Export Kerberos Configuration Script” button of the “Kerberos” tab:Kerberos

This script has two steps:

  • SPN registration: The Service Principal Name (SPN) registers the host in the Kerberos environment, so that services running on it can authenticate against other services.
  • Keytab generation: The Keytab file contains the shared password provided by Kerberos that is used to authenticate against other services. Example:

The default script registers the SPNs and generates the Keytab for the HTTP service in the host where Tableau Server is installed for the “Run as User” account using the setspn and ktpass commands. For example:


setspn -s HTTP/tableauhost DOMAIN\runasuser

setspn -s HTTP/tableauhost.domain DOMAIN\runasuser

ktpass /princ HTTP/tableauhost.domain@DOMAIN /pass runasuserpass /ptype KRB5_NT_PRINCIPAL /out keytabs\kerberos.keytab


There are some scenarios where the default configuration can be problematic:

  • Scenario 1: Duplicate SPN

In some environments, another service might have already registered an SPN for the HTTP service in the host where Tableau Server is installed. An example could be a web server installed in the same host. If this already-existing SPN has been registered for a user other than the Tableau Server “Run as User” you will get an error, since duplicate SPNs for the same service/host pair are not allowed:

Kerberos

There are two ways to overcome this problem. If the existing SPN is not used anymore, you can delete it using the setspn –D command, and then create the new SPN without any problem. If the old SPN is still being used, it will be mandatory to use the user for which the existing SPN was created as the Tableau Server “Run as User”.


  • Scenario 2: Multiple host aliases

It is common to use an alias that is different from the host name to access Tableau Server. For example, you might connect to http://tableau.domain.com, instead of using http://tableauhost. If using an alias registered in your Domain Name Server (DNS), you will have to create SPNs for all of them. For example:


setspn -s HTTP/tableau DOMAIN\runasuser

setspn -s HTTP/tableau.domain.com DOMAIN\runasuser


In this case, you will also need to generate a Keytab for each alias. However, Tableau Server only allows one Keytab file. Therefore, you will have to nest all the Keytab files, one by one (for each alias). To do so, you can use the option \in of the ktpass command. For example:


ktpass /princ HTTP/tableauhost.domain@DOMAIN /pass runasuserpass /ptype KRB5_NT_PRINCIPAL /out keytabs\kerberos_1.keytab

ktpass /princ HTTP/tableau.domain.com@DOMAIN.COM /pass runasuserpass /in keytabs\kerberos_1.keytab /ptype KRB5_NT_PRINCIPAL /out keytabs\kerberos_1.keytab


The intermediate file names are not important, but the last Keytab file must keep the name “kerberos.keytab”.


 

The script will have to be executed from the Command Prompt (CMD) by a user with domain administration rights. After running the script it is important that no error messages appear while registering the SPNs and that the Keytab file has been generated.

Finally, import the Keytab file into Tableau Server and test the configuration. If everything is correct, you will receive an OK message after the test.

Kerberos

If you restart Tableau Server now, the users of Active Directory will be able to use the SSO feature

Authentication delegation to MSSAS

The use of Kerberos delegation to Microsoft SQL Server Analysis Services (MSSAS) is very useful from a security point of view. If you have security roles defined in your OLAP cubes for the Active Directory users, the users will only see the information they are allowed to see without including any extra security in the Tableau workbooks.

In order to enable the delegation to MSSAS, first configure the “Run as User” account to act as part of the operating system. For that purpose, open the Local Security Policy application from “Start -->Administrative Tools --> Local Security Policy”.

Kerberos

Once there, go to “Local Policies -->User Rights Assignment --> Act as part of the operating system (right click) --> Properties”.

Kerberos

Finally, click on “Add User or Group…” and add the “Run as User” account.

Kerberos

After this step, you will have to configure the services that will be delegated. This has to be done by an Active Directory administrator using the Active Directory Administrative Center. Find the “Run as User” account in the list and right click to enter the “Properties” menu. From there go to the “Delegation” tab and select the options “Trust this user for delegation to specific services only” and “Use any authentication method”. Finally, click on “Add…” and select the MSSAS service from the list of available services. The MSSAS services will have names like “MSOLAPSvc.3”.

Kerberos

It is important to keep in mind that, whenever you upload a workbook to Tableau Server, if you want the Kerberos delegation to be used, you will have to configure the authentication options. In this case, select “Viewer credentials” as the authentication method for the MSSAS data source.

Kerberos

privacy policy - Copyright © 2000-2010 ClearPeaks

topnav