Big Data Ecosystem – Spark and Tableau

.

In this article we give you the big picture of how Big Data fits in your actual BI architecture and how to connect Tableau to Spark to enrich your current BI reports and dashboards with data that you were not able to analyse before. Give your reports and dashboards a 360º view, and understand what, when, why, who, where and how.

After reading this article you will understand what Big Data can offer you and you will be able to load your own data into HDFS and perform your analysis on it using Tableau powered by Apache Spark.

The Big Data ecosystem

When considering a Big Data solution, it is important to keep in mind the architecture of a traditional BI system and how Big Data comes into play.

Until now, basically we have been working with structured data coming mainly from RDBMS loaded into a DWH, ready to be analysed and shown to the end user. Before considering how this structure may change when taking Big Data into the field, one could wonder how exactly the use of Big Data technology benefits my current solution. Using this technology allows the system to process higher volumes of data much faster, which can be more diverse, giving the chance to efficiently and safely extract information from data that a traditional solution can’t (high fault tolerance).

In addition, using Big Data permits the hardware structure to grow horizontally, which is more economical and flexible.

So, how does Big Data enter this ecosystem? Well, the main architecture concepts are quite the same, but there are big changes. The main differences are a whole new set of data sources, specifically non-structured and a completely new environment to store and process data.

BId Data -Spark and Tableau

In the picture above, at the top we have our traditional BI architecture. Below we can see how the new Big Data architecture still preserves the same concepts, Data Acquisition, Data Storage, etc. We are showing a few Big Data tools from the ones available in Apache Hadoop project.

What is important to point out is that Reporting & visualization must be combined. We must combine data from traditional and Big Data storage to provide a 360º view, which is where the true value resides.

To combine it there are different options. We could administer our aggregation calculations from HDFS, Cassandra data etc to feed the Data warehouse with information we were unable to compute before. Or we could use a reporting & visualization tool capable of combining traditional Data warehouse and Big Data storage or engines, like Tableau does.

A Big Data implementation: Apache Spark + Tableau

When approaching a Big Data implementation, there are quite a lot of different options and possibilities available, from new data sources and connectors to the final visualization layer, passing through the cluster and its components for storing and processing data.

A good approach to a Big Data solution is the combination of Apache Spark for processing in Hadoop clusters consuming data from storage systems such as HDFS, Cassandra, Hbase or S3, and Tableau such as the visualization software that will make the information available to the end users.

Spark has demonstrated a  great improvement in terms of performance compared to the original Hadoop’s MapReduce model. It also stands out as a one-component solution for Big Data processing, with support for ETL, interactive queries, advanced analytics and streaming.

The result is a unified engine for Big Data that stands out in low-latency applications and iterative computations, where fast performance is required, like iterative processing, interactive querying, large-scale batch computations, streaming or graph computations.

Tableau is growing really quickly, and has already proven to be one of the most powerful data discovery and visualisation tools. It has connectors to nearly any data source such as Excel, corporate Data Warehouse or SparkSQL. But where Tableau really stands out is when transforming data into compelling and interactive dashboards and visualizations through its intuitive user interface.

The combination of Apache Spark with Tableau stands out as a complete end-to-end Big Data solution, relying on Spark’s capabilities for processing the data and Tableau’s expertise for visualisation. Integrating Tableau with Apache Spark gives the chance to visually analyse Big Data in an easy and business-friendly way, no Spark SQL code is needed here.

Connecting Tableau with Apache Spark

Here, at ClearPeaks, we are convinced that connecting Apache Spark to Tableau is one of the best approaches for processing and visualising Big Data. So, how does this solution work? We are already working with this technology, and are proud to show a demonstration of Tableau connected to Apache Spark.

Prerequisites:

  • Tableau Desktop, any version that supports SparkSQL connector.
  • Apache Spark installed either on your machine or on an accessible cluster.

Integration

Tableau uses a specific SparkSQL connector, which communicates with Spark Thrift Server to finally use Apache Spark engine.

Big Data Spark & Tabelau

Software components

Tableau Desktop

Apache Spark Driver for ODBC with SQL Connector

Apache Spark (includes Spark Thrift Server)

Set up the environment

Installing Tableau Desktop and Apache Spark is out of the scope of this article. We assume that you have already installed Tableau Desktop and Apache Spark.

Apache Spark needs to be built with Hive support, i.e.: adding –Phive and –Phive-thriftserver profiles to your build options. More details here.

Install Apache Spark Driver for ODBC with SQL Connector

Install Apache Spark connector from Simba webpage. They are offering a free trial period which can be used to follow this article.

It has an installation wizard which makes installation a straightforward process.

Configure and start Apache Spark Thrift Server

Configuration files

Spark Thrift Server uses Hive Metastore by default unless another database is specified. We need to copy hive-site.xml config file from Hive to Spark conf folder.


cp /etc/hive/hive-site.xml /usr/lib/spark/conf/


park needs access to Hive libraries in order to connect to Hive Metastore. If those libraries are not already in Spark CLASSPATH variable, they need to be added.

Add the following line to /usr/lib/spark/bin/compute-classpath.sh


CLASSPATH=“$CLASSPATH:/usr/lib/hive/lib/*”


Start Apache Spark Thrift Server

We can start Spark Thrift Server with the following command:


./sbin/start-thriftserver.sh --master <master-uri>


<master-uri> might be yarn-cluster if you are running yarn, or spark://host:7077 if you are running spark in standalone mode.

Additionally, you can specify the host and port using the following properties:


./sbin/start-thriftserver.sh \

  --hiveconf hive.server2.thrift.port=<listening-port> \

  --hiveconf hive.server2.thrift.bind.host=<listening-host> \

  --master <master-uri>


To check if Spark Thrift Server has started successfully you can look at Thrift Server log. <thriftserver-log-file> is shown after starting Spark Thrift Server in console output.


tail -f <thriftserver-log-file>


Spark Thrift Server is ready to serve requests as soon as the log file shows the following lines:

INFO AbstractService: Service:ThriftBinaryCLIService is started.

INFO AbstractService: Service:HiveServer2 is started.

INFO HiveThriftServer2: HiveThriftServer2 started

INFO ThriftCLIService: ThriftBinaryCLIService listening on 0.0.0.0/0.0.0.0:10000

Connect Tableau using SparkSQL connector

Start Tableau and select option to connect to Spark SQL.

Select the appropriate Type depending on your Spark version and the appropriate Authentication depending on your security.

Big Data Spark Tableau

The next steps are selecting schema, tables and desired relations, the same as when using any other Tableau connector.

Now you are able to run your own analysis on Big Data powered by Spark!

Spark tableau  Big Data

The dashboard above has been created in Tableau 9.0 after following the instructions provided. Apache Spark is used by Tableau to transparently retrieve and perform calculations over our data stored in HDFS.

Show us a capture of your Spark powered dashboards and reports. Share with us your impressions about Apache Spark and Tableau tandem in the comment section at the bottom.

Happy analytics!

 

Eduard Gil & Pol Oliva

Bonus: Add data to Hive Metastore to consume it in Tableau

If you are not familiar with the process of loading data to Hive Metastore you will find this section very useful.

This section describes how to load your csv from your file system to Hive Metastore. After this process you will be able to use it from Tableau using the process described in this article.

For this example we are going to use the following file that contains the well-known employee example:

my_employees.csv

123234877,Michael,Rogers, IT
152934485,Anand,Manikutty,IT
222364883,Carol,Smith,Accounting
326587417,Joe,Stevens,Accounting
332154719,Mary-Anne,Foster,IT
332569843,George,ODonnell,Research
546523478,John,Doe,Human Resources
631231482,David,Smith,Research
654873219,Zacary,Efron,Human Resources
745685214,Eric,Goldsmith,Human Resources
845657245,Elizabeth,Doe,IT
845657246,Kumar,Swamy,IT

As we can see it follows the schema: Employee Id, Name, Last Name, and Department.

We are going to use beeline to connect to Thrift JDBC Server. Beeline is shipped with Spark and Hive.

Start beeline from the command line

Beeline

Connect to Thrift JDBC Server


beeline> !connect jdbc:hive2://localhost:10000


Create the table and specify the schema of it


beeline> CREATE TABLE employees (employee_id INT, name STRING, last_name STRING, department STRING)  ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';


Now you are ready to load your my_employees.csv file into the previously created table


beeline> LOAD DATA LOCAL INPATH '/home/user/my_employees.csv' INTO TABLE employees;


We can even perform operations over employees table using beeline


beeline> SELECT COUNT(*) FROM employees;


 

OBIEE 11g Installation in Silent Mode

.

At ClearPeaks we recently received a request to perform an OBIEE installation on an Oracle Enterprise Linux (OEL) server without Graphical User Interface (GUI).

The Repository Creation Utility (RCU) and the Oracle Universal Installer (OUI) offer the capability of being executed without a graphical assistant. It will be necessary to run them in SILENT MODE.

Since a database was already installed, only the RCU and OBIEE silent installation process is described in this post.

1. Schema Creation with RCU

1.1 Prerequisites

Make sure that Database and listener are running

1.2  Schema creation

1.2.1  Passwords file creation

As it is a silent installation, the RCU installer will require a text file containing the following passwords (with this sorting):

  • Database password
  • Component 1 schema password (BIPLATFORM)
  • Component 2 schema password (MDS)

vi rcu_passwords.txt


OBIEE Silent Mode

Ensure that the file belongs to Oracle before running the rcu command.

1.2.2 Execution in silent mode

As in every schema creation through RCU, it will be necessary to obtain the software from the Oracle site and extract it. The executable is located at rcuHome/bin/

Execute the following command to start the installer in silent mode:


./rcu -silent -createRepository -connectString localhost:1521:orcl -dbUser SYS -dbRole SYSDBA -schemaPrefix DEV -component BIPLATFORM -component MDS -f < ../../rcu_passwords.txt


After a while, this should be the result:

OBIEE Silent Mode

2. OBIEE Installation

2.1  Prerequisites

2.1.1   Database and listener

As in the RCU execution, the database and the listener need to be started and working before starting the OUI.

2.1.2  Schemas created through RCU

BIPLATFORM and MDS schemas must be created during the RCU installation.

2.1.3  Unset ORACLE HOME variable

If you have already installed an ORACLE database within the same server where you are going to install the OBIEE server, the ORACLE_HOME environment variable must be disabled. Bear in mind that the variable remains disabled only in the terminal session.

Execute the following command (as root):


unset ORACLE_HOME


OBIEE Silent Mode

2.1.4  Set Kernel Parameters

The last step is to modify the Kernel Parameters (as root):

The next lines must be added in the limits.conf file

  • oracle hard nofile 4096
  • oracle soft nofile 4096

vi /etc/security/limits.conf


OBIEE Silent Mode

2.2  Silent validation

2.2.1  Response file creation

If you don’t have GUI in your server, you can edit the response file I used for this installation:

response_file

It will be necessary to replace the <SECURE_VALUE> for your actual passwords.

2.2.2  Silent validation execution

Before installing OBIEE, a silent validation is required. OUI needs the response file to be executed in silent mode.

Ensure that the response file belongs to Oracle before running the installer.

Execute the following command as an Oracle user (the full path of the response file is required).


./runInstaller -silentvalidate -response /home/oracle/Desktop/bi_binaries/obiee_binaries/bishiphome/Disk1/response_file.rsp


You can ignore the following error:

OBIEE Silent Mode

2.3  Silent installation

2.3.1  Location file

If you already have an oraInst.loc file in your system, you can use it:


vi /home/oracle/app/oracle/product/11.2.0/dbhome_1/oraInst.loc


OBIEE Silent Mode

If this file does not exist on the system, the installation program creates it automatically.

2.3.2  Silent installation execution

This is the last and most critical step of the installation. Please make sure that all the previous steps have been performed successfully.

Execute the OUI in silent mode (as an Oracle user):


./runInstaller -silent -response /home/oracle/Desktop/bi_binaries/obiee_binaries/bishiphome/Disk1/response_file.rsp -invPtrLoc /home/oracle/app/oracle/product/11.2.0/dbhome_1/oraInst.loc


This step will take several minutes. If the previous steps have been performed correctly, the installation should end successfully.

Conclusion

This post outlines how to fully install OBIEE 11g on a Linux server without GUI.

Advantages of the silent mode installation include:

  • No need to consume extra resources with the graphical user interface.
  • The whole installation could be automatically executed by a script.
  • Possibility to perform equal installations if the response files don’t change.
  • No need to spend more time executing the graphical wizard manually.

For more information, consult the official Oracle documentation:

Tableau 9.0 – another game changing release for Data Discovery and Visualisation!

.

tableau_logo_2013

We are very excited to inform you of the recent (April 7) Tableau release 9.0

Our in-house team of Tableau experts have been beta testing the product for some time now and have highlighted some key new features which will undoubtedly delight our user community, namely:

 

  • A redesigned start experience that gives you quick access to resources and inspiring examples
  • New data prep tools designed to reduce the amount of time you spend wrestling with cross-tab style spreadsheets that have messy headers and footers
  • Smart maps with geographic search and lasso or radial select.
  • Story Points formatting options that let you customize the fonts and colors of your data stories
  • Powerful new analytics features that enable you to ask and answer questions of your data without leaving your workflow
  • Instant calculation editing and LOD calculations
  • Significant performance enhancements and much more

We would be more than happy to help you in discovering and visualising your data with Tableau 9.0 – simply get in touch!

 

Data Discovery and Analysis: making the best of Oracle Endeca Information Discovery

.

With the acquisition of Endeca in 2011, Oracle enhanced their already powerful Business Analytics products portfolio with Information Discovery capabilities, potentially allowing customers to analyse structured and unstructured data within the same framework.

Version 3.1 of the tool, released in November 2013, introduced new features such as Data Mash-up from multiple sources, easier and deeper unstructured analysis tools available directly to the business users, a tighter integration with the Oracle BI platform, Enterprise Class Data Discovery and the Web Acquisition Toolkit, a crawler to harvest and analyse website content.

Where, therefore, is Endeca positioned now within the context of Business Intelligence, and how should it be used to make the best of its capabilities? Can it be used as an alternative to OBIEE or other traditional, established Business Intelligence tools? How does its web crawling tool fare against the existing competition? To answer these questions and more, we have put Endeca on the test bench and saw it in action.

In today’s business landscape, analysis of unstructured and large volume data (Big Data) is morphing from a nice-to-have task for cutting-edge, BI-savvy companies to an important driver for business processes in enterprises everywhere. Customer data stored and displayed in social media like Twitter, Facebook and LinkedIn are virtual gold for any marketing department, while sensors capture millions of snapshots on a daily basis that can be used to monitor manufacturing processes and improve their performance. It is not difficult, therefore, to see why Oracle considers Endeca a strategic investment and a key component of its Business Analysis product stack.

In the following paragraphs of this article you will find our honest, no-frills opinion on Endeca Information Discovery 3.1, its new features and our suggestions for making the best of it within your enterprise BI stack.

Integration with Oracle Business Intelligence

One of the most recurring complaints from early Endeca adopters was the lack of integration with the Oracle Common EIM (Enterprise Information Model). As often happens with recent acquisitions, the first Oracle-branded versions of Endeca – starting with Version 2.3 in April 2012 - were mostly a revamp of the existing Latitude tool. Endeca felt like, and actually was, a stand-alone data discovery tool with its own data processing engine and front-end studio.

This has radically improved with Version 3.1. Oracle BI is now a native data source for Endeca and users can now create their Discovery Applications sourcing data from OBI with a two-step easy process. Moreover, the Integration Knowledge Module for Oracle Data Integrator now enables the latter ETL tool to load data directly into the Endeca Server.

endeca

There are still margins for improvement, of course. Administration tasks are still performed separately from the rest of the Oracle EIM architecture. Endeca Server does not interface with WebLogic and Enterprise Manager, core of the Oracle Middleware. We would also like to see CloverETL better integrated and possibly merged with ODI, to avoid splitting the overall data workflow and transformation logic in two separate tools. We see a lot of potential in using Endeca Server as a data source to the OBIEE repository, capability that is currently limited to BI Publisher.

We like, however, the concept of e-BS Extensions for Endeca. Based on pre-defined views in Oracle e-Business Suite, the Extensions consist of a set of Studio applications with pre-built content for a broad range of horizontal functions, from Supply Chain Management (Discrete and Process Manufacturing, Cost Management, Warehouse Management, Inventory,…) to Human Capital Management, Asset Management and more. The good level of integration within e-BS makes them a light-weight, easy-to-implement alternative to Oracle BI Analytic Applications module. Like for its bigger brother, however, the customization effort of the pre-built dashboards content required to be successfully used remains a question mark.

Self Service Analysis (Data Mash-up, Provisioning, Applications)

These are the topics that most excited our team when testing the new Endeca capabilities. The range of sources and databases available for data mash-up has been broadened, covering both databases as well as semi-structured data in JSON format and the Applications look and feel has been improved with new visualization options, but in our opinion the most compelling feature of Endeca is the new Provisioning and Applications creation process.

The workflow to create a new Discovery Application is now based on a wizard so user-friendly that we believe the classic buzz-phrase “Business users creating own applications! No more IT overhead!” is not a chimaera anymore but a serious possibility. Yes, establishing the Provisioning Service, connecting to the data source (JDBC for example) and configuring it might require some hand-holding, but once it is done, the wizard simplifies and streamlines the proper Application creation tasks, allowing the business user to perform its data discovery in autonomy.

Also, it is a fact that Endeca Applications look good. Definitely good, actually better than OBIEE dashboards and we can see why business users are usually impressed more by the former than the latter during product demos.

selfservice_endeca - Copy

Web Acquisition Toolkit

“A tool within a tool within a tool” is how our testing team has defined the new web crawling tool embedded in Endeca.

The toolkit looks and feels separate from the Endeca Server (it actually is) and features its own Design Studio where crawling rules and workflows can be defined, organized and scheduled, adding a third layer of data processing complexity: from Design Studio to CloverETL to ODI. In fact, Web Acquisition Toolkit does not use Endeca Server as a target, so a third party ETL tool is necessary to move data accordingly.

However, even if right now there are cheaper and more powerful options on the market, the tool does its job and – if Oracle continues investing in product integration, which we think is very likely – has the potential to become a very interesting feature of future Endeca versions.

wat_interface_endeca - Copy

Best fit for Endeca?

Wrapping up, we can safely say that Endeca is evolving into a compelling component of the Business Intelligence stack for enterprises looking to enable their users to perform rapid-fire data discovery (up to a certain extent, of course – data management, especially in complex enterprise environments, will still be required).

The stand-alone nature of Endeca architecture is a weakness but also a strength, allowing Endeca to be purchased and installed independently from the rest of the Oracle BI stack. However, we can see how e-BS Extensions make Endeca extremely appealing to Oracle ERP existing users.

Could Endeca, therefore, be considered as an alternative to OBIEE (and Oracle BI Applications) as the enterprise Business Intelligence tool? We do not think so. Although its Applications visualization capabilities are very powerful, the best fit for Endeca is to complement OBIEE. While the solid back-end (repository metadata layers, reports and dashboards catalog) of the latter provides corporate reporting in a structured and organised way, Endeca’s real power lies in enabling the business user to individually analyze data patterns on the fly: mix and match different data sources and quickly create new applications to find out the answers they need.

To enable all of the above, Self-service provisioning is where the strength of Endeca shows up. Web sources, unstructured information as flat files can be mashed together, and setting up and configuring another provisioning service to mix it up with the rest is a very easy task.

We at ClearPeaks will keep on the outlook for future enhancements and features of Oracle Endeca Information Discovery. If in the meantime you want to know more about Endeca and how it could add value to your enterprise, contact us.

 

Big Data Strategy

.

As dust begins to settle, the hype around Big Data is slowly changing into a more realistic thinking on how to actually make it work. At this point, mostly everyone have heard of Big Data and have a clear understanding of what it is and what could be the benefits of putting in place a Big Data strategy in the company. However, one of the main adoption barriers is still the technical complexity when it comes to envision and deploy a Big Data solution.

These kinds of projects are usually driven by IT and the CIO on the early stages. The key at this point is to identify a use case to prove that the investment on a Big Data project is worthy; this use case has to clearly demonstrate that thanks to the new data analysis strategy new unprecedented insights could be unleashed allowing for game changer decisions to the company.

For many companies this is just the easy part, many CIOs were conscious already of the huge value that the data produced had, and yet they were not able to bring that data in their BI systems due to its size, speed, lack of structure, etc… Now, Big Data seems to make that possible, but the question that remains is: How are we going to do it? Continue reading this post >

privacy policy - Copyright © 2000-2010 ClearPeaks

topnav