Hive vs. Impala with Tableau

.

As I explained in a previous post, Cloudera is an active contributor to the Hadoop Project and in this ecosystem they have launched Impala inside the CDH4 package.

Impala offers the possibility of running native queries in Apache Hadoop. It allows users to communicate with the data stored in HDFS using live SQL queries with no data extractions neither additional transformation.

But for those of you that already knew Hive, this is nothing new right? Therefore, which is the difference? In the official web of Cloudera, Impala is introduced as the truthful solution for real time data analysis, as it offers a fast implementation process using SQL through Business Intelligence tools. In fact, Impala is not 100% a substitute for Hive (Impala does not cover batch process and ETL, which are offered by Hive) but it is the option that offers shorter execution time in SQL queries as well as better integration with leader tools in BI.

I decided to make a test and compare Hive and Impala in the same environment, and for that I used Tableau. Remember that Tableau allows using any or both connections in its 8th version. In what follows, I analyze the same data source and I launch the same SQL query with the same BI tool.

In the web of Cloudera there is the option to download a VM with everything you need for a quick start with Hadoop (https://ccp.cloudera.com/display/SUPPORT/Cloudera+QuickStart+VM). This VM has Hive and Impala installed, so it is perfect for the comparison that I want to build in.

Inside the VM, the initial appearance inside Cloudera Manager (the CDH4 app that shows all active services) is the following:

We can see that among the installed services we can find the main two Hadoop members: Mapreduce (mapreduce1) and HDFS (hdfs1). We also see that hive1 and impala1 are already started (it is necessary to start Impala services manually, as it is off by default). Inside the HUE application (a query editor for Hive, Pig, and Impala that has a file explorer for HDFS) there are two available samples to download, with data on workers and wages. We will use these two tables as data sources for the comparison: sample_07 and sample_08.

Thanks to this virtual machine it is possible to save plenty of time when installing the components one after the other. With all necessary services running alright, the next step is to perform the connection with Tableau.

Cloudera Hadoop connection in Tableau

Before starting, it is necessary to install the Cloudera Hadoop driver for Tableau. This is the link to download the driver:

https://ccp.cloudera.com/display/con/Cloudera+Connector+for+Tableau+Download

  1. Once downloaded, follow instructions to setup the driver.

2.  Once installed, open Tableau and create one connection to Cloudera Hadoop for each database.

3.  First I start with the Hive connection. The specific port for this connection is the 10000. Select the default scheme and the sample_08 single table.

4. As I mentioned before, I will create the same report for both connections in order to evaluate the execution time of the same query. The report is the following one:

5. Then, we carry out a second connection, this time with Impala. The specific port is the 21000.

 

6. I perform the same report showed above and compare the execution time of the queries:

As we can see, the query is the same, but the execution time is very different:  with Impala it is almost 1 second, and using Hive connection it is more than 1 minute! It is a huge performance difference

We have checked, therefore, that on equal ground, Impala is the best option in terms of performance. Impala by-passes the Map-Reduce layer in Hadoop resulting in much faster query response times than Hive. It’s not risky to affirm that most customers wanting to do ad-hoc visual analytics on Hadoop will turn to a technology like Impala.

 

 

 

 

Big Data in Tableau: Hadoop Connection in Tableau Desktop

.

Hello Folks! In this article I want to explore the Cloudera Hadoop connection that Tableau Desktop v8 offers.

Tableau’s strategy is around big data lately and they are doing their best to become the reference visual analytic tool that costumers use to see and understand their big data. Along this line, Tableau is able to connect with more than 25 types of connections and to provide additional ways to move data into Tableau with new APIs that are offered in the version 8.

And what is Hadoop? Hadoop presents itself as the new solution for unstructured data. Nowadays the companies store and generate much more data than ever in history. The relational databases and Datawarehouses are really helpful and useful for structured data. However, Hadoop was designed to solve another issue: the fast and reliable analysis with structured and complex (unstructured) data.

Hadoop has been inspired by the reports published by Google that describe the difficulties to handle the spate of data and it has become the reference system to store, process and analyze the thousands of terabytes or even petabytes of data.

Technically, Apache Hadoop has two main components: HDFS (Hadoop Distributes File System) and MapReduce (processes large data sets with a parallel, distributes algorithm on a cluster). However, Hadoop has many different products that complement its ecosystem. The most popular are Hive, Hbase, Apache Flume, Pig, and the best contribution of Cloudera for Hadoop, Impala.

Cloudera Hadoop and Tableau

Cloudera is an active contributor of the Hadoop Project. It provides a Hadoop commercial distribution that is 100% open-source and ready to be used called CDH (Cloudera’s Distribution Hadoop). As a new product in the Hadoop ecosystem, Cloudera has launched Impala in the CDH4 package.

We can here see a picture that summarized all CDH4 components:

Impala, like Hive, offers the possibility of running native queries in Apache Hadoop. It allows the user to communicate with stored data in HDFS using SQL queries without moving data or any other additional processing.

Nevertheless, where is Tableau in the open source universe? Tableau, following the strategy of getting closer to Big Data, has become a Cloudera partner and is providing a native connector with Cloudera Hadoop. The main idea is to extend the usability of Apache Hadoop through Tableau.

This is the architecture inside the connection:

How is it working inside Tableau? Let’s discover how it works and the appearance of this connection in Tableau Desktop. In the list of connection we can see a connection called Cloudera Hadoop.

As I mentioned before, Cloudera Hadoop is composed by a group of applications and services you need to install on the server. Therefore, Tableau includes different options to connect to the data base in order to avoid conversions: Hive, Impala, Beeswax and Beeswax and Kerberos.

After all the info is entered on the window showed before, it is time to see your data in the Desktop interface. This is how a table charged from a Hive connection looks.

As you can see, the distributions of the dimensions and measures are pretty much similar to other kind of connections. It’s important to note that each type of connection displays special functions in the Calculated Field window and any of the Cloudera Hadoop connections are not an exemption. Below you can see some of the special functions  available connecting to Hadoop.

Let’s start from the XPATH functions. Because Hive or Impala tables can be linked to a collection of XML files or document fragments stored in the Hadoop file system, Hadoop is much more flexible analyzing XML content. Tableau provides a number of functions for processing XML data, which allows you to extract content, create calculated fields and filter XML data. In the list above we can see some of these XML functions.

In addition to XPATH operators, the Hadoop connection offers several ways to work with common web and text data: for instance, JSON fields. The GET_JSON_OBJECT function retrieves the JSON string base elements on the JSON path.

As you can see, although Tableau allows to connect with unstructured data bases and a priori less intuitive such as Hadoop, the interface and how it works inside Tableau do not miss the richness in the display and  the user-friendly environment that makes us like Tableau so much. Furthermore, it implements functions that are specific to this type of connections in order to keep functionality. So, now you know you can bring your Hadoop data to life with Tableau!

More info at:

http://blog.cloudera.com/blog/2013/05/cloudera-impala-and-partners-tableau/

http://www.cloudera.com/content/cloudera/en/solutions/partner/Tableau.html

privacy policy - Copyright © 2000-2010 ClearPeaks

topnav