Big Data in Tableau: Hadoop Connection in Tableau Desktop

In this article I want to explore the Cloudera Hadoop connection that Tableau Desktop v8 offers.

Tableau’s strategy is around big data lately and they are doing their best to become the reference visual analytic tool that costumers use to see and understand their big data. Along this line, Tableau is able to connect with more than 25 types of connections and to provide additional ways to move data into Tableau with new APIs that are offered in the version 8.

And what is Hadoop? Hadoop presents itself as the new solution for unstructured data. Nowadays the companies store and generate much more data than ever in history. The relational databases and Datawarehouses are really helpful and useful for structured data. However, Hadoop was designed to solve another issue: the fast and reliable analysis with structured and complex (unstructured) data.

Hadoop has been inspired by the reports published by Google that describe the difficulties to handle the spate of data and it has become the reference system to store, process and analyze the thousands of terabytes or even petabytes of data.

Technically, Apache Hadoop has two main components: HDFS (Hadoop Distributes File System) and MapReduce (processes large data sets with a parallel, distributes algorithm on a cluster). However, Hadoop has many different products that complement its ecosystem. The most popular are Hive, Hbase, Apache Flume, Pig, and the best contribution of Cloudera for Hadoop, Impala.

Cloudera Hadoop and Tableau

Cloudera is an active contributor of the Hadoop Project. It provides a Hadoop commercial distribution that is 100% open-source and ready to be used called CDH (Cloudera’s Distribution Hadoop). As a new product in the Hadoop ecosystem, Cloudera has launched Impala in the CDH4 package.

We can here see a picture that summarized all CDH4 components:

Impala, like Hive, offers the possibility of running native queries in Apache Hadoop. It allows the user to communicate with stored data in HDFS using SQL queries without moving data or any other additional processing.

Nevertheless, where is Tableau in the open source universe? Tableau, following the strategy of getting closer to Big Data, has become a Cloudera partner and is providing a native connector with Cloudera Hadoop. The main idea is to extend the usability of Apache Hadoop through Tableau.

This is the architecture inside the connection:

How is it working inside Tableau? Let’s discover how it works and the appearance of this connection in Tableau Desktop. In the list of connection we can see a connection called Cloudera Hadoop.

As I mentioned before, Cloudera Hadoop is composed by a group of applications and services you need to install on the server. Therefore, Tableau includes different options to connect to the data base in order to avoid conversions: Hive, Impala, Beeswax and Beeswax and Kerberos.

After all the info is entered on the window showed before, it is time to see your data in the Desktop interface. This is how a table charged from a Hive connection looks.

As you can see, the distributions of the dimensions and measures are pretty much similar to other kind of connections. It’s important to note that each type of connection displays special functions in the Calculated Field window and any of the Cloudera Hadoop connections are not an exemption. Below you can see some of the special functions  available connecting to Hadoop.

Let’s start from the XPATH functions. Because Hive or Impala tables can be linked to a collection of XML files or document fragments stored in the Hadoop file system, Hadoop is much more flexible analyzing XML content. Tableau provides a number of functions for processing XML data, which allows you to extract content, create calculated fields and filter XML data. In the list above we can see some of these XML functions.

In addition to XPATH operators, the Hadoop connection offers several ways to work with common web and text data: for instance, JSON fields. The GET_JSON_OBJECT function retrieves the JSON string base elements on the JSON path.

As you can see, although Tableau allows to connect with unstructured data bases and a priori less intuitive such as Hadoop, the interface and how it works inside Tableau do not miss the richness in the display and  the user-friendly environment that makes us like Tableau so much. Furthermore, it implements functions that are specific to this type of connections in order to keep functionality. So, now you know you can bring your Hadoop data to life with Tableau!

More info at:

http://blog.cloudera.com/blog/2013/05/cloudera-impala-and-partners-tableau/

http://www.cloudera.com/content/cloudera/en/solutions/partner/Tableau.html

Jessikha G
jessikha.garcia@clearpeaks.com