Big Data Strategy - ClearPeaks

26 Feb 2013 Big Data Strategy

Posted at 15:59h in Big Data & Cloud by Antonio R

As dust begins to settle, the hype around Big Data is slowly changing into a more realistic thinking on how to actually make it work. At this point, mostly everyone have heard of Big Data and have a clear understanding of what it is and what could be the benefits of putting in place a Big Data strategy in the company. However, one of the main adoption barriers is still the technical complexity when it comes to envision and deploy a Big Data solution.

These kinds of projects are usually driven by IT and the CIO on the early stages. The key at this point is to identify a use case to prove that the investment on a Big Data project is worthy; this use case has to clearly demonstrate that thanks to the new data analysis strategy new unprecedented insights could be unleashed allowing for game changer decisions to the company.

For many companies this is just the easy part, many CIOs were conscious already of the huge value that the data produced had, and yet they were not able to bring that data in their BI systems due to its size, speed, lack of structure, etc… Now, Big Data seems to make that possible, but the question that remains is: How are we going to do it?

The Big Data ecosystem can be a bit overwhelming when you first try to put all the pieces together; there are just too many products, vendors and technologies, so, where to start?

A simplified view of a Big Data strategy would depict the following stages:

1. Acquire and ingest information from federated sources both structured (as relational databases) and unstructured (log files, sensor information, social media streams…).

2. Organize this information and store it in a distributed file systems for integration. After that it’s processed by Map Reduce programs that will extract the value out of all that data. The results are then moved to a more traditional storage, like a data warehouse.

3. Analyze that information in the data warehouse using a BI tool that allows users to formulate questions that leverage all the original data.

Stage number three is nothing more than the traditional BI so far. On the other hand, stages one and two present new and interesting challenges. The complex map reduce programs can be programmed by experts developers with advanced skills in one modern programming language who should also have strong analytic knowledge of the data. Also the Hadoop clustering and NoSQL databases would require your DBAs to acquire a new set of skills in order to administer the new environments. And don’t forget to save some time to code some connectors that can move the data between components. Well, no one said this Big Data business was an easy thing…

It’s undeniable that implementing a Big Data solution is a challenging process that would require new expertise in most of the companies. However, many vendors are trying to offer packaged sets of preinstalled, integrated and optimized software on custom hardware appliances that cuter with most of the requirements stated above to minimize the troubles.

Oracle’s solution for Big Data offers a broad set of software and hardware components. Companies that are already working with Oracle’s stack of products or looking to implement a new Big Data architecture from scratch should definitely take a look at the following products.

Oracle Big Data Appliance

Oracle Big Data Appliance comes in a full rack configuration with 18 Sun servers for a total storage capacity of 648TB. Every server in the rack has 2 CPUs, each with 6 cores for a total of 216 cores per full rack. Each server has 48GB memory for a total of 864GB of memory per full rack.

This powerful appliance is loaded with a combination of open source and specialized software developed by Oracle. This includes:

Cloudera Hadoop Distribution

The Big Data Appliance contains Cloudera’s Hadoop distribution and Cloudera Manager. This is probably one of the biggest assets of this solution, as Cloudera is in charge of one of the most successful Hadoop distributions out there.

Also the Cloudera Manager provides a single and central pit stop where you can change the configuration and perform a diagnosis check of the whole Hadoop cluster. A lot of reporting performance is also available here.

Oracle NoSQL database

Oracle NoSQL Database is a distributed, highly scalable, key-value database based on Oracle Berkeley DB. The product is available in both an open source community edition and in a priced enterprise edition for large distributed data centers. The former version is installed as part of the Big Data Appliance integrated software

Oracle Data connectors

Where Oracle Big Data Appliance makes it easy for organizations to acquire and organize new types of data, Oracle Big Data Connectors enables an integrated data set for analyzing all data.

Oracle Big Data Connectors can be installed on Oracle Big Data Appliance or on a generic Hadoop cluster.

At the moment, four connectors are available:

Oracle Loader for Hadoop
Oracle Data Integrator Application Adapter for Hadoop
Oracle R Connector for Hadoop
Oracle SQL Connector for HDFS

One very interesting option is the new Oracle Data Integrator adapter for Hadoop. This connector offers the developers a familiar GUI to develop and orchestrate the Map Reduce jobs. The integration with the Hadoop HDFS system is done by using Hive (part of the Cloudera’s Hadoop distribution) and that allows us to write the data transformation using SQL like language called HiveQL.

Oracle’s version of the widely used Project R statistical environment enables statisticians to use R on very large data sets without any modifications to the end user experience. Examples of R usage include predicting airline delays at a particular airport and the submission of clinical trial analysis and results.

So clearly, the Big Data Appliance is ready to cover the acquisition and organization of your Big Data information. When it comes to the analysis phase, one of the most exciting additions is the Endeca software, which Oracle has rebranded to Oracle Endeca Information Discovery. This software is an enterprise data discovery platform for rapid, intuitive exploration and analysis of information from any combination of structured and unstructured sources.

Right now, the integration of Endeca with OBIEE is still in a very early stage, but I won’t be surprised to see a tighter integration of these products in the future, offering a unified layer of analysis and reporting.

As you can see, the number of available components covers pretty much all the requirements of a Big Data Strategy. In my opinion, the most interesting part is the Big Data Appliance itself because with just one acquisition you cater with all the storage and networking requirements for your HDFS file system which in addition you can manage and control thanks to the excellent Cloudera software.

Once you have gathered and organized your Big Data sources you can move it with ease to your data warehouse thanks to the Oracle data connectors for Hadoop. Also if you have the luck to be running your data warehouse in Exadata you will indeed leverage the 40Gb/s of the infiniband connection available across all the Exa systems.

The analysis part is probably the most important, as this is when all the work is put to the test by the users and their questions. Again we are presented with several appealing options. We could go for a traditional BI analysis data-driven using Oracle Business Intelligence, great for unified version of the enterprise data and ad-hoc reporting. You could even run OBIEE on Exalytics to leverage TimesTen in-memory database and offer your users a unique experience in terms of speed.

But possibly you didn’t put in place a Big Data solution to have just “traditional” reporting. Fear not, Endeca Information Discovery will give you one of the most advanced in-memory analytic engines of the market, with a faceted search across both structured and unstructured information.

And the options don’t end here. Real Time Decision (RTD) or Oracle R can be of great use when used along the new information available thanks to the Big Data strategy.

Conclusion

So Big Data is finally here. If your company has a clear use case for it, then go for it. Trying to put in place the whole thing by yourself is possible but painful, and can risk the whole endeavor. So if the technical barrier is a showstopper for you, don’t despair and try to overcome it by looking at the solutions that we are offering.

Antonio R

antonio.rivas@clearpeaks.com

26 Feb 2013 Big Data Strategy

Oracle Big Data Appliance

Cloudera Hadoop Distribution

Oracle NoSQL database

Oracle Data connectors

Conclusion

Antonio R

More Links

From Our Blog