Twitter and Oracle Endeca Information Discovery – Part 2

.

PART 2: Folling up on the first part of Oracle Endeca Information Discovery

Endeca Studio analysis

After all the loading has been done, a new application in the Endeca Studio can be created with this new data. Then, some meaningful and dynamic dashboards can be created to start “surfing” that data and discovering interesting facts hidden among de hundreds of thousands of tweets that, otherwise, we had missed among the Twitter never ending stream. Let’s get to it.

Observing Messi and Neymar performance repercussion on Twitter

To observe the repercussion of the performance of the star players during the duration of the whole capturing window, the following multi-bar graph/timeline was created:

Figure 1. Messi and Neymar performance repercussion on Twitter

Figure 1. Messi and Neymar performance repercussion on Twitter

It shows how many tweets about Messi, Neymar or both of them where published over time (note that the “Instant” value is coded this way: “21945” means “Day 2 19h 45min”). The peaks at tweets record count mean that some important event has happened about the topic represented by the bar and, for that purpose, the tag cloud component is really useful.

Tag clouds provide comprehensible information about the most frequent or relevant words or phrases in a certain attribute. In our data domain we have two attributes that would be useful to see in that representation: “Tweet Themes” and “Tweet Words”. “Tweet themes” were extracted using the text enrichment engine and provide a meaningful way of understanding the contents of a tweet using a few words. “Tweet words” is just a list of the words inside a tweet and have no meaning by themselves. Building two tag clouds with both attributes provide an easy way of understanding what the tweeters are talking about at a glance. It is advisable to sort the “Tweet words” tag cloud by relevancy instead of by frequency since they are more intelligently weighted.

Let’s see some examples of tag clouds utilisation. At 21:41 (instant 22141) something happened with Messi because 2310 tweets where talking about him. If we click on that bar to filter by this instant and then look at the previously created tag clouds we observe the following:

Figure 2. First goal instant tag cloud

Figure 2. First goal instant tag cloud

We have easily found the precise instant at which Messi scored the first goal.

At 22:36 there is a peak in the “Messi & Neymar” graph so let’s see why through the same process:

Figure 3. First time Neymar and Messi playing together instant tag cloud

Figure 3. First time Neymar and Messi playing together instant tag cloud

People where excited because it was the instant at which Neymar entered in the field and it was the first time he was playing side by side with Messi. One more example of how easy it is to detect important events happening in a time line and identify the main reason using Endeca.

Sentiment analysis

The text enrichment component also provides sentiment analysis information that allows knowing if people are satisfied or not by observing how much positive or negative messages they are tweeting.

Using the metric component this information can quickly be seen textually. We created one “Metrics Bar” using the following EQL query:

Figure 4. EQL sentiment query

Figure 4. EQL sentiment query

The resulting bar can be setup as follows:

figure5

Many charts can be plotted to observe a sentiment attribute. Graphs in Figure 5 show two charts:

  • The left one sums up all the sentiment evaluations of the two star players. We can see how the sum is positive for both of them, but it is almost three times more positive for “Messi”.
  • The chart on the right side shows the sentiment evolution over time. Using the value axis dropdown menu, min (more negative sentiment), max (more positive sentiment) and avg (average of sentiments) graphs are plotted.
Figure 5. Sentiment sum and evolution charts

Figure 5. Sentiment sum and evolution charts

If we filter by the minute when Messi scored the first goal (21:41), we can observe how the global sentiment is very positive and the sum of sentiment of the tweets talking about Messi is very high:

Figure 6. Global sentiment at a Messi’s goal

Figure 6. Global sentiment at a Messi’s goal

Where are the tweeters located?

Using the map component, the geographical coordinates embedded in the tweets can be used to place the messages in a map and know, in this case, from which countries the match is being followed.

Since not all the tweets incorporate geographical location information, we can exclude the ones without that information using an excluding filter:

Figure 7. Excluding filter

Figure 7. Excluding filter

The filtering can also be done from inside the map component using an EQL statement. In our example we used three different record sets inside the map to show, in separated tabs by sentiment, positive, negative and neutral tweets.

Figure 8. Tweets geographical distribution by sentiment.

Figure 8. Tweets geographical distribution by sentiment.

Unstructured text search

The powerful Endeca search engine also allows textual searching through structured and unstructured data with some interesting features such as orthographic corrections or “did you mean” suggestions.

Let’s suppose that we want to look at a certain event that happened during our capturing window time. Searching by “goal” and looking at the timeline graph, we can observe that the data peaks for the published tweets occur on the instants when the goals were scored. Additionally, an uncommon event happened during the match: someone from the audience jumped into the field. Searching the word “espontaneo” which is Spanish for “field invader”, we can identify that something regarding that term happened at 22:37 (instant 22237)and that it had something to do with Neymar (as the timeline in Figure 9 suggests).

Figure 9. Searched event timeline

Figure 9. Searched event timeline

By clicking on it and looking at the tag clouds of that instant (Figure 10), we can guess that an intruder jumped into the field and tried to hug Neymar.

Figure 10. Event tag clouds

Figure 10. Event tag clouds

Setting up alerts

The last thing I am going to talk about in this post about Endeca Studio is about how to set up alarms for quickly identifying certain events that could happen amongst our data when certain conditions are met. Let’s suppose we want to monitor the top minutes with more than 1000 tweets (important events). Each alert has to be defined using a filtering EQL query. For example, for the proposed case, the following query was defined for moments with more than 1000 tweets:

Figure 11. Sample EQL query to get minutes with more than 1000 tweets

Figure 11. Sample EQL query to get minutes with more than 1000 tweets

Using an alert message as “{twts} tweets have been published at hour {h} minute {m}.” a graph as shown in Figure 12 is obtained. The data can even be refined by the alarms, for example if we click on the third alert for “Important minutes”  (2481 tweets at 22:55) and look at the tag clouds, we can identify that Dongou replaced Messi at that moment.

Figure 12. Alert results

Figure 12. Alert results

Endeca Studio offers an easy and intuitive way of discovering and visualising information from both structured and unstructured data providing a user friendly interface and useful components. However, this simplicity is highly affected by how well the integrator manages the initial data. So it is important to consider that the right structurisation of unstructured data (as far as possible) paves the way to a better analysis later on.

Twitter and Oracle Endeca Information Discovery

.

PART 1: Twitter Extraction, Integrator Transformation and Endeca Loading

In the last years we have witnessed how the explosion of social media, rather than just providing an easy way of generating and publishing new personal content, collaterally offers a real-time source of information that shouldn’t be wise to ignore by any company that aims for customer satisfaction (and which isn’t?). That´s why today we want to talk about Twitter and Oracle Endeca Information Discovery.

Nowadays, Twitter is the most representative example of social network providing real time data coming from users all over the world that includes their reactions, opinions, criticisms and or praises to any social event such as a new product release or a football match.

The main drawback of those kind of data sources is that they are heterogeneous, unstructured and some processing needs to be done in order to extract useful information from it. That’s when Oracle Endeca Information Discovery (OEID) comes into play.

This blog article will explain how to capture fresh real time Twitter data in a football match scenario and feed it to Endeca in order to perform unstructured text enrichment and analytics.

Oracle Endeca Framework

Three main components can be identified in the OEID platform: the Oracle Endeca Server, the OEID Studio and the OEID Integrator.

  • Oracle Endeca Server. Stores the data into a key-value pair data model enabling a flexible management of changing and diverse data types, fact that also reduces the necessity for up-front data modelling.
  • OEID Integrator. Gathers and unifies diverse sources of data and also allows the enrichment of unstructured text such as entity and theme detection as well as sentiment analysis.
  • OEID Studio.  Allows the final business user to visualise and discover new insights by analysing the stored data through the intuitive building of interactive dashboards.

Figure 1: Oracle Endeca Framework

In this blog’s use case, the OEID Integrator will be used to parse and classify the raw Twitter data as well as to perform some text enrichment and sentiment analysis over the captured tweets content before loading them into the Endeca Server. Afterwards, the OEID Studio will be used to perform the analytics over the extracted information.

Twitter stream

The public streaming APIs provide low latency real time access to Twitter’s global stream of data. With a “filtering” API call, we are capable of opening a connection with Twitter through which we will be receiving all the tweets that match our query.

For our football match example, we wrote some Java code to capture all the tweets related to the last August 2nd match of the 48th edition of the Joan Gamper Trophy between F.C. Barcelona and Santos, focusing on the two most popular players (Messi and Neymar) with the following query:

https://stream.twitter.com/1.1/statuses/filter.json?track=messi,neymar,gamper

The analysis conducted in that article comprises between 19:45 and 0:00 CET (match started at 21:30). A total sum of about 180,000 multi-lingual tweets (mainly English, Spanish and Portuguese) were captured.

Tweets are received in JSON format, so some parsing was done “on the fly” while receiving from the stream in order to extract the most important characteristics from the tweets and the authors: tweet ID, date, text, coordinates, user ID, user name, user language, user location, and other useful information.

Figure 2: Example of captured tweet in JSON format

Loading tweets into Endeca

After building the data file containing the tweets captured from the stream and adding the influence information, it’s time to perform the necessary transformations in the data and load it into Endeca. That process, performed using the OEID Integrator, comprises the following steps (sample graphs from the “GettingStarted” demo project are advised to be used to perform common tasks such as domain creation or configuration):

1. Initialise the data domain. The graph in Figure 3 checks if the data domain is already created. If it is already, the domain is enabled and if not, it is then created.

Figure 3: Data domain creation

It is based in the “InitDataDomain.grf” graph from the sample “GettingStarted” project. For it to work, it is just necessary to change de adequate properties in the “workspace.prm” file with our Endeca Server and data domain properties:

Figure 4: Endeca Server and domain configuration in workspace.prm file

2. Load the attributes configuration. As I said before, it isn’t necessary to define the data model (hence Endeca flexibility). However, sometimes it is necessary to override some default attribute settings. Since the last version of Endeca, attributes are “single assign” by default (they cannot take multiple values for the same attribute). So, if for example, we want to store in a single field a list of words that represent the tags that define a tweet, we need to set the “IsSingleAssign” property of that attribute profile to “false”. The following images show the content of the configuration files in our case scenario (profiles, metadata and groups).

Figure 5: Attribute profiles configuration file

Figure 6: Attributes metadata and groups configuration files

“LoadConfiguration.grf” graph from the demo project can also be used here to correctly set-up the attributes properties, metadata, groups and basic search configuration (see Figure below). Some of those configurations can be changed later on by running a modified configuration graph or directly through the OEID Studio.

Figure 7: Attribute configuration loading

3.Transform and load the data. The last step is to manage the captured data from twitter, pass it through the text enrichment components in order to obtain semantic information (like the topics the twitter status are talking about or the sentiment analysis) and, finally, load everything to Endeca server.

Figure 8: Data transformation and loading

Figure 8 depicts the ETL graph of that stage. Data extracted from the twitter stream was stored in two separate data files: tweets and users. Both data flows are merged and then processed through four text enrichment components. These elements are in charge of calling the external Lexalytics[2] analysis module to extract the sentiment, named entities and themes (among other information) from the tweets text in three different languages (English, Spanish and Portuguese) and through a special data set prepared for analysing Twitter messages. Afterwards, data processed is merged together and loaded into the Endeca server ready to be used in the Endeca Studio to perform the analytics.

Finally, the data domain in the Endeca Server is loaded with our data ready to be used through an Endeca Studio application to start extracting useful information. In the next blog post, I will talk about how to build an OEID Studio application.

Read the second part of Twitter and Oracle Information Discovery here:

[1] Lexalytics Text Analysis Software Engine: http://www.lexalytics.com/.


privacy policy - Copyright © 2000-2010 ClearPeaks

topnav