This is another blog entry of the unstructured data series, dedicated to retrieval and processing of unstructured (or, rather, semi-structured) data by SAP BusinessObjects Data Services, using JSONAdapter and Text Data Processing. After consumer sentiment analysis of Twitter messages, I would like to focus more on richer content -- news articles.

 

An RSS news feed “Yahoo! Australia & NZ Finance” has been selected as data source. Earlier in this blog series I demonstrated how  JSONAdapter retrieves data from RSS sources. As I mentioned there, Yahoo! only gives away a short snippet of each RSS item, not the full text of a news article – a URL for the latter is provided, though. It would be good to have a special “web crawler” Data Services adapter, that could reach a specified URL from a Function Call and return the full text of an article -- and in some future I may develop one. Until then, a curl.exe has to do it (but with extra crutches), so I created a Data Services Job that would generate a BAT-file and load every article into an HTML-file with name NNNN.html corresponding to the ID of the RSS item in the database:

5-1.PNG                   

 

That way, 1000 news articles have been collected over the period 1st-16th of May, 2012. Then I ran a simple Text Data Processing job over those files, getting ~81000 entities in the “raw” analysis table:

5-2.PNG 

 

The Entity Extraction transform result delivers a ‘spaghetti’ data consisting of entities and facts: from a sentence ‘Apples Ltd. acquired Oranges Inc.’, the transform would extract two entities of type Organization and one fact of type BuyEvent, all listed in the result dataset one after another. These records, however, have attributes helping to navigate in the context:

  • ID uniquely identifies an entity in the document;
  • PARENT_ID links a simpler entity to a more complex one: let’s say that, in the example above, the BuyEvent would have an ID=12 – then, both Apples Ltd. and Oranges Inc. will get PARENT_ID=12;
  • PARAGRAPH_ID provides the paragraph number in the text, where the entity occurs. Data Services counts paragraphs not only for plain text, but for HTML, as well;
  • SENTENCE_ID provides sentence number in the text, where the entity occurs.
  • OFFSET provides the entity’s exact position in the text. If an entity is extracted twice by different dictionary or rule (that is allowed an), the offset helps to catch those duplicates.

 

Making appropriate joins will then do the trick. Later in this blog I will demonstrate how that works.

 

Another issue that one has to handle when working with unstructured content is the natural variability of entities’ names in the unstructured content. In the heart of data flows consolidating those varying names lies Match transform of Data Quality. It is probably the most complex transform in Data Services to configure – luckily, SAP provides a few blueprints that may serve as a good starting point.

 

A person may be mentioned by their full name or by family name – business news style does not usually allow mentioning a person by their first name only. Obviously, there may be two persons sharing the same family name. A news article from my set typically referred to someone as John Doe in the beginning and then just Doe or Mr Doe further throughout the article. I used that fact in names cleansing dataflow and consolidated mentions of persons (there is of course a chance of two persons sharing the same family name to be mentioned in the same news article, but let us ignore it for this exercise). The screenshot below demonstrates how different forms of mentioning Tony Abbott are consolidated into a standardized record – the left pane lists ‘consolidated’ persons, and the right one leads to mentions of each person in all documents of the set :

5-3.PNG

 

Company names have been processed in a similar way. Actually, the idea was to not only consolidate them between the news articles, but also to standardize, using the externally acquired ASX list as master data (whether ASX list contains legally correct names is another question), and enrich with ASX tickers, 3-5 letter acronyms of company names. That way, the structured data – ASX trading history, in this case – was made available to augment the unstructured part using joins by those ticker fields. Below, the left pane, again, shows consolidated and standardized company names and the right pane leads to individual entries in the articles:

5-4.PNG 

 

Company names are somehow trickier than people names. In the screenshot above you may see that I set Qantas and Qantas Airways to count as the same company. However, the same setting made Bank Of Queensland and Bank Of China counting as the same company, either. This could be solved by having a list of ‘generic’ parts of company names – Bank Of, in this case – which would lead to a special matching process. Another finding was that a company name not accompanied with Ltd/Inc/Co/etc is sometimes recognized by Data Services as PROP_MISC, a generic catch-all bucket. Data Quality, obviously, may help to extract those, too.

 

There is a large amount of NOUN_GROUP entities, which by themselves seem to be not of much use, but can add some context – which is why I refer to them further as Topics. Using Data Quality Match and Associate functionality, those topics have been grouped into clusters – a trick from freely available blueprints by SAP. My data collection time interval included an event when Reserve Bank Of Australia had cut its cash rate. Surely, that event had been reflected in the context:

5-5.PNG 

Particular settings of the process to cluster the topics may differ, but you get the idea.

 

One thing I consciously did not do was creation of custom dictionaries. They would have helped to extract more information and structure it better, however that development requires deeper dive into the subject area and more mundane task of collecting the related terms. In the real-world case, that would be a requirement.

 

Of many approaches to analysis of this data, so far I am picking the simplest one, a sort of descriptive/profiling one. I focused on the following cases:

  1. Company performance at ASX -- simply, for a Company extracted from the news articles, the graph of its ASX trading history to be produced.
  2. Companies and Topics -- by ‘relation’ here I defined co-occurrence of Topics and Companies in the same paragraph or the same article – the last criteria, obviously, more relaxed. I could then quantify the number of Topic mentions and produce the Top 5 Topics for the given Company and Date. The same model has been implemented for Companies and People. That quantification is prone to some skewing: for example, if a news article is duplicated in the RSS feed (and I have seen such cases), all entities mentioned in there will get a bump to their mentioning score.
  3. Related Events – a subset of facts, namely, BuyEvent, SellEvent or Action that may be related (again, on the basis of co-occurrence in the text) to be picked for the given Company.
  4. People Events/Relationships – here, for each Person I searched for facts extracted from the unstructured data and reconstructed the Events (Hire or Resign) these people might have participated in, or Organization relationships they might have.
  5. Topics and People – for each Topic, a list of possibly related mentioned Persons to be produced, the frequency of mentioning might then be quantified. Technically, it’s the same as Related Topics exercise.

 

A couple of BusinessObjects Universes have been created to report on the processed data. In order to make the data more interactive, I have created a Dashboard (Xcelsius). I turned out to be a good pick, with one exception: display of textual data grids in Xcelsius is limited. On the other hand, it allowed to take the best from both worlds: MS Excel grid and formulas and live data connectivity.

 

Xcelsius enabled navigation between analysis cases outlined above: as soon as I selected a Company, from its ASX performance graph I could navigate to Company-related Topics, People or Events:

5-6.PNG 

 

The Company-related Events part for that selection was empty, hence not displayed. Here’s the Events part for another company, David Jones. The Events chart in the screenshot below is supposed to display a star for every day for which events are recorded, and navigate to the captured details in the box on the right:

5-7.PNG

 

More into People Events – I made it possible in Xcelsius to get list of Events by Person or by Date:

5-8.PNG 

 

As I mentioned, these analyses belong more to data profiling of some sort, rather than to quantitative analysis. However, parsed and cleansed unstructured data may provide a foundation for analytical scenarios, especially if integrated with ‘traditional’ structured data – forming what is buzzing these days as Big Data.

 

Data amounts and potential complexity of joins promise to be high, and, in SAP ecosystem, suggest using HANA to boost performance of analytics: not only that in-memory data access is generally faster, but also a) as of SPS4 HANA is capable of full text search and b) includes a few statistical libraries that can be integrated with SAP BusinessObjects Predictive Analysis or R. As for unstructured data processing, this blog series concludes here.

 

- Roman Bukarev

In this blog I will discuss using Data Services for consumer sentiment analysis of the data collected from Twitter using JSONAdapter.

 

As discussed in the first blog entry, Twitter Search API has been accessed, with the word ‘cityrail’ as the search term. To those not in the know, Cityrail is the train network of Sydney (Australia) metropolitan area. It was a very obvious target: with relatively big customer base it was guaranteed to get enough unstructured data. Over time (2-3 months) Data Services collected a few thousands tweets supposedly related to Cityrail.

 

It is worth to elaborate on the data collection process. In one request-response session, Twitter Search API returns up to 100 most recent tweets. Provided that within every 15 minutes the number of tweets about Cityrail does not exceed 100 (that assumption has been confirmed), Data Services Batch Job running every 15 minutes can collect all such tweets almost in real-time.

 

I would also like to mention a conversation I had with a colleague recently. He wondered if JSONAdapter may help to obtain a large amount of tweets for analysis instantly, and if Twitter Streaming API might help with that. The answer to the last question seems to be negative: once you open a Stream, Twitter feeds entries there in real time, but those are current entries; they can be grouped into time slots or used in another ETL process by Data Services. Streaming API should be used, for example, when for some topic the number of tweets exceeds, say, 100 per minute, hence Batch Job described above may not be able to cope, even if executed every minute.

 

Otherwise, the only difference between Stream and Search APIs becomes that the Stream API would provide raw data, while Search would apply some extra filtering/ranking by relevance to the search term. In fact, it is possible to build a Data Services job to get historical results, executing consecutive search requests to Search API deeper and deeper into the past (by restricting the TweetID field in a request) -- the process would not be instant, though, but probably running for 1-2 hours (consider it an Initial Load), and it is hard to tell how far into the past it can go.

The bottom line is: if there is an immediate need to analyze the historical data, you may have to contact the Twitter’s partner data providers. Otherwise, JSONAdapter may help to start collecting the data and implement (near) real-time analysis.

 

The further discussion will be around the following points:

  • text parsing using using Data Services,
  • ‘noise’ reduction,
  • Topic-Sentiment links rebuilding,
  • the sarcasm problem (no pun intended!).

 

Text parsing itself is simpler than one might expect. A special Transform in Data Services v.4.0, called Entity Extraction, parses the input unstructured text and extracts entities and facts about them. Its output is a number or ID’ed records containing one entity/fact each, accompanied with location attributes (paragraph number, sentence number and offset) and categorized accordingly to the rules specified in the Transform options.

 

Provided out of the box are dictionary of categorized entities and a few rulesets for facts extraction – they are located in the folder TextAnalysis of a standard Data Services installation (availability for use is subject to the license: either full DI+DQ or DI Premium). One of those rulesets, Voice Of Customer (VOC), is used for this work. SAP allows customization of rulesets (at your own risk, of course) and implementation of user-defined dictionaries and rules. SAP has also published several blueprints, which could be used to start new text analyses developments. For this blog, a blueprint for sentiment analysis has been used, it does the following:

  1. parse the incoming unstructured text into Topic and Sentiment entities using Voice Of Customer ruleset: for example, a phrase “I like apples” would be parsed into Topic=apples and Sentiment=like (accompanied with Sentiment Type ‘StrongPositive’),
  2. process Topics data and put Topics into groups, to enable some measures, like number of Sentiments per group, and ensure topics like ‘apple’ and ‘apples’ would fall into the same group,
  3. build a SAP BusinessObjects Universe on top of resulting tables, to enable WebI reporting with slice and dice capabilities.

 

A few important changes have been made to the blueprint design to deal with Twitter data, the first one covering the issue of noise elimination. For starters, the blueprint assumed the original text to be in plain English; in reality, tweets constitute quite a lingo, full of abbreviations, expletives of all kinds, and with incomplete grammar. In the upcoming SP release for Data Services, SAP makes an attempt to keep in touch with social media and adds new entities for trend/hashtag and for handle/mention. That did not seem enough, and a custom ruleset has been implemented and added to the Entity Extraction transform to detect and mark words that should be excluded from further processing. The picture below demonstrates options of the transform, including two out of the box rulesets followed by the custom one:

3-1.PNG

                   

This way, if an entity is extracted as both e.g. Topic and a (custom-defined) Blather type of entity, it will be detected in a simple join:

3-2.PNG 

 

The following screenshot displays a sample output from the transform ‘Blather’, the second column contains the entity extracted from the original tweet and categorized as an expletive:

3-3.PNG 

 

All such entities would be filtered, thus clearing the output from most of the ‘yucks’, ‘hells’, and ‘lols’. There is one more use of those, which will be discussed a couple paragraphs below.

 

Noise may also occur on macro-level. Tweets analysis is different from the blueprint in one more way: while the blueprint assumed the source text is completely relevant (for example, customer feedbacks on the imaginary Swiftmobile, collected in a separate folder), tweets don’t have to be. Filtering tweets by a word X returns not just a customer’s view on X, but all aspects of people lives that somehow involve X and that people care to write about. The amount of such macro-noise in ‘cityrail’ selection is, actually, small, but in a selection for, say, ‘westfield’ (a major chain of shopping malls in Australia) it becomes much bigger, for obvious reasons. A possible way to further filter the results would be by having a predefined list of topics specific to the bigger theme.

 

By default, the output of Entity Extraction transform looks like what might be called a ‘spaghetti’ type of data, i.e. it doesn’t care about relationships between Topics and Sentiments. While it may be considered sufficient, a need to relate Topics and Sentiments may be considered. Assuming that in a sentence related topic and sentiment should be closely located, it’s possible to derive Topic-Sentiment pairs from ParentID and Offset fields of Entity Extraction transform output:

3-4.PNG

 

This design obviously ignores topics not accompanied by sentiments and vice versa, and those could be added to the reporting data model.

 

‘Raw’ tweets preview in the database revealed that tweets mostly expressed negative feedback on Cityrail: people tend to complain more often than praise, and – by the way – I wrote the first draft of this blog on the day and hour when some Cityrail’s equipment failure caused major suspension of service and delays. Therefore, it was surprising to see significant ‘StrongPositiveSentiment’-related numbers in the reporting. The reason was that many tweets were sarcastic and should not have been taken literally, but, rather, their sentiment would be opposite to their literal meaning. So, if a tweet is deemed sarcastic, its positive Sentiment should be reverted; while negative Sentiment still counts.

 

Apparently, sarcasm detection in user feedbacks is a much bigger problem without a general solution. Even a human cannot detect sarcasm perfectly (73% accuracy has been reported from one research), as familiarity with the context is often required. Given the Data Services’ ability to process Python scripts in User Defined Transforms, one might attempt to build a sarcasm detection functionality in Data Services based on a few published approaches, using not only words, but markers of emotions: emoticons, ‘blather’-words discussed above, words highlighting using ‘*’ (like in “I *love* when trains go slowly in rainy weather”) or enclosing into quotation marks, and, of course, the hashtag #sarcasm. Coincidence of negative and positive (rather, strong positive, in terms of VOC’s ruleset) sentiments or, rather, emotions in one tweet is also a potential sarcasm marker. The last one, actually, can be implemented with regular Data Services ETL:

3-5.PNG

 

The results below could have been slightly better if VOC knew that ‘con’ is a short form for ‘conditioner’, not a negative sentiment expression. Some extra customization of the dictionary may be required.

3-6.PNG 

 

Implementation of the full outlined above sarcasm detection functionality scope, however, seems to be a project by its own and beyond this blog.

 

Setting up reports on the analysed data was not a primary goal of this work, as the SAP blueprint’s approach of BusinessObjects Universe was adopted. The original plan was to use SAP BW BEx reporting, but as storage of texts longer than 60 characters in BW InfoProviders is not trivial, the idea had been discarded.

 

Consumer sentiment is quantified here by counting the number of feedbacks, restricted measures have been created for each feedback type. The screenshot below demonstrates how Data Quality grouped topics into larger groups using fuzzy matching logic:

3-7.PNG

 

A drilldown into a group is then possible, like below:

3-8.PNG

 

An extra characteristic, time, has been added to reporting as an obvious choice: there is clear correlation between number of ‘cityrail’ tweets and morning/afternoon transport peak hours. One might think of implementing a “rolling total negative sentiment” of 30 minute scope and raise an alert if that value exceeds some threshold.

3-9.PNG

 

Lastly, beyond consumer sentiment analysis, another obvious idea would be to geocode Tweets using either geolocation information (GPS coordinates) from tweets metadata, or geographical names from tweets themselves (post-processing is required for the latter, of course, to eliminate noise). The geocoded data could be made available for visualization in Business Objects or provided to a GIS product like ArcGIS for use in spatial analyses.

 

- Roman Bukarev

In the recent couple of years I became convinced in what researchers had been telling for quite a long time – in the rise of unstructured data, or, simply put, plain text in natural language. Some analysts go as far as predicting that volume of unstructured data available to companies will exceed one of the ‘traditional’ structured data, which we read from database tables and are accustomed to build intelligence on. Analytical potential of unstructured data is well discussed (to mention a couple of uses, that would be consumer sentiment analysis and entities finding), so why don’t we talk about some practicalities of text data processing, and how SAP products could be of use.

 

I would like to start this series of blogs with discussion on how unstructured data from social media can be loaded to a data warehouse.

In this context appears the name JSON -- an open source text-based data interchange format. The acronym stands for JavaScript Object Notation – the name points to the roots of the format, but, actually, the standard is used outside of Javascript, with implementations of JSON for various platforms referenced at its homepage, http://json.org.

 

Have a glance at JSON code:

{

   “Plant”:{

     “Colour”:”green”,

     “Height”:{

         “Measure”:”50”,

         “Units”:”cm”

      }

   }

}

As you might guess, it describes a green plant of 50 cm height.

Just so you might compare – in XML, a well-known standard for data interchange, one possible way to express the same would be:

<Plant>

     <Colour>green</Colour>

     <Height>

         <Measure>50</Measure>

         <Units>cm</Units>

     </Height>

</Plant>

 

On large data volumes the JSON representation is more lightweight than XML, mainly because of missing closing tags. JSON became widely used in web development (“Ajax programming” is your keywords for further reading), which brings us to the topic of those famous social web-based applications, including Twitter, Facebook, Friendfeed, or RSS feeds collected by Google Reader.

 

So this is why JSON looks like a good candidate to be the medium for unstructured data extraction into a data warehouse: it is supported by source web-applications, and it can be converted relatively easy into XML, understood by major ETL tools – in our case, SAP BusinessObjects Data Services.

 

Even out of the box, Data Services can integrate with wide range of databases and data formats from various vendors, and that range can be extended by using plugins called Adapters. SAP provides Java Software Development Kit to create such Adapters. A very helpful introduction into Data Services Adapter SDK can be found at SDN website; sample code is also provided as a part of the product installation.

 

With that SDK I have developed JSONAdapter that obtains data via HTTP in JSON-format, converts it to XML and passes to Data Services – so I will describe the steps to configure such interface with Twitter, assuming the JSONAdapter has been installed, and its Datastore in the Local Repository has been created.

Twitter provides a search interface, where search parameters are included into a URL.  For example, a URL “http://search.twitter.com/search.json?q=intelligence&rpp=50” would render Twitter to return a JSON-formatted result with 50 most recent tweets containing the word “intelligence”.

 

The first step in ETL setup is to describe the data structure (metadata) in XML DTD form, one that Data Services understands. Unfortunately, it is not always possible to obtain JSON Schema, used to describe the structure of a JSON document – nor it is actually standardized yet. That is why the Adapter makes an attempt to derive metadata from a sample data. While the sample may be incomprehensive, and there may be several ways to describe the same XML data by DTD, it is still a good start.

 

So, let’s use “Import By Name” JSONAdapter’s functionality for metadata import.

1-1.PNG

 

..which leads to --

1-2.PNG

While actually creating a Function Call that may be used straight away, the import process also (optionally) generates a DTD file.  That file may be reviewed and adjusted, for example, to denote some XML elements as optional rather than required. “Import By Name” functionality should be then used again, but this time to import metadata from the adjusted DTD file, not a sample URL.

 

With that second import or without, the final result would be a Function returning data in Nested Relational Data Model (NRDM), the Data Services’ internal representation of XML data (screenshot below is partial):

1-3.PNG

 

In the new Function, the Input parameters are always URL and EXTRA_PARAM, which both form the Url submitted to the web-application, but at development time such split provides more clarity regarding constant and variable parts of the request.

 

The Output is a nested schema that may be processed using standard Data Services tools. Let’s have a closer look at Data Flow design for that.

1-4.PNG

 

Function cannot be placed into a Data Flow by itself; it rather should be inserted into a Query transform, which, in turn, should be preceded by a Row_Generation transform, to provide an input. Hence, the Row_Generation transform generates exactly 1 row, and Query transform Twitter_Search calls the JSONAdapter-based Function:

1-5.PNG

 

Greater flexibility may be added to the Function call’s input parameters using variables:

1-6.PNG

 

These variables may be globals or the data flow’s input parameters, so that is the way to parameterise Function calls from outside of Data Services.

The NRDM data received by Data Services may then be unnested and stored in table format, voi la!

 

1-7.PNG

 

From here, the social media data can be used and reused for analytical processing as part of the data warehouse.

 

In the next blogs I will discuss data acquisition by JSONAdapter from Facebook and RSS Feeds, and, using collected tweets as a model, discuss text data processing options in Data Services.


Filter Blog

By author:
By date:
By tag: