We had a POC where it was required to load large flat files into a Sybase IQ data warehouse. We used Data Services 4.0 (DS)  to load the files into the staging area of the data warehouse.

 

Loading files into a database table is an easy job for an ETL tool. In DS, all you need to do is create a file format and import the staging table from IQ. Then connect the file format with the staging table in a dataflow, for instance:

 

 

Please note that this is a typical scenario if you are loading large files into an analytic or columnar database such as SAP Sybase IQ or SAP Hana. Even if you need to do some transformations at the beginning of your ETL process you would rather do this in a second step, because the database can most often process transformations much faster than Data Services could do. So, instead of using a Data Services query transform right after the file format you may want to simply first import the file in the database. Then in the next dataflow you would use the imported table (IMP_Customers in the example above) as a source table and specify the transformation logic in a query transform. Ideally the transformation logic will get pushed down to IQ and thus benefit from the high performance of the columnar database. In most of the cases I have seen so far this approach proofed to be quicker as opposed to executing  the transformation logic within the ETL tool.

 

So far so good, but there two things to keep in mind when loading flat files directly into a Sybase IQ table. First you need to decide on the loading mechanism that you specify in the target table options in Data Services:

 

Available loading mechanisms:

  1. Singe-record insert statements: this is the default option in Data Services. The tool will send a SQL insert statement  for each record to be loaded. In fact it is a clear no-go if you are loading large amounts of data. It can work fine for a few hundred or thousand records. It will be a definite showstopper if you are loading hundred thousands or millions or records. The only feasible solution for this is to use the bulk loading option in the target table.
  2. With the bulk loading option the client (Data Services) simply sends the LOAD TABLE ... USING FILE ...command to IQ. IQ will then handle the data load process completely on its own. Here again, IQ offers various other options which are all supported by Data Services. You can specify these options in the advanced datastore configuration:

 

  • JS and DB on same machine: the file can reside local on the Sybase IQ server or remote. If it is remote, the IQ ODBC client will handle the transfer of the file from the client machine (in this case the Data Services server) to the IQ server.
  • Use named pipe: the bulk load can be managed using a named pipe as data file: in this scenario another process keeps on writing data into the file (named pipe) while at the same time IQ picks up the new incoming records from the file and loads them into the table. Again, named pipes can reside local on the IQ server or remote on the client machine.

 

In all my tests that I have done so far in various setups the option of loading a complete file which is local on the IQ server was always the fastest option. Named pipes seem to slow down the bulk load. Still, I would recommend that you test these various options in your own environment. If you decide not to use named pipes there is still one important issue to keep in mind:

 

Bulk loading from Data Services into IQ:

The  screenshot above might infer that Data Services would simply send a LOAD TABLE ... USING FILE ... command to IQ. The whole loading process would be managed completely by IQ and Data Services actually had nothing to do in the data flow. This not true, though:

  • First, Data Services loads the file into its memory. In this step it will also perform all kind of error handling that is specified at the level of the file format!
  • Second, Data Services writes the just loaded data to another file into its bulk loader directory (option Bulk loader directory in the advanced datastore settings - if not specified it is the ~/BusinessObjects/DataServices/log/bulkloader directory).
  • Third it sends the LOAD TABLE ... USING FILE ...command to IQ to initiate the bulk loading process within IQ.

 

You can read these processing steps from the trace log, for instance:


In this example we are loading 130 million records with the dataflow as shown in the screenshot above. From 10:53 until 11:07  Data Services did the first two steps: loading the file into its memory and writing it into the bulk load directory. At 11:07 Data Services sent the LOAD TABLE statement to IQ. From 11:07 until 11:11 IQ loaded the interim file from the bulk load directory into its table.

If you are not depending on the file error handling from Data Services the extra work from Data Services from 10:53 until 11:07 looks like useless overhead and waste of time. Furthermore, a similar kind of error handling could be defined at the level of the bulk load statement in IQ. It is also supported by Data Services: you can specify the error handling options of the IQ bulk load in the target table options (obviously, your ETL process still needed to check the IQ error log files if you need to have some tighter control in case of loading errors):

 

 

How to avoid the extra work of Data Services ?

If you don't need the error handling features of Data Services while loading a file (or if you can handle these with the IQ bulk load error handling) you may want to get rid of the extra loading/writing steps that Data Services is doing. There are two options:

  1. Write your own LOAD TABLE statement in a Data Services script using the sql function. I did this by copying the LOAD table statement from the trace log from Data Services. You will still need to adapt this LOAD TABLE statement, because Data Services will probably write its own interim file in a slight different format than the original file. Probably row delimiter and field delimiter need to be adapted.
  2. (More unrealistic:) vote for my improvement suggestion in SAP idea place: the solution  above works definitely, but this is not the way we want  to code an ETL process when using an ETL tool. I therefore suggested that the IQ bulk load from a data file (which is provided by some supplier outside Data Services) bypasses the loading into the memory of Data Services. In this case features like error handling in Data Services needed to be deacti

This is another blog entry of the unstructured data series, dedicated to retrieval and processing of unstructured (or, rather, semi-structured) data by SAP BusinessObjects Data Services, using JSONAdapter and Text Data Processing. After consumer sentiment analysis of Twitter messages, I would like to focus more on richer content -- news articles.

 

An RSS news feed “Yahoo! Australia & NZ Finance” has been selected as data source. Earlier in this blog series I demonstrated how  JSONAdapter retrieves data from RSS sources. As I mentioned there, Yahoo! only gives away a short snippet of each RSS item, not the full text of a news article – a URL for the latter is provided, though. It would be good to have a special “web crawler” Data Services adapter, that could reach a specified URL from a Function Call and return the full text of an article -- and in some future I may develop one. Until then, a curl.exe has to do it (but with extra crutches), so I created a Data Services Job that would generate a BAT-file and load every article into an HTML-file with name NNNN.html corresponding to the ID of the RSS item in the database:

5-1.PNG                   

 

That way, 1000 news articles have been collected over the period 1st-16th of May, 2012. Then I ran a simple Text Data Processing job over those files, getting ~81000 entities in the “raw” analysis table:

5-2.PNG 

 

The Entity Extraction transform result delivers a ‘spaghetti’ data consisting of entities and facts: from a sentence ‘Apples Ltd. acquired Oranges Inc.’, the transform would extract two entities of type Organization and one fact of type BuyEvent, all listed in the result dataset one after another. These records, however, have attributes helping to navigate in the context:

  • ID uniquely identifies an entity in the document;
  • PARENT_ID links a simpler entity to a more complex one: let’s say that, in the example above, the BuyEvent would have an ID=12 – then, both Apples Ltd. and Oranges Inc. will get PARENT_ID=12;
  • PARAGRAPH_ID provides the paragraph number in the text, where the entity occurs. Data Services counts paragraphs not only for plain text, but for HTML, as well;
  • SENTENCE_ID provides sentence number in the text, where the entity occurs.
  • OFFSET provides the entity’s exact position in the text. If an entity is extracted twice by different dictionary or rule (that is allowed an), the offset helps to catch those duplicates.

 

Making appropriate joins will then do the trick. Later in this blog I will demonstrate how that works.

 

Another issue that one has to handle when working with unstructured content is the natural variability of entities’ names in the unstructured content. In the heart of data flows consolidating those varying names lies Match transform of Data Quality. It is probably the most complex transform in Data Services to configure – luckily, SAP provides a few blueprints that may serve as a good starting point.

 

A person may be mentioned by their full name or by family name – business news style does not usually allow mentioning a person by their first name only. Obviously, there may be two persons sharing the same family name. A news article from my set typically referred to someone as John Doe in the beginning and then just Doe or Mr Doe further throughout the article. I used that fact in names cleansing dataflow and consolidated mentions of persons (there is of course a chance of two persons sharing the same family name to be mentioned in the same news article, but let us ignore it for this exercise). The screenshot below demonstrates how different forms of mentioning Tony Abbott are consolidated into a standardized record – the left pane lists ‘consolidated’ persons, and the right one leads to mentions of each person in all documents of the set :

5-3.PNG

 

Company names have been processed in a similar way. Actually, the idea was to not only consolidate them between the news articles, but also to standardize, using the externally acquired ASX list as master data (whether ASX list contains legally correct names is another question), and enrich with ASX tickers, 3-5 letter acronyms of company names. That way, the structured data – ASX trading history, in this case – was made available to augment the unstructured part using joins by those ticker fields. Below, the left pane, again, shows consolidated and standardized company names and the right pane leads to individual entries in the articles:

5-4.PNG 

 

Company names are somehow trickier than people names. In the screenshot above you may see that I set Qantas and Qantas Airways to count as the same company. However, the same setting made Bank Of Queensland and Bank Of China counting as the same company, either. This could be solved by having a list of ‘generic’ parts of company names – Bank Of, in this case – which would lead to a special matching process. Another finding was that a company name not accompanied with Ltd/Inc/Co/etc is sometimes recognized by Data Services as PROP_MISC, a generic catch-all bucket. Data Quality, obviously, may help to extract those, too.

 

There is a large amount of NOUN_GROUP entities, which by themselves seem to be not of much use, but can add some context – which is why I refer to them further as Topics. Using Data Quality Match and Associate functionality, those topics have been grouped into clusters – a trick from freely available blueprints by SAP. My data collection time interval included an event when Reserve Bank Of Australia had cut its cash rate. Surely, that event had been reflected in the context:

5-5.PNG 

Particular settings of the process to cluster the topics may differ, but you get the idea.

 

One thing I consciously did not do was creation of custom dictionaries. They would have helped to extract more information and structure it better, however that development requires deeper dive into the subject area and more mundane task of collecting the related terms. In the real-world case, that would be a requirement.

 

Of many approaches to analysis of this data, so far I am picking the simplest one, a sort of descriptive/profiling one. I focused on the following cases:

  1. Company performance at ASX -- simply, for a Company extracted from the news articles, the graph of its ASX trading history to be produced.
  2. Companies and Topics -- by ‘relation’ here I defined co-occurrence of Topics and Companies in the same paragraph or the same article – the last criteria, obviously, more relaxed. I could then quantify the number of Topic mentions and produce the Top 5 Topics for the given Company and Date. The same model has been implemented for Companies and People. That quantification is prone to some skewing: for example, if a news article is duplicated in the RSS feed (and I have seen such cases), all entities mentioned in there will get a bump to their mentioning score.
  3. Related Events – a subset of facts, namely, BuyEvent, SellEvent or Action that may be related (again, on the basis of co-occurrence in the text) to be picked for the given Company.
  4. People Events/Relationships – here, for each Person I searched for facts extracted from the unstructured data and reconstructed the Events (Hire or Resign) these people might have participated in, or Organization relationships they might have.
  5. Topics and People – for each Topic, a list of possibly related mentioned Persons to be produced, the frequency of mentioning might then be quantified. Technically, it’s the same as Related Topics exercise.

 

A couple of BusinessObjects Universes have been created to report on the processed data. In order to make the data more interactive, I have created a Dashboard (Xcelsius). I turned out to be a good pick, with one exception: display of textual data grids in Xcelsius is limited. On the other hand, it allowed to take the best from both worlds: MS Excel grid and formulas and live data connectivity.

 

Xcelsius enabled navigation between analysis cases outlined above: as soon as I selected a Company, from its ASX performance graph I could navigate to Company-related Topics, People or Events:

5-6.PNG 

 

The Company-related Events part for that selection was empty, hence not displayed. Here’s the Events part for another company, David Jones. The Events chart in the screenshot below is supposed to display a star for every day for which events are recorded, and navigate to the captured details in the box on the right:

5-7.PNG

 

More into People Events – I made it possible in Xcelsius to get list of Events by Person or by Date:

5-8.PNG 

 

As I mentioned, these analyses belong more to data profiling of some sort, rather than to quantitative analysis. However, parsed and cleansed unstructured data may provide a foundation for analytical scenarios, especially if integrated with ‘traditional’ structured data – forming what is buzzing these days as Big Data.

 

Data amounts and potential complexity of joins promise to be high, and, in SAP ecosystem, suggest using HANA to boost performance of analytics: not only that in-memory data access is generally faster, but also a) as of SPS4 HANA is capable of full text search and b) includes a few statistical libraries that can be integrated with SAP BusinessObjects Predictive Analysis or R. As for unstructured data processing, this blog series concludes here.

 

- Roman Bukarev

In this blog I will discuss using Data Services for consumer sentiment analysis of the data collected from Twitter using JSONAdapter.

 

As discussed in the first blog entry, Twitter Search API has been accessed, with the word ‘cityrail’ as the search term. To those not in the know, Cityrail is the train network of Sydney (Australia) metropolitan area. It was a very obvious target: with relatively big customer base it was guaranteed to get enough unstructured data. Over time (2-3 months) Data Services collected a few thousands tweets supposedly related to Cityrail.

 

It is worth to elaborate on the data collection process. In one request-response session, Twitter Search API returns up to 100 most recent tweets. Provided that within every 15 minutes the number of tweets about Cityrail does not exceed 100 (that assumption has been confirmed), Data Services Batch Job running every 15 minutes can collect all such tweets almost in real-time.

 

I would also like to mention a conversation I had with a colleague recently. He wondered if JSONAdapter may help to obtain a large amount of tweets for analysis instantly, and if Twitter Streaming API might help with that. The answer to the last question seems to be negative: once you open a Stream, Twitter feeds entries there in real time, but those are current entries; they can be grouped into time slots or used in another ETL process by Data Services. Streaming API should be used, for example, when for some topic the number of tweets exceeds, say, 100 per minute, hence Batch Job described above may not be able to cope, even if executed every minute.

 

Otherwise, the only difference between Stream and Search APIs becomes that the Stream API would provide raw data, while Search would apply some extra filtering/ranking by relevance to the search term. In fact, it is possible to build a Data Services job to get historical results, executing consecutive search requests to Search API deeper and deeper into the past (by restricting the TweetID field in a request) -- the process would not be instant, though, but probably running for 1-2 hours (consider it an Initial Load), and it is hard to tell how far into the past it can go.

The bottom line is: if there is an immediate need to analyze the historical data, you may have to contact the Twitter’s partner data providers. Otherwise, JSONAdapter may help to start collecting the data and implement (near) real-time analysis.

 

The further discussion will be around the following points:

  • text parsing using using Data Services,
  • ‘noise’ reduction,
  • Topic-Sentiment links rebuilding,
  • the sarcasm problem (no pun intended!).

 

Text parsing itself is simpler than one might expect. A special Transform in Data Services v.4.0, called Entity Extraction, parses the input unstructured text and extracts entities and facts about them. Its output is a number or ID’ed records containing one entity/fact each, accompanied with location attributes (paragraph number, sentence number and offset) and categorized accordingly to the rules specified in the Transform options.

 

Provided out of the box are dictionary of categorized entities and a few rulesets for facts extraction – they are located in the folder TextAnalysis of a standard Data Services installation (availability for use is subject to the license: either full DI+DQ or DI Premium). One of those rulesets, Voice Of Customer (VOC), is used for this work. SAP allows customization of rulesets (at your own risk, of course) and implementation of user-defined dictionaries and rules. SAP has also published several blueprints, which could be used to start new text analyses developments. For this blog, a blueprint for sentiment analysis has been used, it does the following:

  1. parse the incoming unstructured text into Topic and Sentiment entities using Voice Of Customer ruleset: for example, a phrase “I like apples” would be parsed into Topic=apples and Sentiment=like (accompanied with Sentiment Type ‘StrongPositive’),
  2. process Topics data and put Topics into groups, to enable some measures, like number of Sentiments per group, and ensure topics like ‘apple’ and ‘apples’ would fall into the same group,
  3. build a SAP BusinessObjects Universe on top of resulting tables, to enable WebI reporting with slice and dice capabilities.

 

A few important changes have been made to the blueprint design to deal with Twitter data, the first one covering the issue of noise elimination. For starters, the blueprint assumed the original text to be in plain English; in reality, tweets constitute quite a lingo, full of abbreviations, expletives of all kinds, and with incomplete grammar. In the upcoming SP release for Data Services, SAP makes an attempt to keep in touch with social media and adds new entities for trend/hashtag and for handle/mention. That did not seem enough, and a custom ruleset has been implemented and added to the Entity Extraction transform to detect and mark words that should be excluded from further processing. The picture below demonstrates options of the transform, including two out of the box rulesets followed by the custom one:

3-1.PNG

                   

This way, if an entity is extracted as both e.g. Topic and a (custom-defined) Blather type of entity, it will be detected in a simple join:

3-2.PNG 

 

The following screenshot displays a sample output from the transform ‘Blather’, the second column contains the entity extracted from the original tweet and categorized as an expletive:

3-3.PNG 

 

All such entities would be filtered, thus clearing the output from most of the ‘yucks’, ‘hells’, and ‘lols’. There is one more use of those, which will be discussed a couple paragraphs below.

 

Noise may also occur on macro-level. Tweets analysis is different from the blueprint in one more way: while the blueprint assumed the source text is completely relevant (for example, customer feedbacks on the imaginary Swiftmobile, collected in a separate folder), tweets don’t have to be. Filtering tweets by a word X returns not just a customer’s view on X, but all aspects of people lives that somehow involve X and that people care to write about. The amount of such macro-noise in ‘cityrail’ selection is, actually, small, but in a selection for, say, ‘westfield’ (a major chain of shopping malls in Australia) it becomes much bigger, for obvious reasons. A possible way to further filter the results would be by having a predefined list of topics specific to the bigger theme.

 

By default, the output of Entity Extraction transform looks like what might be called a ‘spaghetti’ type of data, i.e. it doesn’t care about relationships between Topics and Sentiments. While it may be considered sufficient, a need to relate Topics and Sentiments may be considered. Assuming that in a sentence related topic and sentiment should be closely located, it’s possible to derive Topic-Sentiment pairs from ParentID and Offset fields of Entity Extraction transform output:

3-4.PNG

 

This design obviously ignores topics not accompanied by sentiments and vice versa, and those could be added to the reporting data model.

 

‘Raw’ tweets preview in the database revealed that tweets mostly expressed negative feedback on Cityrail: people tend to complain more often than praise, and – by the way – I wrote the first draft of this blog on the day and hour when some Cityrail’s equipment failure caused major suspension of service and delays. Therefore, it was surprising to see significant ‘StrongPositiveSentiment’-related numbers in the reporting. The reason was that many tweets were sarcastic and should not have been taken literally, but, rather, their sentiment would be opposite to their literal meaning. So, if a tweet is deemed sarcastic, its positive Sentiment should be reverted; while negative Sentiment still counts.

 

Apparently, sarcasm detection in user feedbacks is a much bigger problem without a general solution. Even a human cannot detect sarcasm perfectly (73% accuracy has been reported from one research), as familiarity with the context is often required. Given the Data Services’ ability to process Python scripts in User Defined Transforms, one might attempt to build a sarcasm detection functionality in Data Services based on a few published approaches, using not only words, but markers of emotions: emoticons, ‘blather’-words discussed above, words highlighting using ‘*’ (like in “I *love* when trains go slowly in rainy weather”) or enclosing into quotation marks, and, of course, the hashtag #sarcasm. Coincidence of negative and positive (rather, strong positive, in terms of VOC’s ruleset) sentiments or, rather, emotions in one tweet is also a potential sarcasm marker. The last one, actually, can be implemented with regular Data Services ETL:

3-5.PNG

 

The results below could have been slightly better if VOC knew that ‘con’ is a short form for ‘conditioner’, not a negative sentiment expression. Some extra customization of the dictionary may be required.

3-6.PNG 

 

Implementation of the full outlined above sarcasm detection functionality scope, however, seems to be a project by its own and beyond this blog.

 

Setting up reports on the analysed data was not a primary goal of this work, as the SAP blueprint’s approach of BusinessObjects Universe was adopted. The original plan was to use SAP BW BEx reporting, but as storage of texts longer than 60 characters in BW InfoProviders is not trivial, the idea had been discarded.

 

Consumer sentiment is quantified here by counting the number of feedbacks, restricted measures have been created for each feedback type. The screenshot below demonstrates how Data Quality grouped topics into larger groups using fuzzy matching logic:

3-7.PNG

 

A drilldown into a group is then possible, like below:

3-8.PNG

 

An extra characteristic, time, has been added to reporting as an obvious choice: there is clear correlation between number of ‘cityrail’ tweets and morning/afternoon transport peak hours. One might think of implementing a “rolling total negative sentiment” of 30 minute scope and raise an alert if that value exceeds some threshold.

3-9.PNG

 

Lastly, beyond consumer sentiment analysis, another obvious idea would be to geocode Tweets using either geolocation information (GPS coordinates) from tweets metadata, or geographical names from tweets themselves (post-processing is required for the latter, of course, to eliminate noise). The geocoded data could be made available for visualization in Business Objects or provided to a GIS product like ArcGIS for use in spatial analyses.

 

- Roman Bukarev

Published By : Satish Boda, SAP Global Delivery | SAP Consulting.

 

Technology: SAP Data Services, SAP Bank Analyzer.

 

Below are the End-to-End steps for writing legacy “Financial Transaction” data to SAP Bank Analyzer by calling RFC BAPI “FIN_TRANS_MODIFY”.

 

Establish connectivity :-

Install R/3 functions on SAP BANK ANALYZER Web Application Server (WAS).

Establish RFC connectivity between SAP BANK ANALYZER and SAP DATA SERVICES.

 

Build SAP Data Services Job :-

Login to the SAP Bank Analyzer system. Launch “BAPI Explorer” via T-code “BAPI” in SAP Bank Analyzer.

 

b1.png

 

Create DATASTORE of type "SAP Applications" by entering the parameters of BANK ANALYZER system.

 

b_2.png

 

Import BAPI metadata in SAP DATA SERVICES.

B_3.png

 

The BAPI metadata looks like below with input and output parameters.

 

b_4.png

The input parameters are of 2 types : Scalar & Table parameters.

 

B_5.png

 

Create a nested NRDM structure for the source data similar to BAPI Input structure.

 

B_6.png

 

Make a “new function call” to call the BAPI.

 

b7.png

 

Map the input columns to the BAPI scalar parameters using the wizard. Take the help of scrolling option.

 

b8.png

 

Map the input schema to the BAPI Table parameters as below.

Prerequisite:-

  • Ø The column names of the input schema should match to the names of the BAPI input table. The input columns are mapped automatically to the BAPI table columns by matching column names.
  • Ø The input data type of columns should to be similar to the column data type of BAPI.

 

b9.png

 

Select the RETURN table as output schema.

 

b10.png

 

After hitting finish, the query appears as below.

 

b11.png

 

Un-nest the RETURN table to write the details to a database table.

 

b12.png

 

Run the Job :-

 

The RETURN table captures the messages returned by BAPI. This data can be analyzed to find successful entries.

 

b13.png

SAP Information Steward 4.1 is currently in ramp-up phase. I tested the new match review feature. The match review is a tool for a data steward for reviewing/approving potential duplicate records. It relies on the match transform executed in a batch process in Data Services.

The official documentation states how the workflow will look like: what is the job of the data steward/data reviewers, when the Data Services job need  to run and so on. But it doesn’t explain well how the link between Data Services and Information Steward works technically and how to set up the match process in Data Services. I’ll  focus on more on these topics in this post. It is mainly dedicated to Information Steward administrators and ETL developers.

Before starting

Some things to consider beforehand:

  • If you already have an ETL job using the match transform  you will need tp adapt the ETL process obviously: it is supposed that the match transform will run in the staging area. Those matches that are in a configurable range of match scores are dedicated to be reviewed/approved in Information Steward. The match review in Information Steward will pick up the match groups from a staging table that keeps the results of the math transform. After reviewing/approving IS will update the status of the match groups in the same staging table.
  • The match review configuration requires dedicated database connections to the staging area of your ETL process. Unfortunately it only supports:  This is odd, because it differs from the whole set of major database vendors that are available in the other tools: Data Services as well as data profiling and data validation rules within IS. Even ODBC connections don’t work.
    • DB2
    • SQL Server
    • Oracle
    • Sybase ASE
    • SAP Hana
    • SAP Master Data Services

IS (Data Steward or IS Administrator): setup match review configuration

You need to configure the match configuration in IS. The most import configurations are:

  • Match review thresholds: the Data Services match transform calculates a match score. Only match groups having a match score within the specified threshold range are
    inspected by IS. The others are completely ignored by IS:
  • Match review result table: IS need to know where the match transform in Data Services stores its immediate results in the staging table. The following columns are required:
    • Source Record Id: simply the primary key of the input records
    • Job Run Id: every ETL process should have one. IS keeps track of the match groups that it retrieved the last time. Next time it will pick up the match groups of the next job run id.
    • Match Group Number, Match Group Rank, Match Score: these fields should be output by the match transform anyway. IS will only retrieve those match groups that  have match scores the specified range.
    • Match Review Result: this is a new column that you will need to add to the match transform result table. After match groups have been reviewed / approved IS will save the status code in that column. I could not find the codes in the manuals, but I have seen these codes in my tests so far:
      • M (master record)
      • S (subordinate record, in this context it means that the subordinate had been approved as a match in IS)
      • U (un-match: the record has been approved as a non-matching record in IS)
    • Source System: most of the ETL processes have such a field. Even if you have just one source system and don’t maintain such a field you will need to provide it here. At the moment I just see one need for this field:Later in the match review configuration you can set up different sub sets of match groups that need to be reviewed/approved by different groups of persons. For instance Team A would review/approve only match groups that come from source system A and so on:
  • Job Status Table:
    • Job Status / Job Status Value: IS will check this field and value when retrieving the match groups from the staging table. So the ETL process should flag here that the records for this job run are ready for a match review.
    • Indicate Status Change: when activated IS will update the field accordingly after all match reviews/approvals for a job run have been completed.More details on the job status table are in the next section.

Data Services (ETL developer): adapt ETL process

The ETL process in DS need to be adapted. Only match records that were not subject of a match review or had been approved by the match review may be further processed in the ETL process. You might need to introduce a new match result staging table (containing the columns described above) if you didn’t save these results already somewhere. You definitely need to add a column for the match review result. Obviously you cannot simply truncate the staging table with each batch run.

Once the match review is completed in IS the result code of the match review is stored in the match review result field in the staging table. Important: IS is not updating this field before a decision has been approved:

The ETL developer now need to understand which of these match groups can be further processed (means loaded as part of the ETL process) and which of them need to stay because they are either still under review or they are even not subject to a review.
At the moment I can think of these solutions:

  1. Wait for all match reviews to be completed
    As mentioned above IS can update the job status field once ALL match groups of ALL tasks for a match configuration are completed (this functionality need to be configured). In this case all match groups for a batch run have either a status code from the match review (M, S, or U). Or they have a null entry, which then means that they were not subject to a match review at all.
    This is a save mechanism but has the disadvantage that a lot of match groups wont make their way to the DWH until even the last match group had been approved in IS. Depending on the size of the DQ team this might indeed take some longer!
  2. Process match groups as soon as they are approved:
    This is a bit more tricky. You can process all match groups that have a non-null entry in the match review result field. For the others you don’t know whether they are still under review in IS or whether they are even not subject to a review in IS. You can of course filter out the latter ones because you know the corresponding threshold configuration in IS (see section above). But this is of course very risky. The Data Steward might change these settings. He needed to coordinate this with the ETL development team. I cannot believe that this will work reliably …
    To overcome this situation I will suggest an improvement to the ramp-up team: when a match configuration is run IS can update the match result field with a code that is saying that this match group is currently under review by IS.

Job Status Table:
This table need to be configured in IS and helps to synchronize the ETL and match review tasks (see above). In many environments you won’t have a specific ETL  job just for the matching process. Instead,  the matching process will only be one dataflow or workflow in the ETL job which is responsible for loading all facts and dimensions in one job into the DWH. In this case I recommend creating a dedicated status table and call it something like MATCH_REVIEW_STATUS or so, because:

  • We are not synchronizing the complete ETL job with the match review in IS. Instead we are just synchronizing the matching processes. After the match transform in Data Services has completed, the job will go ahead with loading all match groups that are not subject to a match review. If we were using only one job status table for both, the ETL job and the matching process then the match groups can only be picked up by IS after the complete ETL job has finished. This might be a waste of time for the DQ staff, especially when the ETL job is long running.
  • The normal ETL job status table will contain a status like Completed after it has finished successfully. If we were using only one job status table for both, the ETL job and the matching process then IS would finally overwrite this status with its own status (if this feature is configured).

IS (Data Steward or IS Administrator): running match review configuration

Running a match review configuration means that IS starts a process on the DS server that does the following:

  • collect the match groups from the match transform table that have relevant match scores
  • if it found such match groups it creates tasks in the work lists of the configured reviewers in IS

A match review configuration can be started manually in IS. More likely, it will be scheduled in the CMC.

IS (Data Steward, Reviewers/Approvers): review/approve the match groups

I have just tested the scenario that a match group first need to be reviewed (means confirm match or un-match). Second, this decision need to be approved again by somebody else (when setting up the match review configuration you can specify that only a review and no approval is required).

Every reviewer/approver has a task  in his work list which contains all those match groups that were retrieved by the corresponding match review configuration run:


The task will stay in the work list until all match groups had been reviewed / approved. In my test case (review as well as approval required) the Match Review Result field in the staging table will be updated when the match group had been finally approved. (I assume that in the case when only a review is required the field will get updated as soon as the reviewer has decided on match or un-match).
In the example above match groups 1,10 and 11 have already been approved, means the approver has confirmed the decision of the reviewer. The reviewer can no longer change these match groups. Instead, in match group 12 the decision of the reviewer has been rejected by the approvre. The reviewer will need to look into the match group again. The other match groups still need t be reviewed by the reviewers.

(btw: in case of a rejection the process can become unpractical: the approver cannot simply overwrite a decision. The approver need to reject. Then the reviewer has to change his decision and finally the approver has to approve again. This looks a bit too rigid to me)

In the recent couple of years I became convinced in what researchers had been telling for quite a long time – in the rise of unstructured data, or, simply put, plain text in natural language. Some analysts go as far as predicting that volume of unstructured data available to companies will exceed one of the ‘traditional’ structured data, which we read from database tables and are accustomed to build intelligence on. Analytical potential of unstructured data is well discussed (to mention a couple of uses, that would be consumer sentiment analysis and entities finding), so why don’t we talk about some practicalities of text data processing, and how SAP products could be of use.

 

I would like to start this series of blogs with discussion on how unstructured data from social media can be loaded to a data warehouse.

In this context appears the name JSON -- an open source text-based data interchange format. The acronym stands for JavaScript Object Notation – the name points to the roots of the format, but, actually, the standard is used outside of Javascript, with implementations of JSON for various platforms referenced at its homepage, http://json.org.

 

Have a glance at JSON code:

{

   “Plant”:{

     “Colour”:”green”,

     “Height”:{

         “Measure”:”50”,

         “Units”:”cm”

      }

   }

}

As you might guess, it describes a green plant of 50 cm height.

Just so you might compare – in XML, a well-known standard for data interchange, one possible way to express the same would be:

<Plant>

     <Colour>green</Colour>

     <Height>

         <Measure>50</Measure>

         <Units>cm</Units>

     </Height>

</Plant>

 

On large data volumes the JSON representation is more lightweight than XML, mainly because of missing closing tags. JSON became widely used in web development (“Ajax programming” is your keywords for further reading), which brings us to the topic of those famous social web-based applications, including Twitter, Facebook, Friendfeed, or RSS feeds collected by Google Reader.

 

So this is why JSON looks like a good candidate to be the medium for unstructured data extraction into a data warehouse: it is supported by source web-applications, and it can be converted relatively easy into XML, understood by major ETL tools – in our case, SAP BusinessObjects Data Services.

 

Even out of the box, Data Services can integrate with wide range of databases and data formats from various vendors, and that range can be extended by using plugins called Adapters. SAP provides Java Software Development Kit to create such Adapters. A very helpful introduction into Data Services Adapter SDK can be found at SDN website; sample code is also provided as a part of the product installation.

 

With that SDK I have developed JSONAdapter that obtains data via HTTP in JSON-format, converts it to XML and passes to Data Services – so I will describe the steps to configure such interface with Twitter, assuming the JSONAdapter has been installed, and its Datastore in the Local Repository has been created.

Twitter provides a search interface, where search parameters are included into a URL.  For example, a URL “http://search.twitter.com/search.json?q=intelligence&rpp=50” would render Twitter to return a JSON-formatted result with 50 most recent tweets containing the word “intelligence”.

 

The first step in ETL setup is to describe the data structure (metadata) in XML DTD form, one that Data Services understands. Unfortunately, it is not always possible to obtain JSON Schema, used to describe the structure of a JSON document – nor it is actually standardized yet. That is why the Adapter makes an attempt to derive metadata from a sample data. While the sample may be incomprehensive, and there may be several ways to describe the same XML data by DTD, it is still a good start.

 

So, let’s use “Import By Name” JSONAdapter’s functionality for metadata import.

1-1.PNG

 

..which leads to --

1-2.PNG

While actually creating a Function Call that may be used straight away, the import process also (optionally) generates a DTD file.  That file may be reviewed and adjusted, for example, to denote some XML elements as optional rather than required. “Import By Name” functionality should be then used again, but this time to import metadata from the adjusted DTD file, not a sample URL.

 

With that second import or without, the final result would be a Function returning data in Nested Relational Data Model (NRDM), the Data Services’ internal representation of XML data (screenshot below is partial):

1-3.PNG

 

In the new Function, the Input parameters are always URL and EXTRA_PARAM, which both form the Url submitted to the web-application, but at development time such split provides more clarity regarding constant and variable parts of the request.

 

The Output is a nested schema that may be processed using standard Data Services tools. Let’s have a closer look at Data Flow design for that.

1-4.PNG

 

Function cannot be placed into a Data Flow by itself; it rather should be inserted into a Query transform, which, in turn, should be preceded by a Row_Generation transform, to provide an input. Hence, the Row_Generation transform generates exactly 1 row, and Query transform Twitter_Search calls the JSONAdapter-based Function:

1-5.PNG

 

Greater flexibility may be added to the Function call’s input parameters using variables:

1-6.PNG

 

These variables may be globals or the data flow’s input parameters, so that is the way to parameterise Function calls from outside of Data Services.

The NRDM data received by Data Services may then be unnested and stored in table format, voi la!

 

1-7.PNG

 

From here, the social media data can be used and reused for analytical processing as part of the data warehouse.

 

In the next blogs I will discuss data acquisition by JSONAdapter from Facebook and RSS Feeds, and, using collected tweets as a model, discuss text data processing options in Data Services.


If you are using PM Rapid Mart for SAP Version 3.2, you may notice that C_ServiceConfirmation_Section_SAP takes forever to pull data. The join should be optimized for a highly loaded ERP system.

 

The solution would be to :

 

a) Better move join conditions from "where" tab to "join" tab. Make it explicitly "inner" join.

b) Give AFRU table a high join rank. This will ensure better performance on delta loads. For the initial load it might sound like a bad idea but we will take care of it in the next step. This will save you on delta loads.

c) Filter AFRU table based on ORIND filed (should be "3") This will ensure you are only getting PM confirmations (not the PP or other ones)

 

This decreased my load time from forever to only a few seconds. Before optimization, it was trying to parse all 1 millon records of confirmations for only 50 thousand PM confirmations.

 

Remember to always check generated abap for optimization opportunities.

Rapid Marts have been around for many years, but with the most recent release it seems worth a revisit and an introduction to newcomers to SAP Rapid Marts.  In this blog we will cover what they are, when you should consider using them, what do they look like, what is required to use SAP Rapid Marts, and how to install them. 

 

Basics of SAP Rapid Marts

 

SAP Rapid Marts are rapid deployment solutions for data marts based on SAP ECC subject areas.  They provide extraction jobs from ECC, data models for the extracted data, relevant semantic layers and delivered reports.  Each SAP Rapid Mart provides between 15-20 reports are there are 12 delivered rapid marts.   The following details each rapid mart and what is delivered:

 

Financial

 

 

General Ledger: posting audits and control, financial statement analysis, cost of sales accounting, profit center accounting, financial hierarchies, new and old GL support. 

 

 

Accounts Receivable: aged receivables, configurable periods, largest balances, customer payments, clearing versus invoices, dunning and credit, short term cash flow, department workload. 

 

 

Accounts Payable: open payables, configurable periods, payment history, po/invoice reconciliation, checks issued, short-term cash outflow, department workload, financial effectiveness, purchasing document assignment.

 

 

Cost Center: cost allocations, plan-actual-variance, cost center hierarchies, internal orders, fiscal and reporting periods, cost commitment analysis, cost details planning.

 

 

Fixed Assets:  asset balance and transaction detail analysis, asset trend and summary analysis, post asset depreciation analysis, asset performance, fixed asset value, public sector fund and grant center. 

 

 

Operational

 

 

Sales:  bookings, returns, lifetime sales, delivery efficiency, delivery service levels, margins, discounting, backlogs, shipment handling, sales order pricing. 

 

 

Inventory: daily trends, stock turnover/outages, batch management, inventory allocation, stock level optimization, expenditure, replenishment, make-to-order status, inventory valuation, historical daily balances.  

 

 

Purchasing: PO ranked listings, cycle times, purchasing performance, PO history, PO status, vendor performance, price analysis, exception reporting, purchasing document confirmation, PO related partner analysis. 

 

 

Manufacturing

 

 

Plant Maintenance: bad actors, life time performance, maintenance hours, planning efficiency, material consumption, damage analysis, downtime, MTBR and MTTR.

 

 

Production Planning: bill of material explosion, routing explosion, where used, capacity planning, cost analysis, yield analysis, lead time, MRP requirements, historical BOM analysis. 

 

 

Project Systems: project costs, plan-actual-variance, earned value, progress analysis, material reservations, capacity planning, project schedules, project elements.

 

 

Human Capital Management

 

 

Human Resources: employee benefits, eligibility and enrollment, salary comparisons, EEO comparable worth, employee pay history, turnover analysis, employee events, vacation reporting.

 

You can get details on each of the SAP Rapid Marts listed above in the product documentation.  This is available at http://help.sap.com/boall_en/.  Select Rapid Marts, select 4.0, then select the Rapid Mart you are interested in.  The product documentation includes details for all reports and content included with each Rapid Mart. 

 

 

SAP Rapid Marts assume that you will do some customization, adding a field to a report, customizing the report layout, extracting additional data from SAP.  Depending upon the project, SAP Rapid Marts deliver from 70-80% of what is needed, and the remainder is customized specific to the project. 

 

 

When you should use them

 

 

SAP Rapid Marts when you are looking for departmental solutions to specific subject area and you want to have a quick implementation
time starting with the default reports delivered by SAP.  SAP Rapid Marts do not replace the need for a enterprise data warehousing strategy or enterprise solutions such as SAP Business Warehouse.   

 

 

What they look like

 

 

Watch this  5 minute video on SAP Rapid Marts.  The video shows the reports in the sales rapid marts and the extraction, transformation, load jobs in SAP Data Services.  This demo uses SAP HANA as the database, but Microsoft SQL Server and Oracle are also supported.  

 

 

How to install them

 

 

This 3 part eLearning series walks through a complete installation of SAP Rapid Marts.  These videos have voice explanation and explain all phases of the installation.  Including the ETL jobs in SAP Data Services, semantic layer, and reporting setup in SAP BusinessObjects Business Intelligence platform.   Thanks to  Patrick Krivens who did an awesome job of recording and explaining the installation! 

 

 

SAP Rapid Marts installation: part 1

 

 

SAP Rapid Marts installation:part 2

 

 

SAP Rapid Marts installation: part 3

 

 

What has been  your experience with SAP Rapid Marts?   You may remember that SAP used to offer Rapid Marts for non-SAP data.  If you need a rapid mart for other ERP applications, check out Rapid Decisions.  These rapid marts still use SAP Data Services and the semantic layer in SAP BusinessObjects Business Intelligence, but provide the extraction and rapid mart for non-SAP ERP applications.     

Actions

Filter Blog

By author:
By date:
By tag: