Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 
Former Member

HANA Text Analysis with Custom Dictionaries


Prerequisites:

  • How to create a developer workspace in HANA Studio.
  • How to create & share a project  in HANA Studio
  • Run HANA Text Analysis on a table

With release of HANA SPS07, a lot of new features are available. One of the main features is the support for custom dictionaries in Text Analysis. By default HANA comes with three configurations for text analysis:

  • Core Extraction
  • Linguistic Analysis
  • Voice of Customer

One of the main issues you can come across while working on HANA Text Analysis is defining your own custom configurations for Text Analysis engine to work upon.  In the following lines, you will find how to create your own custom dictionary, so you could benefit more from HANA text analysis capabilities.

Scenario:

Assume that your company manufactures laptops and have recently launched some new laptops series. You want to know if the consumers out there who have bought the machine are facing any problems or not. The consumers will be definitely tweeting, posting, blogging about the product on the social media.

You are now harvesting massive amount of unstructured data through social media, blogs, forums, e-mails and other mediums. The main motivation behind this will be to gain customer perception about the products (laptops). You may want to receive early warning of product defects and shortfalls and listen to channel and market-specific customer concerns and delights.

With HANA SPS07 we can create custom dictionaries which can be used to detect word/term/phrase occurrences which may not be detected while we run Text Analysis without any custom dictionary.

You need to follow the following steps to get started with custom dictionaries:

1. Create the source XML file

I have created some dummy data in a table with “ID” and “TEXT” columns.

User_tweets table structure

ID

TEXT

1

The #lenovo T540 laptop's latch are very loose.

2

my laptop's mic is too bad. It can't record any voice. will not be buying #lenovo in near future

3

LCD display is gone for my T520. Customer care too is pathetic.

4

T530 performance is awesome. Only problem I am facing is with microphone. 😞

The mycustomdict.xml file has the following structure:

<?xml version="1.0" encoding="UTF-8"?>

<dictionary name="LAPTOP_COMPONENTS">

   <entity_category name="Internal Parts">

      <entity_name standard_form="Inverter Board">

            <variant name ="InverterBoard"/>

            <variant name ="InvertrBoard"/>

      </entity_name>

      <entity_name standard_form="LCD Cable">

            <variant name ="lcdcable"/>

            <variant name ="cable lcd"/>

      </entity_name>

   </entity_category>

</dictionary>

Please refer to the following guide http://help.sap.com/hana/SAP_HANA_Text_Analysis_Extraction_Customization_Guide_en.pdf  to know more about the creation of the source xml file to build custom dictionaries.

Using the above mentioned custom dictionary, HANA text analysis engine will detect “inverter board” & “LCD Cable” as entities of type internal parts of a Laptop.

2. Compiling the mycustomdict.xml file to a .nc file

First copy the XML file to your HANA machine using some FTP client.

I have copied the mycustomdict.xml to  /home/root/customDict folder

You can find the dictionary complier “tf-ncc” in your HANA installation at:

/<INSTALLATION_DIR>/<SID>/HDB<INSTANCE_NO>/exe/dat_bin_dir

Text analysis configuration files can be found at the following path:

/<INSTALLATION_DIR>/<SID>/SYS/global/hdb/custom/config/lexicon/lang

Run the complier on the source mycustomdict.xml file:

export LD_LIBRARY_PATH = <INSTALLATION_DIR>/<SID>/SYS/exe/hdb:/<INSTALLATION_DIR>/<SID>/SYS/exe/hdb/ dat_bin_dir

/<INSTALLATION_DIR>/<SID>/HDB<INSTANCE_NO>/exe/hdb/dat_bin_dir/tf-ncc -d /<INSTALLATION_DIR>/<SID>/SYS/global/hdb/custom/config/lexicon/lang -o /<INSTALLATION_DIR>/<SID>/SYS/global/hdb/custom/config/lexicon/lang/mycustomdict.nc /home/root/customDict/mycustomdict.xml

After executing the above command a file named mycustomdic.nc will be generated in the

/<INSTALLATION_DIR>/<SID>/SYS/global/hdb/custom/config/lexicon/lang folder which will be later used by the text analysis engine.


3. Create custom HANA Text Analysis configuration file

After compiling the xml file, we need to create a custom text analysis configuration to refer to the compiled .nc file we created in the previous step. The configuration file specify the text analysis

processing steps to be performed, and the options to use for each step.

In HANA studio create a workspace and then create and share a project.  Under this project create a new file with extension “hdbtextconfig”. Copy all the contents of one of the predefined configurations delivered by SAP as mentioned above. They are located in the HANA repository package: “sap.hana.ta.config”. For this scenario, I have copied the contents of the configuration file “EXTRACTION_CORE_VOICEOFCUSTOMER”.

Creating a Text Analysis Configuration: Section 10.1.3.2.1 of the HANA developer guide SPS07: http://help.sap.com/hana/SAP_HANA_Developer_Guide_en.pdf

After copying, modify the “Dictionaries” node under configuration node name "SAP.TextAnalysis.DocumentAnalysis.Extraction.ExtractionAnalyzer.TF” and add a child node for <string-list-value>

<string-list-value>mycustomdict.nc</string-list-value>


Now save, commit and activate the .hdbtextconfig file. After activation, now we can run Text Analysis engine using the custom configuration. To run text analysis, run the following SQL command:

CREATE FULLTEXT INDEX <indexname> ON <tablename> CONFIGURATION ‘<custom_configuration_file>’

TEXT ANALYSIS ON;

The fulltext index will be created as “TA_<indexname>".  For our scenario table the output of the fulltext index table is:

As you can see the Text Analysis engine have indentified LCD, latch, Mic as internal parts. The above results can be used for data mining or analytical purposes.

25 Comments