Using Custom Dictionaries with Text Analysis in HA...

Ian_Henry · ‎03-25-2015

Having done some work with the unstructured text engine within the SAP HANA Platform I wanted to capture and share how to do this. For this example I have used Twitter data looking at Formula One hashtags and F1 accounts.

The linguistic engine is just one of the engines in the HANA Platform but is not often talked about but it is very easy to use to extract structured information from unstructured text. This text could be held in a simple character field or it could be within a binary document, we support many binary formats including TXT, RTF, HTML, PDF, DOC, DOCX, XLS, XLSX, PPT, PPTX and MSG. The official Text Analysis, Text Search and Text Mining documentation can be found here and here.

To check the supported binary file formats you can query the system table M_TEXT_ANALYSIS_MIME_TYPES.

SELECT * FROM "PUBLIC"."M_TEXT_ANALYSIS_MIME_TYPES"

For this example I have used the Text Analysis (TA) engine straight out of the box and yes it works, the results were OK, but as you would expect with any Industry, Line of Business or sport F1 has its own terms, the drivers and teams(constructors) being prime example of these so I wanted to create a custom dictionary to improve the understanding of these.

There's a good blog that shows the old way (HANA SP7) of doing this SAP HANA Custom Dictionary

With HANA 1.0 SP9, this is even easier, there's only really 3 steps.

Create the XML dictionary

Reference the dictionary in a TA configuration file

Call the Text Analysis Configuration with SQL

1.1 HANA Web Development Workbench

Go to the HANA Web-based Development Workbench Editor, for me this is at

http://ukhana.mo.sap.corp:8001/sap/hana/ide/editor/

For others it would be

http(s)://<HANA HOSTNAME>:80<HANA INSTANCE>/sap/hana/ide/editor/

1.2 Create Dictionary File

Create a New "File" for the dictionary, I used the path sap.hana.ta.config

The file needs to end in .hdbtextdict, for later versions of HANA the file should reside within the content path sap.hana.ta.dict

1.3 Create the dictionary

Here's a snippet of mine, I have attached the full XML file below. Check your XML file opens in a web browser, also be careful of the double quotes " - sometimes you may find the "smart quotes" like “ and ” which are not smart for XML files!

<?xml version="1.0" encoding="UTF-8"?>

<dictionary xmlns="http://www.sap.com/ta/4.0">

   <entity_category name="F1 Driver">

      <entity_name standard_form="Lewis Hamilton">

            <variant name ="Lewis"/>

            <variant name ="Hamilton"/>

            <variant name ="HAM"/>

            <variant name ="@LewisHamilton"/>

            <variant name ="#TeamLH"/>

            <variant name ="LewisHamilton"/>

      </entity_name>

      <entity_name standard_form="Jenson Button">

            <variant name ="Jenson"/>

            <variant name ="Button"/>

            <variant name ="#JB22"/>

            <variant name ="@JensonButton"/>

            <variant name ="JensonButton"/>

      </entity_name>

      <entity_name standard_form="Kimi Raikkonen">

            <variant name ="Kimi"/>

            <variant name ="Raikkonen"/>

            <variant name ="Kimi Räikkönen"/>

            <variant name ="Räikkönen"/>

            <variant name ="Ferrari Kimi Raikkonen"/>

      </entity_name>

     </entity_category>

</dictionary>

Full details of the dictionary syntax can be found here

Below you can see the full Dictionary XML file inside the HANA Web-based Development Workbench.

Once you click Save you should see in the black console as below that it gets saved and activated (compiled auto-magically) immediately.

2.1 Create configuration file

The easiest way is to chose one if the other .hdbtextconfig file that you see. Whichever one is the most appropriate.

This can be done easily - Right click copy and paste. I chose the VOICEOFCUSTOMER one as I was initially using some Twitter data for the unstructured analysis. Give the new file a sensible name, remember to keep the .hdbtextconfig extension.

2.2 Edit Configuration file

Open your newly copied file and scroll to the bottom.

Add, an entry that references your Dictionary file you created above for me I added

<string-list-value>sap.hana.ta.dict::F1.hdbtextdict</string-list-value>

2.3 Save configuration file

You should see it also activates at the same time, which will check for any errors too.

3.1 Database Table

You can now use your new configuration. I loaded some Twitter data using the HANA Data Provisioning Agent that's also part of HANA SP9. I created a simple table F1-TWEETS with 3 columns, It must have a primary key and also a text field in either an NVARCHAR, VARCHAR, BLOB or CLOB

INSERT INTO "F1-TWEETS" (

SELECT "Id", "ScreenName", "Tweet"  FROM "F1"."F1HANA-Twitter_Status");

3.2 Create FullText index with the new configuration

CREATE FULLTEXT INDEX "F1-TWEETS-FTI" ON "F1"."F1-TWEETS"("Tweet")

CONFIGURATION 'F1'

FAST PREPROCESS OFF

TEXT ANALYSIS ON;

This creates a new table in my case $TA_F1-TWEETS-FTI which contains the structured version of the unstructured data.

When you use a dictionary the TA_NORMALIZED column is populated enhanced definitions that you have defined in the custom dictionary.

3.3 Visualisation of the Restults

To illustrate the difference that the dictionary makes compare the 2 visualisations that I created with Lumira using a calculation view against the $TA_F1-TWEETS-FTI table.

Without the Dictionary - TA_TOKEN

With the Dictionary - TA_NORMALIZED

For me, it is clear that there's enormous benefit to using the Text Analysis to turn unstructured data into meaning information and when you combine that with the custom dictionaries you have a very powerful tool.

Using Custom Dictionaries with Text Analysis in HANA SPS9, for Formula One Twitter Analysis

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win