Having done some work with the unstructured text engine within the SAP HANA Platform I wanted to capture and share how to do this. For this example I have used Twitter data looking at Formula One hashtags and F1 accounts.
The linguistic engine is just one of the engines in the HANA Platform but is not often talked about but it is very easy to use to extract structured information from unstructured text. This text could be held in a simple character field or it could be within a binary document, we support many binary formats including TXT, RTF, HTML, PDF, DOC, DOCX, XLS, XLSX, PPT, PPTX and MSG. The official Text Analysis, Text Search and Text Mining documentation can be found here and here.
To check the supported binary file formats you can query the system table M_TEXT_ANALYSIS_MIME_TYPES.
SELECT * FROM "PUBLIC"."M_TEXT_ANALYSIS_MIME_TYPES"
For this example I have used the Text Analysis (TA) engine straight out of the box and yes it works, the results were OK, but as you would expect with any Industry, Line of Business or sport F1 has its own terms, the drivers and teams(constructors) being prime example of these so I wanted to create a custom dictionary to improve the understanding of these.
There's a good blog that shows the old way (HANA SP7) of doing this SAP HANA Custom Dictionary
With HANA 1.0 SP9, this is even easier, there's only really 3 steps.
- Create the XML dictionary
- Reference the dictionary in a TA configuration file
- Call the Text Analysis Configuration with SQL
1.1 HANA Web Development Workbench
Go to the HANA Web-based Development Workbench Editor, for me this is at
http://ukhana.mo.sap.corp:8001/sap/hana/ide/editor/
For others it would be
http(s)://<HANA HOSTNAME>:80<HANA INSTANCE>/sap/hana/ide/editor/
1.2 Create Dictionary File
Create a New "File" for the dictionary, I used the path sap.hana.ta.config
The file needs to end in .hdbtextdict, for later versions of HANA the file should reside within the content path sap.hana.ta.dict
1.3 Create the dictionary
Here's a snippet of mine, I have attached the full XML file below. Check your XML file opens in a web browser, also be careful of the double quotes " - sometimes you may find the "smart quotes" like “ and ” which are not smart for XML files!
<?xml version="1.0" encoding="UTF-8"?>
<dictionary xmlns="http://www.sap.com/ta/4.0">
<entity_category name="F1 Driver">
<entity_name standard_form="Lewis Hamilton">
<variant name ="Lewis"/>
<variant name ="Hamilton"/>
<variant name ="HAM"/>
<variant name ="@LewisHamilton"/>
<variant name ="#TeamLH"/>
<variant name ="LewisHamilton"/>
</entity_name>
<entity_name standard_form="Jenson Button">
<variant name ="Jenson"/>
<variant name ="Button"/>
<variant name ="#JB22"/>
<variant name ="@JensonButton"/>
<variant name ="JensonButton"/>
</entity_name>
<entity_name standard_form="Kimi Raikkonen">
<variant name ="Kimi"/>
<variant name ="Raikkonen"/>
<variant name ="Kimi Räikkönen"/>
<variant name ="Räikkönen"/>
<variant name ="Ferrari Kimi Raikkonen"/>
</entity_name>
</entity_category>
</dictionary>
Full details of the dictionary syntax can be found here
Below you can see the full Dictionary XML file inside the HANA Web-based Development Workbench.
Once you click Save you should see in the black console as below that it gets saved and activated (compiled auto-magically) immediately.
2.1 Create configuration file
The easiest way is to chose one if the other .hdbtextconfig file that you see. Whichever one is the most appropriate.
This can be done easily - Right click copy and paste. I chose the VOICEOFCUSTOMER one as I was initially using some Twitter data for the unstructured analysis. Give the new file a sensible name, remember to keep the .hdbtextconfig extension.
2.2 Edit Configuration file
Open your newly copied file and scroll to the bottom.
Add, an entry that references your Dictionary file you created above for me I added
<string-list-value>sap.hana.ta.dict::F1.hdbtextdict</string-list-value>
2.3 Save configuration file
You should see it also activates at the same time, which will check for any errors too.
3.1 Database Table
You can now use your new configuration. I loaded some Twitter data using the HANA Data Provisioning Agent that's also part of HANA SP9. I created a simple table F1-TWEETS with 3 columns, It must have a primary key and also a text field in either an NVARCHAR, VARCHAR, BLOB or CLOB
INSERT INTO "F1-TWEETS" (
SELECT "Id", "ScreenName", "Tweet" FROM "F1"."F1HANA-Twitter_Status");
3.2 Create FullText index with the new configuration
CREATE FULLTEXT INDEX "F1-TWEETS-FTI" ON "F1"."F1-TWEETS"("Tweet")
CONFIGURATION 'F1'
FAST PREPROCESS OFF
TEXT ANALYSIS ON;
This creates a new table in my case $TA_F1-TWEETS-FTI which contains the structured version of the unstructured data.
When you use a dictionary the TA_NORMALIZED column is populated enhanced definitions that you have defined in the custom dictionary.
3.3 Visualisation of the Restults
To illustrate the difference that the dictionary makes compare the 2 visualisations that I created with Lumira using a calculation view against the $TA_F1-TWEETS-FTI table.
Without the Dictionary - TA_TOKEN
With the Dictionary - TA_NORMALIZED
For me, it is clear that there's enormous benefit to using the Text Analysis to turn unstructured data into meaning information and when you combine that with the custom dictionaries you have a very powerful tool.