HANA Text Analysis with Custom Dictionaries
Prerequisites:
With release of HANA SPS07, a lot of new features are available. One of the main features is the support for custom dictionaries in Text Analysis. By default HANA comes with three configurations for text analysis:
One of the main issues you can come across while working on HANA Text Analysis is defining your own custom configurations for Text Analysis engine to work upon. In the following lines, you will find how to create your own custom dictionary, so you could benefit more from HANA text analysis capabilities.
Scenario:
Assume that your company manufactures laptops and have recently launched some new laptops series. You want to know if the consumers out there who have bought the machine are facing any problems or not. The consumers will be definitely tweeting, posting, blogging about the product on the social media.
You are now harvesting massive amount of unstructured data through social media, blogs, forums, e-mails and other mediums. The main motivation behind this will be to gain customer perception about the products (laptops). You may want to receive early warning of product defects and shortfalls and listen to channel and market-specific customer concerns and delights.
With HANA SPS07 we can create custom dictionaries which can be used to detect word/term/phrase occurrences which may not be detected while we run Text Analysis without any custom dictionary.
You need to follow the following steps to get started with custom dictionaries:
1. Create the source XML file
I have created some dummy data in a table with “ID” and “TEXT” columns.
User_tweets table structure
ID | TEXT |
1 | The #lenovo T540 laptop's latch are very loose. |
2 | my laptop's mic is too bad. It can't record any voice. will not be buying #lenovo in near future |
3 | LCD display is gone for my T520. Customer care too is pathetic. |
4 | T530 performance is awesome. Only problem I am facing is with microphone. 😞 |
The mycustomdict.xml file has the following structure:
<?xml version="1.0" encoding="UTF-8"?>
<dictionary name="LAPTOP_COMPONENTS">
<entity_category name="Internal Parts">
<entity_name standard_form="Inverter Board">
<variant name ="InverterBoard"/>
<variant name ="InvertrBoard"/>
</entity_name>
<entity_name standard_form="LCD Cable">
<variant name ="lcdcable"/>
<variant name ="cable lcd"/>
</entity_name>
</entity_category>
</dictionary>
Please refer to the following guide http://help.sap.com/hana/SAP_HANA_Text_Analysis_Extraction_Customization_Guide_en.pdf to know more about the creation of the source xml file to build custom dictionaries.
Using the above mentioned custom dictionary, HANA text analysis engine will detect “inverter board” & “LCD Cable” as entities of type internal parts of a Laptop.
2. Compiling the mycustomdict.xml file to a .nc file
First copy the XML file to your HANA machine using some FTP client.
I have copied the mycustomdict.xml to /home/root/customDict folder
You can find the dictionary complier “tf-ncc” in your HANA installation at:
/<INSTALLATION_DIR>/<SID>/HDB<INSTANCE_NO>/exe/dat_bin_dir
Text analysis configuration files can be found at the following path:
/<INSTALLATION_DIR>/<SID>/SYS/global/hdb/custom/config/lexicon/lang
Run the complier on the source mycustomdict.xml file:
export LD_LIBRARY_PATH = <INSTALLATION_DIR>/<SID>/SYS/exe/hdb:/<INSTALLATION_DIR>/<SID>/SYS/exe/hdb/ dat_bin_dir
/<INSTALLATION_DIR>/<SID>/HDB<INSTANCE_NO>/exe/hdb/dat_bin_dir/tf-ncc -d /<INSTALLATION_DIR>/<SID>/SYS/global/hdb/custom/config/lexicon/lang -o /<INSTALLATION_DIR>/<SID>/SYS/global/hdb/custom/config/lexicon/lang/mycustomdict.nc /home/root/customDict/mycustomdict.xml
After executing the above command a file named mycustomdic.nc will be generated in the
/<INSTALLATION_DIR>/<SID>/SYS/global/hdb/custom/config/lexicon/lang folder which will be later used by the text analysis engine.
3. Create custom HANA Text Analysis configuration file
After compiling the xml file, we need to create a custom text analysis configuration to refer to the compiled .nc file we created in the previous step. The configuration file specify the text analysis
processing steps to be performed, and the options to use for each step.
In HANA studio create a workspace and then create and share a project. Under this project create a new file with extension “hdbtextconfig”. Copy all the contents of one of the predefined configurations delivered by SAP as mentioned above. They are located in the HANA repository package: “sap.hana.ta.config”. For this scenario, I have copied the contents of the configuration file “EXTRACTION_CORE_VOICEOFCUSTOMER”.
Creating a Text Analysis Configuration: Section 10.1.3.2.1 of the HANA developer guide SPS07: http://help.sap.com/hana/SAP_HANA_Developer_Guide_en.pdf
After copying, modify the “Dictionaries” node under configuration node name "SAP.TextAnalysis.DocumentAnalysis.Extraction.ExtractionAnalyzer.TF” and add a child node for <string-list-value>
<string-list-value>mycustomdict.nc</string-list-value>
Now save, commit and activate the .hdbtextconfig file. After activation, now we can run Text Analysis engine using the custom configuration. To run text analysis, run the following SQL command:
CREATE FULLTEXT INDEX <indexname> ON <tablename> CONFIGURATION ‘<custom_configuration_file>’
TEXT ANALYSIS ON;
The fulltext index will be created as “TA_<indexname>". For our scenario table the output of the fulltext index table is:
As you can see the Text Analysis engine have indentified LCD, latch, Mic as internal parts. The above results can be used for data mining or analytical purposes.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
38 | |
19 | |
13 | |
13 | |
11 | |
10 | |
10 | |
10 | |
8 | |
8 |