Most hardcore HANA nuts find themselves at a Linux shell console a lot of the time? Data quality issues are endemic when dealing with data from external sources, and we find ourselves with 50-100GB data files on a HANA appliance which need cleaning up. Because HANA runs on Linux, it's often convenient to do this on the appliance.
So I thought I'd reach out to the world's leading HANA experts, the HANA Distinguished Engineers, to see what their hints and tips are. But first, a few of my own. I was lucky enough to
Get the files on the HANA appliance and fix them there
It's not supported in a productive instance, but for development, test, pilots and PoCs we often cheat. So copy the files onto the HANA appliance data drive or log drive and crunch them there. Just make sure you don't run out of disk space, because bad things happen if you do. You can take advantage of HANA's fast local disks and immense CPU/RAM power.
HANA only supports tab, comma, semi-colon and pipe delimited files
It's a bit inconvenient but HANA doesn't support other delimiters, even when you specify the delimiter in the load control file. So if you have files that are delimited some other way, you need to convert.
Think very carefully about what tool you use.
If you want to do a simple substitution of Ctrl-A (Start of Header) delimited files to CSV then there are several ways to do it:
tr '\01' ',' < input.dat > output.csv
sed -e 's/,/\01/g' input.dat > output.csv
awk '{gsub(",","\x01")}1' input.dat > output.dat
It might surprise you that the performance is very different. All only run in a single thread. awk runs in 22 seconds, tr runs in 32 seconds in my test and sed runs in 57 seconds.
One interesting trick is to avoid using the /g addition to sed. /g is always expensive because it matches all such strings.
sed -e 's/,/\01/' input.dat | sed -e 's/,/\01/g' >/dev/null
Actually runs 15% faster.
Useless use of cat
Cat is the friend of the lazy UNIX admin and it generally just adds overhead. For example, the following commands do the same thing:
There is even a page dedicated to the useless use of cat. Awesome.
Learn awk
My biggest piece of advice is to learn how to use awk, especially if you are processing fixed format files into CSV. Here is an example of an awk file to turn a fixed-format file with a datestamp and a text element into a CSV. This runs incredibly quickly!
#!/bin/bash
awk -v DATESTAMP=`date -d ${1:8:8} +%F` 'BEGIN{FIELDWIDTHS="2 2 2 3 10"}{
HOUR=$1
MIN=$2
SEC=$3
MSEC=$4
TEXT=$5
printf ("%s %s:%s:%s.%s,%s,%s,%s,%s,%s.%s\n", DATESTAMP, HOUR, MIN, SEC, MSEC, TEXT)
}' $1 >$1.csv
Download a few tools to help you on your way
I put a short document together that details my favorite HANA tools. I hope you enjoy!
Parallelize
If you use pigz, you will be parallelizing already. One of my favorite tricks is to write awk scripts and then run them on 20 files at once. Because these scripts are usually CPU bound, each will take up a single thread. With the example above we would do
ls *.fixedfile | parallel ./awk.sh
Go and make a cup of tea and you will have all your files extracted
Go and load your data into HANA in parallel
I put together a Best Practices for HANA Data Loads document a long time ago. I think it's still pretty relevant today, and will help you get data into HANA fast.
Final Words
None of this is good practice in a real productive HANA environment and you will probably have an ETL tool like SAP Data Services or a streaming tool like SAP Event Stream Processor. But when it comes to throwing together demos fast, these tips may get you out of trouble. I hope you enjoy!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
37 | |
10 | |
5 | |
4 | |
4 | |
4 | |
3 | |
3 | |
3 | |
2 |