Technology Blogs by Members
Explore a vibrant mix of technical expertise, industry insights, and tech buzz in member blogs covering SAP products, technology, and events. Get in the mix!
cancel
Showing results for 
Search instead for 
Did you mean: 
Former Member

Most hardcore HANA nuts find themselves at a Linux shell console a lot of the time? Data quality issues are endemic when dealing with data from external sources, and we find ourselves with 50-100GB data files on a HANA appliance which need cleaning up. Because HANA runs on Linux, it's often convenient to do this on the appliance.

So I thought I'd reach out to the world's leading HANA experts, the HANA Distinguished Engineers, to see what their hints and tips are. But first, a few of my own. I was lucky enough to

Get the files on the HANA appliance and fix them there

It's not supported in a productive instance, but for development, test, pilots and PoCs we often cheat. So copy the files onto the HANA appliance data drive or log drive and crunch them there. Just make sure you don't run out of disk space, because bad things happen if you do. You can take advantage of HANA's fast local disks and immense CPU/RAM power.

HANA only supports tab, comma, semi-colon and pipe delimited files

It's a bit inconvenient but HANA doesn't support other delimiters, even when you specify the delimiter in the load control file. So if you have files that are delimited some other way, you need to convert.

Think very carefully about what tool you use.


If you want to do a simple substitution of Ctrl-A (Start of Header) delimited files to CSV then there are several ways to do it:

tr '\01' ',' < input.dat > output.csv

sed -e 's/,/\01/g' input.dat > output.csv

awk '{gsub(",","\x01")}1' input.dat > output.dat


It might surprise you that the performance is very different. All only run in a single thread. awk runs in 22 seconds, tr runs in 32 seconds in my test and sed runs in 57 seconds.

One interesting trick is to avoid using the /g addition to sed. /g is always expensive because it matches all such strings.

sed -e 's/,/\01/' input.dat | sed -e 's/,/\01/g' >/dev/null

Actually runs 15% faster.

Useless use of cat

Cat is the friend of the lazy UNIX admin and it generally just adds overhead. For example, the following commands do the same thing:

There is even a page dedicated to the useless use of cat. Awesome.

Learn awk

My biggest piece of advice is to learn how to use awk, especially if you are processing fixed format files into CSV. Here is an example of an awk file to turn a fixed-format file with a datestamp and a text element into a CSV. This runs incredibly quickly!

#!/bin/bash

awk -v DATESTAMP=`date -d ${1:8:8} +%F` 'BEGIN{FIELDWIDTHS="2 2 2 3 10"}{

    HOUR=$1

    MIN=$2

    SEC=$3

    MSEC=$4

    TEXT=$5

    printf ("%s %s:%s:%s.%s,%s,%s,%s,%s,%s.%s\n", DATESTAMP, HOUR, MIN, SEC, MSEC, TEXT)

}' $1 >$1.csv

Download a few tools to help you on your way

I put a short document together that details my favorite HANA tools. I hope you enjoy!

Parallelize

If you use pigz, you will be parallelizing already. One of my favorite tricks is to write awk scripts and then run them on 20 files at once. Because these scripts are usually CPU bound, each will take up a single thread. With the example above we would do

ls *.fixedfile | parallel ./awk.sh

Go and make a cup of tea and you will have all your files extracted

Go and load your data into HANA in parallel

I put together a Best Practices for HANA Data Loads document a long time ago. I think it's still pretty relevant today, and will help you get data into HANA fast.

Final Words

None of this is good practice in a real productive HANA environment and you will probably have an ETL tool like SAP Data Services or a streaming tool like SAP Event Stream Processor. But when it comes to throwing together demos fast, these tips may get you out of trouble. I hope you enjoy!

9 Comments
Labels in this area