Solved: Compression taking more space than non-compressed?

Former Member · ‎10-05-2015

Hello,

I did an analysis on our top 100 column store objects in HANA to see if we have a potential to save some memory (This is a SAP BW system).

I found one sample in M_CS_COLUMNS where the compression rate was > 100...meaning the non compressed column was smaller than the compressed column.

Overall the compression rates are looking brilliant but after I checked for all our top 100 sized column store tables in M_CS_COLUMNS I found ~500 columns that show a smaller UNCOMPRESSED_SIZE vs. MEMORY_SIZE_IN_MAIN size...In total we talk about 19GB of memory the uncompressed columns would utilize less than then compressed ones.

UNCOMPRESSED_SIZE < MEMORY_SIZE_IN_MAIN

This is the top column entry which shows even a 3 times bigger compressed size:

HOST

PORT

SCHEMA_NAME

TABLE_NAME

COLUMN_NAME

PART_ID

MEMORY_SIZE_IN_TOTAL

MEMORY_SIZE_IN_MAIN

MEMORY_SIZE_IN_DELTA

UNCOMPRESSED_SIZE

COMPRESSION_RATIO_IN_PERCENTAGE

COUNT

DISTINCT_COUNT

COMPRESSION_TYPE

INDEX_TYPE

INDEX_LOADED

IMPLEMENTATION_FLAGS

LAST_ACCESS_TIME

LOADED

LAST_LOAD_TIME

xxxxx

xxxx

xxx

/BIC/Xxxxxx

SID

0

1.097.650.812

1.096.242.904

1.407.908

330.789.114

331,83

82.696.006

82.652.260

DEFAULT

FULL

LOADED

17

05.10.2015 07:05:24.448912

TRUE

04.10.2015 08:02:54.739835

How can this be explained and shall I change the COMPRESSION_TYPE?

Thank you

Florian

lbreddemann · ‎10-06-2015

Ok, this is not a mystery at all.

Compression in SAP HANA mostly works by replacing duplicate data entries with pointers or aggregated entries (e.g. the repeating value 10 10 10 10 could be stored as 4 x 10 and so on).

Now the column store stores information in a split up way.

1. it stores all different values once (that's what we call the dictionary)

2. it saves the occurrence of a value in a record by storing a pointer (up to 4 byte integer) to the dictionary entry.

This approach works pretty well for values that take more space than the pointer and that are repeated often. Think of the three character (6 Unicode bytes) long, ever repeating client (MANDT) column.

Here we store the value once and only point to it with actually just 2 bits - since the dictionary only contains 1 value.

Now look at your example:

The number of different values (distinct values) is nearly as many as you've got rows in your table.

That means, the SID here is close to be unique; which gives us little space to compress 'duplicates'.

Then, the data type of the SID is usually an integer, which means, even if we would have lots of duplicates, the saving by using a reference instead of the value might be rather small.

Finally, this column has an inverted index set up with it. This means, that the pointers to the value occurrences are actually stored twice.

One time linking the occurrence to the value and one time linking the value to the occurrences (hence inverted index).

And there we have your threefold space requirement.

There's plenty of information available meanwhile how column store compression works (not the least one to mention Richard's and my book ) so I think this should answer the question well enough for now.

MAIN_MEMORY_SIZE_IN_DATA	MAIN_MEMORY_SIZE_IN_DICT	MAIN_MEMORY_SIZE_IN_INDEX	MAIN_MEMORY_SIZE_IN_MISC
476.503.120	651.832.232	289.211.536	1.968
			MEMORY_SIZE_IN_MAIN
		TOTAL =	1.417.548.856

Compression taking more space than non-compressed?

Accepted Solutions (1)

Accepted Solutions (1)

Answers (0)

Re: Mass Update a Zfield in Std Table with Split &...

Error in SmartForm after changes

Re: odata Metadata issue for standard service

Custom Fields for Goods Receipt Label

Re: Struggling with Filters on Select - Fiori App