cancel
Showing results for 
Search instead for 
Did you mean: 

Understanding the value of hash partition

patrickbachmann
Active Contributor
0 Kudos

Hi folks,

In the case of Range partition I can clearly see and understand the value.  ie: if I create a range partition on MSEG based upon Material Doc year (MJAHR) I can have a partition for each year.  Let's say I have 10 years and 10 partitions.  If I query the year 2006 then only that partition is read and the advantage is obvious and I get much faster response times.

But in contrast, if I were to create a HASH partition it is supposedly an equal distribution.  So let's say I created a HASH 10 partition on sales item table VBAP.   Since there is no nice key field based on YEAR in this table like there is in MSEG lets say I use Sales Document Number (VBELN) and create HASH 10 VBELN.  Now lets say our sales document numbers start at 0000000001 all the way through 9999999999.  Now I run a query that pulls all sales documents with number somewhere in the middle such as 5555555555.  Couldn't this sales document exist in ANY of the 10 equally distributed partitions?  Meaning wouldn't each and every partition need to be searched to find the value 5555555555?  This is the part that I'm trying to understand about HASH partition.  Is there still some sort of unseen range within the 10 hash buckets?  I need to understand the algorithm how these records are distributed because if ALL hash partitions are read when I run my query then I do not see any advantage in this type of partition.

Many thanks to anyone who can help clarify HASH type partition!

-Patrick

Accepted Solutions (1)

Accepted Solutions (1)

lbreddemann
Active Contributor

HI Patrick

the point of a HASH function H(x) is that it

  • distributes all input data x equally to a predefined output set (say the partition numbers)
  • gives the same value every single time for the same data

So, at the one hand we can evenly distribute the data, e.g. over several nodes/hosts.

And on the other hand, whenever a value x is requested, all we need to do is to evaluate the hash function H(x) to find out in which partition the value is stored.

That way we can simply skip all other partitions (we 'prune' them from the execution path).

A very easy example would be the modulo function that returns the remainder of a division.

Say you have 4 partitions and a partition key column with these values {0, 3, 4, 1, 3, 5, 7, 20, 22}.

In that case the modulo function for a division by 4 would deliver the following partition assignments:

key ->    partition

0   -> [ 1 |   |   |   ]

3   -> [   |   |   | 4 ]

4   -> [ 1 |   |   |   ]

1   -> [   | 2 |   |   ]

3   -> [   |   |   | 4 ]

5   -> [ 1 |   |   |   ]

7   -> [   |   |   | 4 ]

20  -> [ 1 |   |   |   ]

22  -> [   |   | 3 |   ]

Note that the partition numbers start with 1, while the modulo function will "start" at 0 - thus the shift by one to match the partition numbers.

Easy to see, when I am asking for key = 21 I only need to look into partition 2.

There's a lot more to hash functions (read Knuth to know the most important stuff!) and partitioning, but I hope that the example gets you over the little mental hurdle.

- Lars

patrickbachmann
Active Contributor
0 Kudos

Lars thanks for the very detailed explanation. I'm still not quite understanding how key 21 is in partition 2 however.  I do not see 21 in your key column so I'm not clear in my head how it knows to jump to partition 2.  I'm sure I'm missing something simple that you can clarify.   Also the key of 3 is in there twice.  Is that how you meant to type it?

Thanks!

-Patrick

lbreddemann
Active Contributor
0 Kudos

Hi Patrick,

yes, "key" 3 is duplicated in order to illustrate that this is the partitioning-key (the input into the hash function H(x)) and not a table primary key or something like that.

Looking at it, "key" really is a very ambiguous term in today's database world.

Referring to the example with x = 21: OK, that was a sharp turn there, as I simply wanted to show that even looking for values that are not even present in the whole table, the hash function will lead us to the only possible partition.

For 21 the result of modulo (21, 4) is 1, as 20 can be divided by 4 without remainder.

The remainders map to the partitions like this:

remainder  -> partition

0          ->     1

1          ->     2

2          ->     3

3          ->     4

Thus, when we want to find records with the partitioning key 21, we only need to go to partition 2 and search there.

Hope that makes it clearer.

- Lars

patrickbachmann
Active Contributor
0 Kudos

Excellent, thanks for the clarification Lars.  Also reading up on Knuth hash algorithm now so thanks for that extra tidbid as well.

-Patrick

Former Member
0 Kudos

Hi Lars,

The partitioning in HANA doesnot seem to be working that simply. When you do a HASH partitioning by specifying lets say num=4 partitions and suppose the number to be stored is 21, ideally by MOD function it should go in partition #2 (as reminder is 1)  but in HANA it goes into partition #3. Seems that HANA uses a different approach/algorithm for HASH partitioning. Can you confirm which algo HANA uses ?

Regards,

Dhwanit

lbreddemann
Active Contributor
0 Kudos

Hi Dhwanit,

this post was to illustrate how - in principle - the hash partitioning works.

The modulo function was simply used as an example.

What hash-function SAP HANA internally uses to distribute the keys to the available partitions is not externally documented and might change.

However, it should not matter at all, since it does not play any role for the actual usage of the HASH partitioning feature.

- Lars

0 Kudos

Hi Lars

Im trying to define a partitioning scheme over a TIMESTAMP column (part of the PK) that will divide the data in a partition per day fashion

I tried the following and it fails

ALTER TABLE REQUEST_INSTANCES PARTITION BY RANGE (MOD(dayofyear (START_TIME),31))

(

PARTITION 1 <= VALUES < 2,

PARTITION 2 <= VALUES < 3,

PARTITION 3 <= VALUES < 4,

PARTITION 4 <= VALUES < 5,

PARTITION 5 <= VALUES < 6,

... till day 31

)

I've checked the function syntax in the following select and it works

"select MOD(dayofyear (START_TIME),31) from REQUEST_INSTANCES;"

So why wont it let me do it in the TABLE definition ?

Any way to achieve this kind of scheme ?

lbreddemann
Active Contributor
0 Kudos

Well, this is because the partitioning function is not as versatile as general SQL functions.

The documentation lists what is possible to do here.

However, with SPS 8 the partitioning function does support date functions.

Looking at your example I wonder what additional value you expect from the MOD() function here.

The days_of_month will be re-distributed from 28-30 days to 31 days. Doesn't strike me as a big benefit.

0 Kudos

Hi Lars

The idea is to allow me to delete old data per day.

If I'm requested to keep data for 31 days, it will allow me to drop the whole partition with no need to run a delete where date =  SQL.

lbreddemann
Active Contributor
0 Kudos

That's not working out.

So you have your 31 partitions here with the data.

Let's say it's not day #32 - data deletion day - what partition do you drop?

If you drop partition #1 you will also drop data that was entered today.

A more robust approach would be to actually create a partition for every period of time (day, week, month) and drop/create partitions as the time window moves on.

Dropping partitions instead of delete is surely a nice thing to do. However, it does make your solution more complex.

Personally I would only include partition handling in my application logic, if it is actually required.If the simpler DELETE does the job sufficiently, I'd rather stick with that.

But this is really stuff for a new discussion.

- Lars

Answers (1)

Answers (1)

Former Member
0 Kudos

Hello Bachmann,


As far as i know If you're searching for a single Sales_Document_Number then no. SAP HANA's hashing algorithm is consistent, i.e. records with the same partition key gets populated in the same partition. on the contrary if you are searching for Sales_Document_Numbers of the last one year then the hashing algorithm will go through a lot of stress to distribute the Tuples evenly.


SAP HANA's hashing algorithm distributes the data in a repeatable manner which makes the Identical Sales_Document_numbers fall under the same partition.


Regards,

Krishna