Algorithm behind duplicate check rule

thomas_rischar · ‎05-13-2015

The Business Partner Duplicate Check is used to find out potential business partner duplicates existing in the system at the time of creation of a new business partner. It depends on HANA capabilities for performing a fuzzy search on the database with the given data and for calculating the overall rank for the records that have some match. HANA uses the Levenshtein algorithm for calculating the similarity of terms. Combining the capabilities of HANA’s error tolerant search and ranking with FSI’s flexibility to create dynamic SQL ‘Where’ clauses, business partner duplicate check finds duplicates reliably and also efficiently.

Levenshtein algorithm:
Named after its developer Vladimir Levenshtein, 'Levenshtein Algorithm' employs the calculation of the Levenshtein Distance i.e. the distance between two given strings or sequences by finding out the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character. It is often used in applications that need to determine how similar, or different, two strings are, such as spell checkers.

The Duplicate Check considers these “nodes” (= collection of attributes) when calculating the similarity score: name, postal address, e-mail, fax, phone, birthdate (only for persons), additional identifiers. I. e. it calculates the single similarity of the above mentioned nodes compared to an existing business partner on the database and finally aggregates the single similarities to a total similarity. In this step the nodes contribute to the total similarity with different weights which are configured in the Business Configuration Finetuning “Duplicate Check Weighting For Business Partners”.

A short example shall illustrate this:
Let’s assume you are creating a new contact with these data
Name = Bond
E-mail = j.bond@mi6.com

Let’s say there is already this similar contact existing on the database:
Name = Band
E-mail = j.bond@mi6.com

The similarity of the names is 1 – ¼ = 0.75 = 75%.
The similarity of the e-mail addresses is 1 = 100%.

Now the total similarity depends on the weights for name and e-mail configured in “Duplicate Check Weighting For Business Partners”. Only the weights for name and e-mail are relevant, as the similarity of the other nodes is 0.

Let’s assume we have these weights:
Name weight = 30
E-mail weight = 70

To combine the single similarities the Duplicate Check uses the so called OR-formula:

with
R = rank or fuzzy similarity of a node

w = node weight

i = index value of the node

n = number of individual nodes

For this example we get this total similarity:
SIM = SQRT( (30² * 0.75² + 70² * 1²) / (30² + 70²) ) = SQRT( 5406.25 / 5800 ) = 0.97 = 97%

If you decrease the weight for Name and increase the weight for E-mail the total similarity will increase as well, as the contribution of the 100% matching node ‘e-mail’ increases compared to the node ‘name’ which has got a lower similarity.

Now let’s have a look at a second example where we enter only the name for the new contact:
Name = Bond

Now the configuration in “Duplicate Check Weighting For Business Partners” is completely irrelevant as there are no parts to be combined.
That means the total similarity is just the single similarity of the name part independent of any weights:
SIM = 0.75 = 75%

Summary:
The node weights you are configuring in the “Duplicate Check Weighting For Business Partners” only change the influence how much each individual of the above mentioned nodes (e. g. name or address) contribute to the total similarity compared to the other nodes.
Note, that the node weights are relative weights, i. e. only the ratio of a node weight compared to the other node weights is relevant and not the absolute value. If you e. g. double all node weights the calculated fuzzy similarity keeps the same.

Former Member · ‎03-08-2016

Hello Thomas,

besides the standard delivery which you've described above, is there an option to adapt the standard algorithm? Or even to replace it by a custom logic?

If YES - could you please provide directions or an example?

Thanks in advance for any any hints.

Best regards,

Christian.

Former Member · ‎06-12-2015

Hi Thomas,

thank you for the very good and detailed reply.

Our Client's IT Department asked us a quick question: how much is then the threshold on the Levenshtein distance in order to include a Business Partner in the Checked Duplicates list? We guess should be around 80%~90%, is that correct?

Thanks a lot,

Davide

Former Member · ‎05-13-2015

Thomas, thanks a LOT for this complete answer. Now I will stop telling my customers the duplicate check is just a fancy Black Box

I can now give them the whole formula and even suggest an evening course for Mathematics!

Algorithm behind duplicate check rule

Accepted Solutions (0)

Answers (3)

Answers (3)

Re: Hybris user not authenticated after SAML succe...

C4C - COD_SALESORDER_SIMULATE return the error: Ma...

Re: How to get all logs or list of all events trig...

Excel / SAP does not overwrite exciting file name

Re: How to get all logs or list of all events trig...