SAP Information Steward 4.1 is currently in ramp-up phase. I tested the new match review feature. The match review is a tool for a data steward for reviewing/approving potential duplicate records. It relies on the match transform executed in a batch process in Data Services.
The official documentation states how the workflow will look like: what is the job of the data steward/data reviewers, when the Data Services job need to run and so on. But it doesn’t explain well how the link between Data Services and Information Steward works technically and how to set up the match process in Data Services. I’ll focus on more on these topics in this post. It is mainly dedicated to Information Steward administrators and ETL developers.
Some things to consider beforehand:
- If you already have an ETL job using the match transform you will need tp adapt the ETL process obviously: it is supposed that the match transform will run in the staging area. Those matches that are in a configurable range of match scores are dedicated to be reviewed/approved in Information Steward. The match review in Information Steward will pick up the match groups from a staging table that keeps the results of the math transform. After reviewing/approving IS will update the status of the match groups in the same staging table.
- The match review configuration requires dedicated database connections to the staging area of your ETL process. Unfortunately it only supports: This is odd, because it differs from the whole set of major database vendors that are available in the other tools: Data Services as well as data profiling and data validation rules within IS. Even ODBC connections don’t work.
- SQL Server
- Sybase ASE
- SAP Hana
- SAP Master Data Services
IS (Data Steward or IS Administrator): setup match review configuration
You need to configure the match configuration in IS. The most import configurations are:
- Match review thresholds: the Data Services match transform calculates a match score. Only match groups having a match score within the specified threshold range are
inspected by IS. The others are completely ignored by IS:
- Match review result table: IS need to know where the match transform in Data Services stores its immediate results in the staging table. The following columns are required:
- Source Record Id: simply the primary key of the input records
- Job Run Id: every ETL process should have one. IS keeps track of the match groups that it retrieved the last time. Next time it will pick up the match groups of the next job run id.
- Match Group Number, Match Group Rank, Match Score: these fields should be output by the match transform anyway. IS will only retrieve those match groups that have match scores the specified range.
- Match Review Result: this is a new column that you will need to add to the match transform result table. After match groups have been reviewed / approved IS will save the status code in that column. I could not find the codes in the manuals, but I have seen these codes in my tests so far:
- M (master record)
- S (subordinate record, in this context it means that the subordinate had been approved as a match in IS)
- U (un-match: the record has been approved as a non-matching record in IS)
- Source System: most of the ETL processes have such a field. Even if you have just one source system and don’t maintain such a field you will need to provide it here. At the moment I just see one need for this field:Later in the match review configuration you can set up different sub sets of match groups that need to be reviewed/approved by different groups of persons. For instance Team A would review/approve only match groups that come from source system A and so on:
- Job Status Table:
- Job Status / Job Status Value: IS will check this field and value when retrieving the match groups from the staging table. So the ETL process should flag here that the records for this job run are ready for a match review.
- Indicate Status Change: when activated IS will update the field accordingly after all match reviews/approvals for a job run have been completed.More details on the job status table are in the next section.
Data Services (ETL developer): adapt ETL process
The ETL process in DS need to be adapted. Only match records that were not subject of a match review or had been approved by the match review may be further processed in the ETL process. You might need to introduce a new match result staging table (containing the columns described above) if you didn’t save these results already somewhere. You definitely need to add a column for the match review result. Obviously you cannot simply truncate the staging table with each batch run.
Once the match review is completed in IS the result code of the match review is stored in the match review result field in the staging table. Important: IS is not updating this field before a decision has been approved:
The ETL developer now need to understand which of these match groups can be further processed (means loaded as part of the ETL process) and which of them need to stay because they are either still under review or they are even not subject to a review.
At the moment I can think of these solutions:
- Wait for all match reviews to be completed
As mentioned above IS can update the job status field once ALL match groups of ALL tasks for a match configuration are completed (this functionality need to be configured). In this case all match groups for a batch run have either a status code from the match review (M, S, or U). Or they have a null entry, which then means that they were not subject to a match review at all.
This is a save mechanism but has the disadvantage that a lot of match groups wont make their way to the DWH until even the last match group had been approved in IS. Depending on the size of the DQ team this might indeed take some longer!
- Process match groups as soon as they are approved:
This is a bit more tricky. You can process all match groups that have a non-null entry in the match review result field. For the others you don’t know whether they are still under review in IS or whether they are even not subject to a review in IS. You can of course filter out the latter ones because you know the corresponding threshold configuration in IS (see section above). But this is of course very risky. The Data Steward might change these settings. He needed to coordinate this with the ETL development team. I cannot believe that this will work reliably …
To overcome this situation I will suggest an improvement to the ramp-up team: when a match configuration is run IS can update the match result field with a code that is saying that this match group is currently under review by IS.
Job Status Table:
This table need to be configured in IS and helps to synchronize the ETL and match review tasks (see above). In many environments you won’t have a specific ETL job just for the matching process. Instead, the matching process will only be one dataflow or workflow in the ETL job which is responsible for loading all facts and dimensions in one job into the DWH. In this case I recommend creating a dedicated status table and call it something like MATCH_REVIEW_STATUS or so, because:
- We are not synchronizing the complete ETL job with the match review in IS. Instead we are just synchronizing the matching processes. After the match transform in Data Services has completed, the job will go ahead with loading all match groups that are not subject to a match review. If we were using only one job status table for both, the ETL job and the matching process then the match groups can only be picked up by IS after the complete ETL job has finished. This might be a waste of time for the DQ staff, especially when the ETL job is long running.
- The normal ETL job status table will contain a status like Completed after it has finished successfully. If we were using only one job status table for both, the ETL job and the matching process then IS would finally overwrite this status with its own status (if this feature is configured).
IS (Data Steward or IS Administrator): running match review configuration
Running a match review configuration means that IS starts a process on the DS server that does the following:
- collect the match groups from the match transform table that have relevant match scores
- if it found such match groups it creates tasks in the work lists of the configured reviewers in IS
A match review configuration can be started manually in IS. More likely, it will be scheduled in the CMC.
IS (Data Steward, Reviewers/Approvers): review/approve the match groups
I have just tested the scenario that a match group first need to be reviewed (means confirm match or un-match). Second, this decision need to be approved again by somebody else (when setting up the match review configuration you can specify that only a review and no approval is required).
Every reviewer/approver has a task in his work list which contains all those match groups that were retrieved by the corresponding match review configuration run:
The task will stay in the work list until all match groups had been reviewed / approved. In my test case (review as well as approval required) the Match Review Result field in the staging table will be updated when the match group had been finally approved. (I assume that in the case when only a review is required the field will get updated as soon as the reviewer has decided on match or un-match).
In the example above match groups 1,10 and 11 have already been approved, means the approver has confirmed the decision of the reviewer. The reviewer can no longer change these match groups. Instead, in match group 12 the decision of the reviewer has been rejected by the approvre. The reviewer will need to look into the match group again. The other match groups still need t be reviewed by the reviewers.
(btw: in case of a rejection the process can become unpractical: the approver cannot simply overwrite a decision. The approver need to reject. Then the reviewer has to change his decision and finally the approver has to approve again. This looks a bit too rigid to me)