Data Assurance (DA) : Not your father's rs_subcmp

Mark_A_Parsons · ‎05-23-2014

At one of my current clients I've been cursed blessed with the delegated task opportunity to put Sybase's Data Assurance (DA) product through its paces.

From a functional perspective DA is quite similar to the rs_subcmp utility, ie, both tools allow the operator to compare the schema and/or data between source and target tables.

From an operational perspective DA provides some much needed improvements over the capabilities of rs_subcmp.

rs_subcmp

If you've worked with the Sybase Replication Server (Repserver) product you've probably noticed, and perhaps even tried to use, the rs_subcmp utility that comes packaged with the repserver distribution [see ${SYBASE}/${SYBASE_REP}/bin/rs_subcmp].

rs_subcmp is a tool that allows the operator to compare schemas and/or data between a source and target table (eg, to verify a replicated table is in sync between the PDB and RDB).

To compare the data set between 2 tables (source vs target) the operator must pull together several pieces of information: source dataserver/database/table/login/password and query to extract the source data, target dataserver/database/table/login/password and query to extract the target data, and a handful of configuration parameters. [As any (in)sane person who's decided to use rs_subcmp on a regular basis has discovered, you quickly learn how to write a shell script to collect and manage the gazillion components needed to run the rs_subcmp jobs necessary for your replication environment.]

A major issue with rs_subcmp has been its inability to work with largish data sets, typically requiring the operator to break large data sets into (much) smaller data comparisons (eg, by supplying 'where' clauses that pull a subset of data from the source/target tables); and heaven help you if your comparison finds a lot of rows that need to be reconciled as this requires yet more memory which in turn means whittling down the size of the data set you're initially comparing.

My past use of rs_subcmp has shown it typically dies if it requires more than 2GB of memory to compare a data set, ymmv. [For a comparison between a pair of tables with 10's of millions of rows it's not uncommon to have to cobble together dozens of rs_subcmp jobs, with each job working on a subset of the table in question. Did I mention having to manage a gazillion components?]

Other issues with rs_subcmp include: having no idea if/when a comparison will complete, bogging down a dataserver to select/sort the data set for a comparison, and limited reporting on the results of a comparison (failed or successful).

DA

As with rs_subcmp, DA is packaged with the Repserver distribution media (starting with RS 15.7), and while rs_subcmp is installed with/under the Repserver installation directory, DA has its own setup/installation directory within the Repserver distribution (ie, you need to run a separate setup process, and you provide a different $SYBASE value).

As with rs_subcmp, DA has the ability to compare schemas and/or data between a source and target table. One nice performance improvement we get with DA is the ability to configure a single source for comparison against multiple targets and yes, DA really does perform those multiple comparisons in parallel. [With rs_subcmp you need a separate job to compare the source against each target, which means incurring the overhead of repeated selects/sorts of the source table.]

Unfortunately you still need to gather up several pieces of information for the DA comparison (source/target dataserver/database/table/login/password details as well as some configuration settings). And while DA provides a couple commands ('create compareset/foreach' and 'import job') that are suppose to help with automating the creation of some DA components, you'll likely find yourself having to write some shell scripts to help with managing the not-quite-a-gazillion components.

NOTE: The 'create compareset/foreach' command auto-generates DA components at the database level, ie, generate a comparison for each matching table in source and target databases

NOTE: The 'import job' command builds DA components based on table-level repdefs/subs in a given RSSD; unfortunately this command is in its infancy and not very reliable at this time.

Where DA starts to shine is when it comes to performing data comparisons. DA is actually a set of java programs that have the ability to use as much memory as you wish to allocate, with an additional ability to write intermediate sort sets to disk; this in effect eliminates the memory limit rs_subcmp routinely runs into for large data sets, which in turn means you only need to run a single DA comparison job against your largest tables (as opposed to having to run multiple rs_subcmp comparisons against subsets of the your largest tables)

DA provides options on how to process raw data including where to sort your data: a) have the dataserver sort the data for you (thus eating up dataserver resources and thrashing your data caches) or b) have DA sort the data for you out in the OS. [NOTE: DA has some parallel processing capabilities and is thread capable so it will use more than 1 cpu; with some configuration settings and/or running a few DA jobs in parallel it's quite easy to tie up all of your cpu resources ... so make sure you're running DA on a machine where you don't mind giving at/near 100% of your cpu resources to DA.]

NOTE: DA has the ability to install java components (called agents) on multiple machine and have those components managed by a single java process on another machine (called the server); net result is you can spread your DA processing out over multiple machines.

Another nice feature of DA is the ability to monitor a job while it's running. The 'monitor job' command displays start time, high level information about the source and target(s) being compared, initial row counts, how many rows have been compared to date, and an estimate of when the comparison will complete. Repeated runs of 'monitor job' provide updates on number of rows processed, expected completion time and a running count of issues ('M'issing rows, 'O'rphan rows, 'I'nconsistent data) it's found up to that point in time.

As with rs_subcmp, DA has the ability to re-compare inconsistencies and auto-generate reconciliation scripts, and then apply the reconciliation scripts or leave it up to the operator to apply them manually.

The 'report history' command provides a result-set layout of the job's history, high level details of what was compared, run times, numbers of rows processed, counts of the various types of issues ('M'issing/'O'rphan/'I'nconsistent), and the location of any auto-generated reconciliation scripts.

After having used rs_subcmp and DA to perform data comparisons ... DA is by far easier to use, has more features, and simply runs circles around rs_subcmp when it comes to performance.

Unfortunately DA does have one major drawback when compared to rs_subcmp ... one 'gotcha' that may sway your decision to (not) use DA ... that proverbial shoe you've been waiting to drop ... DA requires its own licen$e.

Conclusion

Both products provide schema and data comparison capabilities. [NOTE: My experience with rs_subcmp and DA has been primary in the realm of data comparisons; it's very rare that I need to perform schema comparisons, but when necessary I have other options which are easier/cheaper to use. YMMV]

Both products require managing a library of components (with rs_subcmp typically requiring a larger library due to having to break large table comparisons into smaller, manageable pieces.)

rs_subcmp is limited by how much data can be sorted, compared and reconciled within a 2GB memory limit. DA appears to have none of these limits (though it can and will use up a good bit of disk space, cpu and/or memory).

DA allows the operator to offload sorting operations from the dataserver (by transferring the overhead to an OS process, preferably on a different machine than where the dataserver is running).

DA comparison operations can be parallelized at the cpu and machine/host level thus providing some sizeable performance improvements (over rs_subcmp).

rs_subcmp is a free tool that is installed with Repserver; DA is a licen$ed product that is distributed with the Repserver product but can be installed separately from the Repserver product.