Currently Being Moderated

This past week I was reminded not once, but twice, about the key part generation numbers play in a replication environment.

 

Background

 

Issue #1 : subscription status of VALID/Defined

 

One of my client's DBAs had issued a create subscription command followed by repeated check subscription commands.  While the PRS was showing the subscription as 'VALID', the RRS continued to show the subscription as 'Defined'.  Turns out the DSI had been suspended for some reason, so once we resumed the DSI the subscription status at the RRS finally switched to 'VALID'.

 

NOTE: create subscription places a rs_marker 'validate subscription <subid>' transaction in the PDB transaction log.  When the PRS receives the rs_marker transaction it marks the subscription (rs_subscriptions.subid = <subid>) as 'VALID' and then forwards the rs_marker transaction to the RRS.  Once the DSI processes the rs_marker transaction the RRS updates the subscription's status to 'VALID'.  In this case the DSI was down/suspended so it could not process the rs_marker transaction hence the subscription's status of 'Defined'.

 

Hah! That was an easy one to fix!  Hmmmm, the DBA didn't appear to be amazed (at all) by my display of genius in solving her problem ... noooooo ... she went right into "OK smarty pants, try to explain this one" mode ...

 

The DBA asked me about a pair of subscriptions that were also showing a status of 'VALID' at the PRS and 'Defined' at the RRS (as per rs_helpsub) where the associated DSI was up.

 

The PRS status was 'VALID' so that meant the rs_marker transaction had successfully flowed from the PDB to the PRS.  In this case the PRS and RRS were the same repserver so I could rule out route issues.  And neither suspending/resuming the DSI, nor bouncing the repserver, had any effect on the RRS status of 'Defined'. So why was the rs_marker transaction not getting to the DSI?

 

Hmmmm ... this looked familiar but it wasn't (at the time) obvious what I was missing.

 

Since this pair of subscriptions had been in this VALID/Defined state for at least a few months, and no one had complained about missing data in the RDB (this was a TEST environment and the RDB was being fed from 20+ PDBs), and those other 19 PDBs were successfully replicating into the RDB, I told the DBA I'd have to think about the issue and get back to her.

 

Issue #2 : missing rs_ticket

 

The next day (while the VALID/Defined subscription issue was still simmering in the subconscious) I found myself running some rs_ticket tests of the TEST replication environment.  I submitted rs_tickets in 7 dozen PDBs, and verified that 267 (out of 269) RDBs received their expected rs_tickets.

 

One missing rs_ticket was easily explained by the aforementioned issue with the VALID/Defined subscriptions.

 

So here I am with a missing rs_ticket for a single PDB/RDB pair. (For the sake of discussion we'll call them pdb1 and rdb1.) Looking over the other 267 rs_ticket/PDB/RDB pairs I found the pdb1 rs_ticket had been successfully replicated into rdb2, and a rs_ticket submitted to pdb2 had been successfully replicated to rdb1.  So I was working with the following scenario:

 

PDBRDBrs_ticket received in RDB ?
pdb1rdb1no
pdb1rdb2yes
pdb2rdb1yes

 

NOTE: In this scenario pdb1 and rdb1 are managed by the same repserver, ie, PRS==RRS.

 

Since the pdb1 rs_ticket was successfully replicated into rdb2 I could rule out any replication-related issues with pdb1 and the PRS.

 

Since the pdb2 rs_ticket was successfully replicated into rdb1 I could rule out any issues with the RRS and the DSI *connection* into rdb1.

 

So my missing rs_ticket appeared to be related to an issue with the specific PDB/RDB pair of pdb1/rdb1.  Hmmmm ... I was obviously missing something basic but what ... oh sh*t ... late for a meeting ...

 

The tea kettle starts whistling

 

So I'm sitting in the meeting, a couple folks are going back-n-forth about the status of some project deliverable, and that VALID/Defined subscription issue in the subconscious progressed from simmering to a light boil ...

 

We know the physical replication paths are up and working (ie, other PDBs can replicate into the RDB) so there's got to be an issue with the actual data ... so what can keep data from reaching its destination? or put another way ... what can cause data to be discarded before it reaches its destination? ... ahhh! generation numbers ... but there can't be a problem with the PDB generation number since the data is getting out of the PDB ... but isn't there another generation number check? ...

 

<the boiling becomes more intense>

 

... isn't there a generation number check at the DSI? ... yeah, and we can get its value from the RDB's rs_lastcommit table, right!? ... ooooo! ooooo! ... this could explain the missing rs_ticket issue, too! ... so if the DSI has a higher generation number than the PDB, then even if the rs_marker and rs_ticket transactions find their way to the DSI they would simply be discarded ... but why are some RDBs functioning properly and others are not? ... hmmm ... ahhhhh ...

 

<kettle starts whistling>

 

... OF COURSE! ... (oops, I must've said that out loud as a couple folks sitting next to me in the meeting looked my way, and the current speaker gave me a nod of appreciation for agreeing to ... whatever it was he was talking about) ... we're dumping and loading those RDBs from a different database so the rs_lastcommit table is going to contain garbage and/or incorrect generation number values! ... when's this meeting end, I have a Nobel-qualifying breakthrough to verify!

 

So the meeting finally concluded and I rushed back to my desk ... bypassing that pile of donuts in the kitchen as I was onto something *BIG* and couldn't stop for distractions ... ooooo, that apple fritter looked lonely ... ok, just one, but *THEN* I had to get back to my desk ...

 

Resolution

 

Solution #2 : missing rs_ticket

 

- dbcc gettrunc: showed a generation number of 10 in pdb1

 

- admin get_generation: confirmed a generation number of 10 for pdb1

 

- admin who,dsi: showed a non-zero value in the Xacts_ignored column; after sending another rs_ticket from pdb1 the Xacts_ignored value was incremented by one (Excellent! we're onto something here!)

 

- select origin,oqid from rdb1..rs_lastcommit: scrolled through the list to find the record for origin=345 (pdb1's dbid) and found the first 2 bytes of the oqid as 0x006e (= 110 decimal), the generation number the DSI thought was associated with pdb1

 

Our PDB generation number (10) was smaller than what the DSI thought was pdb1's generation number (110), so any transactions received by the DSI must have an associated generation number greater than or equal to 110.  Since my rs_ticket transactions had an associated generation number of 10 (as assigned when leaving pdb1) that would explain why the rs_ticket transactions were being discarded (ie, 10 < 110). Excellent!  Now, now how to fix this ...

 

While the manuals suggest running sysadmin set_dsi_generation to reset the generation number, I needed a less-tedious fix ... not just for this pdb1->rdb1 issue, but to address the potential issues with all the other garbage/invalid generation numbers in the rs_lastcommit table.  Well, it just so happens that there is another way to reset the generation number managed by DSI ... delete the row from the rs_lastcommit table for the PDB in question; when the DSI finds no matching row for a given PDB it will accept the next PDB transaction's oqid (more specifically the first 2 bytes) as the new/valid generation number for the PDB.

 

Back to the issue at hand ...

 

- suspend connection to rdb1

 

- delete rdb1..rs_lastcommit where origin = 345 (dbid for pdb1)

 

- resume connection to rdb1

 

Well, the DSI connection stayed up and I verified no rows in rdb1..rs_lastcommit for pdb1.  Now for the moment of truth ...

 

- submit rs_ticket in pdb1

 

The DSI connection stayed up, admin who,dsi showed no increase in the Xacts_ignored column, I found a record in the rdb1..rs_ticket_history table (ie, the rs_ticket successfully flowed from pdb1 to rdb1), and I had a new entry in rdb1..rs_lastcommit for origin=345 with the first 2 bytes of oqid = 0x000a (10).  SUCCESS!  Now to clean up the rest of the mess:

 

- suspend connection to rdb1


- delete rdb1..rs_lastcommit (delete *all* rows)

 

- resume connection to rdb1

 

Another full rs_ticket test showed my 7 dozen PDB rs_tickets being received in 268 RDBs.

 

Just 1 rs_ticket missing now and I knew how to fix that. Time to go impress that not-so-easily-impressed DBA ...

 

Solution #1 : subscription status of VALID/Defined

 

I headed over to the DBA to explain the root cause of her VALID/Defined subscriptions issue. I started by having her run a few queries:

 

- dbcc gettrunc: showed a generation number = 7 for the PDB

 

- admin get_generation: confirmed a generation number = 7 for the PDB

 

- admin who,dsi: confirmed a non-zero value in the Xacts_ignored column for the RDB in question

 

- select origin,oqid from rs_lastcommit: scrolled through the list to find the record for the PDB in question and found the first 2 bytes of the oqid as 0x0039 (= 57 decimal)

 

Explained that the rs_marker transaction would have been assigned a generation number = 7 when leaving the PDB, but since this number was lower than the generation number (57) known to the DSI the rs_marker transaction would have been discarded by the DSI; with no rs_marker transaction being processed by the DSI the subscription was left with a status of 'Defined' at the RRS!

 

To fix the issue I had her type a few more commands:

 

- suspend connection to <RDB>

 

- delete rs_lastcommit where origin = <dbid_for_PDB>

 

- resume connection to <RDB> (to clear the DSI's in-memory generation number for the PDB)

 

- delete rs_subscriptions where subname = <names_of_problematic_subs> (this was a simple subscription with no where clause so no need to delete rows from rs_rules or rs_whereclauses)

 

- create subscription(s) ...

 

- check subscription(s) ...: both subscriptions were confirmed VALID at the PRS and RRS!

 

Excellent!  Another problem solved by yours truly! Hmmmm ... this DBA (still) wasn't showing any signs of being blinded by my brilliance ... this is a tough crowd!

 

Went ahead and had the DBA reset all of the generation numbers known by this DSI:

 

- suspend connection to <RDB>


- delete rs_lastcommit (delete *all* rows)


- resume connection to <RDB>

 

Since this client performs overwrites of their PDB and RDB databases on a regular basis I suggested the DBA go ahead and add the 'delete rs_lastcommit' query as part of their post-load processing.

 

Another full rs_ticket test showed my 7 dozen PDB rs_tickets being received in all 269 RDBs.

 

In conclusion

 

If you ever find yourself overwriting your databases (PDB or RDB) with dumps from other dataservers (or from older dumps), keep in mind those generation numbers; a pair of mismatched PDB/DSI generation numbers is all it takes to keep your PDB transactions from never reaching the RDB.

Comments