Last month I was tracking down missing seconds from system clocks, but this month I've been assigned to scope out some missing persons capers. It's a sleazy racket, but an ABAP detective has to go where the works take him. There's no room for cry babies in this business.
You might think that "missing persons" would all be about folks who run out when the going gets tough, or that expire from natural causes, when in fact, the truth is a lot messier than that. Software licenses expire, or become flawed due to innocent hardware changes. Keys that worked for months and months suddenly turn out to have built-in certificates that trace back to paper as phony as Monopoly money. And service accounts set up by trustworthy サラリーマン turn out to be flimsy excuses for a house of cards that tumble in the slightest breeze. I've solved a number of these cases, and while the details can't always be shared due to data privacy rules, I can describe a recent investigation that bore fruit.
I was working the foreign desk, looking at the traces that came in from borderline jobs, overnight batch stuff, like the Orient express, or the night mail train. I spotted a couple of documents I'd describe as needing to be red-balled, or yellow-carded if that's your sport. The jobs should have come to a smooth finish, but from what I could see there were loose ends. In ABAP terms, maybe not a "code 16" like I'd seen in recent episodes, but red marks nonetheless.
Digging into missing person files can be time-consuming and often fruitless. A process runs, it goes astray, but if no one is there to notice the aberrant results, resources are being used for no purpose. The search via SM37, Job Logs, is like peering through mug shots or phone books. To cut down on the overload, I picked "Active" or "Canceled" jobs only. Two sets of prints came up; what was distinctive between them was one failing immediately, with the other running away for a short period before an abnormal end (or what the insiders call an "abend").
Though the Job (Name) and Job Created (By) are redacted here, the 08:20 failure looks like an inside job (set up by a user internally) while the 02:46 has an outside accomplice (external job scheduling). We'll detail the later job first, as it's the easiest to explain, once the facts are known, and since it ends so quickly, the number of root causes is relatively low. To interrogate the job, and by inference, the mastermind that set it up, one only needs to drill into the job log. Standard operating procedure, to be sure, but let's see what the evidence tells us. It's a Z job, and we all know what that means: a custom frame job. But set up by amateur grifters, possibly out on probation, with an almost guaranteed short street life.
The full message, truncated below for quick sizing in the case file, says "Logon of user H******* in client **0 failed when starting a step". Pretty clear what's going on here. The safety deposit key doesn't work - it's been changed after the temp worker aged out of the active case files. Either it was a short step to the account locker, or it was bearer bonds, with a short redemption time. That would wrap up the case as far as laying blame, but society will want a chance of reform. So this one has to get bounced back upstairs to the Ops Squad, to either automate this gimcrackery with a zombie process run by the graveyard shift, or put a stake in the body of code and get it out of circulation. I'd have a sneaking suspicion that if the work hasn't been done for 18 months or more, then shutting off the big switch would not be noticed by Joe Public, or Jane Public for that matter.
With the low hanging fruit eaten, it's time to look at the rest of the small caseload, the 02:46. This one has the look of the classic bag job, I mean, batch job, running between 2AM and 3AM. The other one, running at 8 in the morning, was more likely set up by a street user, for their own workflow only. The batch jobs would be for a general purpose business process. Looking into the job log once again finds the culprit. It's a "Z" program being called from the batch, possibly a sign of being non-standard, though not necessarily automatically a cause for suspicion. Though, ABAP detectives tend to be more suspicious than most.
The relevant error text says "E-mail address not maintained for vendor 0000####", followed by: "Job cancelled after system exception ERROR_MESSAGE". That is probably sufficient evidence for an indictment, but we'd want to go further up the hierarchy, finding out what might have succeeded before the failure, and consider what should have happened instead of what did transpire. Besides the 02:46 error, there's a related fault at 12:15 we need to examine. All of the evidence, or at least quality samples, needs to be viewed.
This job was done by the same family, and the difference in time stamps is a subject for perhaps another time and case file. There are several successful steps before the fault, though the end result is identical to the 02:46 case. Step 001 has 'Sending document "0000###Weekly Delivery Schedule 02.07.2012" to recipient address "%.%%%%@%%%%.fr".' Subsequent to that, though, is the same fault as before, with "E-mail address not maintained for vendor 0000####".
And what's typical in cases like this is the bad ending. Steps proceed like gum shoes, one after the other, until the trail reaches a sudden halt. Armed with the evidence gathered, I prepared to file my missing persons report.
When I set up surveillance more recently, the remote job was working normally. So it looks like my reports came in handy. I'm left with a few loose ends.
- The user custom job is still failing. It's set up to run weekly, and because it's not connected to any outside escalation or alarms, will continue to fail. I'll need to file another missing persons report and wait for the results to come back from the lab.
- Though the outside controlled batch job is working, the way that it failed leaves puzzling questions. Did the designer expect that any vendor email addresses might not be on file, or assume everything would always be correct? Is there a way to check that all data exist before continuing? Is there a better way to trap and report the error condition than to have the step fail?
- Another mystery, which I'm working on in another context, is what happens afterward? Just because an email was successfully queued up does not mean that the delivery succeeded. Is there a scan of the dead letter box for return to sender? What happens when the outgoing queue rules change and the current process needs to be rewritten?
What's next for the ABAP detective? Stay tuned.
[Another saga is here: The ABAP Detective Never Signals an Interrupt ]