As a short introduction:
I am working in the Primary Support group of AGS (Active Global Support). One of the main activities in Primary Support is message solving. This means that the customer messages you create in the service marketplace usually end up with us. When talking to customers, I often got the impression that it is a complete mystery how the SAP end of the support ticket works and what our approach to problem solving in general is.
To get a better understanding of what we do and how support works, we are going to launch a series of blogs. This is just the start so stay tuned for more.
Ever wondered how a typical day in SAP support looks like? Here we go…
The necessary dose of coffee has been consumed, the notebook booted up and all tools needed for daily work started. First thing to check are emails. As usual, I am getting a bunch of mails from my colleagues all over the word who were working while I was asleep. Luckily my coworkers only seek confirmation of things they have already found out by their own this time. However, one colleague was asking a question regarding a specific behavior of our CCMS monitoring that left me scratching my head. I need more time for investigation here and move this as one of my today’s action items to the afternoon.
The regular shift of our Global Support Center (EMEA) is starting. This means that it is time for me to check the inbox of our support system for open customer messages that are on my name. This is usually the moment of great surprise concerning the number of messages returned while you are off duty. Fortunately, not much happened since I left office yesterday.
As you know, customer messages have priorities ranging from Very High to Low, so this would also be the sequence in which I start processing the open cases. Of course I would still judge whether a message needs immediate attention or whether it can still wait until the more urgent issues have been dealt with. Message priority is after all just one factor that determines the message’s importance.
At the top of my list I have a performance issue that has been dragging around for quite some time (“some time” in this context means about 2 months). The reason for this was not inactivity or some trial and error approach from our side, but simply the complexity of the issue. The customer opened the message on a particular application component because some CO job was from time to time running too slow and consequently prolonging his batch processing until the early morning. The application colleagues analyzed the issue, recommended changes to the used variants and even provided a coding that was pilot released just to address possible performance issues in this particular scenario. Some progress was made, but the jobs were still too slow sometimes. This part alone took quite some time, so after about 3 weeks, the message arrived on my component with the request to check it from database perspective. The next couple of weeks were spent to identify the responsible statement (the open SQL was using a dynamic where clause, therefore the intermittent nature of the problem), to implement index and parameter changes, to observe tests on the Q system and finally checking the runtime on the P system. The reason why the message returned today was just to inform me that everything is working fine and that the customer will confirm the case soon. That’s what I call a good start.
It is code analysis time again. I have to deal with one of those cases where one of our kernel functions that are callable from ABAP side return a not documented error code thus leading to a ST22 dump. As always, I am setting up a test case on one of my systems to check the input/output behavior in order to see whether the problem is reproducible and whether it can be isolated. But no luck so far, everything is working as expected on my side. Since the problem cannot really be debugged, I have to resort to the good old practice of reading the code line by line and using my brain debugger to verify what those C functions are trying to do. Fortunately, all our major code bases are searchable via TREX which makes at least the part of locating the relevant coding quite easy.
Finally, it comes clear that our coding assumes that a certain parameter is set and if not, it just throws an exception leading to a strange return code. I am consequently recommending the necessary parameter change to our customer and putting that topic on my list of topics probably relevant for a note or KBA.
I am now working on the first priority very high message for today. The customer is facing the issue that his database upgrade has not finished successfully. After initial contact via phone and assuring the customer that we are working on this issue, I come to the conclusion that more log files are necessary than the ones that are currently attached to the message. This does not sound very spectacular, but is a major part of support work: asking for logs and analyzing them. After receiving the log files and checking them, it comes clear that it’s just false alarm. Because of some non-critical misconfiguration, the final check script reported a failure of the upgrade although everything was working fine. After explaining this to the customer, the priority could be reduced and the case closed.
Lunch time, 30 minutes time to relax.
Escalated Very High message incoming! This is now what you would call uber critical. The message is going on for quite some time, has experienced multiple component changes since different application components though to be responsible and has lots of notes and logs files attached. Even if it is critical, I have to take the time to read through the 10 pages of message processing that occurred until now to get the big picture. I finally figure out that it is all about some inserts that are done by a SCM planning application that are gone “missing”. Unfortunately, it is a very complex scenario with multiple parallel RFCs involved and just understanding the steps to debug the scenario took me nearly half an hour. I quickly come to the conclusion that there is an issue with the implicit RFC commits here. Due to the criticality of the issue, a conf call with the application development was setup to discuss the next steps. I then emphasized that this has to be a problem with the application logic and that the commit behavior has to be analyzed from their side, they especially have to clarify at which time which data is supposed to already be committed. I also pointed them to a flaw in the provided test case: When the new ABAP debugger is not able to start a new internal mode for RFC debugging because of a certain parameter imposed limit, the debugging is done in the same process in which also the debugger runs, causing implicit commits and resulting into unexpected behavior. We agreed that SCM development will review the case and especially the debugging behavior once again and will get back to me in case further questions from database side have to be clarified.
I am now getting back to one of the questions I received via email the last night. The CCMS monitoring transactions are reporting false numbers when it comes to the number of physical IO done, the customer’s operating systems tools were reporting totally different figures. Again, the first step would be to reproduce the problem internally. When I am able to do this, it is usually a win for me. This then means that it is either a bug or I can find an explanation by choosing a suitable weapon from my debugging/code analysis arsenal. Since it is a bad habit of mine to completely misconfigure/damage my test systems while doing my research, I am first of all creating a new system by cloning one of my existing stable environments via VMware. After doing the same thing the customer is doing, putting some load on the system and starting the relevant monitoring scripts, I am able to observe exactly the same differences. After checking the coding and SQL statements of our collectors and comparing this to the way the OS monitoring tools are working, I am coming to the conclusion that this is not a bug, but just a matter of defining what physical reads/writes means in the CCMS. This is then one of the many cases were everything is fine but you have to point out what the numbers mean and when it makes sense to look at them and when not. I am replying to my colleague, including a detailed test case to show that the behavior is expected under certain circumstances.
Continue to work on one of my notes where I received feedback it would be a good idea to put an example into it to make my point clearer.
That’s it, a typical support day is over now, my colleagues from AMERICAS will continue, followed by APJ until it is my turn again.