So at this point we were 6 months into the project, my team had completed the PoC, Dev and QAS upgrades on all the applications - working with the new hosting provider as they conducted the data centre transition.
During this time we faced many issues around performing Upgrades and Unicode conversions using Parallel Export/Import on VMWare systems (that is a subject for another day), as a result of these issues I had a nervous client who was not convinced that we could pull this off within the downtime window. This upgrade (designated TC1) was to be the proof of the pudding, the last 6 months of work and preparation was going to be in full visibility and it had to succeed within the window or have very good reasons why it did not.
As my team got closer to the upgrade time, we refined the plan more and more - I had worked with my project manager on a previous project and knew he was an absolute whizz at MS Project, but he outdid himself this time. The project plan we had was based on ASAP, with nested plans and cascading dependencies throughout. This made it very easy to update the top level and see the knock-on effects of slippage. Finally I conducted walk throughs with people to make sure the timings were reasonable and that no-one was staying up too late.
As we had 4 days downtime, I had a technical time budget of 60 hours to do all the upgrade work across all the applications. So we worked out a shift system which enabled the 6 people on the upgrade to have rest periods during the upgrade.
It is important that the responsibility be shared among the team members, but utimately there has to be a single person responsible for each application. That person has to be the expert in that particular system during upgrades, if there is an issue they have to be involved in the troubleshooting, although they do not get the final decision (that's the team lead's repsonsibility) they are pretty close to it.
The table below outlines the known issues that we had going into the trial cutover and the measures we had developed to try and overcome them.
Apart from the known technical issues, we also had non-technical issues
In every upgrade it is vitally important to be able to guage the performance of the upgrade so that you can predict things like, are we on course, do I need to reschedule people, what is the report to the project, when can I sleep. To help with this, SAP have provided an excellent analysis file - called the UPGANA.xml file, you will find it located in the HTDOC\ directory under DIR_PUT. As you can see from the picture below, it has much of the top level information -
It also has all the timings for the upgrade phases, which is vital for keeping an eye on your general upgrade performance, you should use the log files for specific timings during the process, but I find this file works well for project managers :-)
The table below shows 5 of the longest running phases in the Uptime and Downtime part of the PoC and TC1 upgrades for comparison.
As you can see, the PoC was a much slower upgrade, and for the risk adverse among you, you are probably wondering how I managed to keep my client on my side with timings like those of the PoC (I'm not telling.) The TC1 times show that the PoC was massively under powered in terms of CPU, this can be seen in the difference in the activation times, which are CPU bound and the similarity of the Import phase which is I/O bound.
In terms of overall runtime, the table below summarises the main phases as I have them recorded
As you can see from the table above, we blew our budget on time massively - which was concerning for everyone and we had a lot of explaining to do, but we had captured a lot of good data as to why we had issues, what we did to resolve them and how we can mitigate them for TC2.
This is shown (as usual) in the table below
As I have said above, we blew the transaction log several times during this Unicode conversion and it is important that I explain why this happened. When running a Unicode conversion, the exports are usually fine, as they are read operations - imports are write operations (DML), these will be captured in the transactions logs/online logs unless they are noted to be part of a Bulk load (Oracle, DB2, SQL Server) and there are SAP Notes to enable you to do this. Similary if you are deleting the records of an entire table, you use the Truncate table command, which is non-logged as well - so far so good.
Truncate table BKPF
The issue with transaction/online logs raises its head when you are deleting records because a Unicode import process, for a table you have split, has failed and you have restarted it. This uses this type of command
Delete from BKPF
"GJAHR" <= '1009' and "BELNR" < '006919999' and "BUKRS" = 'XX01' and "MANDT" = '100'
This is a logged operation, and depending on the size of the record set - it could blow your transaction log, but it is unlikely to do this on it's own, more likley as a group of repeated packages performing deletes.
The final issue we had was not being smart about how we restarted all the failed Unicode processes, because we did not manually restart failed processes until the end of each servers Unicode run, when we did restart them we had over 50 Unicode processes all trying to delete from the DB at the same time - as shown above this is an easy way to blow your Transaction/Online log, it caused us a great deal of pain and contributed to the database crash, but we learnt a great deal from it and we applied those lessons to TC2.
So we finally made it through TC1, collected a wealth of important data (which we'll analyse in the next post) and utimately re-affirmed we were on the right path. Next up was Trial Cutover 2, all we had to do was make through Christmas alive!