We had been on SAP BusinessObjects XI Release 2 for a while and it proved to be a very stable platform. While we had few issues, when an issue did come up, I was able to fix it quickly. However, just when you think you are safe and comfy in a stable environment, your boss will throw you back into the water.
The Mandate from Management
Due to some new requests from the business users, last November our boss told us that we must upgrade to SAP BusinessObjects XI 3.1 by the end of December. This meant we had roughly a month and a half to prepare, design, and implement the upgrade. To make things worse, we only had four days (actually the management allowed only three days originally - we had to twisted some arms to get 4 days) to shut down the production system to install the software, configure the system, migrate the contents and the users, conduct testing, and then go live. It was a very tall order.
In order to understand the magnitude of the upgrade, let me first tell you the size of our environment. We have eight physical and dedicated production servers, five dedicated test servers, and two dedicated development servers; we also have more than 20,000 users in our production system. Each Monday morning at peak time, we have about 800 concurrent users. Therefore, upgrading our environment is not a small task. It was sort of a "make it work or else..." situation, and therefore an assignment that "we could not refuse" (or postpone for that matter).
The Test Environment
Resolving the Major Show-Stopper -- Custom Java Code
First I installed XI 3.1 on our test environment, and the installation and content migration went well. However, I discovered a big problem. We wrote a lot of custom Java code using the BOE SDK and Report Engine SDK. The relative paths of the JSP files and jar files in XI 3.1 are different than in XI Release 2. Now that is an easy fix because all we need to do is to update the paths, but there was a bigger problem.
Some of the Java functions were deprecated and we couldn't find their replacements, which set us back by about two weeks. We began to worry as this was a major show-stopper. We could not proceed with the upgrade without resolving this problem. We opened a message with SAP Tech Support, who was very responsive and helpful, and they pointed us to the right direction. We finally found the workaround but we had to re-code some of our functions. I had my BusinessObjects Administrator focused on re-writing the code, while I focused on the preparation and everything else. This put us further behind on our schedule. Unfortunately, we encountered another problem and this time it was the SSL (Secured Sockets Layer). The test domain name that we used to access Infoview didn't work, but we worked diligently with IT network administrators to resolve the DNS problem.
Preparing for the User Migration
The biggest challenge on this upgrade was the user migration. As I mentioned before, we have 20,000+ users in our production system and there's no easy way to set them up. Since XI 3.1 has different security scheme, it was best to rebuild the security from the ground up. I manage users by groups which I think is the best approach. The first thing I needed to do was set up the report folders and user groups. Next, I assigned object permissions to those report folders by user groups accordingly. After that, I arranged all 20,000+ users in CSV format which is the easiest way to load large number of users in bulk.
Next, I faced a key decision: should I load all 20,000+ users in test and migrate them from test to production during the four-day downtime, or should I migrate the user groups shell from test to production during the four-day downtime and then load the 20,000+ users directly to production? I chose the latter because I trust my loading process more than the migration process. This method might take a little more of the valuable down time, but I have better control over my loading process, and it is easier for me to troubleshoot any potential errors.
It was a good thing that I set up the user groups and object permissions first in the test environment. The CMC (Central Management Console) is not meant to be used for a large number of users, and in our case, not for 20,000+ users. The new CMC is built in Java so it is rather slow. It might be an oversight or something because every time you set up the object permission for a particular user group and click "OK" to save it, the whole screen is refreshed. It does not provide a place-holder to go back to the item that you were working on. Instead, you are back to the top of the page. If you have a lot of user groups to work with, like we do, you need to pay attention to where you are. I hope SAP's Product Development will change this behavior in the next release (i.e. XI 4.0).
Ensuring Key Operational Reports Available
With the limited amount of time to prepare, we were somewhat ready for the upgrade. Since operation uses our reports on a daily basis for key operational decisions, they cannot afford any down time. In retail, operation is king so naturally we made a workaround for them. I migrated all the report and universe content from production (XI R2) to test (XI 3.1), and then I changed the universe connections to point to production data. All the senior executives at Home Office (the name we used for our corporate headquarters) and in the field had user accounts set up in test so that they could continue to run the reports without interruption.
Migration Day 1 - Shutting Down Production and Installing SoftwareFinally, it's show time! We were pumped up and ready to "slay the dragon"! On December 29, 2009, day one of our upgrade implementation, I shut down production, uninstalled XI Release 2 from all eight servers, and then began installing XI 3.1. The installation and configuration (e.g. clustering) went smoothly without problem.
Migration Day 2 - Migrating Reports, Universes, and User Groups
On day two, I migrated all the reports, universes, and user groups from test to production. Again, it went well with no errors. Now, it was time to perform the big task. I took all the users in CSV format and loaded them one group at a time into the new Enterprise XI 3.1 system by using the Import Wizard. Together with the Account Manager tool and the prefsCopyUtil.jar file, setting up and configuring all 20,000+ users became relatively easy, even though it was very tedious.
Since I knew exactly how it worked and I had great control of the whole process, the loading of users also went well. After loading all 20,000+ users into the system, I started testing their object security and data security. Of course, the object security is controlled by groups in CMC. For data security, I believe the best way to manage it is in the data mart where it belongs (I prefer not to use Row Level Security feature in the Semantic Layer). Basically, I set up a security table in the data mart with the user's ID and organizational level. Then our Data Warehouse team maintains an organizational hierarchy and alignment table (actually we have two...one vertical table and one horizontal table). When a user logs on to Infoview, we capture the @variable('BOUSER'), then join with the organizational hierarchy table to find out who you are and what data you can access. These tables are easy to set up and easy to maintain or update.
When you have a large number of users, ease of maintenance is very important. This is how it works: if the user is the Regional Director in Dallas Texas, for example, he or she can see all his/her districts and stores (The Regional Director is the ancestor and the Districts and Stores underneath are the descendants in Data Warehouse terminology), but he/she cannot see the stores in New York that belong to another Regional Director under different Division. The user loading process took longer than expected because of the sluggishness of the CMC's Java interface. It was not completed until the morning of day three. Nevertheless, the user access permission and the reports were all working properly. I felt like a big load was off my shoulders.
Migration Day 3 - Dealing with Oracle JDBC Issues
But then, we ran into another problem. In the afternoon of day three (i.e. New Year Eve) when we tested some of our custom Java code, we were getting an Oracle JDBC error. It puzzled me because we did not have such error in test. It threw us out of our groove, and we had only ONE day left. I started looking into the difference between the two systems and after some investigation I discovered that our management had updated the Oracle client on the production web servers from 9i to 11g. It was a big mistake. Apparently, our management was not aware that our custom Java code was written to work with Oracle 9i. In order to work with Oracle 11g, we had to update or possibly rewrite some of our custom code.
We examined the code and we knew that it wouldn't be a quick fix. Since we had only one day left and we still had a lot of remaining tasks to complete before we could go live, I decided not to pursue a fix for 11g. Instead, I reverted the Oracle client back to 9i. However, the Tomcat installation was still messed up. I did not want to reinstall the web servers because we did not have the luxury of time. After examining different alternatives, an idea suddenly came to my mind. Our test environment was still using 9i client. This is the beauty of Java being portable; all I had to do was to copy the entire webapps folder from test and replaced the webapps folders in our web servers in production. And voila, the custom Java code was working. Unfortunately, this issue sidetracked us and used up some valuable time. We were behind schedule and we started to worry...
Migration Day 4 - Testing the Reports and Murphy's Law
On day four (i.e. New Year Day), we asked our report developers to start testing the reports. Of course, there was no way we had the time and the resources to test every single report in less than a day. So we only tested some random samples and made sure we covered the important reports (those reports used by senior executives at Home Office and in the field).
But remember the Murphy's Law...everything that can go wrong will go wrong? The production domain name and the SSL were not working. We mistakenly thought IT would have their ducks in a row after the same issue we experienced in the test environment. Due to the holiday, we had only one network administrator on call to help us on this upgrade but he couldn't figure out what was wrong. It was late in the day and we were all exhausted, so we made a sensible decision to hold back the "go live" for one more day. We would rather to be late than rushing through it. It was a good decision after all. This not only allowed the network folks to fix the DNS problem properly, but also allowed us more time to test our reports.
Migration Day 5 - Going Live
So on day five, Sunday January 2, 2010, our new system went live...finally. We then realized there were some more user interface issues that we did not discover with our test environment. If the users were using Internet Explorer 8, they experienced different behavior when interacting with our custom Java code that handled the report windows. Also when the users installed Google Toolbar on their computers, they had to disable it before running the Web Intelligence reports or else they experienced abnormal behavior. To keep the users from overwhelming the Help Desk, I quickly compiled a self-help document that covered all the common issues the users might have and their respective solution, and then published it on our report logon page where everyone could access it. Other than that and some login account issue (which was expected), everything was working fine.
With the limited amount of time to prepare and to execute, I consider this upgrade was a great success. We fought through many hurdles and delivered the new system just one day over schedule. We congratulated each other on the job well done, and then we had to ramp up again to tackle the next challenge coming down the pipe. There was no time to celebrate or reflect (until now) on this accomplishment.
Learning the Lessons in Hindsight
The hindsight is always 20/20. The lesson learned is that, before making any important decision (like a major upgrade), management might want to solicit input from the team members first. You never know, sometimes the most junior member of your team might come up with the best idea. I understand that due to the rapid change in business needs, we have to react at the "speed of the business" which is perfectly fine. However, insofar as possible, it is always better to allow adequate time for planning because not every story has a happy ending.