Anatomy of a Go-Live

oliver · ‎03-15-2012

I was originally planning to write my first blog post on the new SCN about its platform architecture, integration aspects of the landscape, our migration approach and the delivery model. But with the events of the last days I came to the conclusion it would be better to give you some background information on what happened in the last five days.

Sunday

The production system has been live now since middle of December and a small group of beta testers (~1000 SAP Mentors, Moderatos and SAP employees) had been invited to get a first insight and provide feedback. During that time the focus was mostly on the overall performance of the system and getting bugs Twice a week we where doing deployment to the platform and ran delta migration of the latest blog and forum posts published on the old platform. The idea was to have lots of regular small go-lives every week in order to test the process and avoid any surprises for the actual go-live weekend.

The last delta migration ran this Sunday, and beside some specific tasks for this special weekend, we had done the whole procedure many times before. The final switch was to enable the content redirect from the old platform and enable the contributor permission for the community members. Around 5 pm CET the core SAP IT team was in our connect room and we made the switch.I didn't expect any big surprises but it was still a very exciting moment to see the first community members exploring the site. We did the basic checks on the system, watching the traffic and load on the app and DB server, checking the logs but in the end it was still Sunday.

We knew it would be the next morning, when it would get interesting. We all headed home, getting some sleep to be fit for the real traffic.

Monday

And it got interesting. The Indian part of the community was already on the system for about 3-4 hours and with increasing load we saw the servers going down. This was unexpected, because we had done very excessive load tests before. We knew what traffic to expect and our load tests had shown that the system was able to even handle peak situation. What was going wrong?

We got heap dumps of the dying app servers and found out quickly that a certain widget, which was integrated in basically every content page was the trouble make. The so called MoreLikeThis widget was showing a list of similar content then the one currently shown. In order to identify the similar content object, expensive queries against the meta data of the page was necessary, and this was eating up memory and CPU resources required for the normal site operations. The integration via AJAX did not trigger these queries during our load test and because it was not considered typical end user activity, we did not have it covered in our load test.

Once we had identified the issue and the system was down anyway, we removed the widget and started the app servers again.

The traffic came back and we did not see the system recovering very well. The DB was under permanent load, mostly due to specific long running queries, which where easily identifiable as queries from the "Browse Content" tab each of the space. Even more interesting was that probably 99% of the queries were unique and these could not be properly cached by the DB. The specific design of the content browse functionality is implemented for high load and large tables, in our case ~2.1 million content objects. But the high number of unique queries was what got the DB into trouble, not being able to deliver the results quickly enough and with up to 50 and more request in parallel, the app servers where waiting either for query results or ran into timeouts.

What was happening and why? We knew that we had to expect a very different usage pattern with the new platform. With all the social capabilities we were expecting a much high amount of write accesses, even if it would be simple likes or ratings. We had all of this covered in our load test. What we did not expect was that the users would utilize the content browser for searching content, but this was exactly what happened. The content browser provides abilities to filter by content type (blog posts, discussions, document etc.) but also by custom text entry. Also we had some variants as part of our test suite, we where not prepared to have this high amount of different variants, which made the actual difference performance wise.

With the widget removed the app servers where doing OK, but due to the DB timeouts the actual transactions where failing, resulting in front end errors and a very bad end user experience.

The situation improved slightly with the near caches getting better utilized and by that reducing the need to query the DB. The goal was at this point to keep the app up and running and monitor it to get a better insight what was going wrong. Even though with these problems we saw high load on the system and very frequent, often multiple contributions per minute while getting hit hard with read requests.

Tuesday

Over night the system was very shay and our great 24/7 support had to restart single app servers on a regular base due to unresponsiveness and a resulting removal from the cluster. in the late morning the problems came more critical again, but again not as bad as the day before. More requests could get served from the local or near caches, which gave the DB a bit air to breath.

But this was when the sign out issues started to grow. We saw increasing feedback fro the community about loosing their sign in session, of course paired with overall bad performance of the application. But every app server was able to keep up for multiple hours and rolling restarts or dying servers allowed us to keep the cluster intact and server the requests as much as possible.

The goal in this situation was basically to make it not worse but allow the community to use the platform as much as possible and in parallel gather more information about the issues we where facing. The one thing we wanted to avoid by any chance was a longer downtime. You have to keep in mind that thousands of consultants, partners and employees rely each day on the information made available on SCN and the tremendous feedback of their regular contributors. Even though the overall experience was really bad, we still saw a high traffic with both read only requests but also contributions. Members were already working on the system and posting to the discussion forums, posting questing and providing answers. Blog posts started to show and the content team was working with the community on getting more content into it.

We were looking into getting as much information as necessary to plan the next steps to mitigate the identified issues. Still the DB was under unusual high load and investigation into short term remedies like adding server resources and table indices were done, but also considerations to upgrade the DB version. The app servers were losing the connection to the near caches before dying. With support from our platform vendor we considered to move from from a HA setup with three cache servers to a single one, which sounds counter intuitive at first. In a there cache server setup, each app server has to write his updates to at least two caches, and by reducing it to only one cache server, we would actually take load from the app servers. We were also considering restricting the browse functionality in the front end to reduce the number of unique queries.

With a lot of opinions and assumptions at hand we started our load tests, this time with different DB and cache server configurations in order to verify that the possible options would not get us more into trouble as we were already.

In parallel we identified the most critical functional issues which we would be able to ship with any hot fix required. The issues should be critical to the current user experience but not introduce any new issue, so only small fixes have been considered.

In the late afternoon we had our regular call with the vendor support team to discuss with them our finding, share the won information and discuss possible proposal. WIth their support located in the US they were able to work on a patch which would allow us to configure the number of thread connection to the cache server, which was one of the bottlenecks identified.

Wednesday

Our 24/7 support team located in Dresden is taking care of the operation side while the rest of the team in the European timezone is getting some sleep. App server restarts were again required during the night but the last two day allowed us to develop a pattern to have the system up and running in a controlled state, although the issues were ongoing.

One of the pressing issue has been the irregular sign outs from the platform, which we initially attributed to the ongoing errors. With further analysis of the logs we saw a correlation between certain spikes in the app server load and corresponding 502 error which lead to user sessions being transferred by the load balancer to a different app server. The load spikes happened in regular intervals which resulted in high memory usage and a stalling JVM. During that time the load balancer was not getting a response from its health check test and by that assumed the server wasn't responding anymore, so it would redirect the request to a different app server in which the user session wasn't known and a sign in again required.

The regularity of the spikes raised the suspicion that this could be triggered by some background task. The most possible candidate was a task responsible to collect newly created content and feed it into the internal search index. This index has a size of roughly 25 GB and the unfortunate architecture of its implementation requires to have a copy on each single app server. Most operation on this index are fairly memory intensive and this special activity lead to a temporary out of memory situation and multiple successive full GC run. We gave it a try, disabled this task on a single node, restarted the server.

The logs for this server did show a substantial lower number of 502 errors even after 60 minutes, so we rolled out the configuration to the rest of the cluster followed by a rolling restart.

It worked! [Twitter feedback]

It did not solve all our issues but the application was finally closer to our expectations performance wise and with the sign out issue (mostly) solved, the end user experience increased a lot. Finally it was possible to get some work done on the system without massive frustrations from getting kicked out of the system while posting a forum reply or drafting a blog post.

Thursday

There are still issues with the application which don't allow us to go into a normal operations mode. The servers are performing much better but still a couple of restart were required during the day. There is much more work ahead.

One thing we did was to separate the DB schema responsible for the activity stream and moved it to a different server. We also reduced the number of cache server which reduced the network traffic and took some load from the app server by not having to sen the content to multiple cache servers.

The load on the DB is still way to high and we are working in close relationship with the vendor to improve the responsible queries, which are build dynamically be the application. This required code changes tailored to our specific setup (which is not unusual, but more on that in a different blog post).

Work has started on fixes for pressing issues that we saw in your feedback like redirect loops at login, problems with updating the user profile page and usernames not getting changed during the initial login to the app. These fixes will get rolled out soon, including further performance improvements. We are listening to your feedback and we are aware of some missing blog posts or forum threads. All the reports get collected and we will provide the missing content soon.

It has been a rough start and I apologize for a sub par experience that you had in the first three days. Projects of this size are always a challenge in some form or way and this one is no exception. Thanks a lot for your amazing support and the tremendous feedback that you have provided during this week. Please continue to send us bug reports to our SCN Bug Reporting [Read-only] space as this is invaluable feedback.

I'll keep you updated.

Oliver