This blog post will cover some part of this topic which seems to be the most painful and unpredictable situation: database corruption. They're mostly physical inconsistencies, associated with storage and raid controllers (physical devices), drivers, software controllers, etc.
Database corruptions can happen at any time due to many factors, but most of them caused by hardware issues. You should be prepared for it.
Of course this blog post is intended to help BASIS administrators or new DBAs starting with SQL Server. If you're already an experienced SQL Server DBA and is new to SAP, this blog has some useful information, but nothing that you haven't seen before I hope. If you're an "old sea dog" on SQL Server and SAP, then probably the content of this blog is irrelevant with redundant information that bothers you every day .
Identifying database corruptions:
I can say that 2 DBCC CHECKDB are enough, most of the times, to check the database consistency. 1 result against the database and a second result against either the same database on the same database server/instance after you clear the buffers (it may be only a transient corruption, but still a corruption!) or you restore the current database against a second server (preferable) or another instance on the same server and get the results to compare.
Make sure you read completely the SAP Note 142731 - DBCC checks of SQL Server.
Performing the first DBCC CHECKDB:
In the SQL Server Management Studio perform the command:
DBCC CHECKDB ('<SID>') with no_infomsgs;
Save the results for later analysis.
Re-validating the results:
When occurs a first access a database page, it's first stored in the data buffer of the SQL Server. Due to software bugs or defective RAM modules, a transient corruption may occur. When you invalidate the data cache, the database corruption can simply go away - but you shouldn't sit and forget about it! Always investigate the source of the corruption! For this it's necessary to have a second result to ensure that the database is really free of corruption or the corruption was transient. Or in the worst case, confirm a database corruption.
When you invalidate the data cache, the SQL Server will have to read all the data from the disks again. If you use a RAID system, some pages can be out of sync (RAID system mirror not identical). Due to optimization the reads are taken sometimes from the original and sometimes from the mirror. With this, the results of the DBCC CHECKDB can differ each time you execute it after you clear the data cache.
I personally suggest to backup / restore your database into another server/instance. Preferably, to another hardware, to eliminate a complete investigation over a system that is suspect.
Reset the data cache:
To clear the buffer cache, perform the command:
Note: This command invalidate the buffer cache and if you perform it in a production system, the database will have to read all the data from the disk again. The users will observe bad performance and even TIME OUTS when running programs and transactions. If it's not feasible for you to invalidate the buffer cache, then backup and restore the database into another server/instance!
- Backup the database and restore it into another server/instance.
This option will eliminate some questions about the effectiveness of the analysis and efforts in order to fix/restore your database/system. I won't enter in details on how to backup / restore your database.
- Repeat the DBCC CHECKDB, save the results to analyze it.
An inconsistency has been detected. What now?
Well it's time to identify them in the results. Depending on the following message, you're able to repair the tables:
CHECKDB found 0 allocation errors and 8 consistency errors in database 'NW1'.
repair_rebuild is the minimum repair level for the errors found by DBCC CHECKDB (NW1).
The above results indicates that the corruption is not severe and you can fix without data loss, which is the opposite of the following error message:
CHECKDB found 10 allocation errors and 1739 consistency errors in database 'NW1'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (NW1).
It indicates that at least one table is severely corrupted and cannot be repaired.
I won't go in deep on methods that can help you to extract the data from a severely damaged table because it's not the goal of this blog post. You may also keep in mind that depending on the table that is affected you can also rebuild the table with data from other tables (it depends on the component that is responsible for the table involved) or it's a table that doesn't store sensitive data and can simply be truncated or dropped and recreated. But for this I strongly suggest you to read the entire blog. I'll approach later with some suggestions if your minimum repair level is "repair_allow_data_loss" - also known as RADL.
For each table that is listed in the CHECKDB results is necessary to perform another DBCC CHECKTABLE to check the level of the damage to each table:
DBCC CHECKTABLE ('<sid>.<TABLE_NAME>');
For the tables listed with "repair_rebuild" or "repair_fast" as the minimum repair level, you can repair the tables with the DBCC CHECKTABLE and REPAIR_REBUILD. However, you have to set the database to SINGLE_USER first:
ALTER DATABASE <SID> SET SINGLE_USER;
After completing, you can repair each table:
DBCC CHECKTABLE ('<sid>.<TABLE_NAME>',REPAIR_REBUILD);
After completing the repairs, perform another DBCC CHECKDB whether your database is now clear. If the result is positive, set the database back to MULTI_USER:
ALTER DATABASE <SID> SET MULTI_USER;
Before releasing the system back online to the users, take a full backup for safety
The minimum repair level is "repair_allow_data_loss". What to do?
If the table was marked as the minimum repair level as "repair_allow_data_loss" it means that this corruption cannot be fixed. In some cases (most of them), the database catalog is still consistent and the affected objects are application tables. Depending on which data was affected, maybe it can be (or in most of the cases, cannot be) reconstructed from redundant application data. You can identify the application areas that are responsible for those tables and try to approach them with the list of the affected tables and see if they can recover. It's explained on SAP Note 1597910.
It's not guaranteed that it's possible to recreate a database table with the complete data it had before. But you can give a try.
For all instances, in case you have a database object corruption with "repair_allow_data_loss", the best solution is to rely on your backups. Restore your database into another server. Start with the latest good full backup and start restoring the transaction log backups until you recover your system completely.
If a restore is not possible due to retention policy of your backups or your backups are also corrupted/not consistent, then the best approach is to contact Microsoft directly or a third party. SAP urges to customers to maintain a minimal (and efficient) backup strategy and even provide functions which assist the administrators in the areas of backup and database consistency checks. Disaster recovery solutions and efficient backup strategies have proved to be more efficient than to try to remedy those situations.
To understand the SAP Goals and Policy with the handling of the database corruptions on SQL Server, read carefully and completely the SAP Note 1597910.
Action required! Investigate and eliminate the source of corruptions!
All work done so far to fix/restore your database may be jeopardized if you simply ignore the fact that something went (and probably still) wrong in your database server. Investigate and eliminate the source of the corruption is as important as fixing/restoring the database. Even that the corruption can be caused by software or a bug, you can almost always find the root cause of the corruption in the hardware or driver problems.
In order to prevent database corruptions, what can I do?
You can't prevent database corruptions. It's something that is beyond your control. It can happens even if follow all the best practices. The difference is the impact / downtime and the risk that you lose your data is even lesser or 0 if you follow all the recommendations from SAP and Microsoft.
You can work proactively to:
- Keep your software updated: regularly apply patches to your Windows and SQL Server;
- Keep your drivers tested and updated.
An efficient backup strategy - database corruptions do not spread through transaction logs (unless the storage where you keep your backups is also damaged). Read the SAP Note 1297986 for suggestions to backup/restore strategies for SQL Server. Testing the consistency of the backups is also part of the backup strategy.
- Regularly perform DBCC checks to make sure your database is free of corruption.
- As of SQL Server 2012 you can benefit from AlwaysOn. As there is no shared storage between the primary and secondary replica, if the primary replica is corrupted, there is a chance that the secondary is not. Consider implementing an efficient disaster recovery strategy, it can save a lot of time and your heart may live longer.
- Do not neglect a backup strategy only because you have a high availability like Windows Failover Cluster or AlwaysOn. Backup is a must and you shouldn't even consider think about giving it up!
- DO NEVER RUN DBCC WITH REPAIR_ALLOW_DATA_LOSS! Unless it's explicitly asked to you from SAP. And I haven't asked!
Support Engineer, SAP NetWeaver DB/OS Platforms
SAP Labs Latin America.