Saturday, May 11, 2013

Active Directory Database Corruption - Investigate & Fix it

Suddenly, our script master reported that we may have a replication issue so I started looking into it and to give a brief background of the environment.. we have almost 48 Windows 2008 R2 domain controllers globally, so we needed to find out where and how the replication is broken.. 

Now, i needed a tool that can go and check all domain controllers to summarize the replication inbound and outbound replication status.. so I pulled up "REPADMIN" to find out the inbound and outbound replication status of my domain.. 
I ran "repadmin /replsummary"  and i started counting dots on the command screen which represent the progress. 

So after few minutes of processing, I had a summary report of the servers and unfortunately i found one of our DCs hasn't replicated in last 16 hrs (quite worrying, huh!! ). But just next to it had a reason of the failure which said "The replication operation encountered a database error" Oopps, this is getting interesting now..

So, i logged in to the Domain Controller reporting database issue to investigate further and fix it. The directory service Event log showed me Database index corruption errors.. hmm interesting.. 

Log Name:      Directory Service
Source:        NTDS ISAM
Date:          10.5.2013 10:03:21
Event ID:      467
Task Category: Database Corruption
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      Test.domain.local
Description:
NTDS (492) NTDSA: Database C:\Windows\NTDS\ntds.dit: Index DRA_USN_index of table datatable is corrupted (0).

Corrupt database? This will definitely skip a heartbeat of most of the AD administrators.. :(

so we ran little PowerShell script to quickly check all domain controllers for Event ID 467 and make sure we are not spreading the corruption over to other servers.  Thankfully no other DC is experiencing the corruption..

Generally, the corruption can be caused by numerous reasons but i had few in my mind that requires a check there and then...

  • Hardware
  • Outdated Drivers/firmware especially disk controller & controller cache.
  • Sudden power loss
  • Lingering objects
Time to fix it then.. most of the time the Domain Administrators prefer to go ahead and rebuild the domain controller and sync everything back, but the real concern is how many changes does this box hold and what would be the impact if we go ahead with demote and re promote of the server.. hmm, so in our case we decided to go a bit further and look for clues to fix the issue instead of going for a demotion....

So, the question was how can we find more details about the error.. and like always the answer was enable more logging..To increase NTDS diagnostic logging, change the following REG_DWORD values in the registry of the destination domain controller under the following registry key:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NTDS\Diagnostics

Set the value of the following subkeys to 5:
5 Replication Events
9 Internal Processing

Make sure, you are careful while editing registry and once the diagnostic logging is enabled it will start writing hell lot of information in the event log, so in case you want to save old information save it before you enable diagnostic logging.

Review the event logs for the new events that were generate from the increased logging for error values that will give a definitive view of the original 8451 error. For example, an Internal Processing event ID 1173 with error value of -1526 would indicate that we have a corruption in long-value tree.

Based on the additional information from the increased logging consult the table below for a potential resolution.
Error (decimal)
Error (hex)
Symbolic name
Error message
Potential resolution
-1018
0xfffffc06
JET_errReadVerifyFailure
Checksum error on a database page
Hardware + firmware + driver check. Restore from backup. Demote/promote
-1047
0xfffffbe9
JET_errInvalidBufferSize
Data buffer doesn't match column size
832851  Inbound Replication Fails on Domain Controllers with Event ID: 1699, Error 8451 or jet error -1601
-1075
0xfffffbcd
JET_errOutOfLongValueIDs
Long-value ID counter has reached maximum value. (perform offline defrag to reclaim free/unused LongValueIDs)
Offline Defrag
-1206
0xfffffb4a
JET_errDatabaseCorrupted
Non database file or corrupted db
Hardware + firmware + driver check.
ESENTUTIL /K + NTDSUTIL FILE INTEGRITY + UTDSUTIL Semantic Database Analysis + Offline Defrag.
Otherwise restore from backup or demote/promote
-1414
0xfffffa7a
JET_errSecondaryIndexCorrupted
Secondary index is corrupt. The database must be defragmented
Offline Defrag
-1526
0xfffffa0a
JET_errLVCorrupted
Corruption encountered in long-value tree
Hardware + firmware + driver check.
ESENTUTIL /K + NTDSUTIL FILE INTEGRITY + UTDSUTIL Semantic Database Analysis + Offline Defrag.
Otherwise restore from backup or demote/promote
-1601
0xfffff9bf
JET_errRecordNotFound
The key was not found
Hardware + firmware + driver check.
ESENTUTIL /K + NTDSUTIL FILE INTEGRITY + UTDSUTIL Semantic Database Analysis + Offline Defrag.
Otherwise restore from backup or demote/promote
-1603
0xfffff9bd
JET_errNoCurrentRecord
Currency not on a record
Hardware + firmware + driver check.
ESENTUTIL /K + NTDSUTIL FILE INTEGRITY + UTDSUTIL Semantic Database Analysis + Offline Defrag.
Otherwise restore from backup or demote/promote
8451
0x2103
ERROR_DS_DRA_DB_ERROR
The replication operation encountered a database error
Hardware + firmware + driver check.
ESENTUTIL /K + NTDSUTIL FILE INTEGRITY + UTDSUTIL Semantic Database Analysis + Offline Defrag.
Otherwise restore from backup or demote/promote
In our event viewer we found error id 1404 which is quite close to 1414 mentioned on the above table, so we decided to go ahead with:
NTDSUTIL ->Semantic database analysis
+
NTDSUTIL -> Offline Defrag 
The beauty of Windows 2008 R2 domain controller is that you can stop NTDS service and perform defrag unlike in earlier version where in you need to boot the system in "Directory Service Restore Mode" to anything with the DB.. 
I know some of you guys know the command by heart but i always prefer to open article /steps just to be sure i don't make any mistakes.. 
Offline Defrag  Article (http://support.microsoft.com/kb/232122 )
Semantic database analysis (http://support.microsoft.com/kb/315136)
Hmm, everything went smooth surprisingly (you usually don't see that working smoothly specially on Friday evenings..lol) . Anyways, the good news was that we were ready to go ahead and pull the trigger and that's what we did ..
To my surprise, the errors went away and i could see server replicating stuff now. Just to make sure everything is back up and running, we planned to bring back our friend REPADMIN ;-) .. We ran Repadmin /replsummary and it showed successful delta replication :)
Wooohhhooo... i can go home now and enjoy my weekend :-)
But, If in your case the above steps doesn't fix the issue, you may always demote and promote the server (worst case AD restore)...  

No comments:

Post a Comment