Monday, October 17, 2016

Database Corruption/ Torn Page Repair

Sometimes bad things happen to good databases and corruption can be one of those things. This post goes through a few different methods of recovering from the corruption with minimal or no data loss when all other standard methods have been exhausted. If you find that the corruption exists in a non-clustered index then you can fix the issue by simply scripting out the create index syntax from Management Studio, dropping the index, and using the script to recreate it (you cannot just run the create script while specifying the Drop_Existing option, it doesn’t work). If the non-clustered index index happens to be a primary key then be sure to add a unique constraint on the table with the same definition to help maintain integrity, although this will not help any foreign keys referencing the table so it will take much more work to take care of those and is beyond the scope of this post

Basic Information

A torn page error is a severity 24 error that will display similar to the following:
SQL Server Alert System: 'Severity 24 - Fatal Error: Hardware Error' occurred on
DESCRIPTION: SQL Server detected a logical consistency-based I/O error: torn page (expected signature: 0xaaaaaaaa; actual signature: 0x55555556). It occurred during a read of page (1:30857339) in database ID 8 at offset 0x00003adb0f6000 in file 'F:\MSSQL\Data\DBfile_data.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.


The best way to recover from data corruption is to restore the database from a known good backup and then restore any subsequent differential and transaction log backups. However, this method is for situations in which that is not an option for one reason or another. Also, this process will take the entire database off line so be sure to involve all stakeholders and take appropriate measures to accommodate. This process will remove data and should only be performed when all other courses of action have been considered.

For Page-Level Restore Instead of Repair

Note: For an article on how to do a repair with a page-level restore instead of DBCC CheckTable ('', REPAIR_ALLOW_DATA_LOSS)) see the link below on MSSQLTips to the article: "Using Page Level Restore as a Disaster Recovery Procedure in SQL Server 2005". I have actually used this method on a production machine and it worked perfectly.

Troubleshooting and repair

1) ****The very first thing you should do is start a restore on another server of the last known good database for later use and/ or possibly restoring transaction logs if available****
2) The error message contains the information that you will need to start the repair process, specifically, the database ID and the page number. You will first want to get the name of the database via the database ID and filename. Next, use the numbers in parenthesis which specify the file number and page number to determine the ObjectID, previous page number and next page number with the following command (of course, replacing with the correct name):
DBCC TraceOn (3604)
DBCC Page (, 1, 30857339)
DBCC TraceOff (3604)
Note that this is the corrupt page so you will receive an error, but you will also receive command output in which you will see a row that looks like this:
Metadata: ObjectId = 1108303108 m_prevPage = (1:30857338) m_nextPage = (1:30857340)
3) Make sure you are using the correct database and determine the table name from the ObjectID and note the previous and next pages for later use:
Select Object_Name(1108303108)
4) Get information on the primary key, clustered index (if different), computed columns, identity columns, guid columns, and triggers:
Exec sp_help
Exec sp_helptrigger
5) Next, determine the last value of the previous page and the first value of the next page in order to determine the contents of the bad page:
DBCC Page(,1,30857338,3) With TableResults
This will display the contents of the previous page in tabular format and in clustered index order so note the last value in the clustered index/ primary key, then do the same for the next page, but noting the first value instead:
DBCC Page(,1,30857340,3) With TableResults
6) Once available you should run the DBCC Page command against the restored good database to confirm your results then write a query for the values:
DBCC Page(, 1, 30857339, 3) With TableResults
7) Once you are confident that you know the values in the corrupt page and have a way to transfer them back in to the table (e.g. restored good database or another data center), you need to repair the table:
***At this point it would be a good idea to make a full backup of the database should something not go as expected with the repair operation***
Set the database in restricted mode to help mitigate the potential for another connection intercepting yours once in single user mode:
Alter Database Set restricted_user With RollBack Immediate
Set the database to single user in order to perform the repair (make sure you do this in a single batch as below):
Alter Database Set Single_User With RollBack Immediate
Get the row count before the repair:
Exec sp_spaceused
Then run the repair statement:
**Please note that this will result in loss of data and should only be used if other alternatives (restoring from backup, etc.) are not feasible due to age of backup etc. !!!**
DBCC CheckTable ('', REPAIR_ALLOW_DATA_LOSS) With All_ErrorMsgs
Once complete, get the row count after the repair to validate your findings and query for re-insertion:
Exec sp_spaceused
Then set the database back to multi user:
Alter Database Set Multi_User With RollBack Immediate
8) Re-insert the rows, disabling any triggers and accounting for identity, computed, and guid columns. 

Sunday, October 16, 2016

Process for Changing IP Address on Cassandra Nodes in Windows Environment

I needed to perform an IP address change for Cassandra nodes in a Windows environment and couldn't find anything documented so I created the below procedure for performing the task. This process was for a Cassandra v1.2 cluster. 

1. Cleanly shut-down the node (only 1 node should be brought down at a time)
a. Perform drain
nodetool -h drain
b. Disable Gossip
nodetool -h disablegossip
c. Disable Thrift
nodetool -h disablethrift
d. Disable binary (native protocol)
nodetool -h disablebinary
e. Stop Cassandra service

2. Set Cassandra service to disabled

3. Configure purge gossip state information:
Edit file adding: JVM_OPTS="$JVM_OPTS -Dcassandra.load_ring_state=false"

4. Change yaml file setting start_rpc = false and start_native_transport = false (this is only for the node involved in the IP change)

5. Change hosts files on all nodes to the new IP address for the node involved in the change

6. Perform IP change

7. Reboot machine

8. Set Cassandra service to automatic

9. Start service

10. Verify node state, ring, load, activity, etc

11. Run repair on current column families that may have missed writes to re-IP'd node

12. Confirm that data is in sync

13. Edit file removing the line to purge gossip state: JVM_OPTS="$JVM_OPTS -Dcassandra.load_ring_state=false"

14. Change yaml file for the node involved in the IP change setting start_rpc = true and start_native_transport = true

15. Enable binary
nodetool -h enablebinary

16. Enable thrift
nodetool -h enablethrift