Watch videos with subtitles in your language, upload your videos, create your own subtitles! Click here to learn more on "how to Dotsub"

The DBAs Worst Nightmares

0 (0 Likes / 0 Dislikes)
  • Embed Video

  • Embed normal player Copy to Clipboard
  • Embed a smaller player Copy to Clipboard
  • Advanced Embedding Options
  • Embed Video With Transcription

  • Embed with transcription beside video Copy to Clipboard
  • Embed with transcription below video Copy to Clipboard
  • Embed transcript

  • Embed transcript in:
    Copy to Clipboard
  • Invite a user to Dotsub
[The DBAs Worst Nightmares] [Rick Kutschera] Welcome everyone. There are quite a lot of you. I hope nobody falls asleep. I know the first session after lunch break is always a bad one. This session is called The DBAs Worst Nightmares. I hope you don't expect spiders or visiting mothers-in-law or something like that. It's not that session. My name is Rick Kutschera. I've been a SQL DBA for quite a long time now. I'll keep the slide about myself—there are just 2 things that I want you to know. One is my native tongue is not English, so I might be a little hard to understand for some of you. Sorry. I can't help it. If I should rephrase something, please let me know. I'll do my best to make myself understood. The second thing is there are people around here that think that I'm crazy. I will not argue with that, so you'll see me do some stunts here that you should not normally do with your environment if you have any other chance. This is basically what it's all about here. What I'll do here is describe some problem scenarios, scenarios that I've seen in the past, scenarios that are somewhat ecumenical sometimes but should show you a way, demonstrate how to troubleshoot it if you run into a problem like this, and show a way to fix it with as little impact as possible on both the data quality and the availability of your system. I have to add this disclaimer. I don't want to read it. The key message here is what I'm trying to do is show you some routes to handle problems. This is not the supported scenario for everything. This is not the easiest scenario for everything. It's a way of giving you another tool if you need it. Sometimes it might be better to do it differently, but when you're with your backs to the wall, then it's good to have another way to work with. If you have any questions, please, at any time just interrupt me and ask. There are microphones there. You can also shout if you'd like. I'll leave it to you. Please mute your cell phones, and if you should fall asleep, please don't snore. Most of the sessions will be done in demos. The way I do this is I show you a scenario, tell you a little bit about what it's like, and then I switch to my demo machine and show you what it's actually like, which is one of the parts where some people think that I'm crazy, because unlike some other people, I will not show you prepared scripts. I will do that on the fly. Very well. The setup I'm using is SQL Server 2012 SP1 with cumulative updates that at least was somewhat current when I built the presentation. I used the same database for all the scenarios. It's a pretty simple one. It has 3 tables in it and 1 stored procedure or something like that. I use a single data file, which doesn't mean anything. You can have databases with multiple files, of course, and much more complex scenarios but just for the ease of it, I have a single transaction log, and all my databases are running full recovery model. They are all running SQL 2012 compatibility, but in most of the scenarios it doesn't really matter. Unless I state something otherwise, everything I show you will be available in pretty much any edition of SQL Server except maybe for Compact. I don't know too much about that one, so I can't tell. That was about it for my PowerPoint slides. Let's get the fun started. The first scenario is a pretty simple one. Maybe some of you have seen that before. Your database goes offline, and when you take a deeper look you find that a data file is missing. Now let's hope that this works. Okay, pretty much what you can see here in the beginning is I have this database here— let me blow that up—which you can see is in recovery pending right now. When you look at the SQL error log, which is the preferred way of doing this in a scenario like that, you pretty easily find out that we're missing an MDF file, and let's scroll over here a little bit. The system can't find the file specified. Well, that's a pretty simple problem to have. When you look into the directories, you'll see that there is no Zoo1.mdf. Look at the other directory. The transaction log file is still there. Okay, so that scenario is pretty simple. It's a standard recovery scenario. It's one of those that you can actually handle without running into data loss or consistency problems or anything else. The solution is pretty simple to that. First of all, you start doing a log backup to get the tail of the log. The parameters I'm using I'll go over quickly afterwards. Okay, and the stats would help. What I'm using here is init and format. It's mostly because I might have leftover files on that machine, so I'm making sure that I clear everything out that's in there. Compression I use because some of the scenario demos that I have have a lot of data in there, and my machine is old and not the best one, so I prefer not to generate too much data. No truncate is the important one here. If your database is offline because of a missing data file it has not recovered you can truncate the log with a backup. You need to specify no truncate. Just backup the log without changing anything to it. Generally when you do any kind of scenario like this you should always keep it on no truncate unless you have a good reason to truncate the log with your manual backups. If you're using log shipping or any technology like that you will otherwise end up with corrupting your log shipping as well. Overall that's a pretty easy one, and of course, the important part here is—where is it? Here it is. We do have a full backup and a log backup for the machine. I can just do a restore of the whole thing. That's where we backup the database. We need 2 more. Not lob, log and log. Not much to it. One thing you will notice here— I just started because this takes a little while— one thing you'll notice here is that I do specify no recovery, and if I scroll that down, we will see that I actually also specify no recovery here in the last restore that I do. Why do I do that? The key reason is if you recover your database, there is no way back. If for some reason I find out here that I missed something and I need to run another tail backup or whatever because something has changed, and I ran recovery with my last restore, then I'm out of luck and have to start over again. On the other hand, it's pretty easy to do a restore database, FunDB_Zoo1 with recovery. Firing that command is something that doesn't take too much time, and you can at all times do it at the end when you are assured everything else is okay and you're good to go. Hitting the refresh button, if I can hit it, and all the databases are back online. Not a really big deal but a good thing to get warmed up. Let's go through this. Everything I do is in the PowerPoint slides as well, so you'll get the things afterwards. The basic thing is troubleshooting. If you have a database that is not recovering for whatever reason, look into the SQL error log. It will tell you exactly what the problem is, more or less at least, and you can go from there. In this scenario the important part to see is we still do have a transaction log, so we're actually good to go and start from there. If you don't have the transaction log, well, we'll come to that later. The thing that I do to fix it is take the tail log backup with a no truncate option and then run a full restore of the database. Okay, part 2 of the same story. If you followed my little walkthrough of the directory structure before you can guess what's coming next. Same scenario. The database is offline. But in this case we're not missing a data file but actually the transaction log. Let me get back to my SQL box. When you look into the error log here, here it is. In this case, we are missing the log, and when you look into the directory, the data file is still there, but the transaction log actually isn't. The bad thing in this scenario is that you most likely lose transactional consistency. Now, you can be in the lucky situation that nothing is happening on the database. You had a checkpoint before the log was crashed, or the database was shut down or whatever, and you're in the lucky situation that you actually don't lose data or don't lose consistency. In this case, which is a very rare one, it's easy to work against it. You can detach the database, reattach it, and it will regenerate the log if it finds out that everything is consistent. I'm not assuming that, so we have to go for an inconsistent database. Again, this is inconsistency. Once you did this, once you recovered from that, you have to make sure that your data is actually valid, and there is nothing that can help you around that. You need your business logic and whatever to look into the data itself. How do you regenerate the log? The thing is pretty easy in SQL 2012. It was a lot harder before. You can do a DBCC CheckDB, and the key thing is you have to do Repair_allow_data_loss. The problem is you can't do this while the database is not in single user mode because CheckDB with this option requires single user mode, so what we need to do first is alter the database, FunDB_Zoo2 set to single user, and maybe, you guessed it—whoops, I should learn to type at some point— this will not work. A database that has not recovered can't be set into single user like that. It's a bad thing. Fortunately not too bad. The easy workaround, and again, this is easier than it was in the previous versions of SQL, when you do set the database to emergency mode those of you who have been with SQL a long time know that in the past you had to go into the system catalogs, update the system catalogs, stuff like that. Now you do an alter database set emergency, which will bring the database into an emergency state where you can actually access it. And from that state, interestingly enough, you can it as a single user, and that one works. And with single user mode now, I can run my CheckDB on it, and it will regenerate the log files, and now all I have to do is take the database and set it to multi-user, hit it, done, and we're back online, and this is how we recover from a transaction log that was gone. Okay, let's break this down quickly. Again, it's the same way of troubleshooting. You go into the SQL error log, and it pretty much tells you what the problem is. You check that the MDF file is still there. If you are in the unlucky situation that both the MDF and the LDF are gone, well, tough luck. You could call that an even worse nightmare than the 2 I just showed you. I wouldn't because this is a standard resource scenario then. Nothing fancy on that. I said it before. You lose transactional consistency with that, so make sure you check your data before you do anything with it. Okay, you need to set the database into emergency, otherwise you can't do anything on the database that has recovery pending. Then you can set it into single user, run CheckDB, set it back to multi-user, and you're pretty much good. Okay, we set the stage with the easy ones. Let's make this a little more fun. I had this scenario in a real world area, and it took a while to figure it out at first. You have a user reporting in that he's querying a table that everyone else is querying as well, and he gets an error message. Everyone else doesn't. Based on that, it takes a while until you figure out that the query that causes the problem is only when you query the name field of the table, and everything else is not as much of a problem. Without further ado, let's have a look at this. I'm going to switch this. Use this one. Let me show you what's happening when I do something like that. It takes a while because there is a lot of data in it. You see the 500,000 rows. The query succeeded. Now when I do this, I get an error message and a very interesting one. What's happening? Anyone have a good idea to start with? [inaudible audience statement] [Rick Kutschera] You said it right. It has something to do with an index. The select star queries the entire table, so it goes over the clustered index. This one obviously doesn't. The other message that you would get normally is a pretty good one, and you could see it in the first glance. The only problem is by default I run all my management studios with read uncommitted isolation to not impact anything. And if you encounter a page error like this with a read uncommitted isolation you get this nonsense error message that doesn't say anything. Why not do this in read committed mode and run this again. What we'll get is an actual error message, and it tells us right here that we have a page inconsistency, page 6740 in database ID 7. Now, that's the important information to start with. What do we do with that? How do you find out what's on that page? The easiest way to do that is look. Traceon. For those of you who are not familiar with this trace flag there is an undocumented command in SQL Server where you can look at the contents of a page, but this command, the DBCC page command, only works if you have trace flag 3604 enabled, so if I run this without the trace flag, I don't get any output. If I run the trace flag with it, it tells me what is on that page, and there are 2 actually important things on this page. One is down here. You see the metadata object ID which basically tells us which object it is on the page. Well, you already knew that one. The other one is on this side. Let me see if I can get this right here. The index ID tells which index we are actually looking at. So what do I do with this now? Let's have a look at some DMVs. Sys.indexes, where object_id = this one, and index_id = 2. As we already expected, this is a non-clustered index. This one has caused the problem. The problematic thing here is normally I would say just rebuild this index. NcLupus on DeadLupus rebuild. Unfortunately that doesn't work. Once you have a corrupted page in an index you can't reroute it, and you can't rebuild it. It's just history. What is the easiest way to go around that? Well, I'm not normally the guy who likes to use an interface very much, but for that one it's pretty easy to go out, take the table, get the index, it's this one, script it, drop and create, fire away, and when I do that again now it will suddenly start to work again. What you have to be aware of is that when you do this drop and create like this you will get some locks unless you do this online on, which is an Enterprise feature. And of course, while the recreate of the index is running, all queries that access this index will be a little slow and with full table scans, and if this is a big table and the index rebuild takes a while, you might even end up hitting up your buffer cache with full table scans of this table, which might have further impacts in the end on your performance. But the good part is in Enterprise Edition you can do this online, so it means you don't have any locks on it, and even in the Standard Edition you get a schema model of it for a while, but it should not impact your whole system too much. Okay, I see I'm far too quick for this. The troubleshooting in this case was relatively easy once we figured it out. Be aware that read uncommitted isolation will not give you the right error message. Once you have the error message that includes the page ID, use DBCC page to look at what's actually on that page. The important part is find the right object, find the right index in this case. Of course, if there is not an index page— but I don't know, a game page or something else— then the scenario will be different, and you will need to have different solutions coming up for the problem. Sys.indexes is the view of choice here. Be careful. There is another view called sys.sysindexes that doesn't give you the information you need, so the DMV jungle is getting pretty big these days. And don't forget, the UCC page needs the trace flag enabled. And the fix, as I showed you, is drop and recreate. I also tried doing this create index with drop existing for reasons that go past my understanding. This doesn't work. You really have to drop the index before you can create it again. I don't know why. I reported it to Microsoft. I have not gotten feedback so far on this, and I guess I won't get it anytime soon, because the problem is not that big after all. Okay, let's go to the next one. The scenario looks pretty similar to the one before, but you might find out that it's actually not so much. In this case, I'll take another table. The error message is the same. [cell phone ringing] I will pretend that I didn't hear that and did not talk about switching off the phones before. Okay. Whoops, I should do it right. Then it might work actually. It's the same thing. Of course, it's another page in another database. The problematic thing here is if I copy that over here, and it's 5, 1, 2480, the problem is when you look at this—there we go. You of course have, again, your object ID over here, and surprisingly, those of you who know SQL Server a little will already know what this one means but just to show you what I'm talking about. Indexes. There is your problem. Why is this a bigger problem than the one before? [inaudible audience statement] [Rick Kutschera] The key point is in an unclustered index you just have the structure, so the ordering of the data and the pointer to the actual data. The clustered index holds the data, so you can't just drop it and recreate it. When you do drop the clustered index the table will be converted into a heap, which means it needs to access those exact same pages that are corrupt in this system. There is no way to rebuild it in any way. The way I cluster corruption in this database means that the usual things to do on this will also not work. The way I would go on this first is to run DBCC CheckDB on it, run a repair and rebuild, but it doesn't work in that case. The data is actually corrupt on the page, and it's more than a bit flip, so you can't rebuild it with the CRC checksum. So what do you do now? You come pretty much back to the scenario here, that there is no way to fix it directly. The only way you have to actually fix this is to restore the database. Now I show you what pretty much everyone from my team that I give this test to has done so far, and then you will see why I call this a nightmare. Okay, let's see. We started with something that we have already seen. I should switch out of the database. It might help. Do a backup log. MSSQL\Backup\BirdTail.bak. Format. This database is actually online right now, so no truncate is something you wouldn't need here. But as I'll show you later, there is a good reason I do it that way, and other applications could still do something here. It's good to have a consistent log chain for whatever you do afterwards. Now let's do the restore of this. I know I said something about no prepared script, but I'm getting a little lazy here. And I did use the same database more or less. It's pretty much the same restore that I need to do with one exception for some particular technical reasons. In this case, I have 2 transaction logs to run. Okay, that's pretty simple. I took a log backup. If I run this now, the database will of course go offline, which is an impact on your production system if you do that, but other than that, in principle you would be good. Let's just take that off. Oh, yeah, I should not copy paste. It was a bad idea. If I typed it, it would be faster. Let's do this again. Do it right, and it does work. Okay, and of course, like always, do a recovery on the end. Now when I go back to my Birds table and I run the query it works perfectly fine, so what's the problem? I'll show you what the problem is. That one actually did work before. I ran into this problem in a practical scenario too, and then I learned to enjoy Enterprise Edition features. What happened here is I had a corrupted clustered index on the table. I did what every normal DBA would do in this situation, run a restore, and I ended up having a corrupted backup. Tough luck. In this scenario, I won't go into the details of this page. What I did here is I forced a corrupt page on the clustered index of another table. You're in the same situation now just with a different set of data. The only thing you can do once you're here is start crying or hope that you have an older backup that is not corrupted, which brings me to another point. Why do I restore over the databases and not beside and just copy the data over? Well, there are 2 reasons for that. One is those databases are simple and small, so you can copy the one table over if you need to. If your database is more complex and the problem was more complex to start with it might not be that easy, and I don't know about you, but I normally work with databases that are 7-8 terabytes in size, and I don't normally have another 7-8 terabytes to spare on any of my servers. In some scenarios, you are forced to run over the old database. With something like this, you better not. The problem is that you don't know it beforehand. The only way to do this if you don't have Enterprise Edition is to restore it in another place and copy the data over manually. What's the right way to do it? And as you can see here, I actually did set up another one of those. I must have a reason. Let me switch over to that one just to show you that this one is actually the same thing. You get the same logical consistency here, and I should have shown you that before, but it would have taken the fun away. This one actually works on this database. What do I do now? The database is the same, so I don't need to go over it and find the page again. I do the same thing I did before. Oh, thank you. Take a tail log backup, and the difference now is I'm not taking the database offline again. What I will do is start with the same command. Actually, I should switch this to master. The one key thing here is that I need to run this again after all. Get more information out of here. The database page is the same, of course. It's 2480. What I will do here is run a page-level restore. Anyone in the audience done that before? Anyone knew that the feature existed? Okay. [inaudible audience question] [Rick Kutschera] Okay, the question is if this was a mirrored solution, Enterprise mirror or even Standard Edition mirror, would the problem still exist? I tested it, and the way I caused this problem yes, it will still exist for the main reason that the mirror doesn't pick it up in time. What I did here is when I created this on a mirror I forced inconsistency on the page, and then I ran a statement that changed the page, which caused the page to be replicated to the mirror, and then you're in trouble. Normally you're very right. If you have a mirror or an AlwaysOn availability group in SQL 2012 the autocorrect feature should give you a fighting chance that this will be corrected. But unfortunately, not in every scenario. Okay, so what happens when I do this page-level restore? At first nothing, or at least not much. When I run this query now just the error message gets different. It tells me that I have a restore pending here, and there is no way that I can read this table right now. The key point, though, is the database is still online, and I can still query anything else. Anything that is not touching that one page is still running. The rest is old news. Copy that over. FunDB_Log. Okay, I have the first log. I have the second log. The same thing like in full restore. You have to run every single log file. The interesting difference here is as you can see, the log processed 0 pages in both cases because it does not actually restore the entire transaction log. It looks if the page has been touched within this transaction log space and goes from there. And of course, I need the tail log that we created. In my case, this also has no pages in it, and then, as always, interestingly enough, you still need to run recovery, even though the database is actually online. And now when I run this one— [inaudible audience question] [Rick Kutschera] Pardon? [inaudible audience question] [Rick Kutschera] Actually, the comment was how to tell whether you're finished with the restore. If you try to run recovery without having all the transaction logs restored before it will not allow you to run recovery. SQL should know from its own transaction log chain when I'm finished with the restore, because there is no way it can do anything else, anything more. But it's a good thing that it is that way so at least it's consistent in the way that the system behaves. And now this one works, and just to prove that I did my magic in the background this one also works. Even though I used the same backup file, the one that does have a corruption in it, I now end up with a perfectly fine database, and everything is working well. As I said, page-level restore is Enterprise Edition only, which is a pity, because this feature is really a cool thing, and it really saves you a lot of trouble, and it saves you a lot of downtime, because if you have to run a full restore in a scenario like this, it takes time, and it bears the risk that you'll do something wrong. Let me switch back. I always forget to switch to the demo slide. You have the logical consistency error. DBCC page shows it's on the clustered index. Index recreate is not an option with clustered indexes. We know that already. CheckDB can't rebuild it without data loss. You can of course run DBCC Check typically with a lot of data loss, but that's a good option. Yes, please. [inaudible audience question] [Rick Kutschera] The question was if you have this— [inaudible audience question] [Rick Kutschera] The question is if you know you're running into this scenario and the customer has the Standard Edition, can you go with Enterprise Edition to fix this problem? The answer to that is yes, you can, and you can even switch back to the Standard Edition afterwards. There is a possibility in SQL setup that's called instance upgrade. You can take the instance upgrade to Enterprise Edition and downgrade it again to Standard Edition. [inaudible audience question] [Rick Kutschera] No, it won't. The nice part is it doesn't touch the databases when you change the edition. The other way, of course, would be to take the files and attach them to another system. When you have all the files, you can attach them, because SP at HTTP also doesn't do checks, and if you don't have all files, you can still run the weird parts of creating a database that has the same file names and shutting it down, copying the files over and letting it attach this way. There are a lot of ways you can go about this. But yeah, in this scenario, if this is a really important database and a lot of data in there that needs to be saved maybe buying an Enterprise license for this might be worthwhile. Yeah, and you can copy the data files to another box and buy an Enterprise license for 1 CPU, because you won't need it for anything else. Okay, yeah, that's the syntax. One thing to note. You specified a page only at the restore database level. Every subsequent log restore assumes that you're just restoring that page. If it's more than 1 page, you can specify multiple pages in this command comma separated, or you can issue multiple restore database commands with 1 page each or multiple pages each, and the restore log still will know which pages have been touched and marked for restore and will just go through those. Restore the entire log chain. You can't keep anything, even if in the log there is nothing that needs to be changed. You need to run through every single transaction log backup file, otherwise this will not work. If you lost your transaction log chain somewhere it doesn't work at all. It also doesn't work with simple recovery model, which is kind of obvious, because you don't have the transaction log chain there. [male speaker] When you were going through this, how did you identify which page— or I saw how you identified which page. You were going through the restore. How do you identify which backup you would have that is valid, because you were indicating that you knew how to corrupt the backup. [Rick Kutschera] That's a good question. The question is we know how to get to the page. How do we know that the backup isn't corrupt as well? Because we did have a corrupt backup with something on another page. There is no easy way to do that. That point, though, is when you do page-level restore the page is already damaged. Restoring it from a backup that's also damaged just doesn't do anything. It will give you the same damaged page again, and then you're screwed up and have a problem, and at some point if you don't have a valid copy of this data anywhere anymore you just can't help it. But you can try. Take the last backup. See if it's there. Take the one before if it wasn't in the last one, because the only impact is that you're losing time with every restore cycle, but the system is online. Everyone else can work, so there is no real harm done. If you are in the situation that you only have 3 or 4 backups in your system and all of them are already corrupted, you can't restore something that's not existing anymore. Okay, I said it again twice already. This is an Enterprise Edition feature only. Unfortunate but nice. If you do not have Enterprise Edition—yes, please. [inaudible audience question] [Rick Kutschera] The question is in a scenario with multiple pages will it only name the first in the error message, or will it name all of them? It will unfortunately only name the first, so you have to run this iteratively, take the first one, run through, take the next one, run through, something like that, unless you have another way of finding it. If you run DBCC CheckDB on the database it will show you all the consistency errors, and then you can do it in one line. The question is what is quicker? Because database restore and the transaction log file restores are pretty quick if you only have 1 page. Running a query again might be quicker than running CheckDB on a 7-terabyte database. Was there another question there? [inaudible audience question] [Rick Kutschera] If you don't have Enterprise Edition and you know that you're running into this scenario, your only option is run the copy somewhere else and copy the data manually with some kind of an SSIS merge or whatever. That's a bad one. They have to have some reason to sell you an Enterprise license for 10 times the money or something like that. Okay, this next scenario is a little— let's call it academical. I like it because it's something that shows you a little insight in how SQL Server actually works. The scenario is in my case it was a panicking developer that stormed into my office and said, "Hey, help, help, help!" "I forgot a where clause in the delete statement." Yeah, that was a stupid one. I actually had a second one who had a big where clause in the statement, and in the very end of the where clause was an or 1 = 1. This is a bad one. This is a really bad one, because there is actually not much you can do. In many scenarios, in many ways you do this in practice in big environments for scenarios like this you have the log shipping with a delayed restore where it can react in time. The thing that I will show you in the demo is something that I would not do unless you really know what you're doing. It's a last resort that you can use. If I have enough time I'll show you why I'm going into this direction. Let me show you what that is. That was not that one. One of those. Okay, when I do an sp_spaceused on the DeadBunnies table— oops, what's this called? I should be more consistent with my database setup. What you will see here is this has 100,000 rows. Just as a comparison, all my tables are populated with 500,000, so we're actually missing about 400,000 here. What do I do now? The key thing here is in the scenario I'll show you now everything that happened after the delete statement is gone with it. You lose everything up to the point of the delete statement if you go that route, so this is definitely something where in the real world scenario you need a second copy of the database and do it right. The way I'll tackle this is the data is gone in the table, so you can't see it, but SQL Server still has it. I just need to know where, and the where in this case is the transaction log. What I'll do is use another one of my almost famous undocumented functions. Fn_dblog. Anyone know this? No? Okay. Fn_dblog is a SQL function that allows you to read the transaction log. The 2 parameters I put in here is the first LSN and the last LSN that I want to read. I don't know those in those scenarios, so I have null, null, which means read the entire transaction log. If I run this, what you'll see is a lot of fun data just to give a quick overview of what I'm running on. The key point here is this operation thing, and what I'm looking for is actually this one. What I do is I filter out only those— this is normal SQL. You can handle it like any other table in this case, and this will not help you very much, because most likely on your database a lot more is going on. The second key point here is this one. Let me blow that up a little. The transaction ID. Even if you don't use an explicit transaction every statement that you do has an implicit transaction below it, and this is the ID it has internally in the SQL Server. SQL needs this in case the statement fails and needs to be rolled back. What I'll do now is I just assume that maybe other people are deleting stuff in my database, but nobody is stupid enough to delete 400,000 rows at the same time. What I do here is I take the transaction ID, do a Count*, group that by transaction ID. This will take a little while. What I basically do here is I tell the system show me how many lines, how many rows each transaction has deleted, and once I have that, which as I said, takes a little while, and there is a lot in there, for a good reason. I should have ordered this. Here it was. Let's run this again. I have to note I don't do anything on this database normally, so all those delete statements are just from some metadata operation, schema upgrades or whatever. What you see here, this is the one, and it says 800,000, although it just deleted 400,000, because it needs to delete it from the index page as well. In this case, it's basically easy. This is the one transaction that should not have happened. What I do now is I go on my fn_dblog again. Where transaction ID = that one. Order by current LSN descending. Current LSN is the log sequence number, which basically is the chronological order of the transaction log. This will also take awhile. This is one of the scenarios where I have about 150 MB or something like that in a transaction log, and my old notebook is not very happy about handling data like this. I'm lucky that I have a little time left. [inaudible audience question] [Rick Kutschera] The question is are we able to see which user deleted the rows? Honestly, I'm not sure. Let me have a quick look. I don't think so, though. I don't think the security information is stored in here. Nope. It doesn't look like it. I have to check. If you want to know, leave me your email address. I'll find out for you. What we see in here now is it's still executing, so we won't wait. You get every single operation that this delete statement did internally in one row of the transaction log. The important one that I need is just this one. This is LDP_Commit, which is the commit statement from this implicit transaction. The thing I need is this ID. This is the log sequence number, which basically identifies this log record. So what do I do now? Something that we have done twice already today. But I'll do it again. Backup the table of the log. At this time I definitely want compression, otherwise this will take forever. No truncate, stats 5. Now I have that backed up. Let me copy this into a new window, switch to master. I might have a problem with this database now, because I had it opened up to the explorer. Okay, what I do here now, do a restore of the database. Then of course we have the log. Okay, so that's the easy part. Let's see if this works, and it works. Now comes the fun part. Now I could restore the table of the log, but this would not help me at all, because it would run the delete statement again, and I'd be doomed. Here comes another one of those features that I like so much. Restore actually has more options, and one of them is stop before mark. I can tell the system I want to run the restore of this file, but I don't want to restore it to the end. I only want to restore it to a certain point, and that point can be an LSN. It's unfortunately not that easy, because for whatever stupid reason— sorry, Microsoft guys— while fn_dblog and DBCC Log give me the internal form of the log sequence number, the one you see up here in green, the restore statement needs this in a decimal format. Okay, so I'll show you how to go about this. The first thing is these are hex numbers up here, so you need to convert them into decimals. I take a calculator. I don't know if anyone speaks hexadecimal, but I don't. This is the first one. Now we have 23B0, which is 9136. And then we have 00CD, which translates to 205. Okay, now you might think you can just do something like that, but unfortunately, it's not that easy. Don't ask me exactly why it is the way it is. I just tell you how it's done. The log sequence number in an American format is a 25-digit number. The third part is a 5-digit part. The second part is a 10-digit part, and the third part is a 10-digit part. For the third part it doesn't matter, because leading zeros are ignored. But those other 2 need to be exactly in that format and exactly with that padding, otherwise it doesn't work. Let's see if I did it right. I obviously did. Run recovery on it, and where is it? And when I have the right database, space used null on this table. You see I have my 500,001 rows. Why is it 500,001? Maybe I've told you about this before. What I did when I first created this scenario, this is a test scenario for some of my junior DBAs to see how good they really are. The way the transaction is set up is I add one row, and then I delete 400,000 just below that so I can ensure that you've got really the right point, because if you ignore the last part of the backup and do a recovery, you'll only have 500,000 in there. Okay, one little side note on the LSNs. What you can do here, and I'll just have to show you so you have a little idea on that, you can do a restore header only from a backup file. What you have in here is the first LSN and the last LSN. If you're unsure about the number of zeros, you can always copy that one out here and start guessing. But if you're someone like me and don't like guessing very much, I actually replaced the algorithm in this slide deck. Yeah. [inaudible audience question] [Rick Kutschera] The question is what happens if somebody deletes a row after that? Well, that delete statement will also not be executed. Everything that happens after that big delete statement is just vanished. The way you would normally do that is you restore the database in the second location, stop it there, and then make sure you get the data over to the original location. And this is exactly why I said this is an academical scenario, because you need to make sure that nothing happens afterwards. Let me go through with what I just did, and then I'll show you why I did show you this in the first place. So the way to figure out which transaction was the bad one is to use fn_dblog and group it by the transaction ID. Then you need the highest LSN. I did not use that exact statement. I just looked into the LSNs themselves and looked for the exact commit, but it is always the highest LSN that has the commit statement for one transaction. We already talked about this one. Ten digits in the second block. Five digits in the third block. That's all that has to be said. Then of course, you do need the tail log backup, because only if you do a tail backup you can afterwards do a restore from that. And then you run the normal restore statement and end it with a stop before mark. There is also the possibility to stop at mark or stop after mark depending on which LSN you use. The stop before mark is the easiest one, because you can take the exact commit and tell it to do everything up to that point. Okay, I'm through with the demos I wanted to show you. But as I do have some time left, I'll show you what the real purpose of this scenario is. I'm sure that everyone here has a good version control system and everyone has smart developers in their company. The reaction says it all. I had once a scenario where a developer came in and said, "We did roll out the change to stored procedure a while back, and we had a bug in it." "We developed a bug fix and rolled out that bug fix, "and we had another bug in it, and we rolled out another bug fix and then a third one and a fourth one." "And now the version we have in production is worse than the one before." I said, "Well, okay, you have version control." "Just get your old version and roll it out." "Yeah, that's the problem. We don't have all these versions in version control." "We just fixed them on the fly." "Can you somehow magically restore me the old version?" Sound familiar to anyone? There is a way to do that, and in this scenario where you need very, very specific data from a database this is actually a feasible way to do that, because you can go the same approach. As you can see, I have another one, and it's called the dead developer for a reason, because this guy gets killed afterwards. The cool thing is that the transaction log also stores a command in it, and this is called the transaction name. It's a full-text command, and you can look through it for alter, and just to make that easier—transaction, whoops. I have 3 updates here. I'm just scrolling through to show you that. The transaction name says create or alter procedure. That by itself doesn't tell you much, especially if you have more updates on your database than that one. But if you look at this a little deeper, and we're back to identifying a transaction ID, I take the second one of this, fn_dblog, null, null, where transaction ID = that one, order by current LSN ascending, the nice thing about this one is this is the entire transaction that's going on just to alter 1 stored procedure. This is what happens internally, and I'll scroll a little back here, because there is a field in here that is called lock information, and the first thing an alter procedure does is take a schema mod lock on that stored procedure, and that's how you find it. And now I know looking at the object ID in sysobjects— I won't show you that one—that this is actually one of the statements that had an old procedure version. Now I can do the same thing we did before, look up where the exact commit is, take a tail log backup, run it with a before mark, and restore it exactly to the point where the stored procedure was not changed yet. And I can do that again and again for every stored procedure version that I have in the transaction log. It's still a crappy solution. I would prefer Team Foundation Server any day over that, but it is a solution that works. And as I said in the very beginning, if you are with your backs to the wall, that's a way you can go. As long as you don't, well, fair enough. Good for you. Okay, now I have about 9 minutes or so left. Opening up the floor for questions. Yes, please. [male speaker] In the Dead Bunny scenario, you finished with a restore to the transaction log right up to the point before the massive delete. Do you then need to roll back the transaction with the massive delete? [Rick Kutschera] The question is around the restore right up to the commit, and do I need to do a rollback of this transaction? Yes, you do need to do a rollback. The reason I'm doing this and not stopping at the beginning of the transaction is because if you have interlocking transactions you would lose everything else in between as well. I try to take the latest possible point. Yes, please. [male speaker] In taking that scenario, when you're doing the recovery, if you're restoring all the logs, the recovery portion of that, you restore would take care of all of those interlocking transactions as part of the actual recovery process. [Rick Kutschera] Yes, so the statement was if you do all the restores of the transaction log, the recovery process will take care of interlocking transactions. Yes, that's exactly the point, but if you stop that with the begin transaction, so you don't have to roll back that big one, everything that happens after that begin transaction would not get rolled into the restore. [male speaker] I see. [Rick Kutschera] Question in the back. [male speaker] The function that you're running, fn_dblog, does that table get truncated at all or do backup all the time? [Rick Kutschera] The question was if I run fn_dblog, does that table get truncated over time? This is not a table. What the function does is really just read what's in the transaction log file. If you do a log backup, the transaction log will get truncated, and you won't get anything back anymore unless you run the backup with no truncate, of course. [inaudible audience question] [Rick Kutschera] I don't understand the question. [inaudible audience question] [Rick Kutschera] The question is can you read the old transaction log backup, because you obviously have the log chain. You can read it, but you need to restore it first, so I have not found any supported or usable way to read the transaction log backup file natively. Very well. If there are no more questions, I'll be here for a little while if you have one from the one on ones. Please take your time to fill out the evaluation of the session. Thank you very much. Enjoy the rest of TechEd. [applause]

Video Details

Duration: 1 hour, 7 minutes and 48 seconds
Country: United States
Language: English
Genre: None
Views: 6
Posted by: asoboleva99 on Jul 9, 2013

Caption and Translate

    Sign In/Register for Dotsub to translate this video.