Log in

No account? Create an account
ss - oscar

My SQL Dump

MySQL musings by a self professed MySQL Geek

Previous Entry Share Next Entry
A broken MySQL backup is useless - Make sure you test both your backup AND your recovery processes
ss - oscar
This post is sparked by a recent message on an internet mailing list, where an engineer tried to restore a 40GB MySQL production backup for disaster recovery purposes. Unfortunately, the backup could not be restored and an error is returned from the program. It us unclear (to me) if this is due to a problem with the recovery tool, the original source database or with the backup itself.

I guess the individual never tested RESTORING the backups they were taking, which any production DBA will tell you is absolutely necessary.

Don't let yourself get into the unthinkable situation where all your data is "backed up", but you have no way to restore it. A corrupted, improperly taken or unrecoverable backup is worse than useless, particularly is one's faith is placed in it.

In order to protect your operation:

1) Make regular, logical backups of your data with a tool such as 'mysqldump'. Store at least one copy of these backups off site, with one or more copies preferably in a geographically remote location. These represent a "last prayer" for data recovery if all other methods fail, or if a severe event destroys local copies of data and backups. The DBA responsible for backup/recovery should verify the backup and fingerprint the contents to ensure that any copies are not corrupted.

2) Use hot backup tools to maintain an additional binary backup locally. LVM snapshots or other technology can be used for these backups.

3) TEST both of these backups with full restores to staging machines. This is a great way to keep your staging environment up-to-date with regular production refreshes while at the same time verifying the integrity of your backup system.

4) Use local replication to maintain a dual master/single writer replication "ring". DO NOT balance reads between both masters.

5) In addition, use replication to maintain a geographically remote second copy of your data.

6) Set up two data centers, and reserve one for failover operations. Once again, avoid the temptation to load balance between them, as this makes capacity planning for an outage much more difficult. Serving geographically remote regions with a local datacenter can be very advantageous, however, so careful capacity planning taking into account increased latency during failover is very important should you choose do so. Servers can quickly exceed the expected maximum number of connections when latency is high.

7) Code your application to be tolerant of service degredation, including service latency.

8) Set up regular backup infrastructure in both data centers. You need backups when you fail over, and you need to continue to maintain backups should one datacenter fail completely with little or no hope of restoration.

  • 1

Checksum your slaves!

Make sure that you do consistency checks on your masters and slaves, especially if you use your slaves for backups.

Re: Checksum your slaves!

Yes, a very good point. Also, keeping a slightly out-of-date slave using maatkit or other tools is also a good idea to help recover from user or administrative error, like dropping a table or database.

Re: Checksum your slaves!

The problem with that is: how out of date should your slave be?

In general the problem with your list is that you don't always explain why things should be done, or you over-generalize the "why" -- for example, you say keep the slave out-of-date, but it's not practical in production, because it might take someone 5 minutes to call you to say "oh I messed up" or it might take them 30 minutes.

In general I advise a logical backup with mysqldump --skip-extended-insert, with each table being exported to a different file -- that way you can retrieve part of the data.

You explain that the logical backup can be a "last resort" but there's something you left out: Not many people can do a good, consistent mysqldump -- that requires locking, and there's no way within mysqldump to lock more than one database at a time (if I recall correctly).

The testing and explanation is good :)

you don't explain why you should do 4 and 5 at all, nor that 6 is related to 5. And in general, it's best to not demand #6, but ask the questions that #6 poses -- what is your plan in case of a disaster on the machine? A disaster in the rack? A disaster in the data center? A disaster that affects the entire coastline / 100 mile area around you?

For some, the plan may be "we take the downtime, sometimes disasters happen".

Again, #7 is a good idea, but give some examples.

#8 is good, but it should be in the context of 2 larger points: When you have to fail over a machine/rack/data center/whatever, do you still retain ALL your functionality? If you have dual purposes for a server, what happens when failover occurs? (ie, a slave used for disaster recovery and reporting, where does reporting go when the slave is used for DR?)

So, in general, your points are good and solid, but I fear they will become legends and people will do them without understanding why, and eventually they will be myths.

Re: Checksum your slaves!

Sheeri, you are right, this article contains hints that pertain both to database backup/replication tools, but also 'disaster recovery', which is related because it is faciliated with those tools.

There is no hard/fast rule about keeping an out of date slave, but....
I tend to keep a slave between 1hr and 4hr behind, depending on the replication volume. If you have very high write volumes it may not be possible to keep a slave very far behind without the risk of it never being able to catch up at all, though if you have this much volume it is time to consider horizontal partitioning. In general, if you infrastructure is write-scalable you won't have a hard time keeping one slave per replication master an hour or two out of date. If you are taking LVM backups, you can probably restore a copy of a dropped table by running another copy of the database on a read-only snapshot. Thus a time-delay slave is less important if you have LVM hot backup infrustructure in place and can perform periodic snapshots.

Regarding using mysqldump w/ single insert statements:
Instead of single insert statements, consider backing up to tab delimited files instead, as this adds much less overhead to the data as compared with single insert statements. It can be loaded more quickly too because LOAD DATA INFILE has much lower parser overhead than millions of INSERT statements. You can get a good transactionally consistent (using InnoDB) mysqldump using the --single-transaction command, though this won't take a consistent snapshot of MyISAM tables - you should use FLUSH TABLES WITH READ LOCK for that.

You should do always do #4 (multi-master, single writer) as it is the easiest way to maintain a 'hot standby' for your database. There are numerous articles that expouse the benefits of dual-master/single writer, and tools (such as MMM or flipper) to automate such systems so I won't go into them further here. You can instead use a storage replication tool like DRBD, but there are various reaons why this is problematic. You can also use four machines (two DRBD clusters) in dual-master/single wrtier ring for increased uptime.

#5/#6 are both disaster recovery related. A 'disaster' is any event which leads to a situation where services can not be served from a data center, and such service is not recoverable by any means except complete restoration from backup. Such an event could stem from power outage, flood, fire, earthquake or human error. Such events often require reprovisioning of hardware in the data center. Note, that some services are conveniently served from datacenters with the least number of transits from the client, which I why I have the special note about latency. The further apart two network objects are from each other (in phyiscal cabling distance), the longer it takes to communicate between those nodes. For example, a data center connected by backup satellite link may have as much as 500 - 1000ms of latency added to packets.

#8 assumes that you (like a Yahoo!, or a Google, or an AdBrite) must be able to continue all operations should primary data center operations be lost due to act of God. I worked for a very big company that could completely fail over all services to a secondary data center, but had no core backup/recovery infrastructure in that data center. If the primary data center was lost, and operations were forced to fail over for a considerable amount of time, serious infrastructure problems could arise and there would be no additional backups created, and access to the original backups would be lost, with no ability to restore from. In this case, a double disaster would kill the service, which they felt was acceptable. They failed to realise that within a few weeks of failover operations, the system would be in a completely unrecoverable state - which I consider a second disaster.

Re: Checksum your slaves!

If you have very high write volumes it may not be possible to keep a slave very far behind without the risk of it never being able to catch up at all

Your logic is flawed. There's no difference in write volume if I keep my slave 0 seconds behind or 1 hour behind or 4 hours behind. The write volume is the write volume, no matter how long I keep my slave behind. It's just time-shifted, which doesn't affect write volume.

You're assuming that you must be able to continue all operations, and that works for "big" companies, but startups cannot afford that kind of cost, and many startups could "start over" or have a loss of 1 days' worth of data without affecting their business too much. That's my point; your suggestions are *good* but they assume too much.

Re: Checksum your slaves!

My logic is that if your slave is four hour behind, and something happened two hours ago, you might have to roll the slave forward by two hours to get your copy "up-to-date" to the point at which the drop happens.

Also, since writes can happen simultaneously on the master, but are serialized on the slave, it can take longer for a slave with a very high write volume to catch up to that period than the actual two intervening wallclock hours would suggest.

I don't think my logic is flawed.

Re: Checksum your slaves!

You said:
There is no hard/fast rule about keeping an out of date slave, but....
I tend to keep a slave between 1hr and 4hr behind, depending on the replication volume. If you have very high write volumes it may not be possible to keep a slave very far behind without the risk of it never being able to catch up at all, though if you have this much volume it is time to consider horizontal partitioning.

This implies that an out-of-date slave might lag farther and farther behind, but a regular slave wouldn't. That logic is flawed.

Writes are serialized on the slave regardless of how far you keep the slave behind. Let's take 3 servers: master, slave1, slave2. slave1 is a regular slave, and slave2 is kept at a 4-hour lag.

at time 0: a 5-minute write query starts on the master (q1)
at 1 min: another a 5-minute write query starts on the master (q2)
at 5 min: q1 finishes on master, starts on slave1
at 6 min: q2 finishes on master
at 10 min: q1 finishes on slave1, q2 starts on slave1
at 15 min: q2 finishes on slave1
at 4 hour 5 min: q1 starts on slave2
at 4 hour 10 min: q1 finishes on slave2, q2 starts on slave2
at 4 hour 15 min: q2 finishes on slave2

So your logic is flawed because you're implying that slave2 somehow might lag farther and farther behind simply *because* you're keeping it 4 hours behind.

Re: Checksum your slaves!

I takes time to copy the data from the slave, during which point replication MUST BE STOPPED.

You have to catch up the replication for the downtime of the apply thread. If you can't catch this up, you are in trouble.

Re: Checksum your slaves!

Ah, so it's not the theory of keeping a slave behind, it's the application -- the way the tool works, it adds the extra time to copy the data -- as opposed to something like letting the relay log be caught up, but only allow the SQL thread of replication to apply statements from 4 hours ago or more.

Then certainly, that's a good caveat to have....given the particular tool you're using.

Re: Checksum your slaves!

You have to catch up the apply thread to "10 minutes ago" when the table was dropped. If you slave is at 100% capacity w/ respect to writes, then this will take "wall clock" time, which may or may not be acceptable. Then you have to get the table to the master. The io thread can continue to accumulate logs from the master, and in most circumstances the apply thread can continue as long as you skip the DROP TABLE, accidenal DELETE FROM, TRUNCATE, or UPDATE (the bad DDL/DML) which caused the problem.

Note: you might want to keep the slave at the point right before the drop and not restart the apply thread for some time. You may want to do this for various reasons. The "catch up time" matters if this is the case.

The primary questions are:
a) how long does it take your slave to catch up to "10 minutes ago".

b) how do you want to restore the table?
A) using MySQLdump
B) using a binary copy

c) do circumstances allow the SQL apply thread to continue during b)

The answer to a) affects the downtime. The downtime is a) + the amount of time it takes to complete b). If circumstances does not allow the apply thread to continue while restoring the table, then the slave must be able to catch up for the time lost durign b).

If you are going to use A)mysqldump, then you can catch up to and skip the bad SQL statement(s) then resume deferred operation. You'll use MySQLdump for restoring the necessary table(s). Pipe the output to the master while also making a backup copy to a file. I recommend using the unix 'tee' command for this. It adds a proverbial "T fitting" to a unix pipe. Once this process is complete, drop the table on the slave, as it will be automatically repopulated from the master via replication.

Making a binary copy is easy with MyISAM but more problematic if you use InnoDB, because you will have to be a) using file-per-table and b) let the SQL apply thread idle for some time (it depends) until InnoDB goes into a quiet state so that you may export the table from the tablespace. You have to leave the APPLY thread quiesced until you can export the table in this case. This shouldn't really cause a problem though.

If you are using MyISAM, then skip the bad statement and copy the table.

The binary copy method only works if you can spray the copy to all of your slaves (or dual masters), or you don't have a complex slave infrastructure yet.

Re: Checksum your slaves!

And all of this is a LOT of work, for something that rarely happens.

Is it better to do all that work in preparation for a DROP TABLE that rarely happens, but when it does, it can be restored from a backup + incrementals?

This comment is showing the weaknesses of "keeping a lagging slave". As I said, I have actually done a DROP TABLE by accident. I went to the backups, got the most recent logical backup of the table (6 am that morning), imported that data, then trolled the binary logs using

mysqlbinlog --start-datetime='xxxx-xx-xx 06:00:00' --database="db_name" | grep -i tblname

and applied all those statements. If the table is regulated by a primary key (most are) you could even go farther back, say 05:55:00, and ignore errors when you're applying those statements.

Total time to restore the table was 10 minutes. It would have been much faster if I had a slave that I could mysqldump right away, but I've been working with MySQL since 2001, and it's only happened once. Having a whole slave server set up just to make it easier on the off-chance that I fat-finger a command is pretty difficult to sell to management.

I think this is just a factor of environment; if there's an extra server, then sure, have a time-delay slave. But every single company I have worked for, and worked with, in the past 8 years, has had better uses for an extra server -- for dev/test environments, partitioning/taking load off other servers, or re-allocated for a different purpose (ie, if the sysadmins needed the machine for something else). The bigger the environment, the more likely they have the "extra" server. I've never worked at a place like Google that has thousands of machines; My experience is with tiny through large environments (ie, a few hundred machines) -- not extra large ones.

Multiple day backups are also needed

Making regular backups and having copies offsite is still not enough.

You actually need to have multiple day's of backups available, even if you consider your system non-critical as I discovered with my blog. See http://ronaldbradford.com/blog/has-your-blog-been-hacked-2009-09-08/

Re: Multiple day backups are also needed

Yes, you should always have at least two weeks of backups available if your infrastructure can support that. I personally suggest retaining at least 2 FULL backups, and of course, you must maintain the incremental backups too.

Re: Multiple day backups are also needed

why 2 weeks? Why not 1 week or 1 month? Again, you have these rules without reasons. How about "keep the backups since the last application/server change?" It really depends on the application and what it is used for.

And yes, you need to copy the binary logs, and any slave position, and the configuration files too!

Re: Multiple day backups are also needed

In the organizations I've worked with, application changes are usually rolled out once or twice a month. The two weeks recommendation comes from that experience, along with the knowledge that two weeks of backups consumes a LOT of space in a fireproof safe, which is where you should be storing them.

Re: Multiple day backups are also needed

Right, so better to say something like, "keep enough backups so you can revert to a previous application change. If you don't make application changes often, think about how long it might take to find a grievous logical error (2 days? a week?) and keep that many."

I advocate for a grandfather/father/son policy -- if space permits, keep something like 19 backups --

6 monthly backups -- full backups from the 1st of the month, for the last 6 months (or whatever makes sense, if changes happen the first week of the month, maybe the 15th or 30th is a better backup to keep -- just before or just after changes...).

6 weekly backups -- full backups from one particular day a week (Wednesday or Sunday is often chosen for a variety of reasons) for the last 6 weeks

7 daily backups -- full backups from the previous week

That way you can go back in time if you need to. And at one company, I had to go back in time (using the logical backups) to restore customer data for the bigwig customers....

Of course, if you are using the database to, say, store e-mail type messages back and forth on your social network, and your policy is that messages are kept a max of 30 days, you only need to back up about 30 or so days (depending on legal stuff, you may need to refer back to e-mails, so you may want/need more than 30 days). Every application has different backup needs.

Re: Multiple day backups are also needed

Good suggestion. Some automated backup software will let you manage a G/F/S backup automatically. I know Omnibackup (if HP still sells that...) will let you do this.

Re: Multiple day backups are also needed

*nod* Note that these are all *suggestions*, including your points -- basically:

If you want X, we suggest you do Y.

You're starting off with the assumption that everyone wants to do X. While it's true that many people don't realize that they want X, making rules with unstated assumptions is the paths to myth-making. (ie, if you give your list to a startup they'll spend all their money following the rules on your list, when it's very likely they don't need more than 1 or 2 items on your list.)

At Pythian, we have scripts to manage G/F/S backups (including easy ways to store/retrieve on Amazon S3...retrieving isn't so easy due to "buckets" and storing isn't easy because they have a max filesize of 5Gb (or maybe 4 Gb?)). We also have many clients who do backup in many different ways, and only a few hit all your bullet points. While it's true that other clients are improving their backups, it's also true that some clients just don't need a Facebook/Yahoo/Google kind of uptime.

Most of our clients have maintenance windows they can schedule, which means that it's OK for them to have some downtime. We go through with them the possible scenarios and what's OK.

For example, if a company told people on September 12, 2001 that they lost half a day's worth of data, nobody would complain. If they lost data every time there was a hurricane, everyone would complain. So there's a balance to be made there, and it's OK to have a calculated risk so long as it's *conscious*.

But yes, if you need that kind of uptime, then yes, you should have everything on the list -- and make sure your geographical redundancy is really redundant. Having a data center in San Francisco and San Jose, California won't help you if a massive earthquake affects the area. Having data centers in Boston and New York also won't help if there's a bad hurricane or blizzard that affects the northeast. Similarly, data centers in both coastal Texas and the west coast of Florida aren't really redundant, because one natural disaster can take them both out.

However, even if you're Facebook or Yahoo! or Google, if aliens destroy both New York and Silicon Valley, your customers will probably be OK with the fact that you lost some data.

To me, disaster recovery is like load testing -- just like at some point your load testing should get to the point where the load is too much, your disaster recovery plan should get to the point where you say "in that case, it's OK to lose data".

Retention policy, redundancy policy and backup frequency vary by data

Some data is very important. Due to legal requirements some data must be kept for N years, or sometimes even in perpetuity! Loss of such data, even is disaster, if it were at all preventable, could be considered a violation of state or federal law.

Some data is not important and should not be backed up at all.

Some services are very imporant. Not maintaining a service SLA can mean millions of dollars in lost revenue + penalties due to legal contractual requirements.

Some services are not essential and loss during a disaster is acceptable.

A good disaster recovery plan identifies and categorizes all assets of value by financial value and also by cost of loss of service or downtime.

You need a fully managed and organized computational infrastructure to get to the point where disaster recovery can be effectively implemented in such a way which constrains cost and man-hour effort.

Re: Retention policy, redundancy policy and backup frequency vary by data

Exactly! :) I would have been much happier if the post were mostly stuff like this, with your list of points from the original post being an example of how one company kept good, useful backups.

For example, how useful is a backup if you don't have the right config to go with it? You can corrupt InnoDB if you have the wrong my.cnf settings. The backup itself isn't broken, but nonetheless you can't restore.

(Your ideas are very good, I hope I made that clear. They're just not *universally* applicable, which is why I've been nit-picking).

Re: Multiple day backups are also needed

Also, I suggest two full backups, and daily incrementals, which adds up to two weeks :D

Re: Multiple day backups are also needed

No, actually two full backups don't mean two weeks. You've made another assumption -- that a full backup is done once a week. In my experience, most MySQL databases I've worked with are backed up with a full backup nightly -- the ones that aren't do so on purpose because:

1) they don't have the resources to have a separate server just for taking backups (many medium-to-big companies do have the resources, but many don't!)

2) they cannot have the downtime to do a cold backup

3) a "hot" backup using XtraBackup or InnoDB hot backup takes more than 24 hours *or* taxes the system too much (we've found that both tools cause more slave lag than usual)

4) they do not have a method of using a snapshot (LVM, ZFS, etc).

In those cases, a full backup is done every few days or weekly...

So again, you need to state your assumptions -- full backup = weekly. If you're taking "daily incrementals" -- what is that, once per day you're copying binary logs? Assuming you haven't set expire-logs-days to less than 7 or 8, and assuming you aren't purging binary logs some other way, you still could lose up to a day's worth of data with that method.

Some clients decide they want to backup binary logs every hour, so that they lose at most 1 hour of data.

I wish that it was easy to have a list of rules to live by, but really the only rule is "think about the disaster possibilities and then think about what is OK to lose in what scenarios".

Remember that relay logs have the same information as binary logs, if the slave is up-to-date....

Honestly? In practice, DROP TABLE happens a LOT less often than is necessary to warrant a completely separate mysql instance. If this isn't the case, then privileges should be revoked. Having a slave be X behind is one solution, but a much better solution is one that will work even if your time threshold of X is passed.

I've done a DROP TABLE by accident a few times -- all but once the problem was recoverable, like I dropped a table, imported new data, then accidentally dropped it again....so I just reimported -- that last time, I got the previous day's backup from the mysqldump logical export, and used the binary logs to recover the rest of the table.

That's why I am not thrilled with a separate server instance -- sure, it takes a bit longer to restore, but the binary logs are already there. Setting up an additional server *just* to be a few hours behind doesn't seem worth it. Maybe if it's also a backup server, sure, or if it's 10 minutes behind and it's a disaster recovery server, sure, or if it's also a test server. But otherwise it seems like a waste of resources. Which, again, are much more easily absorbed in a medium-to-big company than in a small one.

I just don't think it's a good idea to give out "rules" without clarifying when they apply. How often has having that "behind" slave really helped? Especially when there's another way to get data?

Data Recovery

I am data recovery (http://www.stellarinfo.com/) expert from india and your article is really very good

It's amazing to me that more organizations don't do this... The best example of what can go wrong would be Ma.gnolia, who lost all of their data because they never bothered to check that their backup script wasn't working properly (I could be wrong about the specifics, but that's what I remember reading about).

In the past I've set up monitoring of the backup scripts as well, to make sure they ran successfully and the right files are generated. This doesn't replace periodical testing that the backups are sound, but it does help; especially if you're using innodb hot backup which will just fail quietly when the license expires.

I agree! The company I work for, www.pythian.com, has reactive daily scripts to check the freshness and status of the backups via the log files the backups make, proactively checks that the backups are working every 2 months, and periodically restores backups (the period depends on the client, we automatically restore 2 instances on 1 development machine every week!)

Data center Rack Security

The first step is to establish a baseline facility environment suitable for equipment installation. Standardization and modularity can yield savings and efficiencies in the design and construction of telecommunications data centers with better Data center Rack Security..

Data center Rack Security

Large data centers are industrial scale operations using as much electricity as a small town and hence they need Data center Rack Security and sometimes are a significant source of air pollution in the form of diesel exhaust.Effective data center operation requires a balanced investment in both the facility and the housed equipment.

  • 1