Monday, February 20, 2006

So what was the answer, part IV

Finishing up the four questions thread – we are now on “Continuity of operations”. Basically, what the discussion was about was fail-over – disaster recovery.

The topic of conversation was – data guard and/or remotely mirrored disk. “We need to provide for continuity of operations and want to understand the options”.

What I find unnerving sometimes in these conversations is the theoretical desire some people have to operate a heterogeneous disaster recovery (DR) site. That is, one group really really wanted me to tell them how to use data guard with different operating systems (the answer is: you do not and no matter how hard you pressure me, I will not say otherwise). I had a question about using data guard between 9i and 10g (the answer is: you do not). DR is supposed to be something that is relatively bullet proof – easy to have happen when you need it to.

Trying to do DR for your 9i database to 10g (or vice versa) would be less than useful in my opinion. When the day comes for you to fail-over, you really want things to go smoothly. The fact that you are failing over indicates you are already having a really bad day. The data center has burnt down, exploded, flooded, whatever. Maybe people are injured or worse. Maybe the lead person who knows everything about everything isn’t around to wave their magic hands and fix things. You just want it to work. Period. You don’t want to be faced with not only activating the standby database – but upgrading the database your application runs on (or downgrading) at the same time. When I state it that way “you really want to activate the standby and upgrade your database at the same time?” (usually I throw in – when is the last time you upgraded anything with zero errors the first time) – they usually get it.

The same is true for cross operating system – you need the standby/fail-over site to basically be the same as production. Maybe standby is not as large (fewer CPUs, less expensive disk setup, whatever) but it is “the same”. Running production on Solaris and trying to have a Windows machine as a fail-over is a recipe for disaster itself (or to be fair, the converse is true as well).

The problem I think is most people have never actually had to fail-over (that is a good thing I suppose). It is something they’ve heard has happened to a friend – but they never have experienced it themselves. This leads me to the next point people seem to forget with DR sites.

You do not want to fail-over for the first time the very day you actually need to. No more than you want to test your ability to recover a database for the first time when you actually need the recovery to work!

Probably the best way to ensure your DR plan won’t work is to not test it. Data guard is pretty good in allowing you to test – it allows you to do graceful switchover and switchbacks so you can verify that if you actually ever need to run on your standby site – YOU CAN.

Oh, and you need to do this on a recurring basis. Because, as we all know, software has a shelf life, it goes stale over time. Just because you tested the failover (via a switchover) 4 months ago doesn’t mean it’ll work today. Things change. This is one of the things you do want to do on a scheduled basis, after major changes (like an application upgrade – you need to make sure the standby has the upgrade application and can function as well!).

Some of the people I was talking to had questions about data guard versus remote disk mirroring. I myself would prefer to use database methods to protect database data. The problem with remote disk mirroring and databases is that databases tend to write a ton of stuff. In reality however, all we need is the redo to be mirrored. Consider an insert into a table with three indexes on it, using an 8k blocksize. Oracle will modify 4 blocks at least (one table, 3 index), generate at least one block of UNDO, write to at least 2 redo groups, and eventually archive that redo. Remote disk mirroring will be forced to perform that work over the network (it just sees 8k block writes all over the place). Data guard however will just transmit the redo stream. The reduction in data transferred over the network can be huge when you compare data guard to remote disk mirroring. Not only that, but the DR site using data guard can be used for some things – like a reporting database, as the database to be backed up (offloading that from production) and so on.

Does that mean “no remote disk mirroring needed?” No, not really – you still have software, configuration files, setup information – other data that needs to be over at the standby, but that isn’t in the database. Remote disk mirroring and a standby database are complimentary, it generally takes a bit of both to get it done.

This’ll be the topic of conversation (well, one of them) for me tomorrow in fact. I’ll be speaking at this conference in Orlando for 2 hours about Availability, Manageability, and Security.
POST A COMMENT

13 Comments:

Anonymous Sam said....

Hi Tom,

We have a customer who has a DR site using Dataguard 9i. He wants to change it to use SRDF (EMC) which is a disk to disk replication tool. The reason he invokes is the easier administration, and the fact that his production team is used to SRDF (though we have the dataguard DR working since 2 years). The only loss i see concerns the backup strategy : we were using the standby site for backing up. Do you see any other disadvantages other than the cost (which obviously doesn't matter that much) ? Network traffic is not a big deal either apparently.
It seems that SRDF is being used widely for DR sites... Thanks

Mon Feb 20, 11:28:00 AM EST  

Blogger Thomas Kyte said....

Seems funny - they have it running, it is running, they are managing it (and data guard isn't rocket science to manage)...

disk mirroring:

o there is the cost of the disk (must be the same in both)

o as well as the mirroring software itself

o you won't be using that DR site for anything - data guard knows you have a database, so backups, reporting and such can take place, you lose that. remote mirroring obviates that. Heck, with 10g you can even flashback your standby, scrape out data that was "broken" in the production system, flash the standby forward and continue applying changes.

o disk mirroring is immediate. You lose the ability to use the standby to recover "accidently" wiped out data.

o disk mirroring is immediate - write a corrupt block on production, have it faithfully reproduced on the standby. Using data guard, a physical failure on production like that won't affect the standby site - the redo is transmitted and turned into changes against the blocks.

o the network traffic will likely go up by a factor of at least 5 - if not more. (we observe about 1/7th the network traffic using data guard internally over remote disk mirroring on one of our largest and most active databases)



I guess if your customer has lots of money they just don't know what to do with, tons of network capacity they cannot figure out how to use and wants to limit things they can do with the standby when it is not failed over to - sure, go for it.

Mon Feb 20, 11:35:00 AM EST  

Anonymous sam said....

Thanks for your valuable opinion. I didn't think of the "block write corruption" risk. I'll add it to the disadvantages list but i'm affraid they're so scared of failing over with dataguard that they'll change to SRDF. I agree with you that you don't change something that works for something doing the same thing (or even less) different and (more) expensive.
It seems that they're massively using SRDF, and we are maintaining one of the few of their application using Dataguard. I'll try to present the reality, pros and cons, but i have a small chance of "winnning". But who will loose more ? :) I guess them...

Mon Feb 20, 12:23:00 PM EST  

Blogger Thomas Kyte said....

Sam -

schedule a graceful switchover and switchback to demonstrate that "hey, this works"

Mon Feb 20, 12:27:00 PM EST  

Blogger kevin loney said....

When people are testing out DR, they should hand off their DR process documents to outsiders. In a real disaster, you lose people or you have people who cannot get to your DR site. In a real disaster you lose your email server. In a real disaster you lose your voice mail system. You lose your office space. You lose people.

In a real disaster you start by re-establishing C3I - Command, Control, Communications, and Intelligence. Who survived, where are they, how can they help you, what is left to work with, and how do we move it to someplace where we can use it effectively.

If you're not dealing with that level of disaster preparedness then you are just preparing for fake disasters. I've been on fake disaster trials before - we all drive downtown to the DR site and do our steps that we wrote. In a real disaster no one would drive downtown, or the downtown may not exist.

I agree 100% with the simplification advice. You must be able to do a blind handoff to people uninvolved with your systems. Imo if you aren't preparing at that level then you'll be only be prepared for fake disasters. The technology for the recovery should be the easy part of the process; you'll have much more difficult issues with the other parts of the recovery.

Mon Feb 20, 01:56:00 PM EST  

Blogger Thomas Kyte said....

Kevin - well said.

I truly believe most people believe "it won't happen to us, if we have to fail over it'll be because we lost power or something"

Some people get visibly uncomfortable when I mention "people might have died, you might not be there, so the fact that YOU know what to do is not really the point"

Mon Feb 20, 01:59:00 PM EST  

Blogger David Aldridge said....

Couple of points: it seems as if for many companies a failover site might be a good candidate for hosting through an external organization (hint: I've got a lot of space available in my basement, and I could move the kids to a smaller bedroom ...)

Also, how is the data guard vs other methods advice different for systems making extensive use of non-logged operations, such as data warehouses? With the much lower overhead experienced from inbex logging, undo, redo and whatnot it seems like the balance might move more in favour of disk mirroring, or even something less synchronous.

Mon Feb 20, 03:20:00 PM EST  

Blogger kevin loney said....

I've seen real DRs. And in the best DR test I've heard of, the manager called together the DR team and went around the room - "You're gone, you're gone, your brother is gone, you're home and your phone is dead, you can't reach your parents, you're gone...". Now who knows the number for the tape storage service?

Back to the original topic, you still have to have some method for keeping the remote site updated at the O/S patch level, O/S parameter level, user accounts, etc. There are lots of components outside the database that must be in place and properly configured for the application to properly use the database. The database failover is just one part of the picture and there are other parts that are just as critical. The database failover strategy should be part of the overall Business Recovery strategy, not just some technical problem the DBA tries to solve in isolation.

Mon Feb 20, 03:25:00 PM EST  

Blogger Wil said....

Hi,

I was wondering on your thoughts of using ASM on a stretched cluster with two failuregroups for mirroring the data and no Dataguard.

This should protect against physical corruption.

As far as understand if Oracle manages to write a soft corrupt block from memory to disk Dataguard would do the same. Or does this come back to the fact that Dataguard does not write the block but the SQL apply?

Tue Aug 22, 11:52:00 AM EDT  

Blogger Thomas Kyte said....

to what end (the stretch cluster), you have a single point of failure - your disk array.

data guard uses redo mirroring, it is doubtful the same memory fault that caused the soft corruption would occur with data guard - the block is not modified in the same way on the failover site as it was on the production machine.

I've always wondered "to what end these stretch clusters", what is the interest in them - without remote disk mirroring anyway - in which case I'd just say "data guard" again.

Tue Aug 22, 04:18:00 PM EDT  

Blogger Wil said....

Hi Tom,

Thanks for your reply.

But we would have two separate SANs located in two physical locations.

ASM would be remote mirroring.

So I cannot see any hardware issues, unless and this is what I have been tasked to look into:

1) ASM could write an error to both failure groups.

2) A corrupt soft block from the SGA would write to both failure groups, and if this would be possible would Dataguard save us.


PS.
Great speaking to, your books have always been a great help to me.

Wed Aug 23, 04:00:00 AM EDT  

Blogger Wil said....

Think I've answered this myself, with your help.

Because Data Guard applies the redo (SQL) against it own data blocks then any corruption in the Oracle block should be detected as the redo logs are recovered against the standby database.

Logical corruptions produced by a process which wrongly modified the Oracle data block. could be propagated to both Oracle blocks of the primary and secondary extents of an ASM diskgroup.

So yes we need Data Guard. :)


Thanks
Wil

Wed Aug 23, 08:36:00 AM EDT  

Anonymous Anonymous said....

What do you think concerning adding some more images? I’m not trying to offend anyone, content is really great. But according to the scientists people acquire information much more efficient when they see some helpful pictures.

Whitney Nixon
gps jammer

Tue May 04, 09:32:00 AM EDT  

POST A COMMENT

<< Home