Wednesday, May 09, 2012

Another debugging story...

There is a saying "correlation is not causation" meaning - just because you observe A and then observe B, it does not mean that A causes B - even if every single time you observe A - you see B as a 'side effect'.

Here is a great story to back that up - a debugging session where the end users had determined that they could not send email more than 500 miles.  Read it and note how much work the end users had gone into 'proving' that email cannot be sent more than 500 miles...

I use a similar story when talking about bind variables and bind variable peeking.  In the story - I make the claim that "when ever it rains on a Monday morning - you have to restart the database in the afternoon to make it perform correctly".  That is - rain on Mondays implies poor performance, and the only fix is to reboot.

This conclusion was reached via empirical observation - every single time, *every single time*, it would rain on Monday mornings - the database would respond with really poor response times to certain critical queries in the afternoon.  Every single time, without failure.  It must be that rain affects database performance somehow.

But here is what really was happening.  The database in question was a database that was backed up using cold backups (DBA's were of the archaic type).  These cold backups took place Sunday nights - late at night.  That meant that every Monday morning all of the caches would be virtually empty.  No parsed SQL for the applications in particular would be present.  So - Monday morning was the morning of hard parses every single week.

Now, in the critical application - there was a query that went against some very skewed data.  99.9999% of the time - the queries would be very selective, they would return 1 or 2 rows.  The other times - they would be very non-selective returning much of the table - but these queries were rare.

Now, the main user of this application was a person that came in early most of the time - they would get into the office at 6am and start working.  They always used bind inputs that returned very few rows - hence the problem queries would hard parse and tend to use indexes.  When this happened - everyone was happy.

The other set of users - they came in later, usually around 9am or so.  They would run the non-selective query and were OK with the index performance.  It might not have been the *best* possible performance - but it worked OK for them.

So, what was the link to rain?  Well, the office where everyone worked was located in Washington DC - when it rains in DC it doesn't matter if you leave home at 6am - it'll been noon before you get to work.  If you leave at 7am - it'll be noon when you get to work.  If you leave at 8am - it'll still be noon before you get to work.  In short - it really didn't matter when you left for work - if you are driving into the city - it is going to take a long time.  So, when it rained, the person that did the selective queries would just hit the snooze button on the alarm and go back to sleep.  Not worth driving in yet.

However, the people that did the non-selective queries - they lived in an apartment across the street from the  office.  They were not affected by the rain.  They came in at 9am regardless.  Since they ran the non-selective queries and the cache was empty - they would hard parse the queries and result with full scan plans.  This group of people was in fact a little convinced that rain *helped* performance a bit - when it rained on Mondays - they saw slightly better performance in the morning.

When the person that slept in finally got to work - and ran the application - they would definitely notice the full scans and how slow the database was.  Since it was raining - and they had observed this hundreds of times before - they call the data center - tell them "It is Monday, it is raining, you know what to do", after the reboot, everything is running fast again.

But - did it have anything to do with rain - or was it something entirely different :)

Getting to the root cause (bind peeking issue) can lead you to an efficient, effective corrective action (perhaps query plan stability for a certain set of queries for example).

Going down the empirical path of "correlation must imply causation" will lead to never actually fixing the problem - and the invention of many myths...


Blogger Meer athil said....

Good Analysis as always

Wed May 09, 12:08:00 PM EDT  

Anonymous Sokrates said....

... But - did it have anything to do with rain ...

Yes, the rain was the root cause.
A great example for the butterfly effect in database administration in my eyes.

Wed May 09, 01:50:00 PM EDT  

Blogger Thomas Kyte said....


beg to differ, rain didn't have anything to do with it. What was not mentioned (but was implied) was that there are other "non-performant" times.

All they noticed was "rain seems to imply bad performance".

They did not say "bad performance means it has to be raining".

But in any case - even if it only happened during rain, even if rain was apparently a necessary precursor to bad performance - it wasn't the rain.

It was not the rain.

The root cause was bind peeking, plain and simple.

It was the fact that someone parsed SQL with non-selective binds. Period.

And the fix isn't to reboot.

False causality.

Wed May 09, 01:55:00 PM EDT  

Anonymous Naresh said....

What a trhiller!!! :)

Tue May 15, 06:31:00 AM EDT  

Anonymous Naresh said....

oops...*thriller :(

(should have used the preview :)

Tue May 15, 06:33:00 AM EDT  

Blogger Uday said....

liked the 500 mile story.

Wed May 16, 03:43:00 AM EDT  

Blogger colemar said....

I would say that the rain was one of the causes, just not the "root" cause.
So there was correlation and causality between rain (A) and bad database performance (B).

Pure correlation is when A and B tend to happen together because of some other condition Z. If you remove A, Z still makes B happen.

Conversely, if you remove the rain (teleworking) then the bad performance disappears.

Thu Jun 07, 09:00:00 AM EDT  

Blogger Thomas Kyte said....


rain had nothing to do with it. If this guy that got in early was transferred - the problem would have happened, but the transfer didn't cause it.

If the database aged the sql out of the shared pool while this guy went to lunch, the problem would happen. But lunch didn't cause it.

and so on.

bind peeking is the root cause, the only cause. The other things are events, just a few of the infinite number of events, that seem to cause it.

If you remove the rain, the lunch problem persists, the "the guy isn't feeling well" problem persists, the "we gathered stats at noon over lunch" problem persists and so on. Getting rid of rain would not fix the problem entirely.

Thu Jun 07, 09:03:00 AM EDT  

Anonymous Anonymous said....

Good Story Tom - I didn't follow how "after the reboot, everything is running fast again" though. Shouldn't the cache be empty after the reboot and the problem persist?

Btw, were you trekking in Harpers Ferry, WV last weekend? I passed someone resembling your photograph, and it took a moment to register in my brain that I might have just crossed path with Tom Kyte!


Thu Jun 07, 10:26:00 PM EDT  

Blogger Thomas Kyte said....


the fact the caches are empty are what made the performance "good", they got the bad plan out of the shared pool and immediately parsed that sql with bind inputs that lead to the "good" plan.

I was in Latvia last weekend until late Sunday night when I returned to Northern Virginia, wasn't me!

Fri Jun 08, 07:02:00 AM EDT  

Anonymous Anonymous said....

Thanks for the response Tom. I am relieved to know it wasn't you - I was admonishing myself for not stopping and shaking hands with "you". Well, the weather is nice and hopefully I will run into you "again" one of these days!

Have a nice day.


Fri Jun 08, 08:04:00 AM EDT  


<< Home