Monday, June 21, 2010

I learned I don't think I like this...

I was reading around, stumbled on an article from Dr Dobb's online (I have a long history with Dr Dobb's - had it not been for them - you wouldn't be reading this!). The article described a 'feature' of Windows 7 that I have mixed feelings on. The same sort of mixed (well, not so mixed, I lean far to one side on the use of this feature) feelings I have for cursor_sharing being set to anything other than EXACT.

Here is the article

Ok, so why don't I like it? It seems to be a way to 'self correct' a program. It "seems" like a "good thing".

I don't like it because it won't help get the problem fixed (same with cursor_sharing :( ). In fact, it will promote *more* code being developed that suffers from heap overwrites. It lets bad developers develop bad code even faster and distribute it - thinking they are seriously good developers. That is, it leads to delusion and the bad coders getting more senior without learning from their mistakes.

In short, it instills a false sense of "hey, I'm pretty good" in developers that probably shouldn't have that sense.

It could definitely lead to some really strange issues - think about it, a program that used to crash - stops crashing - for a while - then crashes more (as the overwrite occasionally gets bigger than normal, requiring more "pad" bytes). And who knows how allowing a memory overwrite to propagate into other bits of the code will affect it. I prefer code that works right - not code that sometimes seems to work.

I'm reminded of a code review I did some 20 years ago. I asked the developer "why do you have this massive static array defined - we don't seem to use it". The answer "if you take it out, the program crashes, the compiler must have a bug". The look on my face - I wish I had a picture - it would have been priceless.

I'm not a fan of this "let's try to fix it for you and let you pretend you know what you're doing" approach to software. We rely on software way too much.

Oh well, just a 30 second thought - I just read the article and felt the need to gripe... Things like this scare me.

I have to say - I wrote something similar myself some many years ago . It was a C library I called xalloc. It replaced the malloc, calloc, realloc, free, etc functions of C. It worked by allocating (or freeing) the requested memory and adding a few bytes. It would set some bytes at the FRONT and the END of the allocated memory, set some bytes to represent the source code file and line number that allocated the memory, and return a pointer to the memory to be used by the program. Every time you called any of the xalloc functions - it would inspect the allocated memory (all of it) and CRASH the program if any of the magic bytes in front/at the end of the memory block had been changed. When the program exited - it would report on all allocated memory that wasn't freed. You could turn off the checking with an environment variable if you wanted, but it was always ready to be "on". I made everyone that worked in my team use it - it saved us countless hours (and it found the bug in the code of the person that needed to allocate that big array in a few seconds)...

My approach differed from Windows 7 in that I would prefer a program to crash and burn immediately rather than live for another second if it made such a grievous error. I'd still rather the program die than continue today...


Blogger s said....

The primary duty of an exception handler is to get the error out of the lap of the programmer and into the surprised face of the user. Provided you keep this cardinal rule in mind, you can't go far wrong.

-- Verity Stob, quoted in issue 184 of The Embedded Muse.

I thought of you and your fight against "when other null;" when I read that.


Mon Jun 21, 11:28:00 PM EDT  

Blogger Brian Tkatch said....

Good stuff, Tom

Tue Jun 22, 12:04:00 AM EDT  

Blogger Narendra said....

One of the "self-correcting improvement" that was introduced in 10g (I guess) falls in same category. I am talking about PL/SQL internally doing bulk fetch even though developer writes a FOR..LOOP. Isn't that allowing developers to not learn "good practices"?

Tue Jun 22, 03:42:00 AM EDT  

Anonymous David Aldridge said....

I think that this is a tricky issue to generalise on.

In the context of a Windows 7 application such as a word processor then I think that you're right that a certain amount of "brittleness" does promote more robust development techniques. Certainly when testing a program a feature like this Fault Tolerant Heap should be disabled (if there is not some process for taking note of the times when it has saved a crash and attempting to fix it).

On the other hand, you do not want your flight control system to be so fragile in use -- "graceful recovery from failure conditions" is one of the features I'd be looking for there. "Fail-deadly" is not.

In the middle ground we have the sort of business processes that we design on an everyday basis. How fragile do we want our invoice generation process to be? And the near-realtime ETL process that the pricing team use everyday? I think that the answer is that in development and testing we want them to be as fragile as possible, but in production we want them to be as tough as possible.

Wouldn't it be handy to have a big dial to control the level of brittleness/toughness of the system? Not least because of the humour inherent in secretly making your coworkers development environment so fragile that they are pulling their hair out by lunchtime.

Tue Jun 22, 06:15:00 AM EDT  

Blogger Thomas Kyte said....


that is helping a performance issue - not attempting to self correct an obviously buggy program.


I doubt the developers would 'disable' this during testing. They would treat it much like the developer I was talking about treated the array. I can hear it now "Hmm, program crashed without this, doesn't crash with it, must be an OS bug in their heap - we'll just run with it".

How fragile do we want our invoice generation process to be?

I'd like it to be robust enough to run without buffer overwrites for sure. If it is making *that* mistake - how many other mistakes is it making??

I see this 'feature' becoming a 'necessary runtime option' for programs you and I know have serious bugs - but the developer thinks the OS is at fault.

Tue Jun 22, 06:57:00 AM EDT  

Blogger Christopher said....

I had a similar experience with some code. I was once modifying an Apache module written by a well respected colleague with the initials TK (sound familiar) and found a line with the following:

malloc( sizeof(x) * 1.2 )

I asked him what was the purpose for the *1.2 but he could not explain it. I guess he was adding in a few extra bytes just in case.

Tue Jun 22, 09:12:00 AM EDT  

Blogger Joel Garry said....

I think you are missing a scaling effect of complexity. This is designed to be part of a feedback loop where the programmers have written something that has so many complex interrelations everyone can't possibly test them all.

So the real wtf becomes the black hole of "reporting this issue to microsoft," not a simple matter of good programming.

I think summation of the parts of a modern Oracle system have not addressed this. It's not just a database with a couple of programs any more.

word: kermod

Tue Jun 22, 05:43:00 PM EDT  

Blogger Thomas Kyte said....


a buffer overwrite is a rather simple thing to catch - it doesn't equate to the complexity of the code.

I don't see how this is a feedback loop. It silently makes the error seem to disappear. To me, this is just like cursor_sharing - only hiding perhaps a more evil bug than cursor_sharing does.

The fault tolerant heap (FTH) feeds off of the supposed feedback loop (the windows error log) and tries to silently and magically make the error disappear. There is no feedback after that - the programmer thinks they've done good - their program isn't randomly crashing *as often*. It almost surely has other subtle bugs in it as a side effect of this fix (if you overrun a buffer - who is to say what happens to the extra bits and bytes you put there that another bit of logic isn't expecting - maybe they just get skipped).

Don't like it.

Tue Jun 22, 06:04:00 PM EDT  

Blogger Daemoncoder said....


I cant imagine any flight control software containing something like an FTH, since one could no longer predict how it will behave under (near) failure conditions and still cater for it.

A good read on the Shuttle's software:

Tue Jun 22, 06:26:00 PM EDT  

Anonymous David Aldridge said....

@Tom: I doubt the developers would 'disable' this during testing ...

Well, firstly that's a very pessimistic statement about the state of the modern software development industry. Secondly, I entirely agree with you. Thirdly, I'm starting to feel old and cynical -- I always imagined that would come in the 50's, not the 40's.


Yes, that's probably the case. But of course there are other (very expensive) fail safes for catching errors, such as multiple flight control systems developed independently with comparative checking of the outputs of each. Now wouldn't that be quite the feature for an invoicing system? Three development teams all working independently to produce three code bases that each generate invoices, then you compare the output of each to look for differences? Slight overkill, perhaps.

Wed Jun 23, 04:00:00 AM EDT  

Blogger al0 said....

Zour xalloc reminded me about the very similar library that I have written at the end of 80-s or the very beginning of 90-s.

Was quite useful.

Wed Jul 07, 11:19:00 AM EDT  

Anonymous Basil said....

I just learned today that our internally written "object persistence" layer (one that long predates things like Hibernate) now attempts to overcome deadlocks. When a deadlock exception is received by the failed (and rolled back) transaction, the code now automatically tries the task again, under the presumption that this is, basically, a data consistency problem.

*sigh* I guess it's just too hard to expect people to write code the right way.

Mon Jul 19, 05:53:00 PM EDT  

Blogger Andrew not the Saint said....

Look, I'm all for building frameworks/utilities that completely automate certain error-prone tasks, but this just isn't such a thing. So, I'm completely with Tom on this one - it's one thing having compile-time or development-time utilities that will check for programmers' mistakes, and it's something else having a 'clever' agent that tucks bugs under the carpet... That surely won't help the overall state of mediocre code quality in the industry.

By the way, Oracle has another example of a 'automagically-fix-and-forget' functionality:

Sun Sep 05, 10:19:00 PM EDT  

Blogger Grant Johnson said....

@David Aldridge

Wow, you made it your 40's before you became cynical? You are doing better than most of us.

Thu Sep 09, 06:05:00 PM EDT  

Anonymous Anonymous said....

Hey All,

My views are the converse of some others expressed.

For sure, ignorance is not bliss.

I would rather expect that "fail-safes" are initiated at some level of Dev SDLC - where functionality is allowed to pass to other areas; but all is resolved before Production - and final testing.

Best Practices should advise against it in Production - Production Systems, by their nature, should aim for robustness.

Using Protection-mechanisms to correct issues, point to 3 alternatives (or a measured composition thereof):
1. We don't trust our Application and/or inter-dependencies
2. We give vent to paranoia
3. It's all become too embedded for us to remove

If my Production System is mission-critical, like running an Airline; I'd rather it be free from fragility rather than protection mechanisms that absolves and promotes it's fragile nature.

I guess Sql Profiles still makes it to you New (Cool) Features list, under the premise:
It has it's place.
Sql Profiles should not form part of the permanent fix.

Best regards

Wed Nov 24, 05:49:00 AM EST  

Blogger Thomas Kyte said....

@Anonymous regarding sql profiles -

if you say that - then I don't think you understand what sql profiles are or how they work.

Saying "sql profiles should not form part of the permanent fix" would be identical to saying "statistics should not form...."

a SQL profile is an in depth analysis of a sql statement - if we are getting the wrong plan (that is motivation behind using a profile - you have the wrong plan) it is due to incorrectly estimated cardinality values.

Now, to correct those incorrect estimates, we need more or better inputs to the optimizer - maybe we need a histogram (that would become part of the permanent fix). Or maybe - due to the complexity of the SQL statement or the relationship between some column pairs in separate tables - "normal" statistics don't cut it - are not good enough.

Enter a sql profile. It is simply "more detailed, more focused, better statistics than can be gleaned from dbms_stats"

If you want an analogy between this (thing I didn't like, what the blog entry was about) and Oracle would be:


and yes, that (cursor sharing force/similar) should not be part of the permanent fix.

So, there are things in Oracle that are similar to this thing I don't like (and I write that I don't like them)

but sql profiles is NOT one of them, not even close!

Wed Nov 24, 08:27:00 AM EST  

Anonymous Anonymous said....

Hi Tom,

Firstly, you're right - I've not spent much time on SQL Profiles - so forgive me if I'm getting this wrong (that's my own ignorance)..

I'm not saying that Sql Profiles should not be part of the process.

Without even speaking of it as a working technology - just the idea is brilliant!

But I understand the Sql Profile hints at the Optimizer, to generate a plan a certain way. So why should the Optimizer not determine the most efficient logic:
a) because the optimizer is not intelligent enough, or
b) because we're passing it the incorrect inputs (as you put it)

If the resolution is that we attain better inputs eg. altering histogram configuration from height-balanced to frequency histograms ; then that would be the "fix".

After fixing, the Sql Profile has served it's purpose (for this query).

Isn't hints in any case a bad idea - for the long-term?

ps. sorry if i'm still getting this wrong - and wasting space on your blog.

Best Regards

Wed Nov 24, 09:54:00 AM EST  

Blogger Thomas Kyte said....

@anonymous regarding sql profiles

But I understand the Sql Profile hints at the Optimizer, to generate a plan a certain way. So why should the Optimizer not determine the most efficient logic

therein lies the problem - it doesn't do that. You have a complete misunderstanding of the process.

I tried to tell you above what it does:

Or maybe - due to the complexity of the SQL statement or the relationship between some column pairs in separate tables - "normal" statistics don't cut it - are not good enough.

Enter a sql profile. It is simply "more detailed, more focused, better statistics than can be gleaned from dbms_stats"

a sql profile is a detailed analysis of a sql statement that provides better estimated cardinalities, which in turn give the optimizer more information than it had in the past, better information - more detailed information - so it could do it's job correctly.

a sql profile works by taking the sql statement - ripping it apart into its component pieces and EXECUTING the bits and pieces to see "about how many rows come back" - and then remembers those row counts

so the next time the optimizer parses that queries and optimizes it - it will use THOSE estimated cardinalities to optimize the query and come up with a better plan.

a sql profile is to a query what "dbms_stats" is to a table or index - a method of gathering STATISTICS about that query.

A sql profile is not a set of hints that tell the optimizer HOW to process the query.

A sql profile is a set of extended statistics about the actual observed cardinalities of the query - statistics in short.

Wed Nov 24, 09:59:00 AM EST  


<< Home