I'm now the proud and happy owner of a new reservoir for our tankless hot water heater. (Tankless is a relative term, when it comes to hot water heaters.) A week ago, the reservoir part of our system finally broke. We had a new reservoir installed on Wednesday, and our lives have changed for the better. We have much more and much hotter water than we had in the past 9 years.
Looking back, I can see the graceful degradation of the hot water heater. First, a few years ago, we had slightly cooler water and we had to keep the temperature higher to maintain enough hot water. We actually ran out of hot water this past winter (but we thought Teenage Daughter Showers were responsible). Then came the drips. We've had trouble with the pressure relief valves for the past couple of years, but the plumber has been out monthly since January to fix the drips. When the reservoir finally went, it was not a drip, it was a full-fledged flood!
Our system degraded gracefully, but as with many complex systems, we couldn't tell it was degrading. It would have been better if it had failed. At least then, we would have spent much less time and aggravation on the current system.
That happens in software all the time, too. We want graceful degradation, but I maintain we need quick failure. Quick failure helps us see where the system works and doesn't work and tells us where to look for problems.
Inscrutable Failures: my new house’s lawn-watering system looks harder to program than x86 assembly code, and when I try to make it do “manual” watering, an element of the LCD display says “Faulty” – no clue as to what’s wrong.
Johanna,
I’d say that your water heater wasn’t really degrading gracefully, though.
Graceful degradation would be failing in one function or feature while keeping other functions available. Since a water heater really has only one function, there’s not very far it can go!
Partial failure might still be preferable to catastrophic. I’d rather have slightly less water available at a lower temperature than having the tank rupture, for example.
Still, a device undergoing even partial failure should complain loudly about it.
Invisible degradation is nearly the worst combination. (Gas leak with explosion is still worse!)
Cheers,
-Mike
Folks just need to follow the Unix philosophy on other items, to wit:
“Rule of Repair: When you must fail, fail noisily and as soon as possible.”
“Basics of the Unix Philosophy”, http://www.faqs.org/docs/artu/ch01s06.html
Nice analogy!
What about performance monitoring? That should always help us to spot degradations in complex systems. Those regular/periodic reports fed back are then acted on to put fixes/improvements back into the pipeline.
This might be the equivalent of the gradual water temp dop-off, just that there was no action taken.
Yes, quick failures useful pointers – ideally under lab conditions and not out in the field.
I see this from the software bugs perspective: It is far easier to identify and fix syntax bugs because they break quickly and obviously. Logic bugs (“graceful” bugs) are much more insidious and hard to resolve because the cause and effect is not obvious.
Jim Gray called this “failfast” operation, and it is a useful technique for building fault-tolerant systems.
http://en.wikipedia.org/wiki/Fail-fast
Pingback: Other interesting blog posts (June 2/2009) « Analytical Mind
Oh if the whole world were ones and zeros. Sadly (for some of us) it is mostly analog.
This kind of begs a discussion of knowing when to cut your losses. With imperceptible entropy, it’s hard to know. We went through a similar plumbing problem that we ignored for three years mostly because we thought it’d be expensive to repair. In the end, I said we had to do it because we wouldn’t be able to sell the house like that and in the end, it cost $100. I wish I had those three years back.
I read something the other day that was tangentially related. Contrary to what we all believe, research shows we learn more from success than failure…
http://bit.ly/d8VhV
Johanna,
I assume that you are familiar with the bathtub curve model of software system life. Failures are frequent at the beginning. Hopefully, we don’t implement until we have them eliminated, but this does not always happen. Then we have a relatively stable period of performance, hopefully also relatively long. Finally, through accumulated fixes and changing conditions, the software no longer adequately performs the functions it was developed for, and failures become frequent. Again, hopefully, we replace the software before the failures become too frequent. Quick or catastrophic failure is rare with software, it just keeps working less well and less dependably, requiring more maintenance which invariably further degrades it. The real trick is knowing when to pull the plug and replace it.
Couldn’t agree more! I wish that would happen with running shoes… they wear out so quickly but I don’t know how to tell when they start going. I just feel like I’m getting old or out of shape, and then eventually I get a new pair and woah! I can run again.
Never thought about it in these terms for software before… I can definitely see it for performance degradations. But maybe also for those things that don’t quite fit your way of thinking/doing business that you keep having to work around and work around and work around until finally you realize you’ve painted yourself into a corner? Hmm.
Charles: re: Unix philosophy, I like it 🙂
Pingback: Development and Integrity Management by Eli Lopian » Pushing and pulling