CARVIEW |
July 20, 2005
Just Try Again
It's funny because it's true:
A Software Engineer, a Hardware Engineer and a Departmental Manager were on their way to a meeting in Switzerland. They were driving down a steep mountain road when suddenly the brakes on their car failed. The car careened almost out of control down the road, bouncing off the crash barriers, until it miraculously ground to a halt, scraping along the mountainside. The car's occupants, shaken but unhurt, now had a problem: they were stuck halfway down a mountain in a car with no brakes. What were they to do?"I know", said the Departmental Manager, "Let's have a meeting, propose a Vision, formulate a Mission Statement, define some Goals, and by a process of Continuous Improvement find a solution to the Critical Problems, and we can be on our way."
"No, no", said the Hardware Engineer, "That will take far too long, and besides, that method has never worked before. I've got my Swiss Army knife with me, and in no time at all I can strip down the car's braking system, isolate the fault, fix it, and we can be on our way."
"Well", said the Software Engineer, "Before we do anything, I think we should push the car back up the road and see if it happens again."
In all seriousness, I can't recall a single week that I haven't done this exact thing at least once: Geez, I dunno, just run it again and see if the problem recurs. I don't know if it's a sad indictment of the state of software engineering or a not-so-subtle hint that software engineers deal with thousands of variables in even the simplest of programs.
The problem with a lot of software, is it's a seal box operating in the wild.
It's sealed, so the user (or developer sometimes) can't just peer in and say "Oh I see what's going wrong there"
Also when a program goes *bang*, unless I'm there seeing and working with the problem, it makes it a lot harder to work out what's going wrong.
For a new website I've been developing for the last year, one of the key components is that every single error that occurs on the website is logged with as much detail as possible.
In addition it was text message/page/call pre-selected person(s) out for certain critcal problems.
The site isn't launched yet, but I'm hoping this will result in a much better experience for all concerned!
Peter Bridger on July 21, 2005 08:11 AMWhile it may sound like an unreasonable and funny approach to working on a car, software isn't like a car. Take any analogy too far and it falls apart.
Re-running the software is a perfectly logical approach to troubleshooting. What was the cause of the problem? Does it happen every time? If so, why? If not, why not? Could it be some outside interference that only affected the program that one time, or is it something inherent to the program itself that will happen every time?
Once you do narrow down the cause, you can address it.
@Peter, one extension we've made to that "log every error" approach is to create customizable RSS feeds. All apps on a server log to a central reporter which sends out feeds. The feeds have minimal detail for security reasons, but the link takes you to a suitably secured page that displays the relevant info. Saves you from checking religiously and also reminds you to go look when needed.
Tom Clancy on July 21, 2005 09:54 AM> While it may sound like an unreasonable and funny approach to working on a car, software isn't like a car. Take any analogy too far and it falls apart.
Right; there are no physical consequences to trying software again, which is why the joke is funny.
I do think we're (or at least I am) occasionally guilty of blindly trying again without doing any kind of postmortem..
Jeff Atwood on July 21, 2005 11:17 AMI hope the software you run is never as dangerous as a car!
Terrier on July 21, 2005 11:26 AMSure, that's what they said about Skynet, too.. ;)
Jeff Atwood on July 21, 2005 05:47 PM> Once you do narrow down the cause, you can address it.
And if you can't reliably reproduce the problem, how can you be sure you've actually fixed it?
Bruce McGee on July 22, 2005 08:40 AMWhile I agree with the posts here, I think we're missing out on a key issue. Another reason that I think software developers like to see an error repeated is to make sure their users are actually reporting what they are seeing. I've been in countless situations where well-meaning users call/email to report an issue only to have said issue be a non-issue. I'm sure most level 1 support folks can attest to trigger happy users calling up when the slightest "out of the norm" thing happens.
zigzag on July 22, 2005 11:29 AMIll admit that my first attempt is often to reproduce the error in a controlled environment (my own). The more complex the problem, the less chance of this succeeding though.
I have no problem saying that I do this more out of laziness when it is simpler to reproduce than to analyze the relevant code. Then again there are probably more moving combined parts in your average enterprise app than there are in any car. You dont have only one or two engineers with an understanding of that system, but thats rarely the case for those of us in software.
The 'log everything' approach can work if you spend enough time refactoring (I know I never log _everything_ while designing the code - the 'should never fail' always will), but has the obvious problem that the log anaylzer becomes a critical piece of software in itself to wade through the mountains of information spewed from any long-lived app.
Since the systems I work on tend to be distributed workflows, I switched over to a multicast socket scenario. That way, I can (if I want) have a listener that records to a database, another that jumps in mid-stream to display current system activity on screen, and another that escalates conditions to email/pager notification.
But then of course, that logging system needs to be thoroughly tested....
Content (c) 2009 Jeff Atwood. Logo image used with permission of the author. (c) 1993 Steven C. McConnell. All Rights Reserved. |