OK, so that title is a lie.
This article won’t teach you how to debug, because the simple answer to the question “how can I debug this code?” is this: use your intuition. Intuition comes from years of practice and directed learning, not from a blog post. But where intuition fails, science prevails, and that is what this post is really about: a scientific approach to debugging software. This approach may be slow, and it may be tedious, but it does work.
Step 1 : look, but don’t touch
When you first hit a bug, you’ve got a golden window of opportunity to make observations. Don’t waste it! Forget debuggers and text editors, your primary tools at this stage are a pen and a notebook. Put one of each in your hands and leave the keyboard alone while you look at what’s going on, and write everything down. I’m not even kidding! Write it down, on paper. This will force you to slow down and actually look at the system, as well as providing a record for later that you can use to sanity check your hypothesizes about what’s causing the problem. The most important thing to do at this stage, besides careful observation, is to keep an open mind. Don’t even think about what the cause of the bug might be. If you do you’ll subconsciously ignore information that doesn’t correspond to your initial analysis, and you can’t afford to miss details at this stage.
The first step in the process of observation is to note down everything you can observe without interacting with the system. Did the code core dump? Did it output anything on stdout? Does it appear to have locked up? What happened just before the event? What state is the rest of the system in? Are there unusual peripherals attached? Are you running a special build? The details will depend on your system, of course, but you need to get it all down on paper.
Once you’ve captured all you can without touching the system, you can move on to making unobtrusive probes of system state. By this I mean you can run tools to show you what the CPU is doing or which processes are using memory. You can look at logfiles of other applications to attempt to capture any IPC interactions leading up to the bug. You can look at working files in the filesystem, examine timestamps, check open file descriptors, and so on. Use your imagination and your domain knowledge to capture more information without directly changing the state of the system. Just as before, make sure you write everything down!
Finally, when you’ve captured absolutely everything you can extract without changing anything, you can start probing more invasively. At this point, you can try attaching to the process with a debugger, or analysing the coredump. You can try sending signals to applications that appear to have locked up. You can start removing lockfiles, unplugging hardware, and poking buttons. In short, the gloves are now off and you can do what you like. By the end of this the system state will be completely destroyed, and your notebook will be full of observations.
Step 2 : come up with a hypothesis
I hope you enjoyed that time at the keyboard, because now you’re going to walk away from the computer for a bit. Take your notebook and pen, find a quiet place, and sit down with a coffee for a quarter of an hour to review your notes. Maybe get a colleague in to bounce ideas off. Your goal now is to come up with an idea about what’s going on with your system. Once again, try to keep an open mind! Don’t shortcut the process by running at the first idea you have: instead, take the time to fully think out the whole of the domain and figure out what could possibly account for the observations you’ve made. You should be able to come up with loads of ideas. I want everything from the obvious (“I think we overran our string buffer”) to the outlandish (“I think the kernel’s corrupted the VM page table”). Get them all written down in your notebook.
When you’ve finished brainstorming all these ideas, it’s time to wander around the lab for a bit to decide which seems most likely. At this point you’re engaging your intuition, hopefully for the first time in this process. Use your experience to filter your previous suggestions into three categories, ranging from highly likely to highly unlikely. Then pick your favorite from the pool of highly likely causes. This idea is now your hypothesis, so write it in your book under the heading “Hypothesis 1″.
Step 3 : design an experiment to validate your hypothesis
Now you’ve figured out your hypothesis, please refrain from running back to your computer to hack the code. You’re not ready to code yet! Instead, go back to the quiet place you used to brainstorm possible causes, and turn to a fresh page in your notebook. What you need to do now is design an experiment. If this sounds a lot like a high-school physics lesson, you’re probably getting the hang of this ;-)
What you’re looking for from your experiment is a set of output data which will tell you something about your hypothesis. Ideally they should either prove or disprove it, so think about what you’d need to do to conclusively demonstrate your hypothesis to be true or to be false. Write these things down. Think also about what sort of quantity of data you need to give you confidence in your results. At this point you don’t know whether the bug happens regularly or is a freak occurrence, so you’ll need to gather a “control” dataset to compare your experimental results against. Write down exactly what data you need to gather, and in what quantity. Write down how you’re going to capture the data and how you’ll store it. Make notes on any special equipment you’ll need, and what versions of software you propose to test.
Step 4 : run the experiment, and capture the results
Happy days: it’s your chance to go back to your computer! But don’t get giddy at this point. All you need to do is run the experiment you just designed. Make sure you capture the data you need, and make sure you note any extra observations you make during the course of the experiment. But don’t do anything more. In particular, don’t start trying to test something else while you’re at it. Stick to the script: run the experiment.
Step 5 : analyze the results
Here’s the fun part. Take a look at the data you’ve gathered, and figure out whether they prove or disprove your hypothesis. If the results are conclusive, then you’re done! You can code up a fix, check it in (along with your experimental results!), and move on. But even if the results are not conclusive, that’s still a valid and useful outcome. You should now know more about the problem than you did previously, and you should be able to design a further experiment to glean yet more. Whatever your analysis, it will probably not surprise you to learn that I want you to write it down in your notebook. Read it back to yourself and check it makes sense. Remember you’re doing science here, and a key part of good science is peer review. You should be able to present your notebook to your most esteemed colleagues with pride and confidence, so make those notes good!
Step 6 : rinse and repeat
By this stage in the process, you may have solved your problem. But if you have not, you will have at least uncovered something more about the nature of the problem. In this case, your next step is to retreat from the keyboard once more to review your previous list of hypothesizes in light of your new knowledge. You may be able to strike some from the list, or add new ones. You’ll almost certainly be able to move some between the three categories of likelihood that I proposed earlier.
While you’re reviewing your progress, try your best to get out of the mindset of investigating your original hypothesis. Instead, return to the open and objective outlook you had before when you were doing your brainstorming. With this open mind, decide what hypothesis to explore next, and then head on back to Step 3 for the next stage of the investigation.
Step 7 : there is no step 7
There is no step 7 because steps 1 to 6, applied enough times, will solve your problem. It may take you a long time — but eventually you’ll get there. That’s the power of science!
Despite the surety of this debugging method, I bet you a mixed bag of sweets that you don’t know anyone who actually debugs like this. This is for one of two reasons. Either your colleagues are just running in circles and debugging by suspicion (“back that last change out, and see if that fixes it!”), or they’ve done enough debugging work to be balancing intuition and science in their heads all the time as they work. This latter possibility is some trick to pull off, but it’s basically what all software engineers do. The trouble with this is that it’s really easy to lose time through failed intuition or biased thinking. In my experience, it’s rare to avoid these pitfalls entirely. Some wizards can do it, but mere mortals may struggle.
With that in mind, I recommend giving this method a try the next time you’re faced with a difficult bug. At the worst, it’ll take longer than a purely intuitive approach. But at best it’ll yield incremental progress as opposed to intuition’s random stabs in the dark. I know which one I prefer.