Looking back on my activities in the past few months I realize that there have been a few occasions where my role has been to help with the troubleshooting process. In the past year I have given talks at PGEast and PGEU on troubleshooting Slony but the process I use applies to most technology problems.
I feel that it is worthwhile to review my rules of troubleshooting
Rule One: Take Ownership
My first rule of troubleshooting is that someone needs to take ownership of the troubleshooting process. If no one else volunteers then this should be you. Taking ownership of the troubleshooting doesn’t mean you are accepting responsibility for fixing the problem and it doesn’t mean you are accepting responsibility for causing the problem. Taking ownership of the process means that your taking responsibility for identifying the problem so that solutions can be proposed.
Most people who have worked in I.T for more than a few weeks have witnessed the finger pointing game. In the finger pointing game one department or vendor points their finger at another department or vendor and blames them for the problem. That department or vendor in turn blames someone else who will then in turn point their finger at the first department or vendor. The problem bounces around between departments, often for weeks, without any real progress being made.
Don’t play the finger pointing game instead say “This problem probably isn’t the fault of my component but the problem causes me pain so I will stick with this problem until solutions are identified, no matter where the root cause may lie or whos fault the problem is”
Rule Two: Ask What is going on
In order to understand why something isn’t working you have to understand what is going on. When troubleshooting a problem I am constantly asking myself “What the **** is going on?“. The conversation typically looks like this
me: What is going on?
me: I don’t know
me: Well what do you know?
me: That the database is giving me an error instead of returning my result
me: What do you know about the error you are getting?
me: Well it means that the database is trying to execute a bad query
me: and what query is it trying to execute?
Asking yourself the same question again and again might get you strange looks when you are walking down the street but it is a great way to keep you focused and to remind yourself what you have so far discovered.
Rule Three: Understand Why it is happening
Not only do you need to know what is happening but you have to figure out why it is happening. This is a lot harder than finding out what is going on. One way of understanding why something is happening is to have some theoretical knowledge of the system your working with. First I like to get a high level understanding of how the software components fit together. Next I will pick a starting point in the information flow and start tracing the flow of information through the system.
To troubleshoot a broken web application I might start at the incoming HTTP request and follow the the request through the different layers of the system. What is the presentation layer doing in response to the request and why is it doing so? What does the business-rules layer receive as an input and what does it do in response? What queries are being issued to the database and why?
The key thing to remember is that you are tracing cause and effect. Why did X happen? X happened because it was caused by Y
Rule Four: Make a hypothesis and test it
This is the scientific method you should have learned in middle school. Come up with a theory, then figure out what type of experiment will test your theory. Test the theory and observe the results. This will help you validate, or more importantly disprove your theory.
When troubleshooting software problems you need to test your theories and you need to be prepared for your tests to show that your theories are wrong. There is nothing wrong when you come up with a theory that turns out to be wrong. You should tell people (managers, coworkers etc..) about the theories you have tested and turned out to be wrong because it shows you are working on the problem and making progress (ruling out a theory or cause is progress).
Rule Five: Record your findings
After the problem has been identified and fixed (or at least mitigated) it is important to circle back and make a written record of your findings. I like sending an email with my report because they are easily searchable and I tend not to delete the emails I send. Wiki-pages or your companies ticketing system might also be a good place to record the results of your investigation. The important thing is that the report is written, stored somewhere and is searchable. This report should contain
- A description of what started the troubleshooting investigation. You want to record what was observed before any in-depth troubleshooting began. Ie ‘the site was down, and the server console had a kernel panic message’
- A list of of the items investigated during the troubleshooting process, what log files were checked, what sub-systems were found to be operating normally and which ones showed issues. It is important to include components that were checked and verified as working normally because later on it might be useful to know which components were unaffected by the issue and you can only know that if you keep a record of what was checked.
- Any theories that were considered and ruled out during the investigation.
- The cause of the problem and what testing was done to verify the hypothosis
This record will help the next time a similar problems shows up. You can’t trust your memory to remember these details 6 months or 3 years from now.