Have you ever had those bugs crop up that seem completely confounding? Something completely unrelated to recent changes, seemingly arisen from the abyss. This can be daunting, sometimes it's hard to even know where to start.
Some lessons I've learnt along the way is, first of all don't be daunted, you're not expected to just know off the top of your head. It's extremely rare that this happens, most bugs are solved by a period of investigation, trial and error. Let's discuss first the investigation phase.
Before you tackle a complex, or at least seemingly complex bug (often they turn out to be something trivial, so bare that in mind first before sinking your head into your hands), start with a period of investigation. Try to recreate the bug first and foremost, if you weren't the person to find the bug. It could be a case that the person using the software used it incorrectly; it could be a UX problem. For example some additional validation needed, or some contextual help like tool tips in order to guide the user.
If you can recreate the bug yourself, try to recreate it a few different times with different inputs perhaps, make sure there are logs and sensible error handling.
A note on error handling
This part's crucial, and to be considered a continual aspect of good engineering. Ensuring your software not only handles error cases, but does something sensible with those errors. This means, not burying valid error messages in terrabytes of useless event logs. Perhaps formatting errors and using tools to let you search through errors and order them by time. Most cloud providers provide something to this effect with little set-up cost. Without these considerations, your software will become a black box, you will create silo's of knowledge around engineers who become good at dealing with that particular black box. The point in error handling is ultimately transparency. If your support person can come to you and say 'feature x has stopped working, it started throwing this error, at this time', you're probably in a good place.
Why specifically support? Well because as engineers you become familiar with bizarre or overly technical processes in order to find faults, you find shortcuts. But if someone who didn't create this piece of software, someone without engineering knowledge for example, can tell you why your software failed and when, then you're probably in a good place. If not, this is a positive and realistic aim to set.
Back to the investigation
Once you've grepped your software's logs, and have narrowed down the scenario that causes the failure to arise, you can start gathering a 'case' or a 'picture' of the situation. For example, it happens when a user logs in, or when this table is accessed, etc, etc. Once you build up a picture of the situation, you can begin to infer and deduce possible lines of enquiry.
If a colleague is debugging something, and you touched that code recently and spotted something strange, even if you're pretty sure it's unrelated, mention it anyway. This is all information, even if it turns out to be unrelated, it can sometimes help to narrow the problem down further. It can even create new 'leads'. So don't be afraid to throw ideas around, even if they're possibly unrelated.
We see this in cop dramas, the troubled detective skulks around rainy back streets, in a kind of noir cliche, talking to the drug dealer that saw the victim last, or sweet talking the bar owner into showing him the CCTV footage from the car park. So go and talk to other engineers, ask them how this piece of software works, have they seen this error before? Has anyone changed anything lately? Keep building up your case, gathering information and context.
At this point the problem may still seem completely shrouded and obscure. But that's okay, keep chipping away at it, write it all down, diagram the process out on a piece of paper, visualise it.
I'm not talking about soft skills here, I'm talking about empathising with the code itself. Which sounds ridiculous, but I sometimes find myself saying out loud, 'what would I do if I was this piece of code, and I had this value at this point' (it's a good job I'm working from home these days). I step through the code and think about each step given a certain parameter or input.
Trial and error
At this point, you should have built up a few lines of enquiry, have a few theories, have ruled out the most obviously incorrect theories.
Running through each theory or scenario, these are your 'leads', try to learn more about the situation each time you investigate a theory. Print out a snapshot of the state for example, if it's a complex piece of code with multiple return points, use a debugger or good old fasioned print statements to see how far through the code the process gets. Is it returning earlier than expected? Is it reaching the end of the method call when it should have exited already?
Eventually, you will start to build up a greater context of the problem, you will have narrowed down the possible causes, you will start to understand that piece of software more and more. From a daunting, clueless starting point, you will piece it together bit by bit. It's important to start small, accept you don't know, accept it might take time and patience.
The best analogy I like to see this process as, in the classic Command and Conquer game series, when you start a new battle or mission, the whole map is black, in what's referred to as 'shroud'. The more you move your units around the map, the more you uncover. In that process you eventually stumble across enemy bases.
It's also important to understand that often knowledge is compounding, once you understand one aspect of a problem, it can have a snowball like affect, you can hone in on the problem quicker based on your most recent findings.
So next time you're faced with a WTF category bug, consider that you could be just a few discoveries and deductions away from the cause.
Finally, this process is also a skill, you build up a memory of past bugs and start to spot patterns, like anything, you can practice this skill. If you shy away from bugs, don't dive head first into them, you can actually get bad at debugging, which is a vital skill in software. Unless of course you and your team always write absolutely perfect bug free software. In which case, I'm sorry to have wasted your time with this post, and please get in touch!