Some Thoughts on Software Defects

Software defects are a part of software. This is a negative subject, but I don’t want to seem like the software I write is full of defects and bugs. This blog is addressing how I have seen teams turn a roadblock into a success.

I have read and heard at conferences about teams who take small failures and create a culture of failure around it. Following are some examples of small failures I have seen and great successes built around them. Sometimes I see small failures as a path to larger positive culture changes.

I think it is safe to say all software that is used and sufficiently complex has defects. There are many reasons for the defects. Here are a few of the defect situations I have been in and how our team solved them.

I think the important part I learned after evaluating these is that resolution of these situation needs to be fast and high quality. They need to make the project better off at the end of their resolution. Success for the customer is the only true result of a project, and defects may not be the most ideal path, but they must lead there.

The speed is important, but it doesn’t mean I should rush a solution and throw it in as soon as possible. To me it means giving the defect the red carpet treatment—capturing craftsmanship in the face of adversity.

Every project has its ups and its downs, and a craftsman will take pride in success through any obstacles. Instead of making defects a slippery slope downhill, they are one step back and we take two steps forward.

Bugs

Bugs are a part of every piece of software I have written. That statement sounds a lot worse than it is. I have worked on systems where the requirements are dramatically changed week to week (which can be pretty exciting). There are situations I didn’t take into account, or some behavior I didn’t imagine until a real user started hitting the system.

Now, as a developer, I do more mental exercises and think thoroughly through my solutions as I get more experience. This has never made my code bug free. For that reason, my team needs to know how to deal with bugs (I am fairly certain I am not the only creator of bugs).

Here is a line from the Pragmatic Programmer book:

“It doesn’t matter whether the bug is your fault or someone else’s. It is still your problem.”

So, the team has a bug list. We list out the bugs as a todo list in basecamp, so they are not so formal they can be forgotten about. Then we try to address them and work on new stories.

This worked well enough until there was a big release coming up and our customer came back to us with a big list of bugs. We put them on the list and continued to fix them, as well as work on stories. They were getting completed, but there was never an empty bug list.

Then, the customer (who can directly add/edit the bug list) started writing priorities to the bugs. This one is high priority. This one is critical. This one is immediate.

I make the suggestion: “we need some real tool to manage our bug list,” because I can’t fit all this in my head. There isn’t that much room up there and I need to use it wisely. One of my team members suggested maybe it wasn’t the craftsmanship way to have any bugs shipped. I was trying to solve the wrong end of the equation. So, the team lead put forth a no bugs policy that we all agreed with.

You cannot pick up a new story unless the bug list is empty.

This made sense to me, but I had reservations about a small/insignificant bug taking priority over a story that is important. To date, this has not happened, and the bug list has stayed near no bugs.

That doesn’t mean less bugs are found, it just means they are fixed, and the code is refactored to prevent a future occurrence. Most importantly, some tool to track bugs never made it into our system. That was an idea which would have desensitized the team mentality to bugs, whereas with our policy now we are very sensitive to the issue of bugs.

Challenging the craftsmanship of the team members that buggy code is something you should take personally was the right choice. Bugs got a first class ticket to termination in our system.

Now, the definition of a bug versus a small feature enhancement is a fine line. I know I have failed to define it well, and that might contribute to what gets called a “bug.”

Often times, the urgency in the customer takes up more mental space than me thinking through it, looking up the acceptance criteria for the story where it was implemented, and going back to the customer and saying, “No, that was clearly not a defined scenario, we are going to need a story to turn that button green.”

Now the next time, it is more than turning a button green, but the precedent has already been set. All I have been able to do is to strike a balance based upon how much effort it would take to make the bug/feature enhancement work.

If it is a lot of effort, I will double check the bug to make sure it is a bug. If it is, I fix it. If not, I will push back to the customer to write a story card.

Production Support

Once a system goes into production, support begins. Following along with one of Paul Graham’s ideas, we have the developers doing the production support.

We are the ones who wrote the system and know the system the best. When I look at a production support request, I can not only solve it, but make sure it doesn’t happen again. Or if it does, make sure it is easy to correct.

So, during our first deployment of a system, a single team member stepped up as the “production support” developer. I don’t know if he embraced it or was cornered to it, but as a craftsman, he took the responsibility and ran with it.

As further systems were released, he would sometimes be doing an entire day of production support. Production support can be a lot of debugging and fixing data, which can be fun, but more times than not is tedious and rhythmic.

Often times when I saw a production support email, I would look to the “production support guy,” who could fix it in about half the time I could. This seems a lot like a silo to me.

Everyone should be able to do production support on any system. I should have to, because it is a perspective of the system that is important to have.

In response to this, we came up with a system of triage. Each day of the week is assigned to a specific developer. If a support item comes up, it is the job of the triage developer to respond to the client/customer we are working on it.

If it addressed to a specific person, they will inform them. Otherwise it is on the triage developer’s shoulders to fix the support request before they continue their work for the day.

This ensures the client always has an open line of communication with a developer. An email never slips through and doesn’t get addressed. There is clear responsibility to who should be addressing the support item.

I know the “production support” developer is in favor of this system. As well as the customer, they ask who the triage is for the day and have no qualms about interrupting their work, as they should.

Communication and Managing Expectations

Recently, I did some integration work with a third party vendor. They were developing their side of the integration at the same time as us.

Not wanting to slow development and wait for their functionality, we decided to write a mock server and integrate with the host according to the spec from the third party vendor. We received a story from the customer for that and proceeded to make the client for the third party system calls. We finished our story.

In the demo portion of the iteration planning meeting, we could only demo against our mock server. This caused some nervousness in the customer (rightly so). I replied, “Once their side of the system is done, we should be able to send our calls across.”

Then we received three more similar stories, but different system calls. We did them, removed the duplication and felt really good about the job we did. Then came their test server.

Nothing worked! There were all sorts of communication problems, questions about who implemented the system to what spec, and political questions. Despite us thinking we were in the right, the stories were signed and the customer said, “Well, you said this would work.”

At first, I tried to communicate the reason why it didn’t work, and how we can move forward. We came to spend a lot of time on this, and the team felt the integration should be its own story.

The customer pushed back, “Well, you said this would work.” There was some tension, because both sides were right. We didn’t believe it was our fault it didn’t work (neither did the customer), but we told them it would. Rather than let out the righteous indignation I was feeling, one of the team members mentioned:

“We should not have assured you it would work.”

That one line brought the real problem to the front for both sides. We didn’t know whether it would work or not, just that we wrote the right code to the specification we had at the time.

That code by itself has no value to the customer without it working, though. Once that line was said, everyone in the room sat for a second, then understood. Expectations were out of sync.

A little ownership over the defect was all we needed to ease the tension and move forward. It is unproductive to get stuck in a stalemate of expectations. When in doubt, the customer is always right.

Paul Pagel, CEO and Co-founder

Paul Pagel has been a driving force in the software craftsmanship movement since its inception.