Wednesday, February 20, 2008

End of the Line for ComputerWorld Canada?

I just got my ComputerWorld Canada last week. It seems like it's been a while since the previous one. I noticed they have gotten thinner the last few months.

I think January 2008 is the slimmest I've seen. It was only 18 pages. While the content is still good, the seeming gaps between issues and the slimming is a concern. I wonder if they will cease publication soon.

It would be too bad if that happens. I've been reading ComputerWorld Canada and its merged predecessors like InfoWorld Canada as long as I've been in tech full time. They are useful to get an idea what's going on at a high level outside of your own company and projects. Although ComputerWorld Canada has always been more focused on the corporate IT departments than on my roles in systems integration and development in an ISV.

I was getting electronic delivery for a while. Around the middle of last year I switched back to delivery of the printed issues and I've been reading them when they arrive.

The tech magazines have had it tough in recent years probably due to the rise of the Internet. It would be unfortunate if ComputerWorld Canada goes away but I guess I could find a site or feed with similar content easily enough on the Web.

Thursday, February 07, 2008

The cost of bugs in the field

It's definitely true that the cost of dealing with software bugs that make it into the field is orders of magnitude larger than finding and fixing them pre production.

After our recent reorg I'm now responsible for maintenance on earlier releases of the flagship product out of our office. On Monday I got a trouble ticket about a customer who was experiencing problems after upgrading between minor releases. They wisely trial the upgrades in their lab before going live.

The customer helpfully provided a detailed description of the symptoms, the log file excerpts, as well as a packet capture. It turns out the packet capture was particularly helpful. In the wire trace I could see we were unexpectedly doing an HTTP 500 response in a certain common valid configuration. Embedded in the HTTP 500 response was a stack trace generated by Tomcat. The stack trace pointed directly to the issue. It was a bug in our code introduced by the previous maintenance developer in the minor release.

Looking at the code diff from revision history against the stack trace, it was obvious what the error was. The code fix was just a couple of lines, add a null check the original developer missed and it would be good again. However getting that change "done" on a system installed on a customer site is a tremendous amount of effort.

First I had to set up my development environment for the earlier code base including Eclipse, the application server, database, Perforce, Tomcat and all of that. Task switching between releases is tedious and that burned pretty much a day. There was an extra requirement to reproduce the issue where there had to be more than one Tomcat instance so I couldn't just run everything off my own PC. It took a while to get a separate Tomcat up with the correct Tomcat version and maintenance version of the application source code. Altogether it took more than a day just to get set up and reproduce the issue.

After reproducing the issue the actual code change only took a few minutes to implement. Then I had to install the fixed code and verify it was now working properly.

All done right? With shipped code, far from it. Then I had to package up a new release using the official procedures. Then assemble a patch to upload to the customer along with patch install instructions I had to write. The code fix also has to be merged to other later releases that will need it. The Wiki sites tracking this all had to be updated.

In addition to my own time spent which was several days for about a 3 line code change, there was the support rep in my company who had to manage the ticket and communicate with the customer, as well he had to update his own running site for that release. Additionally the customer lost a lot of time diagnosing this issue and now they have to lose more time doing the upgrade.

All in all more than one person week has been consumed by a code error that was 5 minutes work to correct. That's what happens when code bugs go into the wild.


Compare this to the cost of finding and fixing it earlier. If it had been detected by the original developer or a peer during code review it would have been about 5 minutes to fix.

If the developer had found it in his testing it would have cost around an hour to do the fix, rebuild, redeploy and rerun the test.

If the test team had found it it would have been about half a day to do the fix, do another baseline build, update the ticket tracking system and the testers rerun their test, and close the ticket.

So at every stage it gets progressively more expensive to fix serious bugs. That's why it's so important to find the bugs before they get into production.