Robert Važan

Software bugs are eating the world

Some years ago, I wrote a post about software getting slower a slower. Now I look at those years with nostalgia, the good old times when performance was the most obvious problem. Performance only got worse over time of course, but the spotlight was taken by something much worse: bugs. Bugs everywhere.

Today was a dogfooding festival. I had first-hand experience with products of my profession, the infamous software developers. The numerous bugs in those products essentially consumed my entire day, so here I am thinking what went wrong with software development.

Of course, when I was about to write about this and opened Eclipse (I write content as code, seriously), Eclipse greeted me with several overlapping bugs. But I have grown used to that and by now I have a catalog of workarounds to deal with them. The real problem is in how bugs are creeping into the real, physical world, how they become an inseparable part of everyday products used by millions of people.

Case #1: Computer hardware

I bought a computer, because increasing energy efficiency of new cumputers was gradually turning the old one into expensive electric heater. So far so good. Then I started testing. Rear microphone jack does not work whereas front one does. I am left wondering whether it's a bad part or more likely the nasty habit of manufacturers to embed audio port configuration in Windows drivers instead of storing it in hardware itself where other OSes can find it too. This is why hdajackretask gets a lot of use. But that's just a nitpick.

I connected an HDMI monitor. Works. Then I connected another monitor via DP. Works, but the HDMI monitor stopped working. Turns out this is a common problem with many (most?) mainboards. So I messed around with BIOS settings per Asus instructions. No effect. (BTW, the BIOS section is titled "NB Configuration". Is that for "notebook"? What is it doing in desktop mainboard?) I messed around a bit longer and then came up with the idea of hot-unplugging and then hot-plugging the monitor while OS is running. Voila! Two monitors are working! Why it worked only after hot-plug is a mystery to me. What I don't get is why manufacturers don't make it work by default like my old mainboard does? Why do we have to mess around in BIOS settings and then do some hot-plug trickery? It's not like multi-monitor setup is something exotic. FullHD monitors cost ~100€. It's perfectly reasonable for people to have several of them.

Anyway, as I was testing the new build, I noticed that the CPU cooler (AMD Wraith Stealth) is buzzing/rattling like a small chainsaw. I find it surprising that something so simple and mass-produced cannot be made with consistent quality, but I don't know that much about manufacturing. What I know for sure is that the guy who built it (for 50€ in labor, no less) at Alza eshop must have heard the noise and decided to ship it anyway even though testing the build is supposed to be included in the price. Now I have to ship it back to get a replacement. If this was caught at assembly time, the guy could have just reached for spare cooler to save everyone's time.

Case #2: Parcel lockers

Now comes the exciting part. I am shipping the computer using Alza's parcel locker service (a bunch of automated mailboxes next to the local supermarket). So I fill out a warranty form and head to the parcel lockers. Instructions say to scan the barcode or enter the parcel number. There's only one barcode on the printed form, so I scan it, but it is rejected as invalid. Huh, okay, let's enter the parcel number manually. Except there are two of them, which is confusing, but I choose to type the one under the barcode. Parcel locker terminal freezes after entering first 1-2 digits, then resets to the main screen only to freeze again in repeated attempts. This device has like 10 UI widgets, very minimal functionality, and it still manages to freeze? These parcel lockers are all over the country and nobody noticed yet? It's not like there are too many use cases to test. While calling support (more on this below), I found an online barcode generator and created barcode for the second number on the form. No luck. Terminal shows the number on the screen, so it scans it correctly, but it is rejected. So I am seeing three overlapping bugs here. Firstly, the warranty form is confusing as it contains two identifying numbers and one barcode (likely the wrong one). Secondly, Alza does not recognize its own warranty claim numbers. Thirdly, manual entry of the number freezes the terminal.

Case #3: AI support

Okay, so I am calling the number shown on the terminal. "Hello, this is Alza's intelligent assistant. Please briefly describe the problem." Wow, that's a really quick adoption of language models. Unfortunately, it soon becomes painfully obvious the assistant does not understand half a word of what I am saying, both at speech recognition level and at cognitive and knowledge level. I am all for automation, but deployment of automation is supposed to reduce cost while improving quality at the same time. The assistant is utterly useless, so I ask for live operator. Fortunately, developers of this disfunctional experimental AI prototype had the good sense to make the chatbot comply with such requests without a fuss. Live operator reboots the system remotely (because that's what tech support does) and I am told to wait 15 minutes until the reboot completes. Wait, what, 15 minutes? I am relieved when this turns out to be an exaggeration and the terminal reboots in three minutes. It's still equally broken though. I will have to take the computer to their brick shop.

Case #4: Self-service checkout

So I head to the supermarket to get some value out of this trip and there I am reminded of another bug. Most self-service checkout terminals are card-only. Cash-accepting terminals exist, but having used one, I understand why most of them are card-only. The cash-accepting ones are slow. Apparently, the software constantly keeps checking for presence of cash. That involves some slow mechanical operations that obviously don't run in the background, because they lock the whole UI. The terminal inexplicably checks for cash all the time, not just when you are about to pay. When you are interrupted every few seconds by the terminal clicking, running electric motors, and flashing its leds, you learn to avoid such terminals and always use the card-only ones. Now this is the kind of device that literally everyone uses. How come nobody fixed it yet? How can such an obvious (and probably expensive) bug stick around for so long?

Case #5: E-shop payments

Bugs can sometimes be a positive experience though. Recently, I bought some clothes at H&M eshop. I wasn't asked to confirm the payment after entering my card, so I assumed I would pay when I pick up the package. At their brick shop however, I was not asked for money either. I have been distracted, so I only realized this on my way home. I assumed they would charge my card later, which they did, but then they refunded the full sum on their own. Probably a glitch, but a good one. I got about 100€ worth of clothes for free. I am certainly going to be shopping at H&M again. Management at H&M however isn't going to be happy if this is happening at scale.

Conclusion

If this goes on, a day will come when some bug will kill me. Maybe some medical equipment malfunctions. Manual reboot will take minutes while I will be dying. And the reboot will not fix the problem anyway. Replacement device will have the same bug, because the bug was introduced by recent over-the-air update. Device by another manufacturer will have it too, because the bug is in some underlying library. The bug will take five years to fix and many more patients die in that time. Unrealistic? Consider that even today, some overconfident self-driving cars are murdering people and then dragging them around like they are mud on wheels.

How did we get into this mess? I can see several contributing factors. Bugs are tracked and prioritized instead of just being fixed on the spot. New features that add more bugs are prioritized above existing bugs. Sometimes the bug is in underlying library or other dependency and you have to wait (possibly indefinitely) for upstream fix. This is aggravated by commercial code, which you cannot fork and fix yourself. There are often no high-level automated tests, especially ones that could test integrated hardware. Non-technical (and often also technical) management opts for measures that make it harder to produce quality, especially hourly tracking of developer's time and weekly or even daily progress reports. And the suits always hire the cheapest candidates instead the most capable ones. Dynamic languages allow gross bugs to get into production unless there's high test coverage (there isn't). The practice of maintaining stable branches is almost extinct. The only way to get a bugfix is to upgrade to the latest version, which introduces new bugs. The common denominator here is cost. Software is cheapened by sacrificing quality. We have given up on the ideal of automation: improving efficiency and quality at the same time.