Skip to main content

1.3 Software Defects — Real-World Tragedies

In the previous section, we were casually chatting about moths and the etymology of "Debug." In this section, the tone needs to change.

Using software to control complex electromechanical systems is not just "common" today—it is virtually omnipresent. But the unfortunate reality is that software engineering is still too young, and we humans are far too prone to error. When these two factors converge at the wrong moment—when software fails to execute as designed—the result is often more than just an error dialog popping up on a screen. It translates to massive financial losses, or even the loss of human life.

This is the real-world cost of what we commonly call a "Bug."

Each of the following cases deserves to be written in bold. My brief descriptions here are merely a starting point—to truly understand the complex technical details behind these disasters, you need to dive into the thick official accident investigation reports (links are in the "Further Reading" section at the end of the chapter). I am dredging up these old cases here not to scare you, but to emphasize two things:

  1. Even in large-scale, rigorously tested systems, software failures can still occur—and when they do, they are devastating.
  2. For those of us involved in any stage of the software lifecycle, this serves as a wake-up call: less hubris, and more rigorous design, careful implementation, and thorough testing.

The Patriot Missile Tragedy: The Cost of Precision

Let's turn the clock back to the 1991 Gulf War. The U.S. deployed a battery of Patriot missile air defense systems in Dhahran, Saudi Arabia. Its mission was clear: track, intercept, and destroy incoming Iraqi Scud missiles.

But on February 25, 1991, one of these Patriot systems failed. The missile missed its target and struck a military barracks directly, killing 28 soldiers and injuring around 100.

The subsequent investigation report pointed the finger at the core of the software tracking system—a fatal flaw in time calculation.

Simply put, the system recorded its uptime as a monotonically increasing integer. For computational convenience, the software needed to convert this integer into a real number (floating-point). The approach was to multiply the integer by 1/10 (i.e., 0.1).

Wait, let's pause right here.

If you know a bit about computer science, or even just rely on intuition, you might think 0.1 is a very simple number. But in the binary world, 0.1 is an infinite repeating fraction:

0.000110011001100110011001100110011...

The Patriot missile system's computer used a 24-bit register to store this conversion result. This meant that anything beyond 24 bits was simply truncated. This is the root cause of the "precision loss."

Under normal circumstances, this wasn't a major issue. But on the day of the tragedy, the system had been running continuously for about 100 hours.

This accumulated error, after passing through that fatal multiplication by 0.1, amplified into a time deviation of approximately 0.34 seconds.

0.34 seconds sounds negligible, right?

But don't forget, the Scud missile traveled at roughly 1,676 meters/second. In those 0.34 seconds, the Scud had already traveled about 570 meters.

For a radar system, a 570-meter error meant the target had moved out of the tracking "range gate." It was no longer visible on the radar screen, and naturally, the missile couldn't intercept it.

This is a classic disaster caused by precision loss during integer-to-floating-point conversion.

The Ariane 5 Explosion: The Trap of Reuse

If you thought 0.34 seconds was absurd enough, the story of June 4, 1996, reveals an even more insidious killer in software engineering—the trap of "reuse."

Early that morning, the European Space Agency's (ESA) Ariane 5 heavy-lift launch vehicle lifted off from the Kourou spaceport in French Guiana, South America. Just 40 seconds later, the expensive rocket lost control and exploded into a massive fireball in the sky.

The final investigation report was shocking: the direct cause was a software overflow error.

But the story behind it goes far beyond the word "overflow." Let's break down this domino-like chain of failures:

  1. The Overflow: The code attempted to convert a 64-bit floating-point value into a 16-bit signed integer.
  2. No Protection: This was an unprotected forced conversion. When the value was too large, it directly threw an exception.
  3. Source of the Exception: That excessively large value was an internal variable (BH, horizontal bias). For Ariane 5, this value was much higher than design expectations because it inherited the logic from Ariane 4.
  4. Chain Reaction: This exception directly caused the Inertial Reference System (SRI) to shut down. The On-Board Computer (OBC) received erroneous data and issued completely wrong commands to the nozzle actuators.
  5. Final Result: The nozzles of the boosters and the main engine deflected completely, causing the rocket to veer violently off course and disintegrate.

What's the most ironic part?

The SRI that failed wasn't even supposed to be operational after launch. However, because the launch window was slightly delayed, the design specified that it should remain active for 50 seconds after liftoff. This gave that fatal Bug a 40-second window to strike.

Subsequent technical analyses (such as the report by Jean-Marc Jézéquel) hit the nail on the head: this was a reuse error.

The problematic SRI horizontal bias module was copied directly from the 10-year-old Ariane 4 software. The designers assumed: since it ran perfectly fine on Ariane 4, it would be fine on Ariane 5 too.

Assumption. That is the most dangerous word for an engineer.

The Mars Pathfinder Reboot: Priority Inversion

Let's turn our gaze to Mars.

On July 4, 1997, NASA's Pathfinder lander successfully touched down on Mars, deploying the famous "Sojourner" rover—the first wheeled vehicle to operate on another planet in human history.

The mission started smoothly, but it wasn't long before ground control noticed the lander experiencing periodic reboots.

This was tens of millions of miles away; there was no way to manually press a reset button. Engineers had to perform remote diagnostics from Earth. Ultimately, they identified this as a textbook concurrency issue: priority inversion.

How did this happen?

In a real-time operating system, we typically assign priorities to tasks. High-priority tasks should execute first. However, if a high-priority task is waiting for a resource (like a lock) held by a low-priority task, that high-priority task has to wait.

This sounds fine on its own—the low-priority task will release the lock soon enough.

But here enters a "third party": a medium-priority task.

The medium-priority task doesn't depend on that lock, but it has a higher priority than the low-priority task. So, it preempts the CPU, preventing the low-priority task (the one holding the lock) from getting CPU time to release it. As a result, the high-priority task keeps waiting and starves.

On Pathfinder, the high-priority task starved for so long that it triggered another mechanism in the system—the watchdog timer.

The watchdog timer logic is simple: if no one "feeds the dog" (resets the timer) within a certain period, I assume the system is hung and force a reboot. So, the system rebooted over and over.

Ironically, the solution to this problem was well-established.

The VxWorks real-time operating system natively supported "priority inheritance." As long as this option was enabled, a low-priority task holding a lock would automatically "inherit" the priority of the high-priority task waiting for it. This allows it to quickly finish its critical section and release the lock, preventing starvation.

But the Jet Propulsion Laboratory (JPL) team had disabled this option when configuring VxWorks.

Fortunately, while they made a mistake, they also left themselves a fallback. The JPL team had intentionally reserved a debug data stream during the mission, continuously sending telemetry data back to Earth. It was precisely thanks to these detailed logs that they were able to reproduce and pinpoint the Bug on Earth.

The fix was simple: send a command from Earth to enable priority inheritance on the semaphore.

The reboots stopped, and the mission continued.

Glenn Reeves, the JPL team lead, later summarized it with a powerful statement:

"We test what we fly and we fly what we test."

This is worth writing on the first page of every embedded and systems software developer's notebook.

The Boeing 737 MAX Crash: A Single Point of Failure

Compared to the previous cases, the tragedy of the Boeing 737 MAX might feel closer to home and more heartbreaking.

On October 29, 2018, a Lion Air flight took off from Jakarta and crashed into the Java Sea minutes later. On March 10, 2019, an Ethiopian Airlines flight took off from Addis Ababa and crashed minutes later.

The two crashes claimed a total of 346 lives.

This is the MCAS (Maneuvering Characteristics Augmentation System) disaster of the Boeing 737 MAX. The origin of the issue traces back to the 737 MAX's hardware changes—larger engines altered the aircraft's aerodynamics, making it prone to stalls at high angles of attack.

The engineers' "fix" was a pure software patch: MCAS. When the system detected an excessively high angle of attack, MCAS would automatically push the nose down to "correct" the aircraft's attitude.

This sounded reasonable. But there was an extremely fatal design flaw: MCAS relied on a single sensor.

Furthermore, this software logic was granted immense authority—it could override the pilots' inputs and forcibly push the nose down.

When that single sensor failed, MCAS would mistakenly believe the aircraft was about to stall and would aggressively push the nose down. The pilots, often unaware that MCAS even existed, had no idea how to quickly assess the situation and disable it in the midst of the chaos.

A software system that failed to account for sensor failure modes during design, and equally failed to account for the human factor, ultimately led to irreversible consequences.

Other Wake-Up Calls

Beyond these headline-grabbing major disasters, the software world is full of bugs that are either absurdly comical or terrifying upon reflection:

  • Altitude Set to Zero: In June 2002, at Fort Drum in the U.S., an Army report indicated that a software defect led to soldiers' deaths. The reason was absurdly simple: if the operator did not explicitly input a target altitude, the system defaulted to 0 (sea level). Fort Drum's actual altitude is 679 feet. This error caused artillery calc deviations, resulting in friendly fire.
  • Unencrypted Live Broadcasts: In November 2001, a British engineer named John Locker was astonished to discover that he could intercept U.S. military satellite signals using a standard satellite TV receiver—it was real-time reconnaissance footage from U.S. spy planes over the Balkans. The reason? The data stream was unencrypted. This remains common in today's IoT devices.
  • Pitfalls in the Linux Kernel: If you ever feel like the code you write is terrible, search for "Linux kernel bug story." You'll see that even the world's top hackers have left seemingly stupid but severely consequential bugs in the kernel.

By this point, I hope you feel a sense of urgency.

These aren't just stories; they are lessons. They constantly remind us that even a single line of erroneous code can be amplified infinitely.

Alright, enough of the heavy topics. If you're now itching to start debugging Linux, let's not waste any more time—

Let's start by setting up our workbench.