Debugger

Dad – below is the article I was telling you about. I submitted it to Wired, The New Yorker, Harper’s, The Paris Review, Granta, and a couple of computer industry publications that you’ve never heard of. No one wanted it. The reasons for rejection were varied: too long, incoherent, poorly written, lacking in pictures, and obscenity. Oh well, I had fun writing it. I don’t know if you want to show this to mom. Let me know what you think. Love, Steve


A debugger—what’s that?

Let’s start with the word itself: debugger.

Consider the de – a negation, and the bugger, from bug, as in insect. Yet debuggers have nothing to do with any insect negation, that is, real extermination: ants, termites, roaches, rats, snakes, or any other varmints. An exterminator, maybe somebody named Tom Delay, would never show up at your house and say, “Ma’am, I’m here to debug your home.” Nor would he introduce himself as a debugger, because a debugger is not a person.

For those given to thinking about these sorts of things, let’s be clear: the word debugger and its variant forms, debugging (continuous present), to debug (infinitive), debugged (past participle), have nothing to do with undoing sodomy. For new English speakers, let me explain that sodomy means anal sex, and if that’s too hard, then let’s just call it butt fucking.

Debuggers: nothing to do with that.

Alternatives terms are unacceptable. Uninsecting, dispesting, and anti-arthropoding are too bizarre, too abstract, and too Latinate. Debugging, with its concrete Anglo-Saxon tone, is preferable to any of these pretentious expressions.

Real world analogies are not helpful to understanding a debugger. A debugger is not like a delousing station, and really, who has ever seen a delousing station? None of us were ever World War I soldiers, coming back from the trenches after the Battle of the Somme, cleaning up before going on leave in Paris or Bar le Duc, where we hoped to get laid by a pretty French girl named Michele.

A debugger is not like a dehumidifier, which sits in your basement cycling on and off, and whose tray you forget to empty until it overflows, and then that moisture must again be re-absorbed by the dehumidifier.

A debugger bears no resemblance to a desalinization plant.

A debugger is software that observes the operation of other software, specifically for finding and fixing problems in the software under observation.

Debuggers are not like real world observation tools: binoculars or telescopes or microscopes. They are somewhat akin to stet he scopes: debuggers and stethoscopes give insight into what is going on under the surface. A bugger is like a continuous and simultaneous x-ray and MRI, although let’s not get too carried away with metaphors of digital systems being like organisms (think virus).

To understand a debugger you must first understand the history of the word—how the word bug came to denote a defect or problem with the operation of a computer. Before the dawn of the computer age, the word bug had been used to describe problems in systems, but a real insect was the source of a computer malfunction in September, 1947: a moth was found trapped inside a relay in the Harvard Mark II electromechanical computer, and caused an error in that computer’s program. The event occurred on September 9, the location was panel F at relay #70.

By the time the computer scientists found it the moth was dead.

Although Nabokov and his net were in Boston at the time, he was teaching at Wellesey, not Harvard, and it is lost to history whether he knew of the moth, its species, or was otherwise involved in the whole affair.

The moth can be found taped to a notebook at the National Museum of American History.

Parts of a Mark II computer can be found at a museum in Japan.

Nabokov died in Switzerland in 1977.

Debuggers are not the same as monitors (not the Civil War Union ironclad nor a teacher’s assistant who can give you demerits). Monitors can also observe the action of software, but they can only point the way, scout out the terrain.

Start a particular debugger, and it can tell you what is going on with the program you are coding, the state and value of variables, and other minutia of software nerdery engineering.

A developer and her debugger are like a playwright, script in hand, directing a dress rehearsal of her play. Events, actions, are started, then stopped. Events are replayed several times. She backs up, breaks, starts over, then continues. The playwright, after seeing two actors perform some part she has written, scribbles some improvements to her text, changing the original to something better; so to the developer, seeing the interaction of two modules, makes changes to her source files. What is written is not often well understood until it is performed. And in both fields, the distribution of talent is very uneven: there is only the occasional Shakespeare, the rare Hopper (Grace, not Edward), and too many hacks.

A monitor program, as mentioned, is different. Start a monitor, say your computer’s task manager, and there is an array of information, a dashboard, but instead of a dashboard for your car it’s for your computer. While driving your car you pay attention to speed, rpms, and most important of all, oil pressure. When driving your computer you might observe cpu utilization (central processing unit—the main chip that does everything), applications running, and the amount of free RAM. RAM and oil pressure are similar in that you don’t want to run out of either: no more and everything stops working. If your computer runs out of RAM, you can restart it. If your engine runs out of oil, you need a new engine.

On dragsters, the only gauge is an oil pressure gauge.

The best automotive monitors are the Smith analog gauges on a 1973 Jensen Healey. The Smith gauges, like the display of software debuggers, are black and white, elegant and functional. On the Jensen the oil pressure gauge is located left of center (on the left hand drive versions of the car), easily viewed through the steering wheel.

Other monitors watch network activity, telling you about various protocols: not the type of protocols that tell you which fork to use or what title to use when writing a letter to a disingenuous politician (redundant), but rather protocols that govern digital communications, often referred to via a series of three letter acronyms (TLAs): upd, tcp ftp.

I like debuggers and monitors. Running them makes me feel like I am smarter than I really am. I like to see a debugger or monitor on my screen, as if I am doing something useful, even if I am just sitting there spacing out —it’s a productive looking space out.

Using a debugger is the closest I can come to putting on airs of tech macho. I’m just an English major, and can do some simple scripting and programming, but I’m not good at it. Sometimes I use a debugger out of feelings of inadequacy, in the same way that non-engineering types, desperate to be digitally chic, prominently display their gold plated landscape layout RPN1Reverse Polish notation Hewlett Packard calculator, whose interface is optimized by the way mathematics is done by the machine—number, number, then operation—as opposed to how humans learned and do arithmetic—number operation number. Who at HP made that interface decision? How was HP able to sell so many of those calculators?

*

In truth I’ve only used a debugger effectively once. But, in debuggeris veritas: what follows is a true story, but first a bit of background.

Whenever I interview SQA (software quality assurance, testers) people, I always ask them what’s the worst bug they ever missed? Everyone has their stories, and of course they are reluctant to tell it—who wants to admit to that sort of thing? I tell them not worry, whatever mistake they made, I’ve made one far, far worse.

At one company we sold a software application, a utility program that ran every time a user started his computer. The utility was installed on tens of millions of computers. It was designed such that once a day the utility would check a remote server to see if there was a new version; we called this the phone home feature. If there was a new version, the program on the user’s computer would download the newer version, then run the update program, thereby replacing itself with a newer version.

If there was not a new version, nothing would happen. The phone home call was over until the next day.

Three of us in engineering were responsible for this update process. The first of us was a software developer for the code that would run the update on the end user’s computer; we called this the patch code; we’ll call him PatchDev. PatchDev had an air of blue collar about him: looking at him you thought he was just off a construction site. A mason’s trowel or sheet rock hammer suited him as much as a keyboard and mouse. He was one of the best all around programmers I had ever worked with. Still is.

A second software developer maintained all the server code for the update; we’ll call ServDev. ServDev at first appeared like so many other programmers in Silicon Valley: quiet, shy, intelligent, a little pasty from being inside too much. He was all those things, but unlike other developers his thinking and imagination were broad: he thought beyond just his own code and his own modules, casting a wider net of considerations, options, problems, and possibilities.

For both of these guys, there were no better co-workers, and in time, no better friends.

Last was me, the software test engineer; no special name needed, but as already inferred, I’m not an engineer.

We got to be pretty good at updating our software. After several years, and running hundreds of millions (yes, that many, really) of successful updates, we released yet another update. Call it update X. For update X we followed our tried and tested engineering process: develop the patch code and test in our engineering environment. When we thought it was ready, we would then update just 1,000 machines, no more, then check the results of that update. If those numbers looked okay, we’d then run an update for twenty-four hours targeting 10,000 machines (just over 400 machines per hour, or almost 7 per minute); this gave us a statistically significant population distributed over the course of an entire day. If this went well, we’d gradually ramp up the updates, to hundreds of thousands of computers per day.

So it began with upgrade X.

Another word about the company before going further: the company was organized like a typical software company, the major departments being finance, sales, marketing, business development, analytics, and engineering. Engineering was sub-divided into software development, software quality assurance, documentation, and technical support.

The business development department was a funny combination of sales, marketing, and engineering, and the vice-president of the department favored hiring people with financial and analytical backgrounds. This turned out to be a good thing. The vice president of the business development, let’s call him Vpbd, was a curious mix of intelligence, youth, inexperience, insight, likability, obnoxiousness, and finally, graciousness (this is not a word you would think of when you first met him). For his first company photo, Vpbd wanted a picture of himself talking on the telephone, until the chief executive officer told him that’s what real estate agents did, and instead a standard three-quarter profile photograph with a no-teeth smile was sufficient.

The analytics department was run by an intelligent, but slightly misanthropic manager, who supervised a group of statisticians, most of whom now work at the Center for Disease Control, extra-national pharmaceutical companies, Connecticut hedge funds, or anyone else interested in exploiting big data. Their job was to sift through the tens of gigabytes of data that came into our servers, everyday—this was a respectable volume of data at the turn of the millennium.

A few weeks after upgrade X had been running, both business development and analytics contacted engineering to state that based upon their reports, the size of the user base was shrinking, when it should have been growing. Even worse, the latest revenue numbers indicated that as a result of this declining user base, the company was losing hundreds of thousands of dollars per month. They believed the problem was related to upgrade X, since these problems began after we released the upgrade.

Of course this was impossible: engineering, with its ironclad process, had irrefutable data that upgrade X worked fine. We had run through all our tests, exercised appropriate prudence and caution, behaved soberly and rationally.

The mills of the gods grind slowly, but they grind exceedingly fine.

To humor our comrades in other departments, the engineering team began an investigation as to what might have gone wrong with upgrade X. The CEO took a keen interest in all this, joined the meeting and asked pointed and disconcerting questions. Other developers joined in, adding to the mix of different points of view. At the end of the meeting, PatchDev suggested that a problem might occur if during the patch process, the utility failed to close correctly, then thereafter the utility would no longer start when the user started his computer.

According to PatchDev, the order of operations (pace HP) of the patch program was critical: the first thing the patch program did was to remove the function that started our utility when the computer started. The last thing the patch program did was restore the function that started our utility when the computer started. If in between these two steps, something went wrong in the patch process, then our utility would no longer start when the user’s computer started.

If the utility no longer started, then it would no longer run its daily phone home function, and therefore it appeared our utility was no longer on that computer, even if it still was. If this happened to enough computers, it would appear that we were losing users, and therefore revenue.

Under this scenario, the utility was functionally intact and fine; it just no longer started—it was dormant. If a user had technical savvy to locate the program file and manually start it, everything would be fine.

The action item fell to me to try to recreate this scenario: start an update, disrupt the patching process, cause it to fail then check the results. Queue the montage of rolling up sleeves, coffee brewing, keyboard clickity-clackiting, stoic expressions, a mouse making circles on a mouse pad, formulas and numbers passing through the air, eye glasses being decisively pushed back on noses, and splash screens of debuggers being started.

On my test computer I installed an older version of the utility. Next I started the debugger, then changed the computer’s date and time so that our utility would think it was time to phone home. I watched on this on two 21” screens: screen left the debugger pane, an airline cockpit of commands and controls, along with various consoles conveying state information, and screen right, more screens to help watch what file is doing what.

I started all the programs, then watched and waited: as expected the utility phoned home, then started to download our patch program. After the patch program was downloaded, I watched as it removed the function to start the utility with the computer. Timing was critical and this was where I intervened: I used the debugger to stop further execution of the patch program, hoping to simulate the problem as it had occurred. Nothing happened further with any program. It looked to be hung (that means it stopped working).

After a few more minutes, I restarted my computer and inspected everything. All the program files for the utility were intact, untouched by the update process. However, our utility had not started with the computer. It was not running. The PatchDev’s hypothesis had been correct. But that wasn’t the worst.

Business development was right.

Analytics was right.

We were wrong.

Clearly the first thing the patch program should NOT do is remove the function that starts the utility with the computer. Our flawed patch process had killed, or at least had disabled our own program.

In war this is known as death by friendly fire.

I sat looking at the screen for a while. Soon I’d have to inform everyone of my findings, but not just yet—I wanted to wallow in it alone for a few minutes. I would have liked to have had a drink, but that would have looked bad before lunch; I’d have to sneak out to the day time drinker bar near the office. The shock of seeing things go wrong slowly gave way to black despair. It was like the exam I thought I aced, that I had always aced before, but this time was returned with an F, and there was no hope of redemption: it was the end of the semester, there were no do overs, and the chance for extra credit, well, that went out in middle school.

I wondered why this problem had not occurred before; perhaps there had been additional changes in the patching process that lead to this bug.

Although I was not alone in the transgression, PatchDev and ServDev had also missed the bug, I was the one had tested it and missed the bug. But those wankers in business development and analytics had found it, or at least the symptoms, even if they didn’t know exactly what was going on.

So when I interview SQA people and tell them about what happened to me: killed or at least temporarily disabled hundreds of thousands, maybe a even into the millions, of our users, they look at me and wonder if they still want the job.

We first fixed the updating bug—that was pretty easy. You can be sure I ran the debugger to make sure that turd was gone. Then we kicked off an interesting project code named Search and Rescue. We knew our software was out there and okay, it was just dormant: it did not start when the users’s computer started. And if it did not start, it did not phone home. How would we reach the software? We did it, but that’s a story for another time.

We were never sure of what had caused the aborted patch process, However, we had at least one scenario that demonstrated the problem, and that was enough to lead to a fix, a rare case of one fix fitting all instances of the problem. Indeed, there may have been many reasons our software got into that particular frozen state: when your software runs on tens of millions of computers, and each computer is configured differently, diagnosing the pathology of a bug is a sort of epidemiological problem.

I don’t know if I would have caught the bug had I been using the debugger at the start of the testing process; I would have had to have been creative enough to think to interrupt the update process at that particular point. Good testing is a combination being methodical and being creative, but sometimes we are not methodical enough nor creative enough.

For a long time afterwards I was depressed by this whole affair. But the shadow was less dark after I got an email from Vpbd—remember him? Early in the process of trying to get to the root of the problem, he sent me the following email:

Steve –

One note, IF (…stressing “if”…) this [the friendly fire problem] proves to be the case, I know nobody will feel worse about it than you. And regardless of whether or not this is what happened, I want to be the first to acknowledge that this is the exact kind of risk we take when we pursue super-aggressive software development schedules at light-speed- that’s just a fact. And I also promise you there have been plenty of mistakes made all around the company that have been painful at one time or another-so keep your chin up (…again, stressing IF this is the case…), & let’s rally. Onward to the greatness that awaits!…

References   [ + ]

1. Reverse Polish notation

Leave a reply