Ask HN: What was the most annoying bug you ever debugged?

15 points by ZevsVultAveHera 9 months ago

romanhn 9 months ago

Once I had to track down an issue where very rarely, with no discernible pattern the web app would produce garbled PDFs. Turned out this happened when an admin account remotely connected to the app server, which caused a reset of default screen resolution, which messed up the PDF library that relied on a specific resolution (it was HTML-to-PDF conversion). Happened rarely and randomly because there were multiple web servers and they were occasionally restarted which would fix the problem until next time.

Another fun problem I dealt with was when I was moving my employer's codebase from Subversion to Mercurial version control ages ago. Everything looked good, except a directory named CVS (after the pharmacy, a customer) was missing. Was banging my head on the table before realizing that the default .hgignore file instructed Mercurial to ignore all contents of .*/CVS (another old version control system).

mikewarot 9 months ago

1997 - Windows 98, HP 4000 printer drivers assumed you had the floating point libraries loaded into windows, and just dynamically unloaded them whenever they felt like it. So, everything would work fine, until you did something that had to compute the page dimensions (and thus use floating point).

Forcing the loading of the floating point libraries fixed it, but it took months to track it down.

It turns out it was an optimizing compiler that HP didn't properly set the options on, in their make setup.

ZevsVultAveHera 9 months ago

Probably wildest story involving loading/unloading libraries I have ever read about. How you noticed that? Had to resolve to memory debugging?
- mikewarot 9 months ago
  
  It was about 6 months of nagging errors. The busier the user was, the more likely the blue screen. In the end it was just sheer persistence in searching the Internet for answers until I finally found it.

Terr_ 9 months ago

Recycling a comment, where part of the annoyance came from the feeling that they should have been asking someone else to solve it: https://news.ycombinator.com/item?id=37859771

_____

[That's like] Me, with zero C/C++ experience, being asked to figure out why the newer version of the Linux kernel is randomly crash-panicking after getting cross-compiled for a custom hardware box.

("He's familiar with the the build-system scripts, so he can see what changed.")

-----

I spent weeks of testing slightly different code-versions, different compile settings, different kconfig options, knocking out particular drivers, waiting for recompiles and walking back and forth to reboot the machine, and generally puzzling over extremely obscure and shifting error traces... And guess what? The new kernel was fine.

What was not fine were some long-standing hexadecimal arguments to the hypervisor, which had been memory-corrupting a spot in all kernels we'd ever loaded. It just happened to be that the newer compiles shifted bytes around so that something very important was in the blast zone.

Anyway, that's how 3 weeks of frustrating work can turn into a 2-character change.

ZevsVultAveHera 9 months ago

Ah, joy of kernel debugging. Even with knowledge of C measured in years it could take weeks to debug trivial mistakes. Been there and seen others (with long careers in kernels) being there.

bcrl 9 months ago

I had a bug in journal replay that only occurred when a transaction of a specific size extended overlapping the last block of the journal device and wrapped around to the first block. While QA managed to come up with a reliable reproducer of the issue, it took writing a comprehensive tool that replayed and checked the contents of the backing store at every single state that was present on the disk during replay of the journal. Turned out that there was a missing mask in the calculation of the offset into an in memory buffer that would otherwise have been impossible to spot as it was still at a valid location within the buffer, just in the wrong location. Gabage into memcpy(), garbage out. Ooops.

C/C++ certainly gives one enough rope to shoot your foot off in the most unexpected places. That one took a heck of a long time to solve.

JoeAltmaier 9 months ago

Combination PIT/serial interrupt issue involving microsecond-resolution system programmable interval timer and multi-port serial driver. Would crash every day or so.

Had to create a stress test to reproduce in minutes not days. Then trace code paths through timers and serial events to find problematical path. Turned out to have many - timer interrupt callback could cancel interrupt, reschedule timer, change interval, cancel then reschedule. All in the presence of other channel interrupts occurring and overlapping unpredictably. Timers rescheduled for intervals that had passed already once the callback completed. And on and on.

Took a weekend alone with the code and a set of machines, desk-time getting my head around it all, then coding bullet-proof paths for all calls and callbacks for every related system call.

Once it worked, it worked for days then months under test. Nothing is too hard to resist a methodical approach.

ZevsVultAveHera 9 months ago

Ah yes, one of "the funniest" problems. They teach you a lot or drive you insane. Have you ever wrote about this adventure in some form of an article or narrative story? Would be a great read, I'm sure.

erdaniels 9 months ago

Upgrading from QT4 to 5 broke the appending of QStrings to QByteArrays such that it stored half the data from a QString (some wonkiness with UTF8 and UTF16 IIRC). Took a rewrite of the RTMP/AMF layer in the codebase to figure it out.

ecesena 9 months ago

Recently, this one which I'm still investigating -- if you want to help :) https://github.com/anza-xyz/agave/pull/4585

bravetraveler 9 months ago

Anything made because "we didn't have time"... when an educated participant wouldn't have entertained any of this to begin with

When the bugs near flaws... I'll burn it, you, and myself down

1970-01-01 9 months ago

Anything that has a completely useless error message qualifies. Please do not use ambiguous error messages in your code!

Very old examples that live in my head, rent-free:

Error - Error.

Something is wrong.

_kb 9 months ago

Equal parts annoying and fun. Copying an old comment (https://news.ycombinator.com/item?id=35523969#35531850).

---

In an environment I worked there were multichannel audio recordings that are archived. The archival recordings all had a perfect 4kHz tone appearing, seemingly out of nowhere. This was happening on every channel, across every room, but only in one building. Nowhere else. Absolutely nothing of the sort showed up on live monitoring. The systems across all sites were the same and yet this behaviour was consistent across all systems only at one location.

The full system was reviewed: from processing, recording, signal distribution, audio capture, and in room. Maybe there was a test gen that had accidentally deployed? Nope. Some odd bug in an echo canceller? Also no. Something weird with interference from lighting or power? Slim chance, but also no. Complete mystery.

When looking for acoustic sources there was an odd little blip on the RTA at 20kHz. This was traced back to a test tone emitted from the fire safety system (ultrasonic signal for continuous monitoring). It's inaudible to most people and will be filtered before any voice-to-text processing so no reason for concern. Anyway 20kHz is nowhere near 4kHz though so the search continued.

The dissimilarly of 20kHz and 4kHz is true, until you consider what happens in a non-bandwidth limited signal. The initial capture was taking place at a 48kHz sampling rate. It turns out the archival was downsampling to 24kHz, without applying an anti-aliasing filter. Without filtering, any frequency content above the Nyquist 'folds' back over the reproducible range. So in this case a clean 24kHz bandwidth signal with a little bit of inaudible ultrasonic background noise was being folded at 12kHz to create a very audible 4kHz tone. It was essentially a capture the flag for signals nerds and a whole lot of fun to trace.

mike_hearn 9 months ago

I used to work on Wine, first as a volunteer and later as a job, so spent a lot of time staring at gigantic multi-gigabyte sized logs trying to work out why an app was crashing when running on Linux. Sometimes apps would work fine for me but be reported as crashing by a user, or we wouldn't have access to the app at all, so logs were the only way to work out what was going wrong.

We got a bug report that an app would crash, and I couldn't reproduce it. So we asked the user, are you using the latest version of Wine from our website? "Yes I am". OK, that's odd, send us some logs then. The crash was some sort of memory corruption during startup of the app. Everything seemed to be running fine, the app was loading files and reading registry entries happily, and then suddenly it would segfault in a random place. No opportunity to debug directly, as everything was binary only and only crashing on this guy's machine.

I spent days working painstakingly through hundreds of millions of lines of API call traces, until eventually I found what seemed to be a difference between his logs and mine. In his logs, some registry reads were failing, and in mine they worked. But why?

It turned out that the guy had been lying to us. He hadn't actually installed the app using the Wine downloads from winehq.org, he'd installed it from the Debian repositories. The packages provided by Debian were badly broken: they had split various tools out into a separate -utils package which wasn't installed by default because that complied with Debian standards better. But that was an error because Windows doesn't care about Debian standards and those tools aren't optional there, so many programs assumed those tools were always available. One of them was regedit.exe, which this app's installer was running with some flags to add default registry entries. On Windows this would never fail, so the installer didn't check the error codes and the install failure was silent. And then the app didn't check the error codes when reading the entries either, because again, that would never fail on Windows. So the reads silently did nothing, the memory the app expected to be initialized wasn't, it tried to use it and corrupted its heap which then led to a random crash about a million API calls away. The original failure wasn't even in the logs I was looking at.

At the time we had an explicit policy of not supporting anyone who installed Wine from their distribution packages, exactly because of bugs like this. Instead the project provided its own apt repositories. The distro-centric model Linux used was just broken because it led to packagers who weren't a part of the upstream communities "fixing" software they didn't understand as they packaged it. The notorious SSH bug was another case of that but such stories are commonplace. Debian users in particular were hard to deal with because lots of community built packages was a part of the distro's appeal and moat, even though upstream developers often hated it (lots of obsolete bug reports or distro-created bugs). So they had become defensive, and some had taken to deceiving upstreams when filing bugs because they thought they knew better.

Needless to say, a multi-day memory corruption debugging session that ended with "there is no bug, follow the install instructions on our website and stop lying to us about it" was by far the most annoying bug I ever had to work on.

ZevsVultAveHera 9 months ago

I feel you. People submitting bug reports not writing the truth in them are wasting time of developers and tech support far too often.

billconan 9 months ago

rendering corruption issue or perf issue of wayland that involves 100 processes.

nejsjsjsbsb 9 months ago

Any flaky selenium test.