Browsing the archives for the leap second tag.

Last post on leap seconds.

Uncategorized

I thought I was done with this. Then, today I saw this. To the best of my knowledge, Fedora 8 didn’t suffer from the bug I originally described several posts ago. I think this one happening at nearly midnight UTC is coincidence.

There’s a “me too” in the comments, but it seems odd that two people on slashdot saw it, but we never heard a peep on the Fedora mailing lists, or in bugzilla. Or even in upstream kernel.org. It could just be coincidence, the story is unsurprisingly short on details. I guess slashdot stories are easier to write than bug reports. But without additional debugging info we won’t ever know. Bear in mind that last time we saw a crash of this nature it didn’t affect everyone then either.

It was only by chance I managed to catch the backtrace in the `06 crash. I actually had two locked up machines, but one had its screen blanked, and wouldn’t unblank. The other machine had blanking disabled (setterm -blank 0) and thankfully, had also been set up to use a VGA screen resolution so had plenty of lines to display the whole backtrace.

Update: a problem has been found, and fixed.

8 Comments

More on leap seconds.

Uncategorized

Jesse Keating made a comment in my previous post on leap seconds, which I thought was worth highlighting in another post, for the benefit of those who don’t read the comments.

This is why rarely executed codepaths suck. Whilst it is tempting to gloat over another Microsoft failure, this could easily have been any other OS. I already mentioned that Linux had suffered something similar once. A bug like this in consumer devices is a nightmarish, but imagine if such a bug ended up in something more critical ? “Sorry, your life support system went offline because there was a leap second”. In safety critical systems, rare codepaths are kind of terrifying.

Writing test cases for bugs like this is also not particularly fun. You’d have to have a fake ntp server for testing the rare case.
Now think about all the other potential ‘only runs once every blue moon’ codepaths in your apps, and imagine the effort required to write test plans for all of them. Not impossible, but certainly a lot of potential job security there for QA folks. Just like fuzz-testing, traditional coverage-testing by just running common workloads aren’t the panacea of testing when there are variables outside your control.

What’s still puzzling to me though.. The Zunes died several hours before 00:00:00 UTC.
Quirk of MSFT’s ntp implementation I guess. *shrug*

6 Comments

Leap seconds.

Uncategorized

Tonight, a leap second will occur. After 23:59.59, we have 23:59:60 before rolling over to 00:00:00. Most people won’t even notice. Most electronic devices won’t notice. Those unaware of the event (like the clock on my microwave oven), will end up a second slower. (not that it really matters, it doesn’t display seconds in its clock, and I surely wasn’t second-accurate when I set it).

Of slightly more concern, are the more clever devices. The devices that are aware of leap seconds know when to insert one. On these internet connected devices, ntpd tells the kernel “insert or deduct a second” as necessary.
This all sounds fairly benign, but it has been known to be problematic. For reasons I’m not entirely sure of, ntp still calls into the kernel twice a year, regardless of whether a leap second is inserted or not. So, twice a year, we end up in different code paths that we don’t execute the rest of the year.

Whilst I was travelling in June 2006, I noticed I couldn’t get at my email. A week passed before I found out on returning home that the kernel had oopsed in that code path. There was no leap second in June that year. Nor has there been in any year this decade. Thankfully, that particular oops was only fatal if you were running a build with certain debugging CONFIG options turned on (I was), so that vast majority of users never saw a problem. Here’s the fix that went into 2.6.22 for this bug.

The very few that did see the problem (I don’t recall anyone else mentioning it when I posted to lkml) likely just rebooted, with the “if it happens again, I’ll report it” mindset, which of course, it didn’t..

Hopefully at midnight, all will be well and that code will just do it’s thing with no dire consequences :-)

update: anti-climax, just as we like it :-)

Dec 31 18:59:59 localhost kernel: Clock: inserting leap second 23:59:60 UTC

4 Comments