Browsing the archives for the bugs tag.

When acpi-cpufreq fails.

Uncategorized

The majority of modern CPUs that support CPU scaling now use a common driver (acpi-cpufreq). Judging by the search queries that hit my blog, and the amount of mail I get on the subject, there is a failure mode of this driver that many people are hitting, that there isn’t a great deal of information on.

The failure mode looks like this:

$ modprobe acpi-cpufreq
FATAL: Error inserting (/lib/modules/…/kernel/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.ko): No such device

Not particularly informative. We don’t spit out anything helpful to dmesg either. So what is the cause of this problem?
In many cases, /proc/cpuinfo shows the cpu supports speedstep (the ‘est’ flag). The answer in nearly all of these cases is.. The BIOS. The ACPI tables in the BIOS list which P-states a particular CPU supports. If your CPU was manufactured after your BIOS was written, you’re probably going to be out of luck. Sometimes, there are BIOS updates on the motherboard manufacturers website that will add support for newer processors. Sometimes we aren’t so lucky. In these cases, we’re out of luck, there’s nothing we can do.

There is another possibility for the error message above: kernel bugs. We have introduced bugs in the ACPI interpretor in the past which have broken parsing of the P-states on some platforms. These kinds of bugs tend to get noticed very quickly, and fixed in equally short time, but it’s worth making a point that it’s important to be running on the last kernel version before reporting bugs.

2 Comments

SATA disasters with the Silicon Image 3114

Uncategorized

I spent way too much time over the last few days chasing bugs which turned out to have nothing to do with Linux.
I bought a SATA controller which arrived just before the weekend. It seems there is a fundamental flaw with the Silicon Image 3114 chips. Or to be more precise, with the firmware on some of the boards using this chip.

This thread is a summary of all manner of problems with it, but in short, it corrupts data past a certain block number. This took a lot of tracking down. (And badblocks takes forever to run when in destructive mode).

There is mention in that thread that a firmware update fixes the problem. Unfortunately, the DOS based flasher program seems completely unable to even write to my card.

I guess I’ll only use this controller for smaller disks, unless someone comes up with a workaround.

1 Comment

Last post on leap seconds.

Uncategorized

I thought I was done with this. Then, today I saw this. To the best of my knowledge, Fedora 8 didn’t suffer from the bug I originally described several posts ago. I think this one happening at nearly midnight UTC is coincidence.

There’s a “me too” in the comments, but it seems odd that two people on slashdot saw it, but we never heard a peep on the Fedora mailing lists, or in bugzilla. Or even in upstream kernel.org. It could just be coincidence, the story is unsurprisingly short on details. I guess slashdot stories are easier to write than bug reports. But without additional debugging info we won’t ever know. Bear in mind that last time we saw a crash of this nature it didn’t affect everyone then either.

It was only by chance I managed to catch the backtrace in the `06 crash. I actually had two locked up machines, but one had its screen blanked, and wouldn’t unblank. The other machine had blanking disabled (setterm -blank 0) and thankfully, had also been set up to use a VGA screen resolution so had plenty of lines to display the whole backtrace.

Update: a problem has been found, and fixed.

8 Comments

More on leap seconds.

Uncategorized

Jesse Keating made a comment in my previous post on leap seconds, which I thought was worth highlighting in another post, for the benefit of those who don’t read the comments.

This is why rarely executed codepaths suck. Whilst it is tempting to gloat over another Microsoft failure, this could easily have been any other OS. I already mentioned that Linux had suffered something similar once. A bug like this in consumer devices is a nightmarish, but imagine if such a bug ended up in something more critical ? “Sorry, your life support system went offline because there was a leap second”. In safety critical systems, rare codepaths are kind of terrifying.

Writing test cases for bugs like this is also not particularly fun. You’d have to have a fake ntp server for testing the rare case.
Now think about all the other potential ‘only runs once every blue moon’ codepaths in your apps, and imagine the effort required to write test plans for all of them. Not impossible, but certainly a lot of potential job security there for QA folks. Just like fuzz-testing, traditional coverage-testing by just running common workloads aren’t the panacea of testing when there are variables outside your control.

What’s still puzzling to me though.. The Zunes died several hours before 00:00:00 UTC.
Quirk of MSFT’s ntp implementation I guess. *shrug*

6 Comments

Leap seconds.

Uncategorized

Tonight, a leap second will occur. After 23:59.59, we have 23:59:60 before rolling over to 00:00:00. Most people won’t even notice. Most electronic devices won’t notice. Those unaware of the event (like the clock on my microwave oven), will end up a second slower. (not that it really matters, it doesn’t display seconds in its clock, and I surely wasn’t second-accurate when I set it).

Of slightly more concern, are the more clever devices. The devices that are aware of leap seconds know when to insert one. On these internet connected devices, ntpd tells the kernel “insert or deduct a second” as necessary.
This all sounds fairly benign, but it has been known to be problematic. For reasons I’m not entirely sure of, ntp still calls into the kernel twice a year, regardless of whether a leap second is inserted or not. So, twice a year, we end up in different code paths that we don’t execute the rest of the year.

Whilst I was travelling in June 2006, I noticed I couldn’t get at my email. A week passed before I found out on returning home that the kernel had oopsed in that code path. There was no leap second in June that year. Nor has there been in any year this decade. Thankfully, that particular oops was only fatal if you were running a build with certain debugging CONFIG options turned on (I was), so that vast majority of users never saw a problem. Here’s the fix that went into 2.6.22 for this bug.

The very few that did see the problem (I don’t recall anyone else mentioning it when I posted to lkml) likely just rebooted, with the “if it happens again, I’ll report it” mindset, which of course, it didn’t..

Hopefully at midnight, all will be well and that code will just do it’s thing with no dire consequences :-)

update: anti-climax, just as we like it :-)

Dec 31 18:59:59 localhost kernel: Clock: inserting leap second 23:59:60 UTC

4 Comments