[Linux-aus] CPU errors
Dan Kortschak
dan at kortschak.io
Tue Feb 4 11:44:03 AEDT 2025
On Tue, 2025-02-04 at 11:12 +1100, Russell Coker via linux-aus wrote:
> https://www.theregister.com/2021/06/04/google_chip_flaws/
>
> I've been considering this issue of flaws in CPUs ever since it first
> was
> reported 4 years ago.
>
> https://arxiv.org/pdf/2102.11245
>
> There is no good data published about how common such problems are
> but
> Facebook states "hundreds" of CPUs out of "hundreds of thousands" of
> systems
> which implies something like 1/1000.
>
> Over the years the number of machines I've run (with actual root
> access - not
> counting cloud VMs) adds up to more than 1000. I expect that there
> are people
> on this list who have run 1000+ machines at one time.
>
> If something has an incidence of 1/1000 there's a good chance that it
> has
> happened to a system that I run, and the probability that none of the
> systems
> run by people on this list have had the problem would be very low.
>
> Has anyone seen such things and known it? If not does that imply
> that some of
> us are just losing data for ourselves and our clients without knowing
> it?
>
> The nearest I've come to seeing this is a Pentium D system that I got
> from
> corporate rubbish back when AMD64 systems were still new and rare. I
> tried to
> install Debian on it and it got SEGVs on uncompressing packages. I
> replaced
> the RAM with no change (desktop system without ECC support) and then
> just sent
> it to e-waste without any further thought. In retrospect I should
> have done
> more research on that system to find out what was wrong, but at the
> time I was
> more focussed on getting working systems than on studying computer
> engineering.
>
> Modern CPUs have caches that are bigger than the hard drives in early
> Linux
> systems. The PC I'm using to write this message has 46M of CPU cache
> which is
> larger than the storage of the iPaQs I was running Linux on 20 years
> ago. Has
> anyone written a recovery image that will lock itself into the cache,
> verify
> checksums, and then test things like RAM for errors?
I have an old i5 laptop that can stay up for about 2-4 hours and then
will black screen with a consistent kernel panic, suggesting a specific
logic flaw that has arisen in the CPU. It's used for playing DVDs now
since that uptime is reasonable in this use. It worked fine for many
years, so this is wear degradation, but the linked article also
referred to those.
Dan
More information about the linux-aus
mailing list