[Linux-aus] pcie errors
Russell Coker
russell at coker.com.au
Sat Oct 11 14:56:53 AEDT 2025
On Saturday, 11 October 2025 12:43:27 AEDT Al Maclang wrote:
> 1. What the messages mean
> Lines like:
> pcieport 0000:00:02.0: AER: Correctable error message received amdgpu
> 0000:02:00.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer
>
> show correctable PCIe Data Link Layer errors.
>
> These are not fatal — the link detected a bad packet (BadTLP, Timeout,
> Rollover), retried it, and succeeded. So the system continued working fine,
> but it logged the noise.
It however didn't work fine, it displayed visual artifacts when playing a
movie and caused at lease one lockup of KDE.
> 2. Possible causes
>
> • Dust or poor contact — especially if you recently cleaned the system and
> reseated parts.
>
> • Slightly oxidized or stressed PCIe slot pins.
Apart from contact cleaner what can be done about that? If I insert and
remove the GPU a few times would that rub oxide off?
> • Marginal PCIe link stability (e.g., too long riser cable, dirty slot, or
> flex in GPU).
There is no riser and I have confidence in HP designing things with
appropriate track limits. Dirt is a definite possibility as I don't know what
the previous owner did - the seller didn't know much about computers. Flex is
no more than expected, GPUs always have some flex due to being big and bulky
and the tolerance stacking issues of PC expansion.
> • BIOS / firmware quirk — some boards log excessive AER spam even for
> benign link retraining.
It's running the latest BIOS update and I'm confident in the ability of HP
programmers to do this correctly. I've not seen any errors like this which
weren't correlated with a functionality issue.
> • Power fluctuation or grounding — the GPU draws heavily from the slot;
> small transient errors can appear.
I have an AMD GPU, radeontop reports 63% video RAM use, 100% memory clock, and
about 70% shader clock. The GPU has no PCIe power cable (I originally bought
it for a system which had a small PSU). With the low power GPU and lack of
hard drives I think I'm well below that limit of the PSU.
> • Q2. Is reseating the CPU the thing to do for that?
>
> A2. Not yet. Start smaller:
>
> - Reseat the GPU, not the CPU.
>
> - Check the PCIe slot for dust or bent pins.
>
> - Clean contacts with isopropyl alcohol.
>
> - Reseat power connectors (GPU & motherboard).
OK thanks for these suggestions. At the moment I'll leave it running until it
happens again. If/when it happens again I'll start with reseating the GPU
(which I really should have done previously) and go through the list.
https://www.bunnings.com.au/wd-40-290g-specialist-fast-drying-contact-cleaner_p6100409
How does isopropyl alcohol compare with the WD-40 contact cleaner from
Bunnings?
> Only consider reseating the CPU after ruling those out — because removing
> the CPU introduces more risk (bent pins, paste reapplication, etc.).
https://en.wikipedia.org/wiki/Land_grid_array
The CPU in question is LGA and I accidentally touched the contacts. Should I
have washed it with isopropyl alcohol after that?
My previous swap of that CPU I had used the "buttering the toast" method of
spreading the heatsink paste and had made it too thick and it had got
everywhere. It took a lot of cleaning.
> 4. What to try next (in order)
>
> • Monitor – if you only see a few correctable errors and the system runs
> fine, you can safely ignore them.
>
> • Reseat GPU – pull it out, clean contacts, reseat firmly.
>
> • Inspect slot & connectors – look for dust, corrosion, flexing.
How would I inspect a PCIe slot? I literally can't even see the contacts.
Nowadays I use my phone with 3* magnification to read the ID numbers of small
parts like NVMe devices.
> • Check BIOS updates – sometimes AER handling or signal tuning is improved.
>
> • Try another PCIe slot – if supported and doesn’t affect performance.
>
> • Disable AER logging (optional) – to reduce syslog spam (pci=noaer kernel
> parameter).
That's the one all the google results gave which isn't suitable for my case as
I have had functionality problems with the system caused by PCIe errors.
> • If it persists or worsens (uncorrectable errors, lockups) – then you
> might suspect board or GPU hardware degradation.
>
> 5. Take away: why it’s quieter now?
>
> “Before cleaning out the dust the system was annoyingly loud when BOINC was
> using 9 CPU cores; now I can hardly hear it even at full load.”
>
> That’s a great sign — dust removal improved thermal performance, so fans
> spin slower for the same workload. It might also indicate improved airflow
> around the PCIe slot (which can indirectly reduce link temperature and
> errors too).
I didn't deliberately blow dust out of the GPU, but I sprayed it around the
area so it would have got some of it. In retrospect I should have
deliberately cleaned that out too.
Here's what lm-sensors gives about the GPU:
amdgpu-pci-0200
Adapter: PCI adapter
vddgfx: 850.00 mV
fan1: 1314 RPM (min = 1800 RPM, max = 6900 RPM)
edge: +67.0°C (crit = +97.0°C, hyst = -273.1°C)
PPT: 10.13 W (cap = 32.00 W)
pwm1: 24%
sclk: 955 MHz
mclk: 2 GHz
> Conclusion:
>
> • Yes, those are correctable PCIe link errors (not kernel bugs).
>
> • Start with reseating/cleaning the GPU and connectors, not the CPU.
>
> • If they persist, try another slot or BIOS update.
>
> • Only worry if you start seeing uncorrectable errors or crashes.
Thanks. Also I have seen crashes so I am worried. Also I have a spare z640
so replacing the entire system is a possibiity. I could sell the current one
to someone who wants a SOHO server if it's only a GPU issue.
--
My Main Blog http://etbe.coker.com.au/
My Documents Blog http://doc.coker.com.au/
More information about the linux-aus
mailing list