[Linux-aus] pcie errors

Sat Oct 11 14:56:53 AEDT 2025

On Saturday, 11 October 2025 12:43:27 AEDT Al Maclang wrote:
> 1. What the messages mean
> Lines like:
> pcieport 0000:00:02.0: AER: Correctable error message received amdgpu
> 0000:02:00.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer
> 
> show correctable PCIe Data Link Layer errors.
> 
> These are not fatal — the link detected a bad packet (BadTLP, Timeout,
> Rollover), retried it, and succeeded. So the system continued working fine,
> but it logged the noise.

It however didn't work fine, it displayed visual artifacts when playing a 
movie and caused at lease one lockup of KDE.

> 2. Possible causes
> 
> • Dust or poor contact — especially if you recently cleaned the system and
> reseated parts.
> 
> • Slightly oxidized or stressed PCIe slot pins.

Apart from contact cleaner what can be done about that?  If I insert and 
remove the GPU a few times would that rub oxide off?

> • Marginal PCIe link stability (e.g., too long riser cable, dirty slot, or
> flex in GPU).

There is no riser and I have confidence in HP designing things with 
appropriate track limits.  Dirt is a definite possibility as I don't know what 
the previous owner did - the seller didn't know much about computers.  Flex is 
no more than expected, GPUs always have some flex due to being big and bulky 
and the tolerance stacking issues of PC expansion.

> • BIOS / firmware quirk — some boards log excessive AER spam even for
> benign link retraining.

It's running the latest BIOS update and I'm confident in the ability of HP 
programmers to do this correctly.  I've not seen any errors like this which 
weren't correlated with a functionality issue.

> • Power fluctuation or grounding — the GPU draws heavily from the slot;
> small transient errors can appear.

I have an AMD GPU, radeontop reports 63% video RAM use, 100% memory clock, and 
about 70% shader clock.  The GPU has no PCIe power cable (I originally bought 
it for a system which had a small PSU).  With the low power GPU and lack of 
hard drives I think I'm well below that limit of the PSU.

> • Q2. Is reseating the CPU the thing to do for that?
> 
> A2. Not yet. Start smaller:
> 
>  - Reseat the GPU, not the CPU.
> 
>  - Check the PCIe slot for dust or bent pins.
> 
>   - Clean contacts with isopropyl alcohol.
> 
>  - Reseat power connectors (GPU & motherboard).

OK thanks for these suggestions.  At the moment I'll leave it running until it 
happens again.  If/when it happens again I'll start with reseating the GPU 
(which I really should have done previously) and go through the list.

https://www.bunnings.com.au/wd-40-290g-specialist-fast-drying-contact-cleaner_p6100409

How does isopropyl alcohol compare with the WD-40 contact cleaner from 
Bunnings?

> Only consider reseating the CPU after ruling those out — because removing
> the CPU introduces more risk (bent pins, paste reapplication, etc.).

https://en.wikipedia.org/wiki/Land_grid_array

The CPU in question is LGA and I accidentally touched the contacts.  Should I 
have washed it with isopropyl alcohol after that?

My previous swap of that CPU I had used the "buttering the toast" method of 
spreading the heatsink paste and had made it too thick and it had got 
everywhere.  It took a lot of cleaning.

> 4. What to try next (in order)
> 
> • Monitor – if you only see a few correctable errors and the system runs
> fine, you can safely ignore them.
> 
> • Reseat GPU – pull it out, clean contacts, reseat firmly.
> 
> • Inspect slot & connectors – look for dust, corrosion, flexing.

How would I inspect a PCIe slot?  I literally can't even see the contacts.  
Nowadays I use my phone with 3* magnification to read the ID numbers of small 
parts like NVMe devices.

> • Check BIOS updates – sometimes AER handling or signal tuning is improved.
> 
> • Try another PCIe slot – if supported and doesn’t affect performance.
> 
> • Disable AER logging (optional) – to reduce syslog spam (pci=noaer kernel
> parameter).

That's the one all the google results gave which isn't suitable for my case as 
I have had functionality problems with the system caused by PCIe errors.

> • If it persists or worsens (uncorrectable errors, lockups) – then you
> might suspect board or GPU hardware degradation.
> 
> 5. Take away: why it’s quieter now?
> 
> “Before cleaning out the dust the system was annoyingly loud when BOINC was
> using 9 CPU cores; now I can hardly hear it even at full load.”
> 
> That’s a great sign — dust removal improved thermal performance, so fans
> spin slower for the same workload. It might also indicate improved airflow
> around the PCIe slot (which can indirectly reduce link temperature and
> errors too).

I didn't deliberately blow dust out of the GPU, but I sprayed it around the 
area so it would have got some of it.  In retrospect I should have 
deliberately cleaned that out too.

Here's what lm-sensors gives about the GPU:

amdgpu-pci-0200
Adapter: PCI adapter
vddgfx:      850.00 mV 
fan1:        1314 RPM  (min = 1800 RPM, max = 6900 RPM)
edge:         +67.0°C  (crit = +97.0°C, hyst = -273.1°C)
PPT:          10.13 W  (cap =  32.00 W)
pwm1:             24%
sclk:         955 MHz 
mclk:           2 GHz 

> Conclusion:
> 
> • Yes, those are correctable PCIe link errors (not kernel bugs).
> 
> • Start with reseating/cleaning the GPU and connectors, not the CPU.
> 
> • If they persist, try another slot or BIOS update.
> 
> • Only worry if you start seeing uncorrectable errors or  crashes.

Thanks.  Also I have seen crashes so I am worried.  Also I have a spare z640 
so replacing the entire system is a possibiity.  I could sell the current one 
to someone who wants a SOHO server if it's only a GPU issue.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/