[Linux-aus] pcie errors

Sat Oct 11 12:43:27 AEDT 2025

Hi mate, this looks like a classic PCIe AER (Advanced Error Reporting) log
dump. Please refer to the details below.

1. What the messages mean
Lines like:
pcieport 0000:00:02.0: AER: Correctable error message received amdgpu
0000:02:00.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer

show correctable PCIe Data Link Layer errors.

These are not fatal — the link detected a bad packet (BadTLP, Timeout,
Rollover), retried it, and succeeded. So the system continued working fine,
but it logged the noise.

Key points:

• 0000:00:02.0 → PCIe root port (CPU or PCH host bridge)

• 0000:02:00.0 → GPU (in your case AMD)

• 0000:02:00.1 → GPU’s onboard audio device

These errors often reflect electrical signal quality issues on the PCIe
lane between CPU and GPU — not necessarily a bad CPU or motherboard.

2. Possible causes

• Dust or poor contact — especially if you recently cleaned the system and
reseated parts.

• Slightly oxidized or stressed PCIe slot pins.

• Marginal PCIe link stability (e.g., too long riser cable, dirty slot, or
flex in GPU).

• BIOS / firmware quirk — some boards log excessive AER spam even for
benign link retraining.

• Power fluctuation or grounding — the GPU draws heavily from the slot;
small transient errors can appear.

3.  Answers to your specific questions

• Q1. Am I right in interpreting this as a PCIe error related to the CPU
root port?

A1. Yes. The CPU root complex (00:02.0) is the upstream reporting agent.

• Q2. Is reseating the CPU the thing to do for that?

A2. Not yet. Start smaller:

 - Reseat the GPU, not the CPU.

 - Check the PCIe slot for dust or bent pins.

  - Clean contacts with isopropyl alcohol.

 - Reseat power connectors (GPU & motherboard).

Only consider reseating the CPU after ruling those out — because removing
the CPU introduces more risk (bent pins, paste reapplication, etc.).

• Q3. Is the kernel change likely connected?

A3. Extremely unlikely. The kernel just reports what the PCIe hardware
tells it; newer kernels might log more verbosely, which can make it seem
new.

4. What to try next (in order)

• Monitor – if you only see a few correctable errors and the system runs
fine, you can safely ignore them.

• Reseat GPU – pull it out, clean contacts, reseat firmly.

• Inspect slot & connectors – look for dust, corrosion, flexing.

• Check BIOS updates – sometimes AER handling or signal tuning is improved.

• Try another PCIe slot – if supported and doesn’t affect performance.

• Disable AER logging (optional) – to reduce syslog spam (pci=noaer kernel
parameter).

• If it persists or worsens (uncorrectable errors, lockups) – then you
might suspect board or GPU hardware degradation.

5. Take away: why it’s quieter now?

“Before cleaning out the dust the system was annoyingly loud when BOINC was
using 9 CPU cores; now I can hardly hear it even at full load.”

That’s a great sign — dust removal improved thermal performance, so fans
spin slower for the same workload. It might also indicate improved airflow
around the PCIe slot (which can indirectly reduce link temperature and
errors too).

Conclusion:

• Yes, those are correctable PCIe link errors (not kernel bugs).

• Start with reseating/cleaning the GPU and connectors, not the CPU.

• If they persist, try another slot or BIOS update.

• Only worry if you start seeing uncorrectable errors or  crashes.

I hope this helps.

Best regards,

Al Maclang

Founder & Proprietor | Al-Masih

www.al-masih.com.au

linkedin.com/in/albertomaclang <https://www.linkedin.com/in/albertomaclang/>

On Fri, 10 Oct 2025 at 11:41 pm, Russell Coker via linux-aus <
linux-aus at lists.linux.org.au> wrote:

> 00:02.0 PCI bridge: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 PCI
> Express Root Port 2 (rev 02)
>
> I had been getting the below errors on my PC (HP z640 with E5-2696v3
> CPU).
> Above is the lspci line that matches.  I have had another problem with
> that PC
> in that the 3rd DIMM slot (identified as CPU0-DIMM6) doesn't work (5 beeps
> from the BIOS on boot if a DIMM is installed).  The errors started when I
> upgraded from linux-image-6.12.27-amd64 (Debian/Testing) to linux-
> image-6.12.48+deb13-amd64 (latest Debian/Trixie update) and blasted the
> inside
> of the PC with compressed air to get 5 months of dust and fluff out of it.
>
> Some of the errors seemed to have no affect.  But one time kwin_wayland
> (the
> graphics program) hung and I needed to use loginctl terminate-session, and
> another time I was playing a movie with mpv and the screen repeatedly got
> corrupted in a way that appeared to be hardware decoding with missing data.
>
> My google searches for this only returned results on how to make the
> kernel
> stop displaying such warnings if the error is not a problem.  But for me
> it is
> a problem and the google hits weren't helpful.
>
> I took the CPU out and reseated it with new heatsink paste.  This has been
> reported as a solution to a problem of E5-26xx CPUs having some banks of
> RAM
> not work.  This did not affect the RAM issue, the DIMM socket could be
> damaged
> - I bought the system cheap in "unknown condition" and the previous owner
> could have damaged it.  The same CPU had previously worked correctly in a
> HP
> ML-110 Gen9 with 8 DIMMs installed.  I can't rule out the possibility that
> I
> damaged the CPU when transferring it from the ML-110 to the z640 in a way
> that
> caused the issue with one DIMM socket.
>
> The system is now working again, for the moment at least.  I completed
> watching the movie in question without screen corruption.
>
>
> What I would like from the experts here is any suggestions about things I
> may
> have missed or misunderstood.  Am I right in interpreting this as a PCIe
> error
> related to the CPU root port?  Is reseating the CPU the thing to do for
> that?
> Am I right in thinking that the change of kernel version is extremely
> unlikely
> to be connected to the problem?
>
> If the problem comes back is it likely to be caused by the CPU or the
> motherboard?
>
> If the problem comes back would it be a possible solution to never use the
> PCIe slot that the GPU is currently in?
>
> If anyone here has had such a situation before please let me know how it
> went.
>
>
> As an aside before cleaning out the dust the system was annoyingly loud
> when
> BOINC was using 9 CPU cores,  Now I can hardly hear it when BOINC is using
> all
> 18 cores and my head is within 1 meter of it.
>
> Oct 10 20:46:36 xev kernel: pcieport 0000:00:02.0: AER: found no error
> details
> for 0000:00:02.0
> Oct 10 20:46:36 xev kernel: pcieport 0000:00:02.0: AER: found no error
> details
> for 0000:00:02.0
> Oct 10 20:46:36 xev kernel: pcieport 0000:00:02.0: AER: Correctable error
> message received from 0000:00:02.0
> Oct 10 20:46:36 xev kernel: pcieport 0000:00:02.0: AER: found no error
> details
> for 0000:00:02.0
> Oct 10 20:46:36 xev kernel: pcieport 0000:00:02.0: AER: Correctable error
> message received from 0000:00:02.0
> Oct 10 20:46:36 xev kernel: pcieport 0000:00:02.0: AER: found no error
> details
> for 0000:00:02.0
> Oct 10 20:46:36 xev kernel: pcieport 0000:00:02.0: AER: Correctable error
> message received from 0000:00:02.0
> Oct 10 20:46:36 xev kernel: pcieport 0000:00:02.0: AER: found no error
> details
> for 0000:00:02.0
> Oct 10 20:46:36 xev kernel: pcieport 0000:00:02.0: AER: Correctable error
> message received from 0000:00:02.0
> Oct 10 20:46:36 xev kernel: pcieport 0000:00:02.0: AER: found no error
> details
> for 0000:00:02.0
> Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: AER: Multiple
> Correctable
> error message received from 0000:00:02.0
> Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: PCIe Bus Error:
> severity=Correctable, type=Data Link Layer, (Transmitter ID)
> Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0:   device [8086:2f04]
> error
> status/mask=00001040/00002000
> Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0:    [ 6] BadTLP
>
> Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0:    [12] Timeout
>
> Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: AER:   Error of this
> Agent
> is reported first
> Oct 10 20:46:37 xev kernel: amdgpu 0000:02:00.0: PCIe Bus Error:
> severity=Correctable, type=Data Link Layer, (Transmitter ID)
> Oct 10 20:46:37 xev kernel: amdgpu 0000:02:00.0:   device [1002:6987]
> error
> status/mask=00001000/00002000
> Oct 10 20:46:37 xev kernel: amdgpu 0000:02:00.0:    [12] Timeout
>
> Oct 10 20:46:37 xev kernel: snd_hda_intel 0000:02:00.1: PCIe Bus Error:
> severity=Correctable, type=Data Link Layer, (Transmitter ID)
> Oct 10 20:46:37 xev kernel: snd_hda_intel 0000:02:00.1:   device
> [1002:aae0]
> error status/mask=00001000/00002000
> Oct 10 20:46:37 xev kernel: snd_hda_intel 0000:02:00.1:    [12] Timeout
>
> Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: AER: Correctable error
> message received from 0000:00:02.0
> Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: AER: found no error
> details
> for 0000:00:02.0
> Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: AER: Multiple
> Correctable
> error message received from 0000:00:02.0
> Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: AER: found no error
> details
> for 0000:00:02.0
> Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: AER: Multiple
> Correctable
> error message received from 0000:00:02.0
> Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: PCIe Bus Error:
> severity=Correctable, type=Data Link Layer, (Transmitter ID)
> Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0:   device [8086:2f04]
> error
> status/mask=00001040/00002000
> Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0:    [ 6] BadTLP
>
> Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0:    [12] Timeout
>
> Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: AER:   Error of this
> Agent
> is reported first
> Oct 10 20:46:37 xev kernel: amdgpu 0000:02:00.0: PCIe Bus Error:
> severity=Correctable, type=Data Link Layer, (Transmitter ID)
> Oct 10 20:46:37 xev kernel: amdgpu 0000:02:00.0:   device [1002:6987]
> error
> status/mask=00001100/00002000
> Oct 10 20:46:37 xev kernel: amdgpu 0000:02:00.0:    [ 8] Rollover
>
> Oct 10 20:46:37 xev kernel: amdgpu 0000:02:00.0:    [12] Timeout
>
> Oct 10 20:46:37 xev kernel: snd_hda_intel 0000:02:00.1: PCIe Bus Error:
> severity=Correctable, type=Data Link Layer, (Transmitter ID)
> Oct 10 20:46:37 xev kernel: snd_hda_intel 0000:02:00.1:   device
> [1002:aae0]
> error status/mask=00001100/00002000
> Oct 10 20:46:37 xev kernel: snd_hda_intel 0000:02:00.1:    [ 8] Rollover
>
> Oct 10 20:46:37 xev kernel: snd_hda_intel 0000:02:00.1:    [12] Timeout
>
>
> --
> My Main Blog         http://etbe.coker.com.au/
> My Documents Blog    http://doc.coker.com.au/
>
>
>
> _______________________________________________
> linux-aus mailing list
> linux-aus at lists.linux.org.au
> https://lists.linux.org.au/mailman/listinfo/linux-aus
>
> To unsubscribe from this list, send a blank email to
> linux-aus-unsubscribe at lists.linux.org.au
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linux.org.au/pipermail/linux-aus/attachments/20251011/b42d3380/attachment.htm>