<div dir="auto">Hi mate, this looks like a classic PCIe AER (Advanced Error Reporting) log dump. Please refer to the details below.</div><div dir="auto"><br></div><div dir="auto">1. What the messages mean</div><div dir="auto">Lines like:</div><div dir="auto"><div>pcieport 0000:00:02.0: AER: Correctable error message received

amdgpu 0000:02:00.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer</div></div><div dir="auto"><div>

<div>

<p dir="auto">show correctable PCIe Data Link Layer errors.</p><p dir="auto">These are not fatal — the link detected a bad packet (BadTLP, Timeout, Rollover), retried it, and succeeded. So the system continued working fine, but it logged the noise.</p><p dir="auto">Key points:</p><p dir="auto">• 0000:00:02.0 → PCIe root port (CPU or PCH host bridge)</p><p dir="auto">• 0000:02:00.0 → GPU (in your case AMD)</p><p dir="auto">• 0000:02:00.1 → GPU’s onboard audio device</p>

<p dir="auto">These errors often reflect electrical signal quality issues on the PCIe lane between CPU and GPU — not necessarily a bad CPU or motherboard.</p><p dir="auto">2. Possible causes</p><p dir="auto">• Dust or poor contact — especially if you recently cleaned the system and reseated parts.<br></p><p dir="auto">• Slightly oxidized or stressed PCIe slot pins.</p><p dir="auto">• Marginal PCIe link stability (e.g., too long riser cable, dirty slot, or flex in GPU).</p><p dir="auto">• BIOS / firmware quirk — some boards log excessive AER spam even for benign link retraining.</p><p dir="auto">• Power fluctuation or grounding — the GPU draws heavily from the slot; small transient errors can appear.</p><p dir="auto">3.  Answers to your specific questions</p><p dir="auto">• Q1. Am I right in interpreting this as a PCIe error related to the CPU root port?</p><p dir="auto">A1. Yes. The CPU root complex (00:02.0) is the upstream reporting agent.</p><p dir="auto">• Q2. Is reseating the CPU the thing to do for that?</p><p dir="auto">A2. Not yet. Start smaller:</p><p dir="auto"> - Reseat the GPU, not the CPU.</p><p dir="auto"> - Check the PCIe slot for dust or bent pins.</p><p dir="auto">  - Clean contacts with isopropyl alcohol.</p><p dir="auto"> - Reseat power connectors (GPU & motherboard).</p><p dir="auto">Only consider reseating the CPU after ruling those out — because removing the CPU introduces more risk (bent pins, paste reapplication, etc.).</p><p dir="auto">• Q3. Is the kernel change likely connected?</p><p dir="auto">A3. Extremely unlikely. The kernel just reports what the PCIe hardware tells it; newer kernels might log more verbosely, which can make it seem new.</p><p dir="auto">4. What to try next (in order)</p><p dir="auto">• Monitor – if you only see a few correctable errors and the system runs fine, you can safely ignore them.</p><p dir="auto">• Reseat GPU – pull it out, clean contacts, reseat firmly.</p><p dir="auto">• Inspect slot & connectors – look for dust, corrosion, flexing.</p><p dir="auto">• Check BIOS updates – sometimes AER handling or signal tuning is improved.</p><p dir="auto">• Try another PCIe slot – if supported and doesn’t affect performance.</p><p dir="auto">• Disable AER logging (optional) – to reduce syslog spam (pci=noaer kernel parameter).</p><p dir="auto">• If it persists or worsens (uncorrectable errors, lockups) – then you might suspect board or GPU hardware degradation.</p><p dir="auto">5. Take away: why it’s quieter now?</p><p dir="auto">“Before cleaning out the dust the system was annoyingly loud when BOINC was using 9 CPU cores; now I can hardly hear it even at full load.”</p><p dir="auto"><div><div>

<p dir="auto">That’s a great sign — dust removal improved thermal performance, so fans spin slower for the same workload. It might also indicate improved airflow around the PCIe slot (which can indirectly reduce link temperature and errors too).</p><p dir="auto">Conclusion:</p><p dir="auto">• Yes, those are correctable PCIe link errors (not kernel bugs).</p><p dir="auto">• Start with reseating/cleaning the GPU and connectors, not the CPU.</p><p dir="auto">• If they persist, try another slot or BIOS update.</p><p dir="auto">• Only worry if you start seeing uncorrectable errors or  crashes. </p><p dir="auto"><div>

<div>

<p><br></p><p>I hope this helps.</p>

<p>Best regards,</p>

<p>Al Maclang</p>

<p>Founder & Proprietor | Al-Masih</p>

<p dir="auto"><a href="https://www.al-masih.com.au">www.al-masih.com.au</a></p>

<p dir="auto"><a href="https://www.linkedin.com/in/albertomaclang/">linkedin.com/in/albertomaclang</a></p>

</div>

</div></p></div></div></p></div></div><div><div><blockquote style="margin:0px 0px 0px 15px;color:rgb(255,255,255)"><br></blockquote></div></div></div><div><div class="gmail_quote gmail_quote_container"><div dir="ltr" class="gmail_attr">On Fri, 10 Oct 2025 at 11:41 pm, Russell Coker via linux-aus <<a href="mailto:linux-aus@lists.linux.org.au">linux-aus@lists.linux.org.au</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;padding-left:1ex;border-left-color:rgb(204,204,204)">00:02.0 PCI bridge: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 PCI <br>

Express Root Port 2 (rev 02)<br>

<br>

I had been getting the below errors on my PC (HP z640 with E5-2696v3 CPU).  <br>

Above is the lspci line that matches.  I have had another problem with that PC <br>

in that the 3rd DIMM slot (identified as CPU0-DIMM6) doesn't work (5 beeps <br>

from the BIOS on boot if a DIMM is installed).  The errors started when I <br>

upgraded from linux-image-6.12.27-amd64 (Debian/Testing) to linux-<br>

image-6.12.48+deb13-amd64 (latest Debian/Trixie update) and blasted the inside <br>

of the PC with compressed air to get 5 months of dust and fluff out of it.<br>

<br>

Some of the errors seemed to have no affect.  But one time kwin_wayland (the <br>

graphics program) hung and I needed to use loginctl terminate-session, and <br>

another time I was playing a movie with mpv and the screen repeatedly got <br>

corrupted in a way that appeared to be hardware decoding with missing data.<br>

<br>

My google searches for this only returned results on how to make the kernel <br>

stop displaying such warnings if the error is not a problem.  But for me it is <br>

a problem and the google hits weren't helpful.<br>

<br>

I took the CPU out and reseated it with new heatsink paste.  This has been <br>

reported as a solution to a problem of E5-26xx CPUs having some banks of RAM <br>

not work.  This did not affect the RAM issue, the DIMM socket could be damaged <br>

- I bought the system cheap in "unknown condition" and the previous owner <br>

could have damaged it.  The same CPU had previously worked correctly in a HP <br>

ML-110 Gen9 with 8 DIMMs installed.  I can't rule out the possibility that I <br>

damaged the CPU when transferring it from the ML-110 to the z640 in a way that <br>

caused the issue with one DIMM socket.<br>

<br>

The system is now working again, for the moment at least.  I completed <br>

watching the movie in question without screen corruption.<br>

<br>

<br>

What I would like from the experts here is any suggestions about things I may <br>

have missed or misunderstood.  Am I right in interpreting this as a PCIe error <br>

related to the CPU root port?  Is reseating the CPU the thing to do for that?  <br>

Am I right in thinking that the change of kernel version is extremely unlikely <br>

to be connected to the problem?<br>

<br>

If the problem comes back is it likely to be caused by the CPU or the <br>

motherboard?<br>

<br>

If the problem comes back would it be a possible solution to never use the <br>

PCIe slot that the GPU is currently in?<br>

<br>

If anyone here has had such a situation before please let me know how it went.<br>

<br>

<br>

As an aside before cleaning out the dust the system was annoyingly loud when <br>

BOINC was using 9 CPU cores,  Now I can hardly hear it when BOINC is using all <br>

18 cores and my head is within 1 meter of it.<br>

<br>

Oct 10 20:46:36 xev kernel: pcieport 0000:00:02.0: AER: found no error details <br>

for 0000:00:02.0<br>

Oct 10 20:46:36 xev kernel: pcieport 0000:00:02.0: AER: found no error details <br>

for 0000:00:02.0<br>

Oct 10 20:46:36 xev kernel: pcieport 0000:00:02.0: AER: Correctable error <br>

message received from 0000:00:02.0<br>

Oct 10 20:46:36 xev kernel: pcieport 0000:00:02.0: AER: found no error details <br>

for 0000:00:02.0<br>

Oct 10 20:46:36 xev kernel: pcieport 0000:00:02.0: AER: Correctable error <br>

message received from 0000:00:02.0<br>

Oct 10 20:46:36 xev kernel: pcieport 0000:00:02.0: AER: found no error details <br>

for 0000:00:02.0<br>

Oct 10 20:46:36 xev kernel: pcieport 0000:00:02.0: AER: Correctable error <br>

message received from 0000:00:02.0<br>

Oct 10 20:46:36 xev kernel: pcieport 0000:00:02.0: AER: found no error details <br>

for 0000:00:02.0<br>

Oct 10 20:46:36 xev kernel: pcieport 0000:00:02.0: AER: Correctable error <br>

message received from 0000:00:02.0<br>

Oct 10 20:46:36 xev kernel: pcieport 0000:00:02.0: AER: found no error details <br>

for 0000:00:02.0<br>

Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: AER: Multiple Correctable <br>

error message received from 0000:00:02.0<br>

Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: PCIe Bus Error: <br>

severity=Correctable, type=Data Link Layer, (Transmitter ID)<br>

Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0:   device [8086:2f04] error <br>

status/mask=00001040/00002000<br>

Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0:    [ 6] BadTLP                <br>

Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0:    [12] Timeout               <br>

Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: AER:   Error of this Agent <br>

is reported first<br>

Oct 10 20:46:37 xev kernel: amdgpu 0000:02:00.0: PCIe Bus Error: <br>

severity=Correctable, type=Data Link Layer, (Transmitter ID)<br>

Oct 10 20:46:37 xev kernel: amdgpu 0000:02:00.0:   device [1002:6987] error <br>

status/mask=00001000/00002000<br>

Oct 10 20:46:37 xev kernel: amdgpu 0000:02:00.0:    [12] Timeout               <br>

Oct 10 20:46:37 xev kernel: snd_hda_intel 0000:02:00.1: PCIe Bus Error: <br>

severity=Correctable, type=Data Link Layer, (Transmitter ID)<br>

Oct 10 20:46:37 xev kernel: snd_hda_intel 0000:02:00.1:   device [1002:aae0] <br>

error status/mask=00001000/00002000<br>

Oct 10 20:46:37 xev kernel: snd_hda_intel 0000:02:00.1:    [12] Timeout               <br>

Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: AER: Correctable error <br>

message received from 0000:00:02.0<br>

Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: AER: found no error details <br>

for 0000:00:02.0<br>

Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: AER: Multiple Correctable <br>

error message received from 0000:00:02.0<br>

Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: AER: found no error details <br>

for 0000:00:02.0<br>

Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: AER: Multiple Correctable <br>

error message received from 0000:00:02.0<br>

Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: PCIe Bus Error: <br>

severity=Correctable, type=Data Link Layer, (Transmitter ID)<br>

Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0:   device [8086:2f04] error <br>

status/mask=00001040/00002000<br>

Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0:    [ 6] BadTLP                <br>

Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0:    [12] Timeout               <br>

Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: AER:   Error of this Agent <br>

is reported first<br>

Oct 10 20:46:37 xev kernel: amdgpu 0000:02:00.0: PCIe Bus Error: <br>

severity=Correctable, type=Data Link Layer, (Transmitter ID)<br>

Oct 10 20:46:37 xev kernel: amdgpu 0000:02:00.0:   device [1002:6987] error <br>

status/mask=00001100/00002000<br>

Oct 10 20:46:37 xev kernel: amdgpu 0000:02:00.0:    [ 8] Rollover              <br>

Oct 10 20:46:37 xev kernel: amdgpu 0000:02:00.0:    [12] Timeout               <br>

Oct 10 20:46:37 xev kernel: snd_hda_intel 0000:02:00.1: PCIe Bus Error: <br>

severity=Correctable, type=Data Link Layer, (Transmitter ID)<br>

Oct 10 20:46:37 xev kernel: snd_hda_intel 0000:02:00.1:   device [1002:aae0] <br>

error status/mask=00001100/00002000<br>

Oct 10 20:46:37 xev kernel: snd_hda_intel 0000:02:00.1:    [ 8] Rollover              <br>

Oct 10 20:46:37 xev kernel: snd_hda_intel 0000:02:00.1:    [12] Timeout        <br>

<br>

-- <br>

My Main Blog         <a href="http://etbe.coker.com.au/" rel="noreferrer" target="_blank">http://etbe.coker.com.au/</a><br>

My Documents Blog    <a href="http://doc.coker.com.au/" rel="noreferrer" target="_blank">http://doc.coker.com.au/</a><br>

<br>

<br>

<br>

_______________________________________________<br>

linux-aus mailing list<br>

<a href="mailto:linux-aus@lists.linux.org.au" target="_blank">linux-aus@lists.linux.org.au</a><br>

<a href="https://lists.linux.org.au/mailman/listinfo/linux-aus" rel="noreferrer" target="_blank">https://lists.linux.org.au/mailman/listinfo/linux-aus</a><br>

<br>

To unsubscribe from this list, send a blank email to<br>

<a href="mailto:linux-aus-unsubscribe@lists.linux.org.au" target="_blank">linux-aus-unsubscribe@lists.linux.org.au</a><br>

</blockquote></div></div>