From russell at coker.com.au Fri Feb 18 16:42:16 2022 From: russell at coker.com.au (Russell Coker) Date: Fri, 18 Feb 2022 16:42:16 +1100 Subject: [Linux-aus] strange ATI video issue Message-ID: <5226418.CLqGRgimMR@xev> I have a system running typical KDE desktop stuff with Chrome and Thunderbird. After running for a while it gets the following in /var/log/Xorg.0.log: [928485.680] (WW) RADEON(0): flip queue failed: Device or resource busy [928485.680] (WW) RADEON(0): Page flip failed: Device or resource busy [928485.680] (EE) RADEON(0): present flip failed [928485.890] (WW) RADEON(0): flip queue failed: Device or resource busy [928485.890] (WW) RADEON(0): Page flip failed: Device or resource busy [928485.890] (EE) RADEON(0): present flip failed Then after that it gives messages like the following: [928541.507] (WW) RADEON(0): flip queue failed: Cannot allocate memory [928541.507] (WW) RADEON(0): Page flip failed: Cannot allocate memory [928541.507] (EE) RADEON(0): present flip failed [928542.008] (WW) RADEON(0): flip queue failed: Cannot allocate memory [928542.008] (WW) RADEON(0): Page flip failed: Cannot allocate memory [928542.008] (EE) RADEON(0): present flip failed At that time normal X operations start failing, if the system is in use the window manager blocks, if the system isn't in use then it becomes impossible to unlock the screen (it doesn't give a password prompt and doesn't respond to "loginctl unlock-session" commands). This seems to be correlated with BOINC accessing the GPU when the system is idle (I haven't yet done sufficient testing to prove this), but other systems don't have such problems. Chrome also seems involved in the problems, but I don't know if it's causing it or just making it noticable. The system with the problem is a Dell PowerEdge T320 with 96G of RAM and the following video card according to lspci running at 2560x1440 resolution: 0a:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Bonaire XTX [Radeon R7 260X/360] A system without that problem is a HP Proliant ML110 Gen 9 with 64G of RAM and the following video card running at 3840x2160 resolution: 03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Baffin [Radeon RX 460/560D / Pro 450/455/460/555/555X/560/560X] (rev e5) The Dell has this in the Radeon section of lspci -vv output: Capabilities: [200 v1] Physical Resizable BAR BAR 0: current size: 256MB, supported: 256MB 512MB 1GB 2GB 4GB The HP has this in the lspci -vv output: Capabilities: [200 v1] Physical Resizable BAR BAR 0: current size: 4GB, supported: 256MB 512MB 1GB 2GB 4GB Could a mere 256M of buffer memory be contributing to video card problems or be a symptom of some deeper problem? Here's an article on the resizable BAR: https://www.tomshardware.com/news/geforce-driver-465-89-resizable-bar-support Both those systems are running Debian/Bullseye (Stable). The problems on the T320 started occurring about 4-8 weeks ago with no deliberate changes that seem relevant. Of course there were new versions of Chrome and Chromium (which certainly had other changes than just bug fixes) and bug fixes to various Debian packages which shouldn't break things (but computers are complex). My current test is to deny BOINC access to X and see if that makes things more reliable. While running BOINC with only the CPU would be OK, I'd really like to get the GPU working as without BOINC it's entirely idle 16 hours a day. Any ideas? -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ From quozl at laptop.org Fri Feb 18 18:12:55 2022 From: quozl at laptop.org (James Cameron) Date: Fri, 18 Feb 2022 18:12:55 +1100 Subject: [Linux-aus] strange ATI video issue In-Reply-To: <5226418.CLqGRgimMR@xev> References: <5226418.CLqGRgimMR@xev> Message-ID: <20220218071255.GC2073664@laptop.org> That's surprisingly similar to my week, but the details don't match up entirely. At a telescope we had a performance problem with a four headed workstation, which uses XDMCP to host an X11 display from an instrument computer, then uses x11vnc to provide that to multiple remote observers. It was limited CPU vs GPU bandwidth. Once we had enough display updates happening, the TCP queues for the X server would begin to grow, and interactive latency for the VNC users would climb from 250ms to about 10 seconds. I reproduced it with many instances of `xclock -update 1`. What looked like your scenario for me was that an strace of Xorg showed bursts of -ERESTARTSYS in response to a DRM_IOCTL_NOUVEAU_GEM_PUSHBUF as the problem grew, then ioctl's began pausing entirely. So, entirely different graphics driver, just a similar behaviour suggesting difficulty handling the rate of buffer pushes. Try using strace (over SSH) to check for what code path is giving grief. In the end, I ran out of time and was urged to switch to a proprietary driver. I was able to move from 96 xclocks with some lag, to 192 xclocks with no visible lag, filling all four displays. Kernel 5.4.0 plus Ubuntu 20.04 patches. From russell at coker.com.au Fri Feb 18 21:00:22 2022 From: russell at coker.com.au (Russell Coker) Date: Fri, 18 Feb 2022 21:00:22 +1100 Subject: [Linux-aus] strange ATI video issue In-Reply-To: <20220218071255.GC2073664@laptop.org> References: <5226418.CLqGRgimMR@xev> <20220218071255.GC2073664@laptop.org> Message-ID: <14692772.zFbMJxpPmS@xev> On Friday, 18 February 2022 18:12:55 AEDT James Cameron via linux-aus wrote: > At a telescope we had a performance problem with a four headed > workstation, which uses XDMCP to host an X11 display from an > instrument computer, then uses x11vnc to provide that to multiple > remote observers. It was limited CPU vs GPU bandwidth. Once we had > enough display updates happening, the TCP queues for the X server > would begin to grow, and interactive latency for the VNC users would > climb from 250ms to about 10 seconds. I reproduced it with many > instances of `xclock -update 1`. So you are saying that the GPU isn't fast enough to keep up with requests and has some sort of buffer overflow and congestion control problem? So the solution would be to have the system do less graphics stuff or get a faster GPU? I'm not going to use a proprietary driver. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ From quozl at laptop.org Sat Feb 19 08:19:52 2022 From: quozl at laptop.org (James Cameron) Date: Sat, 19 Feb 2022 08:19:52 +1100 Subject: [Linux-aus] strange ATI video issue In-Reply-To: <14692772.zFbMJxpPmS@xev> References: <5226418.CLqGRgimMR@xev> <20220218071255.GC2073664@laptop.org> <14692772.zFbMJxpPmS@xev> Message-ID: <20220218211952.GA2101911@laptop.org> On Fri, Feb 18, 2022 at 09:00:22PM +1100, Russell Coker wrote: > On Friday, 18 February 2022 18:12:55 AEDT James Cameron via linux-aus wrote: > > At a telescope we had a performance problem with a four headed > > workstation, which uses XDMCP to host an X11 display from an > > instrument computer, then uses x11vnc to provide that to multiple > > remote observers. It was limited CPU vs GPU bandwidth. Once we had > > enough display updates happening, the TCP queues for the X server > > would begin to grow, and interactive latency for the VNC users would > > climb from 250ms to about 10 seconds. I reproduced it with many > > instances of `xclock -update 1`. > > So you are saying that the GPU isn't fast enough to keep up with > requests and has some sort of buffer overflow and congestion control > problem? It wasn't possible to tell if the GPU was slow, the kernel was slow to pass buffers, or the buffer management was faulty. But slow GPU was the most likely. We knew the hardware was capable of more, before we upgraded the operating system. So it looked like a regression. > So the solution would be to have the system do less graphics stuff > or get a faster GPU? We suspect we got a faster GPU without changing it, by using a driver that engaged a proprietary thermal management algorithm. Presumably it ran various clocks at a higher rate and knew how to shift those clocks as needed. That's often a special sauce that is slow to be reverse engineered. We could have verified that with our thermal cameras, by measuring temperature distribution across the card using the same processing load but with different drivers. > I'm not going to use a proprietary driver. Indeed. I feel the same way. Yet I use proprietary silicon, so I'm not sure where to draw the line. From russell at coker.com.au Tue Feb 22 15:08:13 2022 From: russell at coker.com.au (Russell Coker) Date: Tue, 22 Feb 2022 15:08:13 +1100 Subject: [Linux-aus] Flounder: new list and updated web site, next meeting 5th March Message-ID: <1995813.GcUPgKebJS@xev> https://flounder.linux.org.au/ Above is the new URL for the web site. It has a link for an iCal file with future meetings. I will add meetings without agenda many months ahead of time and then replace them with proper events when we have an agenda. This is the new list, I subscribed some people who are interested (hope everyone wanted that). Please forward to your friends! -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ From russell at coker.com.au Tue Feb 22 22:27:00 2022 From: russell at coker.com.au (Russell Coker) Date: Tue, 22 Feb 2022 22:27:00 +1100 Subject: [Linux-aus] SLUG status Message-ID: <12192784.1c4RISRJPY@xev> What's the status of SLUG? https://slug.org.au/contact.html This contact page has a broken link for Conor Buckley, the mailing list link gives a 403, and the Meetup page says that there's been no events since 2020. Does SLUG still operate? -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ From mkaan at abdcomputers.net Wed Feb 23 07:21:14 2022 From: mkaan at abdcomputers.net (Mohammad Kaan - ABD Computers and Networks) Date: Wed, 23 Feb 2022 07:21:14 +1100 Subject: [Linux-aus] SLUG status In-Reply-To: <12192784.1c4RISRJPY@xev> References: <12192784.1c4RISRJPY@xev> Message-ID: Hi Russell , I do know that the mailing-list works so im guessing there is some kind of operation. Given Covid and the rest we have all been fragmented . Anyway Im here if that helps . regards -- Mohammad Kaan Senior Development and Operations Engineer ABD Computer Installations W: www.abdcomputers.net [1] E: mkaan at abdcomputers.net P: 0450 592 017 On 2022-02-22 22:27, Russell Coker via linux-aus wrote: > What's the status of SLUG? > > https://slug.org.au/contact.html > > This contact page has a broken link for Conor Buckley, the mailing list > link > gives a 403, and the Meetup page says that there's been no events since > 2020. > > Does SLUG still operate? -- Mohammad Kaan Senior Development and Operations Engineer ABD Computer Installations W: www.abdcomputers.net [1] E: mkaan at abdcomputers.net P: 0450 592 017 This e-mail message may contain confidential or legally privileged information and is intended only for the use of the intended recipient(s). Any unauthorized disclosure, dissemination, distribution, copying or the taking of any action in reliance on the information herein is prohibited. E-mails are not secure and cannot be guaranteed to be error free as they can be intercepted, amended, or contain viruses. Anyone who communicates with us by e-mail is deemed to have accepted these risks. 'ABD Computers and Networks' is not responsible for errors or omissions in this message and denies any responsibility for any damage arising from the use of e-mail. Any opinion and other statement contained in this message and any attachment are solely those of the author and do not necessarily represent those of the company. Links: ------ [1] http://www.abdcomputers.net -------------- next part -------------- An HTML attachment was scrubbed... URL: From james at french.id.au Fri Feb 25 17:41:21 2022 From: james at french.id.au (James French) Date: Fri, 25 Feb 2022 14:41:21 +0800 Subject: [Linux-aus] strange ATI video issue In-Reply-To: <5226418.CLqGRgimMR@xev> References: <5226418.CLqGRgimMR@xev> Message-ID: Afternoon Russell, I had similar issues with an RX 570 and kernels 5.8 - 5.10 (running Debian testing pre bullseye and then with the bullseye release kernel) that I never quite worked out. In my case by the time the problem manifested the system was grossly unstable. Slightly different error messages if memory serves, but similar pattern of behaviour. In my case, the problem would be triggered intermittently after a monitor woke from sleep. To keep my system usable I was running the buster 4.19 kernel with testing and then bullseye release for quite a while. Those issues have been resolved since I moved back to testing and started using kernels newer than 5.12. If you can tolerate running a backports kernel, I'd suggest upgrading to 5.15 and see if that helps. Kind Regards, James On Fri, 18 Feb 2022 at 13:43, Russell Coker via linux-aus wrote: > > I have a system running typical KDE desktop stuff with Chrome and Thunderbird. > After running for a while it gets the following in /var/log/Xorg.0.log: > > [928485.680] (WW) RADEON(0): flip queue failed: Device or resource busy > [928485.680] (WW) RADEON(0): Page flip failed: Device or resource busy > [928485.680] (EE) RADEON(0): present flip failed > [928485.890] (WW) RADEON(0): flip queue failed: Device or resource busy > [928485.890] (WW) RADEON(0): Page flip failed: Device or resource busy > [928485.890] (EE) RADEON(0): present flip failed > > Then after that it gives messages like the following: > > [928541.507] (WW) RADEON(0): flip queue failed: Cannot allocate memory > [928541.507] (WW) RADEON(0): Page flip failed: Cannot allocate memory > [928541.507] (EE) RADEON(0): present flip failed > [928542.008] (WW) RADEON(0): flip queue failed: Cannot allocate memory > [928542.008] (WW) RADEON(0): Page flip failed: Cannot allocate memory > [928542.008] (EE) RADEON(0): present flip failed > > At that time normal X operations start failing, if the system is in use the > window manager blocks, if the system isn't in use then it becomes impossible > to unlock the screen (it doesn't give a password prompt and doesn't respond to > "loginctl unlock-session" commands). > > This seems to be correlated with BOINC accessing the GPU when the system is > idle (I haven't yet done sufficient testing to prove this), but other systems > don't have such problems. Chrome also seems involved in the problems, but I > don't know if it's causing it or just making it noticable. > > The system with the problem is a Dell PowerEdge T320 with 96G of RAM and the > following video card according to lspci running at 2560x1440 resolution: > 0a:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] > Bonaire XTX [Radeon R7 260X/360] > > A system without that problem is a HP Proliant ML110 Gen 9 with 64G of RAM and > the following video card running at 3840x2160 resolution: > 03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] > Baffin [Radeon RX 460/560D / Pro 450/455/460/555/555X/560/560X] (rev e5) > > The Dell has this in the Radeon section of lspci -vv output: > Capabilities: [200 v1] Physical Resizable BAR > BAR 0: current size: 256MB, supported: 256MB 512MB 1GB 2GB 4GB > > The HP has this in the lspci -vv output: > Capabilities: [200 v1] Physical Resizable BAR > BAR 0: current size: 4GB, supported: 256MB 512MB 1GB 2GB 4GB > > Could a mere 256M of buffer memory be contributing to video card problems or > be a symptom of some deeper problem? > > Here's an article on the resizable BAR: > https://www.tomshardware.com/news/geforce-driver-465-89-resizable-bar-support > > Both those systems are running Debian/Bullseye (Stable). The problems on the > T320 started occurring about 4-8 weeks ago with no deliberate changes that > seem relevant. Of course there were new versions of Chrome and Chromium > (which certainly had other changes than just bug fixes) and bug fixes to > various Debian packages which shouldn't break things (but computers are > complex). > > My current test is to deny BOINC access to X and see if that makes things more > reliable. While running BOINC with only the CPU would be OK, I'd really like > to get the GPU working as without BOINC it's entirely idle 16 hours a day. > > Any ideas? > > -- > My Main Blog http://etbe.coker.com.au/ > My Documents Blog http://doc.coker.com.au/ > > _______________________________________________ > linux-aus mailing list > linux-aus at lists.linux.org.au > http://lists.linux.org.au/mailman/listinfo/linux-aus > > To unsubscribe from this list, send a blank email to > linux-aus-unsubscribe at lists.linux.org.au