Carview!

CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 cross-origin-resource-policy: cross-origin etag: W/"f7ef1b41914067aa6daf255827fc43739ca207b38916161aab8bb04778d0bac6" date: Fri, 26 Dec 2025 19:46:43 GMT content-type: application/atom+xml; charset=UTF-8 server: blogger-renderd expires: Fri, 26 Dec 2025 19:46:44 GMT cache-control: public, must-revalidate, proxy-revalidate, max-age=1 x-content-type-options: nosniff x-xss-protection: 0 last-modified: Mon, 15 Dec 2025 02:12:17 GMT content-encoding: gzip content-length: 25374 x-frame-options: SAMEORIGIN alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000 tag:blogger.com,1999:blog-45304601246029161462025-12-14T18:12:17.556-08:00Dave Airlie Linux Graphics blogDave Airliehttps://www.blogger.com/profile/03386351362681039664noreply@blogger.comBlogger56125tag:blogger.com,1999:blog-4530460124602916146.post-72949736191057766542025-11-23T17:42:00.000-08:002025-11-23T17:42:03.781-08:00fedora 43: bad mesa update oopsie<p>F43 picked up the two patches I created to fix a bunch of deadlocks on laptops reported in my previous blog posting. Turns out Vulkan layers have a subtle thing I missed, and I removed a line from the device select layer that would only matter if you have another layer, which happens under steam.</p><p>The fedora update process caught this, but it still got published which was a mistake, need to probably give changes like this more karma thresholds.</p><p>I've released a new update <a href="https://bodhi.fedoraproject.org/updates/FEDORA-2025-2f4ba7cd17">https://bodhi.fedoraproject.org/updates/FEDORA-2025-2f4ba7cd17</a> that hopefully fixes this. I'll keep an eye on the karma. </p>Dave Airliehttps://www.blogger.com/profile/03386351362681039664noreply@blogger.com0tag:blogger.com,1999:blog-4530460124602916146.post-47940454419730435172025-11-09T19:16:00.000-08:002025-11-09T19:16:37.704-08:00a tale of vulkan/nouveau/nvk/zink/mutter + deadlocks<p> I had a bug appear in my email recently which led me down a rabbit hole, and I'm going to share it for future people wondering why we can't have nice things.</p><h2 style="text-align: left;">Bug:</h2><p>1. Get an intel/nvidia (newer than Turing) laptop.</p><p>2. Log in to GNOME on Fedora 42/43 </p><p>3. Hotplug a HDMI port that is connected to the NVIDIA GPU.</p><p>4. Desktop stops working.</p><p>My initial reproduction got me a hung mutter process with a nice backtrace which pointed at the Vulkan Mesa device selection layer, trying to talk to the wayland compositor to ask it what the default device is. The problem was the process was the wayland compositor, and how was this ever supposed to work. The Vulkan device selection was called because zink called EnumeratePhysicalDevices, and zink was being loaded because we recently switched to it as the OpenGL driver for newer NVIDIA GPUs.</p><p>I looked into zink and the device select layer code, and low and behold someone has hacked around this badly already, and probably wrongly and I've no idea what the code does, because I think there is at least one logic bug in it. Nice things can't be had because hacks were done instead of just solving the problem. </p><p>The hacks in place ensured under certain circumstances involving zink/xwayland that the device select code to probe the window system was disabled, due to deadlocks seen. I'd no idea if more hacks were going to help, so I decided to step back and try and work out better.</p><p>The first question I had is why WAYLAND_DISPLAY is set inside the compositor process, it is, and if it wasn't I would never hit this. It's pretty likely on the initial compositor start this env var isn't set, so the problem only becomes apparent when the compositor gets a hotplugged GPU output, and goes to load the OpenGL driver, zink, which enumerates and hits device select with env var set and deadlocks.</p><p>I wasn't going to figure out a way around WAYLAND_DISPLAY being set at this point, so I leave the above question as an exercise for mutter devs.</p><h2 style="text-align: left;">How do I fix it?</h2><h3 style="text-align: left;">Attempt 1:</h3><p>At the point where zink is loading in mesa for this case, we have the file descriptor of the GPU device that we want to load a driver for. We don't actually need to enumerate all the physical devices, we could just find the ones for that fd. There is no API for this in Vulkan. I wrote an initial proof of concept instance extensions call VK_MESA_enumerate_devices_fd. I wrote initial loader code to play with it, and wrote zink code to use it. Because this is a new instance API, device-select will also ignore it. However this ran into a big problem in the Vulkan loader. The loader is designed around some internals that PhysicalDevices will enumerate in similiar ways, and it has to trampoline PhysicalDevice handles to underlying driver pointers so that if an app enumerates once, and enumerates again later, the PhysicalDevice handles remain consistent for the first user. There is a lot of code, and I've no idea how hotplug GPUs might fail in such situations. I couldn't find a decent path forward without knowing a lot more about the Vulkan loader. I believe this is the proper solution, as we know the fd, we should be able to get things without doing a full enumeration then picking the answer using the fd info. I've asked Vulkan WG to take a look at this, but I still need to fix the bug.</p><h3 style="text-align: left;">Attempt 2:</h3><p style="text-align: left;">Maybe I can just turn off device selection, like the current hacks do, but in a better manner. Enter VK_EXT_layer_settings. This extensions allows layers to expose a layer setting in the instance creation. I can have the device select layer expose a setting which says don't touch this instance. Then in the zink code where we have a file descriptor being passed in and create an instance, we set the layer setting to avoid device selection. This seems to work but it has some caveats, I need to consider, but I think should be fine.</p><p style="text-align: left;">zink uses a single VkInstance for it's device screen. This is shared between all pipe_screens. Now I think this is fine inside a compositor, since we shouldn't ever be loading zink via the non-fd path, and I hope for most use cases it will work fine, better than the current hacks and better than some other ideas we threw around. The code for this is in [1].</p><h2 style="text-align: left;">What else might be affected:</h2><p style="text-align: left;">If you have a vulkan compositor, it might be worth setting the layer setting if the mesa device select layer is loaded, esp if you set the DISPLAY/WAYLAND_DISPLAY and do any sort of hotplug later. You might be safe if you EnumeratePhysicalDevices early enough, the reason it's a big problem in mutter is it doesn't use Vulkan, it uses OpenGL and we only enumerate Vulkan physical devices at runtime through zink, never at startup.</p><p style="text-align: left;">AMD and NVIDIA I think have proprietary device selection layers, these might also deadlock in similiar ways, I think we've seen some wierd deadlocks in NVIDIA driver enumerations as well that might be a similiar problem. </p><p style="text-align: left;"> [1] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/38252 </p>Dave Airliehttps://www.blogger.com/profile/03386351362681039664noreply@blogger.com0tag:blogger.com,1999:blog-4530460124602916146.post-36252769402185022292025-09-15T12:08:00.000-07:002025-09-15T12:08:54.814-07:00radv takes over from AMDVLK<p><br /></p><p><a data-preview="" href="https://www.google.com/search?ved=1t:260882&q=AMD&bbid=4530460124602916146&bpid=3625276940218502229" target="_blank">AMD</a> have announced the end of the <a data-preview="" href="https://www.google.com/search?ved=1t:260882&q=AMDVLK&bbid=4530460124602916146&bpid=3625276940218502229" target="_blank">AMDVLK</a> open driver in favour of focusing on <a data-preview="" href="https://www.google.com/search?ved=1t:260882&q=radv+vulkan+driver&bbid=4530460124602916146&bpid=3625276940218502229" target="_blank">radv</a> for Linux use cases.</p><p>When Bas and I started radv in 2016, AMD were promising their own Linux vulkan driver, which arrived in Dec 2017. At this point radv was already shipping in most Linux distros. AMD strategy of having AMDVLK was developed via over the wall open source releases from internal closed development was always going to be a second place option at that point.</p><p>When Valve came on board and brought dedicated developer power to radv, and the aco compiler matured, there really was no point putting effort into using AMDVLK which was hard to package and impossible to contribute to meaningfully for external developers.</p><p>radv is probably my proudest contribution to the Linux ecosystem, finally disproving years of idiots saying an open source driver could never compete with a vendor provided driver, now it is the vendor provided driver.</p><p>I think we will miss the open source PAL repo as a reference source and I hope AMD engineers can bridge that gap, but it's often hard to find workarounds you don't know exist to ask about them. I'm also hoping AMD will add more staffing beyond the current levels especially around hardware enablement and workarounds.</p><p>Now onwards to NVK victory :-)</p><p>[1] https://github.com/GPUOpen-Drivers/AMDVLK/discussions/416</p>Dave Airliehttps://www.blogger.com/profile/03386351362681039664noreply@blogger.com1tag:blogger.com,1999:blog-4530460124602916146.post-70337739545859399552025-07-24T14:12:00.000-07:002025-07-24T15:19:49.192-07:00ramalama/mesa : benchmarks on my hardware and open source vs proprietary<p>One of my pet peeves around running local LLMs and inferencing is the sheer mountain of shit^W^W^W complexity of compute stacks needed to run any of this stuff in an mostly optimal way on a piece of hardware.</p><p>CUDA, ROCm, and Intel oneAPI all to my mind scream over-engineering on a massive scale at least for a single task like inferencing. The combination of closed source, over the wall open source, and open source that is insurmountable for anyone to support or fix outside the vendor, screams that there has to be a simpler way. Combine that with the pytorch ecosystem and insanity of deploying python and I get a bit unstuck.</p><p>What can be done about it?</p><p>llama.cpp to me seems like the best answer to the problem at present, (a rust version would be a personal preference, but can't have everything). I like how ramalama wraps llama.cpp to provide a sane container interface, but I'd like to eventually get to the point where container complexity for a GPU compute stack isn't really needed except for exceptional cases.</p><p>On the compute stack side, Vulkan exposes most features of GPU hardware in a possibly suboptimal way, but with extensions all can be forgiven. Jeff Bolz from NVIDIA's talk at <a href="https://www.vulkan.org/user/pages/09.events/vulkanised-2025/T47-Jeff-Bolz-NVIDIA.pdf">Vulkanised 2025</a> started to give me hope that maybe the dream was possible.</p><p>The main issue I have is Jeff is writing driver code for the NVIDIA proprietary vulkan driver which reduces complexity but doesn't solve my open source problem.</p><p>Enter NVK, the open source driver for NVIDIA GPUs. Karol Herbst and myself are taking a look at closing the feature gap with the proprietary one. For mesa 25.2 the initial support for VK_KHR_cooperative_matrix was landed, along with some optimisations, but there is a bunch of work to get VK_NV_cooperative_matrix2 and a truckload of compiler optimisations to catch up with NVIDIA.</p><p>But since mesa 25.2 was coming soon I wanted to try and get some baseline figures out.</p><p>I benchmarked on two systems (because my AMD 7900XT wouldn't fit in the case). Both Ryzen CPUs. The first I used system I put in an RTX5080 then a RTX6000 Ada and then the Intel A770. The second I used for the RX7900XT. The Intel SYCL stack failed to launch unfortunately inside ramalama and I hacked llama.cpp to use the A770 MMA accelerators. </p><p>ramalama bench  hf://unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL </p><p>I picked this model at random, and I've no idea if it was a good idea.<br /> </p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgE9P_GUUQhRLygCqidIZq3eOkwqG23mpfOMJw65IXDHPrxmEAJeGG1V5Bw5G8Ju919gUGqBB5NHozFea02m9znEgosoRW8RtaIuSCHo8T4GPb5cbqsbRiTSrkcJExO1871NdtVVI6QZBv_SETDzGlioVbaUix1DSP9V19MP5ms8G6iFkHfA0E0Sk4ZotpI/s1200/tg.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="800" data-original-width="1200" height="385" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgE9P_GUUQhRLygCqidIZq3eOkwqG23mpfOMJw65IXDHPrxmEAJeGG1V5Bw5G8Ju919gUGqBB5NHozFea02m9znEgosoRW8RtaIuSCHo8T4GPb5cbqsbRiTSrkcJExO1871NdtVVI6QZBv_SETDzGlioVbaUix1DSP9V19MP5ms8G6iFkHfA0E0Sk4ZotpI/w578-h385/tg.png" width="578" /></a></div><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgApp3To7bs2i9dmNix1D38y_-9W3rqAZUbBm3IXeUSlRuD6poJQ36IAKqmd7pUyry5qOcZCiEukfPSL_X9ZaN_Q2CiIrsPdjMubFiYNXBtXqoG-TydfmYBK0_QUlqiADutrrYb8-tUuu-TcflPpVNHfhYGHjZRp49WbEkMvcsEJUtynz1OkJNuoGa98Mvr/s1200/pp.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="800" data-original-width="1200" height="368" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgApp3To7bs2i9dmNix1D38y_-9W3rqAZUbBm3IXeUSlRuD6poJQ36IAKqmd7pUyry5qOcZCiEukfPSL_X9ZaN_Q2CiIrsPdjMubFiYNXBtXqoG-TydfmYBK0_QUlqiADutrrYb8-tUuu-TcflPpVNHfhYGHjZRp49WbEkMvcsEJUtynz1OkJNuoGa98Mvr/w554-h368/pp.png" width="554" /></a></div><h4 style="text-align: left;">Some analysis:</h4><p>The token generation workload is a lot less matmul heavy than prompt processing, it also does a lot more synchronising. Jeff has stated CUDA wins here mostly due to CUDA graphs and most of the work needed is operation fusion on the llama.cpp side. Prompt processing is a lot more matmul heavy, extensions like NV_coopmat2 will help with that (NVIDIA vulkan already uses it in the above), but there may be further work to help close the CUDA gap. On AMD radv (open source) Vulkan is already better at TG than ROCm, but behind in prompt processing. Again coopmat2 like extensions should help close the gap there.</p><p>NVK is starting from a fair way behind, we just pushed support for the most basic coopmat extension and we know there is a long way to go, but I think most of it is achievable as we move forward and I hope to update with new scores on a semi regular basis. We also know we can definitely close the gap on the NVIDIA proprietary Vulkan driver if we apply enough elbow grease and register allocation :-)</p><p>I think it might also be worth putting some effort into radv coopmat2 support, I think if radv could overtake ROCm for both of these it would remove a large piece of complexity from the basic users stack.</p><p>As for Intel I've no real idea, I hope to get their SYCL implementation up and running, and maybe I should try and get my hands on a B580 card as a better baseline. When I had SYCL running once before I kinda remember it being 2-4x the vulkan driver, but there's been development on both sides. </p><p>(The graphs were generated by Gemini.)</p>Dave Airliehttps://www.blogger.com/profile/03386351362681039664noreply@blogger.com4tag:blogger.com,1999:blog-4530460124602916146.post-58280093731398693732025-07-01T02:44:00.000-07:002025-07-01T03:20:36.719-07:00nvk: blackwell support<p>Blog posts are like buses sometimes...</p><p>I've spent time over the last month enabling Blackwell support on NVK, the Mesa vulkan driver for NVIDIA GPUs. Faith from Collabora, the NVK maintainer has cleaned up and merged all the major pieces of this work and landed them into mesa this week. Mesa 25.2 should ship with a functioning NVK on blackwell. The code currently in mesa main passes all tests in the Vulkan CTS.</p><p>Quick summary of the major fun points:</p><p>Ben @ NVIDIA had done the initial kernel bringup in to r570 firmware in the nouveau driver. I worked with Ben on solidifying that work and ironing out a bunch of memory leaks and regressions that snuck in.</p><p>Once the kernel was stable, there were a number of differences between Ada and Blackwell that needed to be resolved. Thanks to Faith, Mel and Mohamed for their help, and NVIDIA for providing headers and other info.</p><p>I did most of the work on a GB203 laptop and a desktop 5080.</p><p>1. Instruction encoding: a bunch of instructions changed how they were encoded. Mel helped sort out most of those early on.</p><p>2. Compute/QMD: the QMD which is used to launch compute shaders, has a new encoding. NVIDIA released the official QMD headers which made this easier in the end.</p><p>3. Texture headers: texture headers were encoded different from Hopper on, so we had to use new NVIDIA headers to encode those properly</p><p>4. Depth/Stencil: NVIDIA added support for separate d/s planes and this also has some knock on effects on surface layouts. </p><p>5. Surface layout changes. NVIDIA attaches a memory kind to memory allocations, due to changes in Blackwell, they now use a generic kind for all allocations. You now longer know the internal bpp dependent layout of the surfaces. This means changes to the dma-copy engine to provide that info. This means we have some modifier changes to cook with NVIDIA over the next few weeks at least for 8/16 bpp surfaces. Mohamed helped get this work and host image copy support done.</p><p>6. One thing we haven't merged is bound texture support. Currently blackwell is using bindless textures which might be a little slower. Due to changes in the texture instruction encoding, you have to load texture handles to intermediate uniform registers before using them as bound handles. This causes a lot of fun with flow control and when you can spill uniform registers. I've written a few efforts at using bound textures, so we understand how to use them, just have some compiler issues to maybe get it across the line.</p><p>7. Proper instruction scheduling isn't landed yet. I have a spreadsheet with all the figures, and I started typing, so will try and get that into an MR before I take some holidays. </p><p> </p>Dave Airliehttps://www.blogger.com/profile/03386351362681039664noreply@blogger.com0tag:blogger.com,1999:blog-4530460124602916146.post-46147888045539869442025-07-01T02:27:00.000-07:002025-07-01T02:27:02.704-07:00radv: VK_KHR_video_encode_av1 support<p> I should have mentioned this here a week ago. The Vulkan AV1 encode extension has been out for a while, and I'd done the initial work on enabling it with radv on AMD GPUs. I then left it in a branch, which Benjamin from AMD picked up and fixed a bunch of bugs, and then we both got distracted. I realised when doing VP9 that it hasn't landed, so did a bit of cleanup. Then David from AMD picked it up and carried it over the last mile and it got merged last week.</p><p>So radv on supported hw now supports all vulkan decode/encode formats currently available.  </p>Dave Airliehttps://www.blogger.com/profile/03386351362681039664noreply@blogger.com0tag:blogger.com,1999:blog-4530460124602916146.post-64594955059514874162025-06-09T12:42:00.000-07:002025-06-09T12:42:20.713-07:00radv: vulkan VP9 video decode<p>The Vulkan WG has released VK_KHR_video_decode_vp9. I did initial work on a Mesa extensions for this a good while back, and I've updated the radv code with help from AMD and Igalia to the final specification.</p><p>There is an open MR[1] for radv to add support for vp9 decoding on navi10+ with the latest firmware images in linux-firmware. It is currently passing all VK-GL-CTS tests for VP9 decode.</p><p>Adding this decode extension is a big milestone for me as I think it now covers all the reasons I originally got involved in Vulkan Video as signed off, there is still lots to do and I'll stay involved, but it's been great to see the contributions from others and how there is a bit of Vulkan Video community upstream in Mesa.</p><p> [1] <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35398">https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35398</a></p><p> </p>Dave Airliehttps://www.blogger.com/profile/03386351362681039664noreply@blogger.com0tag:blogger.com,1999:blog-4530460124602916146.post-50256492282617138382024-08-29T17:26:00.000-07:002024-08-29T18:52:37.634-07:00On Rust, Linux, developers, maintainers<p>There's been a couple of mentions of Rust4Linux in the past week or two, one from Linus on the speed of engagement and one about Wedson departing the project due to non-technical concerns. This got me thinking about project phases and developer types.<br /></p><h2 style="text-align: left;">Archetypes:</h2><div style="text-align: left;">I will regret making an analogy, in an area I have no experience in, but let's give it a go with a road building analogy.</div><div style="text-align: left;"> </div><div style="text-align: left;">Let's sort developers into 3 rough categories. Let's preface by saying not all developers fit in a single category throughout their careers, and some developers can do different roles on different projects, or on the same project simultaneously.</div><h3 style="text-align: left;">1. Wayfinders/Mapmakers</h3><div style="text-align: left;">I want to go build a hotel somewhere but there exists no map or path. I need to travel through a bunch of mountains, valleys, rivers, weather, animals, friendly humans, antagonistic humans and some unknowns. I don't care deeply about them, I want to make a path to where I want to go. I hit a roadblock, I don't focus on it, I get around it by any means necessary and move onto the next one. I document the route by leaving maps, signs. I build a hotel at the end. </div><h3 style="text-align: left;">2. Road builders</h3><div style="text-align: left;">I see the hotel and path someone has marked out. I foresee that larger volumes will want to traverse this path and build more hotels. The roadblocks the initial finder worked around, I have to engage with. I engage with each roadblock differently. I build a bridge, dig a tunnel, blow up some stuff, work with with/against humans, whatever is necessary to get a road built to the place the wayfinder built the hotel. I work on each roadblock until I can open the road to traffic. I can open it in stages, but it needs a completed road.<br /></div><h3 style="text-align: left;">3. Road maintainers</h3><div style="text-align: left;">I've got a road, I may have built the road initially. I may no longer build new roads. I've no real interest in hotels. I deal with intersections with other roads controlled by other people, I interact with builders who want to add new intersections for new roads, and remove old intersections for old roads. I fill in the holes, improve safety standards, handle the odd wayfinder wandering across my 8 lanes.</div><h2 style="text-align: left;">Interactions:</h2><p style="text-align: left;">Wayfinders and maintainers is the most difficult interaction. Wayfinders like to move freely and quickly, maintainers have other priorities that slow them down. I believe there needs to be road builders engaged between the wayfinders and maintainers.</p><p style="text-align: left;">Road builders have to be willing to expend the extra time to resolving roadblocks in the best way possible for all parties. The time it takes to resolve a single roadblock may be greater than the time expended on the whole wayfinding expedition, and this frustrates wayfinders. The builder has to understand what the maintainers concerns are and where they come from, and why the wayfinder made certain decisions. They work via education and trust building to get them aligned to move past the block. They then move down the road and repeat this process until the road is open. How this is done might change depending on the type of maintainers.<br /></p><h2 style="text-align: left;">Maintainer types:<br /></h2><div style="text-align: left;">Maintainers can fall into a few different groups on a per-new road basis, and how do road builders deal with existing road maintainers depends on where they are for this particular intersection:<br /></div><h4 style="text-align: left;">1. Positive and engaged </h4><div style="text-align: left;">Aligned with the goal of the road, want to help out, design intersections, help build more roads and more intersections. Will often have helped wayfinders out.<br /></div><div style="text-align: left;"></div><h4 style="text-align: left;">2. Positive with real concerns</h4><p style="text-align: left;">Agrees with the road's direction, might not like some of the intersections, willing to be educated and give feedback on newer intersection designs. Moves to group 1 or trusts that others are willing to maintain intersections on their road.</p><h4 style="text-align: left;">3. Negative with real concerns</h4><div style="text-align: left;">Don't agree fully with road's direction or choice of building material. Might have some resistance to changing intersections, but may believe in a bigger picture so won't actively block. Hopefully can move to 1 or 2 with education and trust building. <br /></div><h4 style="text-align: left;">4. Negative and unwilling</h4><div style="text-align: left;">Don't agree with the goal, don't want the intersection built, won't trust anyone else to care about their road enough. Education and trust building is a lot more work here, and often it's best to leave these intersections until later, where they may be swayed by other maintainers having built their intersections. It might be possible to build a reduced intersection. but if they are a major enough roadblock in a very busy road, then a higher authority might need to be brought in.<br /></div><h4 style="text-align: left;">5. Don't care/Disengaged</h4><div style="text-align: left;">Doesn't care where your road goes and won't talk about intersections. This category often just need to be told that someone else will care about it and they will step out of the way. If they are active blocks or refuse interaction then again a higher authority needs to be brought in.<br /></div><div style="text-align: left;"><br /></div><h3 style="text-align: left;">Where are we now?</h3><div style="text-align: left;">I think the r4l project has a had lot of excellent wayfinding done, has a lot of wayfinding in progress and probably has a bunch of future wayfinding to do. There are some nice hotels built. However now we need to build the roads to them so others can build hotels.<br /></div><div style="text-align: left;"> </div><div style="text-align: left;">To the higher authority, the road building process can look slow. They may expect cars to be driving on the road already, and they see roadblocks from a different perspective. A roadblock might look smaller to them, but have a lot of fine details, or a large roadblock might be worked through quickly once it's engaged with.<br /></div><div style="text-align: left;"> </div><div style="text-align: left;">For the wayfinders the process of interacting with maintainers is frustrating and slow, and they don't enjoy it as much as wayfinding, and because they still only care about the hotel at the end, when a maintainer gets into the details of their particular intersection they don't want to do anything but go stay in their hotel. </div><div style="text-align: left;"> </div><div style="text-align: left;">The road will get built, it will get traffic on it. There will be tunnels where we should have intersections, there will be bridges that need to be built from both sides, but I do think it will get built.</div><p></p><p>I think my request from this is that contributors should try and identify the archetype they currently resonate with and find the next group over to interact with.</p><p>For wayfinders, it's fine to just keep wayfinding, just don't be surprised when the road building takes longer, or the road that gets built isn't what you envisaged.</p><p>For road builder, just keep building, find new techniques for bridging gaps and blowing stuff up when appropriate. Figure out when to use higher authorities. Take the high road, and focus on the big picture.</p><p>For maintainers, try and keep up with modern road building, don't say 20 year old roads are the pinnacle of innovation. Be willing to install the rumble strips, widen the lanes, add crash guardrails, and truck safety offramps. Understand that wayfinders show you opportunities for longer term success and that road builders are going to keep building the road, and the result is better if you engage positively with them.<br /></p>Dave Airliehttps://www.blogger.com/profile/03386351362681039664noreply@blogger.com0tag:blogger.com,1999:blog-4530460124602916146.post-11930902903660433792024-02-01T22:41:00.000-08:002024-02-04T19:16:58.582-08:00anv: vulkan av1 decode status<p> Vulkan Video AV1 decode has been released, and I had some partly working support on Intel ANV driver previously, but I let it lapse.</p><p>The branch is currently [1]. It builds, but is totally untested, I'll get some time next week to plug in my DG2 and see if I can persuade it to decode some frames.</p><p>Update: the current branch decodes one frame properly, reference frames need more work unfortunately. <br /></p><p>[1] <a href="https://gitlab.freedesktop.org/airlied/mesa/-/commits/anv-vulkan-video-decode-av1">https://gitlab.freedesktop.org/airlied/mesa/-/commits/anv-vulkan-video-decode-av1</a><br /></p>Dave Airliehttps://www.blogger.com/profile/03386351362681039664noreply@blogger.com0tag:blogger.com,1999:blog-4530460124602916146.post-73894275195055462232024-02-01T18:27:00.000-08:002024-02-01T18:27:20.364-08:00radv: vulkan av1 video decode status<p>The Khronos Group announced VK_KHR_video_decode_av1 [1], this extension adds AV1 decoding to the Vulkan specification. There is a radv branch [2] and merge request [3]. I did some AV1 work on this in the past, but I need to take some time to see if it has made any progress since. I'll post an ANV update once I figure that out.</p><p>This extension is one of the ones I've been wanting for a long time, since having royalty-free codec is something I can actually care about and ship, as opposed to the painful ones. I started working on a MESA extension for this a year or so ago with Lynne from the ffmpeg project and we made great progress with it. We submitted that to Khronos and it has gone through the committee process and been refined and validated amongst the hardware vendors.</p><p>I'd like to say thanks to Charlie Turner and Igalia for taking over a lot of the porting to the Khronos extension and fixing up bugs that their CTS development brought up. This is a great feature of having open source drivers, it allows a lot quicker turn around time in bug fixes when devs can fix them themselves!<br /></p><p>[1]: <a href="https://www.khronos.org/blog/khronos-releases-vulkan-video-av1-decode-extension-vulkan-sdk-now-supports-h.264-h.265-encode">https://www.khronos.org/blog/khronos-releases-vulkan-video-av1-decode-extension-vulkan-sdk-now-supports-h.264-h.265-encode</a> </p><p>[2]  <a href="https://gitlab.freedesktop.org/airlied/mesa/-/tree/radv-vulkan-video-decode-av1">https://gitlab.freedesktop.org/airlied/mesa/-/tree/radv-vulkan-video-decode-av1</a></p><p>[3] <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27424">https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27424</a><br /></p>Dave Airliehttps://www.blogger.com/profile/03386351362681039664noreply@blogger.com0tag:blogger.com,1999:blog-4530460124602916146.post-5694059260221895852023-12-17T17:00:00.000-08:002023-12-19T12:29:27.433-08:00radv: vulkan video encode status<p>Vulkan 1.3.274 moves the Vulkan encode work out of BETA and moves h264 and h265 into KHR extensions. radv support for the Vulkan video encode extensions has been in progress for a while.</p><p>The latest branch is at [1]. This branch has been updated for the new final headers.<br /></p><p>Updated: It passes all of h265 CTS now, but it is failing one h264 test.<br /></p>Initial ffmpeg support is [2].<br /><p>[1] <a href="https://gitlab.freedesktop.org/airlied/mesa/-/tree/radv-vulkan-video-encode-h2645-spec-latest?ref_type=heads">https://gitlab.freedesktop.org/airlied/mesa/-/tree/radv-vulkan-video-encode-h2645-spec-latest?ref_type=heads</a></p><p>[2] <a href="https://github.com/cyanreg/FFmpeg/commits/vulkan/">https://github.com/cyanreg/FFmpeg/commits/vulkan/</a><br /></p>Dave Airliehttps://www.blogger.com/profile/03386351362681039664noreply@blogger.com0tag:blogger.com,1999:blog-4530460124602916146.post-7169277460826413832023-11-05T12:23:00.002-08:002023-11-05T12:23:40.025-08:00nouveau GSP firmware support - current state<p>Linus has pulled the initial GSP firmware support for nouveau. This is just the first set of work to use the new GSP firmware and there are likely many challenges and improvements ahead.</p><p>To get this working you need to install the firmware which hasn't landed in linux-firmware yet.</p><p>For Fedora this copr has the firmware in the necessary places:<br /></p><p><a href="https://copr.fedorainfracloud.org/coprs/airlied/nouveau-gsp/build/6593115/ ">https://copr.fedorainfracloud.org/coprs/airlied/nouveau-gsp/build/6593115/ </a></p><p>Hopefully we can upstream that in next week or so.<br /></p><p>If you have an ADA based GPU then it should just try and work out of the box, if you have Turing or Ampere you currently need to pass nouveau.config=NvGspRm=1 on the kernel command line to attempt to use GSP.</p><p>Going forward, I've got a few fixes and stabilization bits to land, which we will concentrate on for 6.7, then going forward we have to work out how to keep it up to date and support new hardware and how to add new features.<br /></p><p><br /></p>Dave Airliehttps://www.blogger.com/profile/03386351362681039664noreply@blogger.com0tag:blogger.com,1999:blog-4530460124602916146.post-4341273641355165232023-09-01T11:56:00.003-07:002023-09-01T12:12:17.350-07:00Talk about compute and community and where things are at.<p> Sriram invited me to the oneAPI meetup, and I felt I hadn't summed up the state of compute and community development in a while. Enjoy 45 minutes of opinions!</p><p><a href="https://www.youtube.com/watch?v=HzzLY5TdnZo">https://www.youtube.com/watch?v=HzzLY5TdnZo</a><br /></p><p><br /></p><div class="separator" style="clear: both; text-align: center;"><iframe allowfullscreen="" class="BLOG_video_class" height="266" src="https://www.youtube.com/embed/HzzLY5TdnZo" width="487" youtube-src-id="HzzLY5TdnZo"></iframe></div><br /><p><br /></p>Dave Airliehttps://www.blogger.com/profile/03386351362681039664noreply@blogger.com0tag:blogger.com,1999:blog-4530460124602916146.post-51142951664478614422023-08-04T15:26:00.002-07:002023-08-04T15:26:49.303-07:00nvk: the kernel changes needed<p>The initial NVK (nouveau vulkan) experimental driver has been merged into mesa master[1], and although there's lots of work to be done before it's application ready, the main reason it was merged was because the initial kernel work needed was merged into drm-misc-next[2] and will then go to drm-next for the 6.6 merge window. (This work is separate from the GSP firmware enablement required for reclocking, that is a parallel development, needed to make nvk useable). Faith at Collabora will have a blog post about the Mesa side, this is more about the kernel journey.<br /></p><h3 style="text-align: left;">What was needed in the kernel?</h3><p style="text-align: left;">The nouveau kernel API was written 10 years or more ago, and was designed around OpenGL at the time. There were two major restrictions in the current uAPI that made it unsuitable for Vulkan.</p><ol style="text-align: left;"><li style="text-align: left;">buffer objects (physical memory allocations) were allocated 1:1 with virtual memory allocations for a file descriptor. This meant the kernel managed the virtual address space. For proper Vulkan support, the bo allocation and vm allocation have to be separate, and userspace should control the virtual address space.</li><li style="text-align: left;">Command submission didn't use sync objects. The nouveau command submission wasn't wired up to the modern sync objects. These are pretty much a requirement for Vulkan fencing and semaphores to work properly.</li></ol><h3 style="text-align: left;">How to implement these?</h3><div style="text-align: left;"><p style="text-align: left;">When we kicked off the nvk idea I made a first pass at implementing a new user API, to allow the above features. I took at look at how the GPU VMA management was done in current drivers and realized that there was a scope for a common component to manage the GPU VA space. I did a hacky implementation of some common code and a nouveau implementation. Luckily at the time, Danilo <span class="gI"><span data-hovercard-id="dakr@redhat.com" data-hovercard-owner-id="121">Krummrich had joined my team at Red Hat and needed more kernel development experience in GPU drivers. I handed my sketchy implementation to Danilo and let him run with it. He spent a lot of time learning and writing copious code. His GPU VA manager code was merged into drm-misc-next last week and his nouveau code landed today.</span></span></p><h3 style="text-align: left;"><span class="gI"><span data-hovercard-id="dakr@redhat.com" data-hovercard-owner-id="121">What is the GPU VA manager?</span></span></h3><p style="text-align: left;"><span class="gI"><span data-hovercard-id="dakr@redhat.com" data-hovercard-owner-id="121">The idea behind the GPU VA manager is that there is no need for every driver to implement something that should essentially not be a hardware specific problem. The manager is designed to track VA allocations from userspace, and keep track of what GEM objects they are currently bound to. The implementation went through a few twists and turns and experiments. </span></span></p><p style="text-align: left;"><span class="gI"><span data-hovercard-id="dakr@redhat.com" data-hovercard-owner-id="121">For a long period we considered using maple tree as the core of it, but we hit a number of messy interactions between the dma-fence locking and memory allocations required to add new nodes to the maple tree. The dma-fence critical section is a hard requirement to make others deal with. In the end Danilo used an rbtree to track things. We will revisit if we can deal with maple tree again in the future. </span></span></p><p style="text-align: left;"><span class="gI"><span data-hovercard-id="dakr@redhat.com" data-hovercard-owner-id="121">We had a long discussion and a couple of implement it both ways and see, on whether we needed to track empty sparse VMA ranges in the manager or not,  nouveau wanted these but generically we weren't sure they were helpful, but that also affected the uAPI as it needed explicit operations to create/drop these. In the end we started tracking these in the driver and left the core VA manager cleaner.<br /></span></span></p><p style="text-align: left;"><span class="gI"><span data-hovercard-id="dakr@redhat.com" data-hovercard-owner-id="121">Now the code is in tree we will start to push future drivers to use it instead of spinning their own.</span></span></p><h3 style="text-align: left;"><span class="gI"><span data-hovercard-id="dakr@redhat.com" data-hovercard-owner-id="121">What changes are needed for nouveau?</span></span></h3><p style="text-align: left;"><span class="gI"><span data-hovercard-id="dakr@redhat.com" data-hovercard-owner-id="121">Now that the VAs are being tracked, the nouveau API needed two new entrypoints. Since BO allocation will no longer create a VM, a new API is needed to bind BO allocations with VM addresses. This is called the VM_BIND API. It has two variants</span></span></p><ol style="text-align: left;"><li style="text-align: left;"><span class="gI"><span data-hovercard-id="dakr@redhat.com" data-hovercard-owner-id="121"> a synchronous version that immediately maps a BO to a VM and is used for the common allocation paths. </span></span></li><li style="text-align: left;"><span class="gI"><span data-hovercard-id="dakr@redhat.com" data-hovercard-owner-id="121">an asynchronous version that is modeled after the Vulkan sparse API, and takes in/out sync objects, which use the drm scheduler to schedule the vm/bo binding.</span></span></li></ol><span class="gI"><span data-hovercard-id="dakr@redhat.com" data-hovercard-owner-id="121">The VM BIND backend then does all the page table manipulation required.</span></span></div><div style="text-align: left;"><span class="gI"><span data-hovercard-id="dakr@redhat.com" data-hovercard-owner-id="121"> </span></span><br /></div><div style="text-align: left;"><span class="gI"><span data-hovercard-id="dakr@redhat.com" data-hovercard-owner-id="121">The second API added was an EXEC call. This takes in/out sync objects and a set of addresses that point to command buffers to execute. This uses the drm scheduler to deal with the synchronization and hands the firmware the command buffer address to execute.</span></span></div><div style="text-align: left;"><span class="gI"><span data-hovercard-id="dakr@redhat.com" data-hovercard-owner-id="121">Internally for nouveau this meant having to add support for the drm scheduler, adding new internal page table manipulation APIs, and wiring up the GPU VA. </span></span></div><div style="text-align: left;"><h3 style="text-align: left;"><span class="gI"><span data-hovercard-id="dakr@redhat.com" data-hovercard-owner-id="121">Shoutouts:</span></span></h3><p style="text-align: left;"><span class="gI"><span data-hovercard-id="dakr@redhat.com" data-hovercard-owner-id="121">My input was the sketchy sketch at the start, and doing the userspace changes to the nvk codebase to allow testing.<br /></span></span></p><p style="text-align: left;"><span class="gI"><span data-hovercard-id="dakr@redhat.com" data-hovercard-owner-id="121">The biggest shoutout to Danilo, who took a sketchy sketch of what things should look like, created a real implementation, did all the experimental ideas I threw at him, and threw them and others back at me, negotiated with other drivers to use the common code, and built a great foundational piece of drm kernel infrastructure.<br /></span></span></p><p style="text-align: left;"><span class="gI"><span data-hovercard-id="dakr@redhat.com" data-hovercard-owner-id="121">Faith at Collabora who has done the bulk of the work on nvk did a code review at the end and pointed out some missing pieces of the API and the optimisations it enables.</span></span></p><p style="text-align: left;"><span class="gI"><span data-hovercard-id="dakr@redhat.com" data-hovercard-owner-id="121">Karol at Red Hat on the main nvk driver and Ben at Red Hat for nouveau advice on how things worked, while he smashed away at the GSP rock.</span></span></p><p style="text-align: left;"><span class="gI"><span data-hovercard-id="dakr@redhat.com" data-hovercard-owner-id="121">(and anyone else who has contributed to nvk, nouveau and even NVIDIA for some bits :-) </span></span></p><p style="text-align: left;"><span class="gI"><span data-hovercard-id="dakr@redhat.com" data-hovercard-owner-id="121">[1] <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24326">https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24326</a><br /></span></span></p><p style="text-align: left;"><span class="gI"><span data-hovercard-id="dakr@redhat.com" data-hovercard-owner-id="121">[2] <a href="https://cgit.freedesktop.org/drm-misc/log/">https://cgit.freedesktop.org/drm-misc/log/</a><br /></span></span></p></div>Dave Airliehttps://www.blogger.com/profile/03386351362681039664noreply@blogger.com2tag:blogger.com,1999:blog-4530460124602916146.post-31499856337097506822023-07-13T21:30:00.003-07:002023-07-13T21:30:51.144-07:00tinygrad + rusticl + aco: why not?<p>I recently came across tinygrad as a small powerful nn framework that had an OpenCL backend target and could run LLaMA model.</p><p>I've been looking out for rusticl workloads, and this seemed like a good one, and I could jump on the AI train, and run an LLM in my house!</p><p>I started it going on my Radeon 6700XT with the latest rusticl using radeonsi with the LLVM backend, and I could slowly interrogate a model with a question, and it would respond. I've no idea how performant it is vs ROCm yet which seems to be where tinygrad is more directed, but I may get to that next week.</p><p>While I was there though I decided to give the Mesa ACO compiler backend a go, it's been tied into radeonsi recently, and I done some hacks before to get compute kernels to run. I reproduced said hacks on the modern code and gave it a run.</p><p>tinygrad comes with a benchmark script called benchmark_train_efficientnet so I started playing with it to see what low hanging fruit I could find in an LLVM vs ACO shootout.</p><p>The bench does 10 runs, the first is where lots of compilation happens, the last is well primed cache wise. There are the figures from the first and last runs with a release build of llvm and mesa. (and the ACO hacks).<br /></p><p>LLVM:</p><p><span style="font-family: courier;">215.78 ms cpy,  12245.04 ms run,  120.33 ms build, 12019.45 ms realize,  105.26 ms CL,   -0.12 loss,  421 tensors, 0.04 GB used,      0.94 GFLOPS</span></p><p><span style="font-family: courier;">10.25 ms cpy,   221.02 ms run,   83.50 ms build,   36.25 ms realize,  101.27 ms CL,   -0.01 loss,  421 tensors, 0.04 GB used,     52.11 GFLOPS</span></p><p></p><p>ACO:</p><p><span style="font-family: courier;">71.10 ms cpy,  3443.04 ms run,  112.58 ms build, 3214.13 ms realize,  116.34 ms CL,   -0.04 loss,  421 tensors, 0.04 GB used,      3.35 GFLOPS<br />10.36 ms cpy,   234.90 ms run,   84.84 ms build,   36.51 ms realize,  113.54 ms CL,    0.05 loss,  421 tensors, 0.04 GB used,     49.03 GFLOPS</span><br /></p><p>So ACO is about 4 times faster to compile but produces binaries that are less optimised.</p><p>The benchmark produces 148 shaders:</p><p>LLVM:</p><div style="text-align: left;"><span style="font-family: courier;">126 Max Waves: 16 </span></div><div style="text-align: left;"><span style="font-family: courier;">  6 Max Waves: 10<br /></span></div><div style="text-align: left;"><span style="font-family: courier;">  5 Max Waves: 9</span></div><div style="text-align: left;"><span style="font-family: courier;">  6 Max Waves: 8 <br /></span></div><div style="text-align: left;"><span style="font-family: courier;">  5 Max Waves: 4</span></div><p><br />ACO:</p><div style="text-align: left;"><span style="font-family: courier;"> 96 Max Waves: 16</span></div><div style="text-align: left;"><span style="font-family: courier;"> 36 Max Waves: 12 <br /></span></div><div style="text-align: left;"><span style="font-family: courier;">  2 Max Waves: 10</span></div><div style="text-align: left;"><span style="font-family: courier;"> 10 Max Waves: 8</span></div><div style="text-align: left;"><span style="font-family: courier;">  4 Max Waves: 4</span></div><div style="text-align: left;"><br /></div><p>So ACO doesn't quite get the optimal shaders for a bunch of paths, even with some local hackery I've done to make it do better.[1]</p><p>I'll investigate ROCm next week maybe, got a bit of a cold/flu, and large GPU stacks usually make me want to wipe the machine after I test them :-P <br /></p><p>[1] <a href="https://gitlab.freedesktop.org/airlied/mesa/-/commits/radeonsi-rusticl-aco-wip">https://gitlab.freedesktop.org/airlied/mesa/-/commits/radeonsi-rusticl-aco-wip</a><br /></p><p><br /></p>Dave Airliehttps://www.blogger.com/profile/03386351362681039664noreply@blogger.com0tag:blogger.com,1999:blog-4530460124602916146.post-15946561167984221622023-05-21T20:12:00.002-07:002023-05-21T20:12:30.977-07:00lavapipe and sparse memory bindings: part two<p> Thanks for all the suggestions, on here, and on twitter and on mastodon, anyway who noted I could use a single fd and avoid all the pain was correct!</p><p>I hacked up an ever growing ftruncate/madvise memfd and it seemed to work fine. In order to use it for sparse I have to use it for all device memory allocations in lavapipe which means if I push forward I probably have to prove it works and scales a bit better to myself. I suspect layering some of the pb bufmgr code on top of an ever growing fd might work, or maybe just having multiple 2GB buffers might be enough.</p><p>Not sure how best to do shaderResourceResidency, userfaultfd might be somewhat useful, mapping with PROT_NONE and then using write(2) to get a -EFAULT is also promising, but I'm not sure how best to avoid segfaults for read/writes to PROT_NONE regions.<br /></p><p>Once I got that going, though I ran headfirst into something that should have been obvious to me, but I hadn't thought through.</p><p>llvmpipe allocates all it's textures linearly, there is no tiling (even for vulkan optimal). Sparse textures are incompatible with linear implementations. For sparseImage2D you have to be able to give the sparse tile sizes from just the image format. This typically means you have to work out how large the tile that fits into a hw page is in w/h. Of course for a linear image, this would be dependent on the image stride not just the format, and you just don't have that information.</p><p>I guess it means texture tiling in llvmpipe might have to become a thing, we've thought about it over the years but I don't think there's ever been a solid positive for implementing it.</p><p>Might have to put sparse support on the back burner for a little while longer. <br /></p>Dave Airliehttps://www.blogger.com/profile/03386351362681039664noreply@blogger.com0tag:blogger.com,1999:blog-4530460124602916146.post-16513584669541171012023-05-17T00:28:00.000-07:002023-05-17T00:28:48.285-07:00lavapipe and sparse memory bindings<p>Mike nerdsniped me into wondering how hard sparse memory support would be in lavapipe.</p><p>The answer is unfortunately extremely.<br /></p><p>Sparse binding essentially allows creating a vulkan buffer/image of a certain size, then plugging in chunks of memory to back it in page-size multiple chunks.</p><p>This works great with GPU APIs where we've designed this, but it's actually hard to pull off on the CPU.</p><p>Currently lavapipe allocates memory with an aligned malloc. It allocates objects with no backing and non-sparse bindings connect objects to the malloced memory.</p><p>However with sparse objects, the object creation should allocate a chunk of virtual memory space, then sparse binding should bind allocated device memory into the virtual memory space. Except Linux has no interfaces for doing this without using a file descriptor.</p><p>You can't mmap a chunk of anonymous memory that you allocated with malloc to another location. So if I malloc backing memory A at 0x1234000, but the virtual memory I've used for the object is at 0x4321000, there's no nice way to get the memory from the malloc to be available at the new location (unless I missed an API).</p><p>However you can do it with file descriptors. You can mmap a PROT_NONE area for the sparse object, then allocate the backing memory into file descriptors, then mmap areas from those file descriptors into the correct places.</p><p>But there are limits on file descriptors, you get 1024 soft, or 4096 hard limits by default, which is woefully low for this. Also *all* device memory allocations would need to be fd backed, not just ones going to be used in sparse allocations.</p><p>Vulkan has a limit maxMemoryAllocationCount that could be used for this, but setting it to the fd limit is a problem because some fd's are being used by the application and just in general by normal operations, so reporting 4096 for it, is probably going to explode if you only have 3900 of them left.</p><p>Also the sparse CTS tests don't respect the maxMemoryAllocationCount anyways :-)</p><p>I shall think on this a bit more, please let me know if anyone has any good ideas!<br /></p>Dave Airliehttps://www.blogger.com/profile/03386351362681039664noreply@blogger.com3tag:blogger.com,1999:blog-4530460124602916146.post-79688217913838185542023-04-23T20:29:00.003-07:002023-04-23T20:29:57.630-07:00Fedora 38 LLVM vs Team Fortress 2 (TF2)<p>F38 just released and seeing a bunch of people complain that TF2 dies on AMD or other platforms when lavapipe is installed. Who's at fault? I've no real idea. How to fix it? I've no real idea.</p><h4 style="text-align: left;">What's happening?</h4><p>AMD OpenGL drivers use LLVM as the backend compiler. Fedora 38 updated to LLVM 16. LLVM 16 is built with c++17 by default. C++17 introduces new "operator new/delete" interfaces[1].</p><p>TF2 ships with it's own libtcmalloc_minimal.so implementation, tcmalloc expects to replace all the new/delete interfaces, but the version in TF2 must not support or had incorrect support for the new align interfaces.</p><p>What happens is when TF2 probes OpenGL and LLVM is loaded, when DenseMap initializes, one "new" path fails to go into tcmalloc, but the "delete" path does, and this causes tcmalloc to explode with</p><p>"src/tcmalloc.cc:278] Attempt to free invalid pointer"</p><h4 style="text-align: left;">Fixing it?</h4><p style="text-align: left;">I'll talk to Valve and see if we can work out something, LLVM 16 doesn't seem to support building with C++14 anymore. I'm not sure if static linking libstdc++ into LLVM might avoid the tcmalloc overrides, it might not also be acceptable to the wider Fedora community.<br /></p><p>[1] <a href="https://www.cppstories.com/2019/08/newnew-align/">https://www.cppstories.com/2019/08/newnew-align/</a><br /></p>Dave Airliehttps://www.blogger.com/profile/03386351362681039664noreply@blogger.com3tag:blogger.com,1999:blog-4530460124602916146.post-25342125028426728372023-04-18T22:16:00.003-07:002023-04-18T22:16:50.069-07:00nouveau/gsp + kernel module firmware selection for initramfs generation<p>There are plans for nouveau to support using the NVIDIA supplied GSP firmware in order to support new hardware going forward </p><p>The nouveau project doesn't have any input or control over the firmware. NVIDIA have made no promises around stable ABI or firmware versioning. The current status quo is that NVIDIA will release versioned signed gsp firmwares as part of their driver distribution packages that are version locked to their proprietary drivers (open source and binary). They are working towards allowing these firmwares to be redistributed in linux-firmware.</p><p>The NVIDIA firmwares are quite large. The nouveau project will control the selection of what versions of the released firmwares are to be supported by the driver, it's likely a newer firmware will only be pulled into linux-firmware for:</p><ol style="text-align: left;"><li>New hardware support (new GPU family or GPU support)</li><li>Security fix in the firmware</li><li>New features that is required to be supported</li></ol><p>This should at least limit the number of firmwares in the linux-firmware project.</p><p>However a secondary effect of the size of the firmwares is that having the nouveau kernel module at more and more MODULE_FIRMWARE lines for each iteration will mean the initramfs sizes will get steadily larger on systems, and after a while the initramfs will contain a few gsp firmwares that the driver doesn't even need to run.</p><p>To combat this I've looked into adding some sort of module grouping which dracut can pick one out off.</p><p></p><p>It currently looks something like:</p><p></p><pre class="notranslate"><code class="notranslate">MODULE_FIRMWARE_GROUP_ONLY_ONE("ga106-gsp"); MODULE_FIRMWARE("nvidia/ga106/gsp/gsp-5258902.bin"); MODULE_FIRMWARE("nvidia/ga106/gsp/gsp-5303002.bin"); MODULE_FIRMWARE_GROUP_ONLY_ONE("ga106-gsp"); </code></pre><p class="notranslate" style="text-align: left;"><code class="notranslate"><span style="font-family: verdana;">This group only one will end up in the module info section and dracut will only pick one module from the group to install into the initramfs. Due to how the module info section is constructed this will end up picking the last module in the group first.</span></code></p><p class="notranslate" style="text-align: left;"><code class="notranslate"><span style="font-family: verdana;">The dracut MR is: </span></code></p><p class="notranslate" style="text-align: left;"><code class="notranslate"><span style="font-family: verdana;"><a href="https://github.com/dracutdevs/dracut/pull/2309">https://github.com/dracutdevs/dracut/pull/2309</a></span></code></p><p class="notranslate" style="text-align: left;"><code class="notranslate"><span style="font-family: verdana;">The kernel one liner is:</span></code></p><p class="notranslate" style="text-align: left;"><code class="notranslate"><span style="font-family: verdana;"><a href="https://lore.kernel.org/all/20230419043652.1773413-1-airlied@gmail.com/T/#u">https://lore.kernel.org/all/20230419043652.1773413-1-airlied@gmail.com/T/#u</a><br /></span></code></p><p><br /> </p>Dave Airliehttps://www.blogger.com/profile/03386351362681039664noreply@blogger.com1tag:blogger.com,1999:blog-4530460124602916146.post-19735834211325845202023-03-12T21:14:00.003-07:002023-03-12T21:14:41.659-07:00vulkan video vp9 decode - radv update<p>While going over the AV1 a few people commented on the lack of VP9 and a few people said it would be an easier place to start etc.</p><p>Daniel Almeida at Collabora took a first pass at writing the spec up, and I decided to go ahead and take it to a working demo level.</p><p>Lynne was busy, and they'd already said it should take an afternoon, so I decided to have a go at writing the ffmpeg side for it as well as finish off Daniel's radv code.</p><p>About 2 mins before I finished for the weekend on Friday, I got a single frame to decode, and this morning I finished off the rest to get at least 2 test videos I downloaded to work.</p><p>Branches are at [1] and [2]. There is only 8-bit support so far and I suspect some cleaning up is required.<br /></p><p>[1] https://github.com/airlied/FFmpeg/tree/vulkan-vp9-decode</p><p>[2] https://gitlab.freedesktop.org/airlied/mesa/-/commits/radv-vulkan-video-decode-mesa-vp9</p><p><br /></p>Dave Airliehttps://www.blogger.com/profile/03386351362681039664noreply@blogger.com1tag:blogger.com,1999:blog-4530460124602916146.post-73151682868329598052023-02-07T20:07:00.001-08:002023-02-07T20:07:36.875-08:00vulkan video: status update (anv + radv)<p> Okay just a short status update.</p><h4 style="text-align: left;">radv H264/H265 decode: <br /></h4><p>The radv h264/h265 support has been merged to mesa main branch. It is still behind RADV_PERFTEST=video_decode flag, and should work for basics from VI/GFX8+. It still has not passed all the CTS tests.</p><h4 style="text-align: left;">anv H264 decode:<br /></h4><p>The anv h264 decode support has been merged to mesa main branch. It has been tested from Skylake up to DG2. It has no enable flag, just make sure to build with h264dec video-codec support. It passes all current CTS tests.</p><h4 style="text-align: left;">hasvk H264 decode:</h4><p style="text-align: left;">I ported the anv h264 decoder to hasvk the vulkan driver for Ivybridge/Haswell. This in a draft MR (<a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21183">HASVK H264</a>). I haven't given this much testing yet, it has worked in the past. I'll get to testing it before trying to get it merged.</p><h4 style="text-align: left;">radv AV1 decode:<br /></h4><p>I created an MR for spec discussion (<a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21173">radv av1</a>). I've also cleaned up the radv AV1 decode code.</p><h4 style="text-align: left;">anv AV1 decode: <br /></h4><p>I've started on anv AV1 decode support for DG2. I've gotten one very simple frame to decode. I will attempt to do more. I think filmgrain is not going to be supported in the short term. I'll fill in more details on this when it's working better. I think there are a few things that might need to be changed in the AV1 decoder provisional spec for Intel, there are some derived values that ffmpeg knows that it would be nice to not derive again, and there are also some hw limits around tiles and command buffers that will need to be figured out.<br /></p>Dave Airliehttps://www.blogger.com/profile/03386351362681039664noreply@blogger.com0tag:blogger.com,1999:blog-4530460124602916146.post-25108471560208395052023-01-18T19:53:00.002-08:002023-01-18T19:53:52.304-08:00vulkan video decoding: anv status update<p>After hacking the Intel media-driver and ffmpeg I managed to work out how the anv hardware mostly works now for h264 decoding.</p><p>I've pushed a branch [1] and a MR[2] to mesa. The basics of h264 decoding are working great on gen9 and compatible hardware. I've tested it on my one Lenovo WhiskeyLake laptop.</p><p>I have ported the code to hasvk as well, and once we get moving on this I'll polish that up and check we can h264 decode on IVB/HSW devices.</p><p>The one feature I know is missing is status reporting, radv can't support that from what I can work out due to firmware, but anv should be able to so I might dig into that a bit.<br /></p><p>[1] https://gitlab.freedesktop.org/airlied/mesa/-/tree/anv-vulkan-video-decode<br /></p><p>[2] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20782<br /></p>Dave Airliehttps://www.blogger.com/profile/03386351362681039664noreply@blogger.com0tag:blogger.com,1999:blog-4530460124602916146.post-9478599157039926472023-01-16T23:54:00.001-08:002023-01-16T23:54:07.824-08:00vulkan video decoding: av1 (yes av1) status update<p>Needless to say h264/5 weren't my real goals in life for video decoding. Lynne and myself decided to see what we could do to drive AV1 decode forward by creating our own extensions called VK_MESA_video_decode_av1. This is a radv only extension so far, and may expose some peculiarities of AMD hardware/firmware.<br /></p><p>Lynne's blog entry[1] has all the gory details, so go read that first. (really read it first).</p><p>Now that you've read and understood all that, I'll just rant here a bit. Figuring out the DPB management and hw frame ref and curr_pic_idx fields was a bit of a nightmare. I spent a few days hacking up a lot of wrong things before landing on the thing we agreed was the least wrong which was having the ffmpeg code allocate a frame index in the same fashion as the vaapi radeon implementation did. I had another hacky solution that involved overloading the slotIndex value to mean something that wasn't DPB slot index, but it wasn't really any better. I think there may be something about the hw I don't understand so hopefully we can achieve clarity later.<br /></p><p>[1] <a href="https://lynne.ee/vk_mesa_video_decode_av1.html">https://lynne.ee/vk_mesa_video_decode_av1.html</a><br /></p>Dave Airliehttps://www.blogger.com/profile/03386351362681039664noreply@blogger.com0tag:blogger.com,1999:blog-4530460124602916146.post-88187008104016687772022-12-28T23:22:00.003-08:002022-12-28T23:22:38.173-08:00vulkan video encoding: radv update<p>After the video decode stuff was fairly nailed down, Lynne from ffmpeg nerdsniped^Wtalked me into looking at h264 encoding.</p><p>The AMD VCN encoder engine is a very different interface to the decode engine and required a lot of code porting from the radeon vaapi driver. Prior to Xmas I burned a few days on typing that all in, and yesterday I finished typing and moved to debugging the pile of trash I'd just typed in.</p><p>Lynne meanwhile had written the initial ffmpeg side implementation, and today we threw them at each other, and polished off a lot of sharp edges. We were rewarded with valid encoded frames.<br /></p><p>The code at this point is only doing I-frame encoding, we will work on P/B frames when we get a chance.</p><p>There are also a bunch of hacks and workarounds for API/hw mismatches, that I need to consult with Vulkan spec and AMD teams about, but we have a good starting point to move forward from. I'll also be offline for a few days on holidays so I'm not sure it will get much further until mid January.<br /></p><p>My branch is [1]. Lynne ffmpeg branch is [2].<br /></p><p>[1] <a href="https://gitlab.freedesktop.org/airlied/mesa/-/commits/radv-vulkan-video-enc-wip">https://gitlab.freedesktop.org/airlied/mesa/-/commits/radv-vulkan-video-enc-wip</a><br /></p><p>[2] <a href="https://github.com/cyanreg/FFmpeg/tree/vulkan_decode">https://github.com/cyanreg/FFmpeg/tree/vulkan_decode</a><br /></p>Dave Airliehttps://www.blogger.com/profile/03386351362681039664noreply@blogger.com0tag:blogger.com,1999:blog-4530460124602916146.post-36388655954746964262022-12-16T14:36:00.003-08:002022-12-16T14:37:15.731-08:00vulkan video decoding: anv status<p>After cleaning up the radv stuff I decided to go back and dig into the anv support for H264. <br /></p><p>The current status of this work is in a branch[1]. This work is all against the current EXT decode beta extensions in the spec.<br /></p><p>This contains an initial implementation of H264 Intel GPUs that anv supports. I've only tested it on Kabylake equivalents so far. It decodes some of the basic streams I've thrown at it from ffmpeg. Now this isn't as far along as the AMD implementation, but I'm also not sure I'm programming the hardware correctly. The Windows DXVA API has 2 ways to decode H264, short and long. I believe but I'm not 100% sure the current Vulkan API is quite close to "short", but the only Intel implementations I've found source for are for "long". I've bridged this gap by writing a slice header parser in mesa, but I think the hw might be capable of taking over that task, and I could in theory dump a bunch of code. But the programming guides for the hw block are a bit vague on some of the details around how "long" works. Maybe at some point someone in Intel can tell me :-)<br /></p><h3 style="text-align: left;">Building:</h3><p style="text-align: left;"></p><span style="font-family: courier;">git clone https://gitlab.freedesktop.org/airlied/mesa<br /><br />git checkout anv-vulkan-video-prelim-decode<br /><br />mkdir build<br /><br />meson build -Dvulkan-beta=true -Dvulkan-drivers=intel -Dvideo-codecs=h264dec --prefix=<prefix><br /><br />cd build<br /><br />ninja<br /><br />ninja install</span><p style="text-align: left;"></p><h3 style="text-align: left;">Running:</h3><blockquote></blockquote><span style="font-family: courier;">export VK_ICD_FILENAMES=<prefix>/share/vulkan/icd.d/intel_icd.x86_64.json<br /><br />export ANV_VIDEO_DECODE=1 <br /><br />vulkaninfo</span><blockquote></blockquote><blockquote><p></p></blockquote><p>This should show support for VK_KHR_video_queue, VK_KHR_video_decode_queue, VK_EXT_video_decode_h264.</p><p></p><p>[1] <a href="https://gitlab.freedesktop.org/airlied/mesa/-/tree/radv-vulkan-video-prelim-decode">https://gitlab.freedesktop.org/airlied/mesa/-/tree/anv-vulkan-video-prelim-decode</a></p>Dave Airliehttps://www.blogger.com/profile/03386351362681039664noreply@blogger.com2

Original Source | Taken Source