| CARVIEW |
Summary
The latest stable kernel is Linux 6.3, released by Linus Torvalds on Sunday, April 23rd, 2023.
The latest mainline (development) kernel is 6.3. The Linux 6.4 “merge window” is open.
Linux 6.3
Linus Torvalds announced the release of Linux 6.3, noting, “It’s been a calm release this time around, and the last week was really no different. So here we are, right on schedule”. As usual, the KernelNewbies website has a summary of Linux 6.3, including links to the appropriate LWN (Linux Weekly News) articles with deep dives for each new feature (if you like this podcast and want to support Linux Kernel journalism, please subscribe to Linux Weekly News).
Linux 6.3 includes additional support for the Rust programming language, a new red-black tree data structure for BPF programs, and the removal of a large number of legacy Arm systems.
With the release of Linux 6.3 comes the opening of the “merge window” (period of time during which disruptive changes are allowed to be merged into the kernel source code) for what will be Linux 6.4 in another couple of months. The next podcast release will include a full summary.
Thorsten Leemhuis has been doing his usual excellent work tracking regressions. He posted multiple updates during the Linux 6.3 development cycle as usual, at one point saying that “The list of regressions from the 6.3 cycle I track is still quite short”. Most seemed to relate to build problems that had stalled for fixes. He had been concerned that there “are two regressions from the 6.2 cycle still not fixed”. These included that “Wake-on-lan (WOL) apparently is broken for a huge number of users” and “a huge number of DISCARD request on NVME devices with Btrfs” causing “a performance regression for some users”. With the final release of Linux 6.3, he has “nothing much to report”, with just “two regression from the 6.3 cycle…worth mentioning”.
Sebastian Andrej Siewior announced pre-empt RT (Real Time) patch v6.3-rc5-rt8.
Shuah Khan posted a summary of complaints addressed by the Linux Kernel Code of Conduct Committee between October 1, 2022 through March 31, 2023. During that time, they received reports of “Unacceptable behavior of comments in email” 6 times. Most were resolved with “Clarification on the Code of Conduct related to maintainer rights and responsibility to reject code”. Overall “The reports were about the decisions made in rejecting code and these actions are not viewed as violations of the Code of Conduct”.
Russia
It cannot have escaped anyone’s attention that there is an active military conflict ongoing in Europe. I try to keep politics out of this podcast. We are, after all, not lacking for other places in which to debate our opinions. Similarly, for the most part, it can be convenient as Open Source developers to attempt to live in an online world devoid of politics and physical boundaries, but the real world very much continues to exist, and in the real world there are consequences (in the form of sanctions) faced by those who invade other sovereign nations. Those consequences can be imposed by governments, but also by fellow developers. The latter was the case over the past month with a patch posted to the Linux “netdev” networking development list.
An engineer from (sanctioned) Russian company Baikal Electronics attempted to post some network patches. His post was greeted by a terse response from one of the maintainers: “We don’t feel comfortable accepting patches from or relating to hardware produced by your organization. Please withhold networking contributions until further notice”. Baikal is known for its connections to the Russian state. The question of official policy was subsequently raised by James Harkonnen, citing a message allegedly from Linus in which he reportedly said “I will not stop any kernel developer I trust from taking patches from Russian sources that they in turn trust, but at the same time I will also not override anybody who goes “I don’t want to have anything to do with this” and doesn’t want to work with Russian companies”. James wanted a clarification as to any official position. As of this date no follow up discussion appears to have taken place, and there does not appear to be an official kernel-wide policy on Russian patches.
Introducing Bugbot
Konstantin Ryabitsev, who is responsible for running kernel.org on behalf of Linux Foundation, posted “Introducing bugbot”, in which he described a new tool that aims to be “a bridge between bugzilla [as in bugzilla.kernel.org] and public-inbox (the mailing list). The tool is “still a very early release” but it is able to “Create bugs from mailing list discussions, with full history”, and “Start mailing list threads from pre-triaged bugzilla bugs”. He closed (presciently) with “bugbot is very young and probably full of bugs, so it will still see a lot of change and will likely explode a couple of times”. True to the prediction, bugbot saw that it was summoned by the announcement of its existence and it replied to the thread, which Konstantin used as an example of the “may explode” comment he had made. Generally feedback to the new tool was positive.
Ongoing Development
Anjali Kulkarni posted version 3 of “Process connector bug fixes & enhancements”, a patch series to improve the performance of monitoring the exit of dependent threads. According to Anjali, “Oracle DB runs on a large scale with 100000s of short lived processes, starting up and exiting quickly. A process monitoring DB daemon which tracks and cleans up after processes that have died without a proper exit needs notifications only when a process died with a non-zero exit code (which should be rare)”. The patches allow a “client [to] register to listen for only exit or fork or a mix of all events. This greatly enhances performance”.
Vlastimil Babka posted “remove SLOB and allow kfree() with kmem_cache_alloc()”. In the patch posted, Vlastimil notes that “The SLOB allocator was deprecated in 6.2 so I think we can start exposing the complete removal in for-next and aim at 6.4 if there are no complaints”.
Thorsten Leemhuis (“the Linux kernel’s regression tracker”) poked an older thread about a 20% UDP performance degradation that Tariq Toukan (NVIDIA) had reported a few months ago. The report observed that a specific CFS (Completely Fair Scheduler, the current default Linux scheduler) patch was the culprit, but that the team discovering it “couldn’t come up with a good explanation how this patch causes this issue”. Thorsten tagged the mail for followup tracking.
Lukas Bulwahn posted “Updating information on lanana.org”. Lanana was setup to be “The Linux Assigned Names and Numbers Authority”, a play on organizations like the IANA: Internet Assigned Numbers Authority, that assigns e.g. IP addresses on the internet. As the patches note, “As described in Documentation/admin-guide/devices.rst, the device number register (or linux device list) is at Documentation/admin-guide/devices.txt and no longer maintained at lanana.org”. Lanana still technically hosts some of the LSB (Linux Standard Base) IDs.
On the Rust front, Asahi Lina posted “rust: add uapi crate” that “introduce[s] a new ‘uapi’ crate that will contain only these [uapi] publicly usable definitions” for use by userspace APIs.
Marcelo Tosatti posted “fold per-CPU vmstats remotely”, a patch that notes a (Red Hat) customer had encountered a system in which 48 out of 52 CPUs were in a “nohz_full” state (i.e. completely idle with the idle “tick” interrupt stopped), where a process on the system was “trapped in throttle_direct_reclaim” (a low memory “reclaim” codepath) but was not making progress because the counters the reclaim code wanted to use were stale (coming from a completely idle CPU) and not updating. The patch series causes the “vmstat_shepered” kernel thread to “flush the per-CPU counters to the global counters from remote [other] CPUs”.
Reinette Chatre posted “vfio/pci: Support dynamic allocation of MSI-X interrupts”. MSIs are “Message Signaled Interrupts”, typically used by modern buses, such as PCIe, in which an interrupt is not signaled using a traditional wiggling of a wire, but instead by a memory write to a special magic address that subsequently causes an actual hard-wired interrupt to be asserted. In the patch posting, Reinette noted that “Qemu allocates interrupts incrementally at the time the guest unmasks an interrupt, for example each time a Linux guest runs request_irq(). Dynamic allocation of MSI-X interrupts was not possible until v6.2. This prompted Qemu to, when allocating a new interrupt, first release a previously allocated interrupts (including disable of MSI-X) followed by re-allocation of all interrupts that includes the new interrupt”. This of course may not be possible while a device or accelerator is running. The patches are marked as RFC (Request For Comments) because “vfio support for dynamic MSI-X needs to work with existing user space as well as upcoming user space that takes advantage of this feature”. Reinette adds, “I would appreciate guidance on the expectations and requirements surrounding error handling when considering existing user space”. She provides several scenarios to consider.
Tejun Heo posted version 3 of “sched: Implement BPF extensible scheduler class”, which “proposed a new scheduler class called ‘ext_sched_class’, or sched_ext, which allows scheduling policies to be implemented as BPF programs”. BPF (Berkeley Packet Filter) programs are small specially processed “bytecode” programs that can be loaded into the kernel and run within a special form of sandbox. They are commonly used to implement certain tracing logic and come with restrictions (for obvious reasons) on the nature of the modifications they can make to a running kernel. Due to their complexity, and potential intrusiveness of allowing scheduling algorithms to be implemented in BPF programs, the patches come with a (lengthy) “Motivation” section, describing the “Ease of experimentation and exploration”, among other reasons for allowing BPF extension of the scheduler instead of requiring traditional patches. An example provided includes that of implementing an L1TF (L1 Terminal Fault, a speculation execution security side-channel bug in certain x86 CPUs) aware scheduler that performs co-scheduling of (safe to pair) peer threads using sibling hyperthreads using BPF.
Joel Fernandes sent a patch adding himself as a maintainer for RCU, noting “I have spent years learning / contributing to RCU with several features, talks and presentations, with my most recent work being on Lazy-RCU. Please consider me for M[aintainer], so I can tell my wife why I spend a lot of my weekends and evenings on this complicated and mysterious thing — which is mostly in the hopes of preventing the world from burning down because everything runs on this one way or another”. RCU (Read-Copy-Update) is a notoriously difficult subsystem to understand yet it is a feature of certain modern Operating Systems that allows them to gain significant performance enhancements from the fundamental notion of having different views into the same data, based upon point-in-time producers and consumers that come and go. Joel later followed up with “Core RCU patches for 6.4”, including the shiny new MAINTAINERS change and several other fixes.
Separately, Paul McKenney (the original RCU author, and co-inventor) posted assorted updates to sleepable RCU (SRCU) reducing cache footprint and marking it non-optional in Kconfig (kernel build configuration), “courtesy of new-age printk() requirements”.
Mike Kravetz raised a concern about THP (Transparent Huge Page) “backed thread stacks”. In his mail, he cited a “product team” that had “recently experienced ‘memory bloat’ in their environment” due to the alignment of the allocations they had used for thread local stacks within the Java Virtual Machine (JVM) runtime. Mike questioned whether stacks should always be THP given that “Stacks by their very nature grow in somewhat unpredictable ways over time”. Most replies were along the lines that the JVM should alter how it does allocations to use the MADV_NOHUGEPAGE parameter to madvise when allocating space for thread stacks.
Carlos Llamas posted “Using page-fault handler in binder” about “trying to remove the current page handling in [Android’s userspace IPC] binder and switch to using ->fault() and other mm/ infrastructure”. He was seeking pointers and input on the direction from other developers.
Mike Rapoport posted a patch series that “move[s] core MM initialization to mm/mm_init.c”.
Randy Dunlap noted that uclinux.org was dead and requested references to it be removed from the Linux kernel MAINTAINERS file.
Jonathan Corbet (of LWN) posted various cleanups to the kernel documentation (which he maintains), including an “arch reorg” to clean up architecture specific docs.
Architectures
Arm
Lukasz Luba posted “Introduce runtime modifiable Energy Model”, a patch set that “adds a new feature which allows to modify Energy Model (EM) power values at runtime. It will allow to better reflect power model of a recent SoCs and silicon. Different characteristics of the power usage can be leverages and thus better decisions made during task placement”. Thus, the kernel’s (CFS) scheduler can (with this patch) make a decision about where to schedule (place, or migrate) a running process (known as a task within the kernel) according to the power usage that the silicon knows will vary according to nature of the workload, and its use of hardware. For example, heavy GPU use will cause a GPU to heat up and alter a chip’s (SoC’s) thermal properties in a manner that may make it better to migrate other tasks to a different core.
Itanium
Reports of Itanium’s demise may not have been greatly exaggerated, but when it comes to the kernel they may have been a little premature by a month or two. Florian Weimer followed up to “Retire IA64/Itanium support” with a question, “Is this still going ahead? In userspace, ia64 is of course full of special cases, too, so many of us really want to see it gone, but we can’t really start the removal process while there is still kernel support”.
LoongArch
Tianrui Zhao posted version 5 of “Add KVM LoongArch support”.
Huacai Chen posted a patch, “LoongArch: Make WriteCombine configurable for ioremap()” that aims to work around a PCIe protocol violation in the implementation of the LS7A chipset.
Separately, Huacai also posted a patch enabling the kernel itself to use FPU (Floating Point Unit) functions. Quoting the patch, “They can be used by some other kernel components, e.g. the AMDGPU graphic driver for DCN”.
WANG Xuerui posted “LongArch: Make bounds-checking instructions useful”, referring to “BCE” (Bounds Checking Error) instructions, similar to those of other architectures, such as x86_64.
POWER
Laurent Dufour posted “Online new threads according to the current SMT level”, which aims to balance a hotplugged CPU’s SMT level against the current one used by the overall system. For example, a system capable of SMT8 but booted in SMT4 will currently nonetheless online all 8 SMT threads of a subsequently added CPU, rather than only 4 (to match the system).
RISC-V
Evan Green posted the fourth version of “RISC-V Hardware Probing User Interface”, which aims to handle the number of (potentially incompatible) ISA extensions present in implementations of the RISC-V architecture. The basic idea is to provide a vDSO (virtual Dynamic Shared Object – a kind of library that appears in userspace and is fast to link against, but is owned by the kernel) and backing syscall (for fallback use by the vDSO in certain cases) that can quickly hand an application key/value pairs representative of potential ISA features present on a system. The previous attempts had experienced pushback, so this time Evan came with performance numbers showing the (many) orders of magnitude differences in performance between using a vDSO/syscall approach vs. the sysfs file interface originally counter proposed by Greg KH (Greg Kroah-Hartman). Greg had preferred an application perform many open calls to parse sysfs files in order to determine the capabilities of a system, but this would be expensive for every binary. This patch series was later merged by Palmer Dabbelt (the RISC-V kernel maintainer) and should therefore make its way into the Linux 6.4 kernel series in the next couple of months.
Sia Jee Heng posted version 5 of a patch series implementing hibernation support for RISC-V. According to the posting, “This series adds RISC-V Hibernation/suspend to disk support. Low level Arch functions were created to support hibernation”. The cover letter explains how e.g. swsusp_arch_resume “creates a temporary page table that [covering only] the linear map. It copies the restore code to a ‘safe’ page, then [start] restore the memory image”.
Heiko Stuebner posted “RISC-V: support some cryptography accelerations”. These rely on version 14 of a previous patch series adding experimental support for the “v” (vector) extension, which has not been ratified (made official) by the RISC-V International organization yet. And speaking of this, a recent discussion of the non-standard implementation of the RISC-V vector extension in the “T-Head C9xx” cores suggests describing those as an “errata” implementation.
The PINE64 project recently began shipping a RISC-V development board known as “Star64”. This board uses the StarFive JH7110 SoC for which Samin Guo recently posted an updated ethernet driver, apparently based on the DesignWare MAC from Synopsys. Separately, Walker Chen posted a DMA driver for the same SoC, and Mason Huo posted cpufreq support (which included enabling “the axp15060 pmic for the cpu power source”). Seems an effort is underway to upstream support for this low-cost “Raspberry Pi”-like alternative in the RISC-V ecosystem.
Greg Ungerer posted “riscv: support ELF format binaries in nommu mode” which does what it says on the tin: “add the ability to run ELF format binaries when running RISC-V in nommu mode. That support is actually part of the ELF-FDPIC loader, so these changes are all about making that work on RISC-V”. Greg notes, “These changes have not been used to run actual ELF-FDPIC binaries. It is used to load and run normal ELF – compiled -pie format. Though the underlying changes are expected to work with full ELF-FDPIC binaries if or when that is supported on RISC-V in gcc”.
Anup Patel posted version 18 of “RISC-V IPI Improvements” which aims to teach RISC-V (on suitable hardware) how to use “normal per-CPU interrupts” to send IPIs (Inter-Processor Interrupts), as well as remote TLB (Translation Lookaside Buffer) flushes and cache maintenance operations without having to resort to calls into “M” mode firmware.
x86 (x86_64)
Rick Edgecombe posted version 8 of “Shadow stacks for userspace”, to which Borislav Petkov replied “Yes, finally! That was loooong in the making. Thanks for the persistence and patience”. He signed off as having reviewed the patches.
Ian Rogers posted “Event updates for GNR, MTL and SKL”. Apparently these perf events are generated automatically using a script on Intel’s github (that’s pretty sweet).
Usama Arif posted version 15 of “Parallel CPU bringup for x86_64”. This is about doing parallel calls to INIT/SIPI/SIPI (the initialization sequences used by x86 CPUs to bring them up) rather than the single threaded process that previously was used by the Linux kernel.
Tony Luck posted version 2 of “Handle corrected machine check interrupt storms”, which includes additional patches from Smita Koralahalli that “Extend the logic of handling Intel’s corrected machine check interrupt storms to AMD’s threshold interrupts”.
Yi Liu posted “iommu: Add nested domain support”, which “Introduce[s] a new domain type for a user space I/O address, which is nested on top of another address space address represented by a UNMANAGED domain”.
Kirill A. Shutemov posted version 16 of “Linear Address Masking enabling”. As he noted, “(LAM) modifies the checking that is applied to 64-bit linear addresses, allowing software to use of the untranslated address bits for metadata. The capability can be used for efficient address sanitizers (ASAN) implementation and for optimizations in JITs and virtual machines”. It’s also been present in architectures such as Arm for many, many years as TBI (Top Byte Ignore), etc.
Kuppuswamy Sathyanarayanan posted “TDX Guest Quote generation support”, which enables “TDX” (Trusted Domain Extensions – aka Confidential Compute) guests to attest to their “trustworthiness to other entities before provisioning secrets to the guest”. The patch describes a two step process including a “TDREPORT generation” and a “Quote generation”. The report captures measurements while the report is sent to a “Quoting Enclave” (QE) that generates a “remotely verifiable Quote”. A special conduit is provided for guests to send these quotes.
Shan Kang posted some benchmark results from KVM for Intel’s “FRED” (Flexible Return and Event Delivery) new syscall/sysenter enhanced architecture.
Mario Limonciello posted “Add vendor agnostic mechanism to report hardware sleep”, noting that “An import part of validating that S0ix [an SoC level idle power state] worked properly is to check how much of a cycle was spent in a hardware sleep state”.
]]>Summary
The latest stable kernel is Linux 6.2.2, released by Greg Kroah-Hartman on March 3rd 2023. The latest mainline (development) kernel is 6.3-rc1, released by Linus on March 5th 2023.
Mathieu Desnoyers has announced Userspace RCU release 0.14.0 which adopts a baseline requirement of C99 and C++11, and introduces new APIs for C++.
Alejandro Colomar announced man-pages-6.03 is released. Among the “most notable changes” is “We now have a hyperlinked PDF book of the Linux man-pages”.
Junio C Hamano announced that “A release candidate Git v2.40.0-rc2 is now available”.
Takashi Sakamoto has stepped up to become the owner of the FireWire subsystem.
Linux 6.2 released
Linux 6.2 was released “right on (the extended) schedule” on February 19th following an extra RC (Release Candidate) motivated by the end of year holidays. Linus noted in his release announcement that “Nothing unexpected happened” toward the end of the cycle but there were a “couple of small things” on the regression side that Thorsten Leemhuis is tracking. Since “they weren’t actively pushed by maintainers…they will have to show up for stable [kernel releases]”.
Thorsten diligently followed up with his summary of regressions, noting that “There are still quite a few known issues from this cycle mentioned below. Afaics none of them affect a lot of people”. He also recently posted “docs: describe how to quickly build a trimmed kernel” as “that’s something users will often have to do when they want to report an issue or test proposed fixes”.
Among the fixes that will come in for stable is a build fix for those running Linux 6.2 on a Talos II (IBM POWER9) machine, who may notice an “undefined reference to ‘hash__tlb_flush” during kernel compilation. A fix is being tracked for backport to stable.
Speaking of regressions, Nick Bowler identified an older regression beginning in Linux 6.1 that caused “random crashes” on his SPARC machine. Peter Xu responded that it was likely a THP (Transparent Huge Page) problem, perhaps showing up because THP was disabled (which it was in Nick’s configuration). Nick tested a fix from Peter that seemed to address the issue.
As you’ll see below, ongoing discussions are taking place about the removal of various legacy architectures from the kernel. Another proposal recently made (by Christoph Hellwig) is to “orphan JFS” (the “Journalling File System”). Stefan Tibus was among those who stood up and claimed to still be “a happy user of JFS from quite early on all my Linux installations”.
Linux 6.3-rc1
Linus announced the closure of the merge window (the period of time during which disruptive changes are allowed to be merged into the kernel) with the release of Linux 6.3-rc1, noting, “So after several releases where the merge windows had something odd going on, we finally had just a regular “two weeks of just merge window”. It was quite nice. In fact, it was quite nice in a couple of ways: not only didn’t I have a huge compressed merge window where I felt I had to cram as much as possible into the first few days, but the fact that we _have_ had a couple of merge windows where I really asked for people to have everything ready when the merge window opened seems to have set a pattern: the bulk of everything really did come in early”.
As usual, Linux Weekly News has an excellent summary of part 1 and part 2 of the merge window (across two weeks). I encourage you to subscribe and read it for a full breakdown.
Ongoing Development
Linux 6.2 brought with it initial support for the Rust programming language. Development continues apace upsteam, with proposed patches extending the support to include new features. Miguel Ojeda (the Rust for Linux maintainer) posted a pull request for Linux 6.3, including support for various new types. Daniel Almeida recently posted “rust: virtio: add virtio support”, which “adds virtIO support to the rust crate. This includes the capability to create a virtIO driver (through the module_virtio_driver macro and the respective Driver trait)”.
And the work extends to the architectural level also, with Conor Dooley recently posting “RISC-V: enable rust”, which he notes is a “somewhat blind (and maybe foolish) attempt at enabling Rust for RISC-V. I’ve tested this on Icicle [a prominent board], and the modules seem to work. I’d like to play around with Rust on RISC-V, but I’m not interested in using downstream kernels, so figured I should try and see what’s missing…”.
But probably the most interesting development in Rust language land has noting to do with Rust as a language at all. Instead, it is a patch series titled “Rust DRM subsystem abstractions (& preview AGX driver)” from Asahi Lina. In the patch, Lina notes “This is my first take on the Rust abstractions from the DRM [graphics] subsystem. It includes the abstractions themselves, some minor prerequisite changes to the C side, as well as drm-asahi GPU driver (for reference on how the abstractions are used, but not necessarily intended to land together)”. It’s that last part, patch 18, the one titled “drm/asahi: Add the Asahi driver for Apple AGX GPUs” which we refer to here. In it, Lina implements support for the GPUs used by the Apple M1, M1 Pro, M1 Max, M1 Ultra, and the Apple M2 silicon. This is not a small driver and an interesting demonstration of the level of capability already being reached in terms of Linux upstream Rust language support.
Lokesh Gidra posted an “RFC for new feature to move pages from one vma to another without split” which allows one “anonymous” (not file backed) page (the fundamental granule size by which memory is managed and accounted) to be moved from part of a runtime heap (VMA) to another without otherwise impacting the state of the overall heap. The intended benefit is to managed runtimes with garbage collection, allowing for simplified “coarse-grained page-level compaction” garbage collection algorithms “wherein pages containing live objects are slid next to each other without touching them, while reclaiming in-between pages which contain only garbage”. The patch posting includes a lengthy writeup explaining the details.
Alison Schofield posted patches titled “CXL Poison List Retrieval & Tracing” targeting the CXL 3.0 specification, which allows OS management software to obtain a list of memory locations that have been poisoned (corrupted due to a RAS event, such as an ECC failure), for example in a “CXL.mem” DDR memory device attached to a system using the serial CXL interconnect.
Dexuan Cui noted that “earlyprintk=ttyS0” was broken on AMD SNP (Confidential Compute) guests running under KVM. This turned out to be due to a particular code branch taken during initialization that varied based upon whether a kernel was entered in 64-bit mode via EFI or through a direct (e.g. kexec/qemu KVM device modeling userspace) type of load.
Zhangjin Wu posted “Add dead syscalls elimination support” intended to remove support from the kernel for “dead” syscalls “which are not used in target system”. Presumably this is to benefit deeply embedded architectures where any excess memory used by the kernel is precious.
Nick Alcock posted “MODULE_LICENSE removals, first tranche” intended to “remove the LICENSE_MODULE usage from files/objects that are not tristate” [meaning that they are not actually setup to be used as modules to begin with].
Bobby Eshleman posted “vsock: add support for sockmap”. Bytedance are apparently “testing usage of vsock as a way to redirect guest-local UDS [Unix Domain Socket] requests to the host and this patch series greatly improves the performance of such a setup”. By 121% throughput.
Chih-En Lin posted version 4 of a patch series “Introduce Copy-On-Write to Page Table” which aims to add support for COW to the other half of the equation. Copy-on-Write is commonly used as an optimization whereby a cloned process (for example, during a fork used to exec a new program) doesn’t actually get a copy of the entire memory used by the original process. Instead, the tracking structures (page tables) are modified to mark all the pages in the new process as read only. Only when it attempts to write to the memory are the actual pages copied. The COW page table patches aim to do the same for the page tables themselves, so that full copies are not needed until the new address space is modified. Pulling off this trick requires that some of the tables are copied, but not the leaf (PTE) entries themselves, which are shared between the two processes. David Hildenbrand thanked Chih-En for the work, and the measurements, but expressed concern about “how intrusive even this basic deduplication approach already is”.
On the subject of page tables, Matthew (Willy) Wilcox posted version 3 of “New page table range API” that allows for setting up multiple page table entries at once, noting “The point of all this is better performance, and Fenwei Yin has measured improvement on x86”.
Architectures
Arm
Kristina Martsenko posted “arm64: support Armv8.8 memcpy instructions in userspace”, which adds support for (you guessed it) the memcpy instructions that were added in Armv8.8. These are described by the FEAT_MOPS documentation in the Arm ARM. As Kristina puts it, “The aim is to avoid having many different performance-optimal memcpy implementations in software (tailored to CPU model and copy size) and the overhead of selecting between them. The new instructions are intended to be at least as fast as any alternative instruction sequence”.
Various Apple Silicon patches have been posted. As the Asahi Linux project noted recently in “an update and reality check”, Linux “6.2 notably adds device trees and basic support for M1 Pro/Max/Ultra machines. However, there is still a long road before upstream kernels are usable on laptops”. Nonetheless, patches continue to fly, with the latest including “Apple M2 PMU support” from Janne Grunau, which notes that “The PMU itself appears to work in the same way as o[n] M1”, and support for the Broadcom BCM4387 WiFi chip used by Apple M1 platforms from Hector Martin. Hector also posted “Apple T2 platform support” patches.
Itanium
The Intel Itanium architecture, also known as “IA-64”, was originally announced on October 4th 1999. It was intended as the successor to another legacy architecture that Intel had previously introduced back in 1978. That legacy architecture (known as “x86”) had a number of design challenges that could limit its future scalability, but it was also quite popular, and there were a relatively large number of systems deployed. Nonetheless, Intel was determined to replace x86 with a modern architecture designed with the future in mind. Itanium was co-designed with Hewlett-Packard, who created the original ISA specification. It featured 128 64-bit general purpose registers, 128 floating point registers, 64-bit predicate registers, and more besides.
Itanium was a VLIW (Very Long Instruction Word) machine that leveraged fixed-width “bundles” of instructions that are each individually 41-bits, plus a 5-bit template describing which type of instructions are present in the bundle. The Itanium implementation of VLIW is referred to as “EPIC” (Explicitly Parallel Instruction Computing) – which one must be careful not to confuse with the highly successful x86 architecture implementation from AMD known as “EPYC”. In Itanium, modern high performance microprocessor innovations such as hardware speculative and Out-of-Order execution take a back seat to software managed speculation, requiring an extremely complicated compiler toolchain that took many years to develop. Even then, it was clear early on that software management of dependencies and speculation could not compete with a hardware implementation, such as that used by contemporary x86 and RISC CPUs.
Intel Itanium processors were officially discontinued in January of 2020. As Ard Biesheuvel noted across several patch postings attempting to remove or mark IA-64 as broken, various support for Itanium has already been removed from dependent projects (such as upstream Tianocore – the EFI implementation needed to boot such systems, from which Intel itself removed such support in 2018), “QEMU no longer implements support for it”, and given the lack of systems and ongoing firmware maintenance, “there is zero test coverage using actual hardware” (“beyond a couple of machines used by distros to churn out packages”). Even this author has long since decommissioned his Itanium system (named “Hamartia” after the tragic hero, which he acquired during the upstreaming of PCI support for Arm as both of Itanium’s users had expressed concern that Arm support for PCI might break Itanium and it thus seemed important to be able to test that this mission-critical architecture was not broken in the process).
As of this writing support for Itanium has not (yet) been removed from the kernel.
LoongArch
A lot of work is going into the LoongArch [aside: could someone please let me know how to pronounce it properly?]. Recent patches include a patch from Youling Tang (“Add support for kernel relocation”) that “allows to compile [the] kernel as PIE and to relocated it at any virtual address at runtime” (to “pave the way to KASLR”, added in a later patch). Another patch “Add hardware breakpoints/watchpoints support” does what it says on the tin. Finally, Tianrui Zhao posted “Add KVM LoongArch support”, which adds KVM support noting that the Loongson (the company behind the architecture) “3A5000” chip “supports hardware assisted virtualization”.
RISC-V
Evan Green posted “RISC-V: Add a syscall for HW probing” which started an extremely long discussion about the right (and wrong) ways to handle the myriad (sometimes mutually incompatible) extensions supported by the RISC-V community. Traditionally, architectures were quite standardized with a central authority providing curation. But while RISC-V does have the RISC-V International organization, and the concept of ratification for extensions with a standard set of extensions defined in various profiles, the practical reality is somewhat less rigid than folks may be used to. As a result, there are in fact a very wide range of implementations, and the kernel needs to somehow be able to handle all of the hundreds of permutations.
Most architectures handle minor variation between implementations using the “HWCAP” infrastructure and the “Auxiliary Vectors” which are special environment variables exported into every running process. This allows (e.g.) userspace software to quickly determine whether a particular feature is supported or not. For example, the feature might be some novel atomic or vector support that isn’t present in older processors. But when it comes to RISC-V this approach isn’t as easy. As Evan said in his posting, “We don’t have enough space for these all in ELF_HWCAP and there’s no system call that quite does this, so let’s just provide an arch-specific one to probe for hardware capabilities. This currently just provides m{arch,imp,vendor}id, but with the key-value pairs we can pass more in the future”.
The response was swift, and negative, with Greg Kroah-Hartman responding, “Ick, this is exactly what sysfs is designed to export in a sane way. Why not just use that instead? The “key” would be the filename, and the value the value read from the filename”. The response was that this would slow down future RISC-V systems because of the large number of file operations that every process would need to perform on startup in order for the standard libraries to figure out what features were supported or not. Worse, some of the infrastructure for file operations might not be available at the time when it would be needed. This situation is a good reminder of the importance of standardization and the value that it can bring to any modern architecture.
Speaking of standardization, several rounds of patches were posted titled “Add basic ACPI support for RISC-V” which “enables the basic ACPI infrastructure for RISC-V”. According to Sunil V L, who posted the patch series, “Supporting external interrupt controllers is in progress and hence it is tested using poll based HVC SBI console and RAM disk”.
Other patches recently posted for RISC-V include “Introduce virtual kernel mapping KASLR”. The patches note that “The seed needed to virtually move the kernel is taken from the device tree, so we rely on the bootloader to provide the correct seed”. Later patches may add support for the RISC-V “Zkr” random extension so that this can be provided by hardware instead. As a dependent patch, Alexandre Ghiti posted “Introduce 64b relocatable kernel”.
Deepak Gupta posted “riscv control-flow integrity for U mode” in which he notes he has “been working on linux support for shadow stack and landing pad instruction on riscv for a while. These are still RFC quality. But at least they’re in a shape which can start a discussion”. The RISC-V extension adding support for control flow integrity is called Zisslpcfi, which rolls off the tongue just as easily as all of the other extension names, chosen by cats falling on keyboards.
Jesse Taube posted “Add RISC-V 32 NOMMU support”, noting, “This patch-set aims to add NOMMU support to RV32. Many people want to build simple emulators or HDL models of RISC-V. [T]his patch makes it possible to run linux on them”.
Returning to the topic of incompatible vendor extensions, Heiko Stuebner posted “RISC-V: T-Head vector handling”, which notes “As is widely known, the T-Head C9xx cores used for example in the Allwinner D1 implement an older non-ratifed variant of the vector spec. While userspace will probably have a lot more problems implementing support for both, on the kernel side the needed changes are actually somewhat small’ish and can be handled via alternatives somewhat nicely. With this patchset I could run the same userspace program (picked from some riscv-vector-test repository) that does some vector additions on both qemu and a d1-nezha board. On both platforms it ran successfully and even produced the same results”.
Super-H
Returning to the subject of dying architectures once again, an attempt was made by Christoph Hellwig to “Drop arch/sh and everything that depends on it” since “all of the support has been barely maintained for almost 10 years, and not at all for more than 1 year”. Geert Uytterhoeven noted that “The main issue is not the lack of people sending patches and fixes, but those patches never being applied by the maintainers. Perhaps someone is willing to stand up to take over maintainership?” This caused John Paul Adrian Glaubitz to raise his hand and say he “actually would be silling to do it but I’m a bit hesitant as I’m not 100% sure my skills are sufficient”. Rob Landley offered to help out too. It seems sh might survive this round.
x86
Mathieu Desnoyers was interested in formal documentation from Intel concerning concurrent modification of code while it is executing (specifically, updating instructions to patch them as calling a debug handler via “INT3”). He wrote to Peter Anvin saying “I have emails from you dating from a few years back unofficially stating that it’s OK to update the first byte of an instruction with a single-byte int3 concurrently…Olivier Dion is working on the libpatch project aiming to use this property for low-latency/low-overhead live code patching in user-space as well, but we cannot find an official statement from Intel that guarantees this breakpoint-bypass technique is indeed OK without stopping the world while patching”. Steven Rostedt was among those who noted “The fact that we have been using it for over 10 years without issue should be a good guarantee”. Mathieu was able to find comprehensive documentation in the AMD manual that allows it, but noted again “I cannot find anything with respect to asynchronous cross-modification of code stated as clearly in Intel’s documentation”. Anyone want to help him?
Development continues toward implementing support for “Flexible Return and Event Delivery” aka “FRED” on Intel architecture. Among the latest patches, Ammar Faizi includes a fix to the “sysret_rip” selftest that handles the fact that FRED’s “syscall” instruction (to enter the kernel from userspace) no longer clobbers (overwrites) the x86 “rcx” and “r11” registers. On the subject of tests, Mingwei Zhang posted patches updated the “amx_test” suite to add support for several of the “new entities” that are present in Intel’s AMX (matrix extension) architecture.
Sean Christopherson posted “KVM: x86: Add “governed” X86_FEATURE framework”, which is intended to “manage and cache KVM-governed features, i.e. CPUID based features that require explicit KVM enabling and/or need to be queried semi-frequently by KVM”. According to Sean, “The idea originally came up in the context of the architectural LBRs [Last Branch Record, a profiling mechanism to record precisely the last N branched] series as a way to avoid querying guest CPUID in hot paths without needing a dedicated flag, but as evidenced by the shortlog, the most common usage is to handle the ever-growing list of SVM [Shared Virtual Memory] that are exposed to L1”. Reducing calls to CPUID is generally a good thing since it results in a (possibly lengthy) trap into microcode, and is also a context serializing instruction.
Paolo Bonzini posted “Cross-Thread Return Address Predictions vulnerability”, noting that “Certain AMD processors are vulnerable to a cross-thread return address predictions bug. When running in SMT [Simultaneous Multi-Threading] mode and one of the sibling threads transitions out of C0 state, the other thread gets access to twice as many entries in the RSB [Return Stack Buffer], but unfortunately the predictions of the now-halter logical processor are not purged”. Paolo is referring to the fact that x86 processors include two logical “threads” (which Intel calls “Hyperthreads” – a trademarked name – and more generally are known as SMT or Simultaneous Multi-Thread). Most modern x86 processors include an optimization that when one logical thread is not being used and the other transitions into what software sees as a “low power” state, what actually happens is that the partitioned resources as given to the other thread, which consequently sees a boost in performance as it is no longer contending on the back end for execution units, and now has double the store buffer and predictor entries.
But in this case, the RSB [Return Stack Buffer] entries are not zeroed out in the process, meaning that it is possible for a malicious thread to “train” the RSB predictor later used by the peer thread to guess that certain function call return paths will be used. This opens up an opportunity to cause a sibling thread to speculatively execute down a wrong path that leaves cache breadcrumbs that can be measured in order to potentially leak certain information. Paolo addresses this by adding a KVM (hypervisor) parameter that “if set, will prevent the user from disabling the HLT, MWAIT, and CSTATE exits”, preventing the other thread from preventing the hypervisor from stuffing the RSB with dummy safe values when the sibling thread goes to sleep.
Dionna Glaze posted “Add throttling detection to sev-guest”, noting that “The guest request synchronous API from SEV-SNP [AMD’s Confidential Computing feature] to the host’s security processor consumes a global resource. For this reason, AMD’s docs recommend that the host implements a throttling mechanism. In order for the guest to know it’s been throttled and should try its request again, we need some good-faith communication from the host that the request has been throttled. These patches work with the existing dev/sev-guest ABI”.
On the subject of Confidential Compute, Kai Huang posted version 9 of a patch series “TDX host kernel support” aiming to add support for Intel’s TDX Confidential Compute extensions, while Jeremi Piotrowski posted “Support nested SNP KVM guests on Hyper-V” intending to add support for nested (hypervisor inside hypervisor) support for AMD’s Confidential Compute to the Hyper-V hypervisor as used by Microsoft Azure. Nested Confidential Compute sounds fun.
Rick Edgecombe posted version 6 of “Shadow stacks for userspace”, a series that “implements Shadow Stacks for userspace using x86’s Control-flow Enforcement Technology (CET)”. As he reminds us, CET supports both shadow stacks and indirect branch tracking (landing pads), but these patches “implements just the shadow stack part of this feature, and just for userspace”.
Michael S. Tsirkin posted “revert RNG seed mess” noting “All attempts to fix up passing RNG [random entropy] seed via setup_data entry failed. Let’s just rip out all of it. We’ll start over”.
Arnd Bergmann posted “x86: make 64-bit defconfig the default” noting that 32-bit kernel builds were “rarely what anyone wants these days”. The patch changes “the default so that the 64-bit config gets used unless the user asked for i686_defconfig, uses ARCH=i386 or runs on a system that “uname -m” identifies as i386/i486/i586/i686”.
]]>The latest stable kernel is Linux 6.1.11, released by Greg K-H on February 9th 2023.
The latest mainline (development) kernel is 6.2-rc7, released on February 5th 2023.
Linux 6.2 progress
A typical kernel development cycle begins with the “merge window” (period of time during which disruptive changes are allowed to be merged into the kernel) followed by a series of (weekly) Release Candidate (RC) kernels, and then the final release. In most cases, RC7 is the final RC, but it is not all that unusual to have an extra week, as is likely the case this time around. Linus said a few weeks ago, “I am expecting to do an rc8 this release regardless, just because we effectively had a lost week or two in the early rc’s”, and indeed fixes for RC8 were still coming in as recently as today. We should at this rate see RC8 tomorrow (Sunday is the normal release day), and the 6.3 merge window in another week, meaning we’ll cover the 6.3 merge window in the next edition of this podcast. In the meantime, I encourage listeners to consider subscribing and supporting LWN (Linux Weekly News), who always have a great merge window summary.
Confidential Compute (aka “CoCo”)
If there were a “theme of the moment” for the industry (other than layoffs), it would probably be Confidential Compute. It seems one can’t go more than 10 minutes without seeing a patch for some new confidential compute feature in one of the major architectures, or the system IP that goes along with it. Examples in just the past few weeks (and which we’ll cover in a bit) include patches from both Intel (TDX) and AMD (SEV-SNP) for their Confidential Compute solutions, as well as PCI pass-through support in Hyper-V for Confidential VMs. At the same time, thought is going into revising the kernel’s “threat model” to update it for a world of Confidential Compute.
A fundamental tenet of Confidential Compute is that guests no longer necessarily have to trust the hypervisor on which they are running, and quite possibly also don’t trust the operator of the system either (whether a cloud, edge network, OEM, etc.). The theory goes that you might even have a server sitting in some (less than friendly) geographical location but still hold out a certain amount of trust for your “confidential” workloads based on properties provided by the silicon (and attested by introspecting the other physical and emulated devices provided by the system). In this model, you necessarily have to trust the silicon vendor, but maybe not much beyond that.
Elana Rehetova (Intel) posted “Linux guest kernel threat model for Confidential Computing” in which she addressed Greg Kroah-Hartman (“Greg K-H”), who apparently previously requested “that we ought to start discussing the updated threat model for kernel”. She had links to quite detailed writeups on Intel’s github. Greg replied to a point about not trusting the hypervisor with “That is, frankly, a very funny threat model. How realistic is it really given all of the other ways that a hypervisor can mess with a guest?”. And that did indeed used to be a good point. Some of the earlier attempts at Confidential Compute included architectural designs in which guest registers were not protected against single step debug (and introspection) from a hypervisor, for example. And so one can be forgiven for thinking that there are some fundamental gaps, but a lot has changed over the past few years, and the architectures have advanced quite a bit since.
Greg also noted that he “hate[s] the term “hardening”” (as applied to “hardening” device drivers against malicious hardware implementations (as opposed to just potentially buggy ones). He added, “Please just say it for what it really is, “fixing bugs to handle broken hardware”. We’ve done that for years when dealing with PCI and USB and even CPUs doing things that they shouldn’t be doing. How is this any different in the end? So what you also are saying here now is “we do not trust any PCI devices”, so please just say that (why do you trust USB devices?) If that is something that you all think that Linux should support, then let’s go from there.” David Alan Gilbert piled on with some context around Intel, and AMD’s implementations, and in particular that more than mere memory encryption is used; register state, guest VMSA (control), etc. – all of that and much more – is carefully managed under the new world order.
Daniel Berrange further clarified, in response to a discussion about deliberately malicious implementations of PCI and USB controllers, that, “As a baseline requirement, in the context of confidential computing the guest would not trust the hypervisor with data that needs to remain confidential, but would generally still expect it to provide a faithful implementation of a given device.” A lot of further back and forth took place with others piling on comments indicating a few folks weren’t aware of the different technical pieces involved (e.g. PCI IDE, CMA, DOE, SPDM and other acronyms) for device attestation prior to trusting it from within a guest, or that this was even possible. The thread was more informative for revealing that general knowledge of technology involved in Confidential Compute is not broadly pervasive. Perhaps there is an opportunity there for sessions at the newly revived in-person conferences taking place in ‘23.
Ongoing Development
Miguel Ojeda posted a patch introducing a new “Rust fixes” branch, noting, “While it may be a bit early to have a “fixes” branch, I guessed it would not hurt to start practicing how to do things for the future when we may get actual users. And since the opportunity presented itself, I wanted to also use this PR to bring up a “policy” topic and ideally get kernel maintainers to think about it.” He went on to describe the PR as containing a fix for a “soundness issue” related to UB (Undefined Behavior) in which “safe” rust code can nonetheless trigger UB in C code. He wanted to understand whether such fixes were truly considered fixes suitable for backport to stable and was mostly interested in addressing the policy aspect of the development process. Linus took the pull request without discussion, so presumably it wasn’t a big deal for him.
Saurabh Sengar posted “Device tree support for Hyper-V VMBus driver”, which “expands the VMBus driver to include device tree support. This feature allows for a kernel boot without the use of ACPI tables, resulting in a smaller memory footprint and potentially faster boot times. This is tested by enabling CONFIG_FLAT and OF_EARLY_FLATTREE for x86.” It isn’t articulated in the patch series, but this smells like an effort to support a special case minimal kernel – like the kind used by Amazon’s “Firecracker” for fast spinup VMs used to back ephemeral services like “functions”. It will be interesting to see what happens with this.
Elliot Berman (QUIC, part of Qualcomm) posted version 9 of a patch series “Drivers for gunyah hypervisor”. Gunyah is, “a Type-1 hypervisor independent of any high-level OS kernel, and runs in a higher CPU privilege level. It does not depend on any lower-privileged OS kernel/code for its core functionality. This increases its security and can support a much smaller trusted computing base than a Type-2 hypervisor.” The Gunyah source is available on github.
Breno Leitao posted “netpoll: Remove 4s sleep during carrier detection” noting that “Modern NICs do not seem to have this bouncing problem anymore, and this sleep slows down the machine boot unnecessarily”. What he meant is that traditionally the carrier on a link might be reported as “up” while autonegotiation was still underway. As Jakub Kicinski noted, especially on servers the “BMC [is often] communicating over NC-SI via the same NIC as gets used for netconsole. BMC will keep the PHY up, hence the carrier appearing instantly.”
Robin Murphy (Arm) posted a patch series aiming to “retire” the “iommu_ops” per bus IOMMU operations and reconcile around a common kernel implementation.
SongJae Park continues to organize periodic “Beer/Coffee/Tea” chat series virtual sessions for those interested in DAMON. The agenda and info is an a shared Google doc.
Architectures
Arm
Suzuki K Poulose posted several sets of (large) related RFC (Request For Comment) patches beginning with, “Support for Arm CCA VMs on Linux”. Arm CCA is a new feature introduced as part of the Armv9 architecture, including both the “Realm Management Extension” (RME) and associated system level IP changes required to build machines that support Confidential Compute “Realms”. In the CCA world, there are additional security states beyond the traditional Secure/Non-Secure. There is now a Realm state in which e.g. a Confidential Compute VM communicates with a new “RMM” (Realm Management Monitor) over an “RSI” (Realm Service Interface) to obtain special services on the Realm’s behalf. The RMM is separated from the traditional hypervisor and “provides standard interfaces – Realm Management Interface (RMI) – to the Normal world hypervisor to manage the VMs running in the Realm world (also called
Realms in short).” The idea is that the RMM is well known (e.g. Open Source) code that can be attested and trusted by a Realm to provide it with services on behalf of an untrusted hypervisor.
Arm include links to an updated “FVP” (Fixed Virtual Platform) modeling an RME-enabled v9 platform, alongside patched TF-A (Trusted Firmware), an RMM (Realm Management Monitor), kernel patches, an updated kvmtool (a lightweight alternative to qemu for starting VMs), and updated kvm-unit-tests. Suzuki notes that what they are seeking is feedback on:
- KVM integration of the Arm CCA
- KVM UABI for managing the Realms, seeking to generalise the operations wherever possible with other Confidential Compute solutions.
- Linux Guest support for Realms
kvx
Yann Sionneau (Kalrey) posted version 2 of a patch series, “Upstream kvx Linux port”, which adds support for yet another architecture, as used in the “Coolidge (aka MPPA3-80)” SoC. The architecture is a little endian VLIW (Very Long Instruction Word) with 32 and 64-bit execution modes, 64 GPRs, SIMD instructions, and (but of course) a “deep learning co-processor”. The architecture appears to borrow nomenclature from elsewhere, having both an “APIC” and a “GIC” as part of its interrupt controller story. Presumably these mean something quite different. In the mail, Yann notes that this is only an RFC at this stage, “since kvx support is not yet upstreamed into gcc/binutils”. The most infamous example of a VLIW architecture is, of course, Intel’s Itanium. It is slowly being removed from the kernel in a process that began in 2019 with the shipping of the final Itanium systems and deprecation of GCC and GLIBC support for it. If things go well, perhaps this new VLIW architecture can take Itanium’s place as the only one.
RISC-V
Anup Patel (Ventana Micro) is heavily involved in various RISC-V architecture enablement, including for the new “AIA” (Advanced Interrupt Architecture) specification, replacing the de facto use of SiFive’s “PLIC” interrupt controller. The spec has now been frozen (Anup provided a link to the frozen AIA specification) and initial patches are posted by Anup enabling support for guests to see a virtualized set of CSRs (Configuration and Status Registers). AIA is designed to be fully virtualizable, although as this author has noted from reading the spec, it does require an interaction with the IOMMU to interdict messages in order to allow for device live migration.
Sunil V L (Ventana Micro) posted patches to “Add basic ACPI support for RISC-V”. The patches come alongside others for EDK2 (UEFI, aka “Tianocore”), and Qemu (to run that firmware and boot RISC-V kernels enabled with ACPI support). This is an encouraging first step toward an embrace of the kinds of technologies required for viability in the mainstream. This author recalls the uphill battle that was getting ACPI support enabled for Arm. Perhaps the community has more experience to draw upon at this point, and a greater understanding of the importance of such standards to broader ecosystems. In any case, there were no objections this time around.
x86-64
Early last year (2022), David Woodhouse (Amazon) posted the 4th version of a patch series he had been working on titled “Parallel CPU bringup for x86_64” which aims to speed up the boot process for large SMP x86 systems. Traditionally, x86 systems would enter the Linux kernel in a single threaded mode with a “bootcpu” being the first core that happened to start Linux (not necessarily “cpu0”). Once early initialization was complete, this CPU would use a SIPI (Startup IPI or “Inter Processor Interrupt”) to signal to the “secondary” cores that they should start booting. The entire process could eventually take quite some time, and it would therefore be better if these “secondary” cores could start their initialization earlier – while the first core was getting things setup – and then rendezvous waiting for a signal to proceed.
Usama Arif (Bytedance) noted that these older patches “brought down the smpboot time from ~700ms to 100ms”. That’s a decent savings, especially when using kexec as Usama is doing (perhaps in a “Linuxboot” type of configuration with Linux as a bootloader), and at the scale of a large number of systems. Usama was interested to know whether these patches could be merged. David replied that the last time around there had been some AMD systems that broke with the patches, “We don’t *think* there are any remaining software issues; we think it’s hardware. Either an actual hardware race in CPU or chipset, or perhaps even something as simple as a voltage regulator which can’t cope with an increase in power draw from *all* the CPUs at the same time. We have prodded AMD a few times to investigate, but so far to no avail. Last time I actually spoke to Thomas [Gleixner – one of the core x86 maintainers] in person, I think he agreed that we should just merge it and disable the parallel mode for the affected AMD CPUs.”. The suggestion was to proceed to merge but to disable this feature on all AMD CPUs for the moment out of an abundance of caution.
Nikunj A Dadhania (AMD) posted patches enabling support for a “Secure TSC” for SNP (Secure Nested Paging) guests. SNP is part of AMD’s Confidential Compute strategy and securing the TSC (Time Stamp Counter) is a necessary part of enabling confidential guests to not have to trust the host hypervisor. Prior to these patches, a hypervisor could interdict the TSC, providing a different view of the passage of CPU time to the guest than reality. With the patches, “Secure TSC allows guest to securely use RDTSC/RDTSCP instructions as the parameters being used cannot be changed by hypervisor once the guest is launched. More details in the AMD64 APM Vol 2, Section “Secure TSC”. According to Nikunj, “During the boot-up of the secondary cpus, SecureTSC enabled guests need to query TSC info from Security processor (PSP). This communication channel is encrypted between the security processor and the guest, hypervisor is just the conduit to deliver the guest messages to the security processor. Each message is protected with an AEAD (AES-256 GCM).”
Rick Edgecomb (Intel) posted an updated patch series titled “Shadow stacks for userspace” that “implements Shadow Stacks for userspace using x86’s Control-flow Enforcement Technology (CET). CET consists of two related security features: shadow stacks and indirect branch tracking. This series implements just the shadow stack part of this feature, and just for userspace.” As Rick notes, “The main use case for shadow stack is providing protection against return oriented programming attacks”. ROP attacks aim to string together pre-existing “gadgets” (existing pieces of code, not necessarily actually well defined functions in themselves) by finding a vulnerability in existing code that can cause a function to jump (return) into a gadget sequence. Shadow stacks mitigate this by adding an additional, separate hardware structure, that tracks all function entry/exit sequences and ensures returns only come from real function calls (or are special cased longjmp like sequences that usually require special handling).
Jarkko Sakkinen posted some fixes for AMD’s SEV-SNP hypervisor that had been discovered by the Enarx developers. I’m mentioning it because this patch series may have been the final one to go out by the startup “Profian”, which had been seeking to commercialize support for Enarx. Profian closed its doors in the past few weeks due to the macro-economic environment. Some great developers are available on the market and looking for new opportunities. If you are hiring, or know folks who are, you can see posts from the Profian engineers on LinkedIn.
Final words
The Open Source Summit North America returns in person (and virtual) this year, from May 10th in Vancouver, British Columbia, Canada. There are several other events planned to be colocated alongside the Open Source Summit. These include the (invite only) Linux Storage, Filesystem, Memory Management, and BPF (LSF/MM/BPF) Summit for which a CFP is open. Another colocated event is the Linux Security Summit North America, the CfP of which was announced by James Morris with a link for submitting proposals.
Cyril Hrubis (SuSE) posted an announcement that the Linux Test Project (LTP) for January 2023 had been released. It includes a number of new tests, among them “dirtyc0w_shmem aka CVE-2022-2590”. They have also updated the minimum C requirement to -std=gnu99. Linux itself moved to a baseline of C11 (from the much older C99 standard) since Linux 5.18.
]]>This is the pilot episode for what will become season 2 of the Linux Kernel Podcast. Back in 2008-2009 I recorded a daily “kernel podcast” that summarized the happenings of the Linux Kernel Mailing List (LKML). Eventually, daily became a little too much, and the podcast went weekly, followed by…not. This time around, I’m not committing to any specific cadence – let’s call it “periodic” (every few weeks). In each episode, I will aim to broadly summarize the latest happenings in the “plumbing” of the Linux kernel, and occasionally related bits of userspace “plumbing” (glibc, systemd, etc.), as well as impactful toolchain changes that enable new features or rebaseline requirements. I welcome your feedback. Please let me know what you think about the format, as well as what you would like to see covered in future episodes. I’m going to play with some ideas over time. These may include “deep diving” into topics of interest to a broader audience. Keep in mind that this podcast is not intended to editorialize, but only to report on what is happening. Both this author, and others, have their own personal opinions, but this podcast aims to focus only on the facts, regardless of who is involved, or their motives.”
On with the show.
For the week ending January 21st 2023, I’m Jon Masters and this is the Linux Kernel Podcast.
Summary
The latest stable kernel is Linux 6.1.7, released by Greg K-H on January 18th 2023.
The latest mainline (development) kernel is 6.2-rc4, released on January 15th 2023.
Long Term Stable 6.1?
The “stable” kernel series is maintained by Greg K-H (Kroah-Hartman), who posts hundreds of patches with fixes to each Linus kernel. This is where the “.7” comes in on top of Linux 6.1. Such stable patches are maintained between kernel releases, so when 6.2 is released, it will become the next “stable” kernel. Once every year or so, Greg will choose a kernel to be the next “Long Term Stable” (LTS) kernel that will receive even more patches, potentially for many years at a time. Back in October, Kaiwan N Billimoria (author of a book titled “Linux Kernel Programming”), seeking a baseline for the next edition, asked if 6.1 would become the next LTS kernel. A great amount of discussion has followed, with Greg responding to a recent ping by saying, “You tell me please. How has your testing gone for 6.1 so far? Does it work properly for you? Are you and/or your company willing to test out the -rc releases and provide feedback if it works or not for your systems?” and so on. This motivated various others to pile on with comments about their level of testing, though I haven’t seen an official 6.1 LTS as of yet.
Linux 6.2 progress
Linus noted in his 6.2-rc4 announcement mail that this came “with pretty much everybody back from winter holidays, and so things should be back to normal. And you can see that in the size, this is pretty much bang in the middle of a regular rc size for this time in the merge window.” The “merge window” is the period of time during which disruptive changes are allowed to be merged (typically the first two weeks of a kernel cycle prior to the first “RC”) so Linus means to refer to a “cycle” and not “merge window” in his announcement.
Speaking of Linux 6.2, it counts among new features additional support for Rust. Linux 6.1 had added initial Rust patches capable of supporting a “hello world” kernel module (but not much more). 6.2 adds support for accessing certain kernel data structures (such as “task_struct”, the per-task/process structure) and handles converting C-style structure “objects” with collections of (possibly null pointers) into the “memory safe” structures understood by Rust. As usual, Linux Weekly News (LWN) has a great article going into much more detail.
Ongoing Development
Richard Guy Briggs posted the 6th version of a patch series titled “fanotify: Allow user space to pass back additional audit info”, which “defines a new flag (FAN_INFO) and new extensions that define additional information which are appended after the response structure returned from user space on a permission event”. This allows audit logging to much more usefully capture why a policy allowed (or disallowed) certain access. The idea is to “enable the creation of tools that can suggest changes to the policy similar to how audit2allow can help refine labeled security”.
Maximillian Luz posted a patch series titled “firmware: Add support for Qualcomm UEFI Secure Application” that allows regular UEFI applications to access EFI variable via proxy calls to the “UEFI Secure Application” (uefisecapp) running in Q’s “secure world” implementation of Arm Trustzone. He has tested using this on a variety of tables, including a Surface Pro X. The application interface was reverse engineer from the Windows QcTrEE8180.sys driver.
Kees Cook requested a stable kernel backport of support for “oops_limit”, a new kernel feature that seeks to limit the number of “oopses” allowed before a kernel will “panic”. An “oops” is what happens when the kernel attempts to access a null pointer reference. Normal application software will crash (with a “segmentation fault”) when this happens. Inside the kernel, the access is caught (provided it happened while in process context), and the associated (but perhaps unrelated) userspace task (process) is killed in the process of generating an “oops” with a backtrace. The kernel may at that moment leak critical resources associated with the process, such as file handles, memory areas, or locks. These aren’t cleaned up. Consequently, it is possible that repeated oopses can be generated by an attacker and used for privilege escalation. The “oops_limit” patches mitigate this by limiting the number of such oopses allowed before the kernel will give up and “panic” (properly crash, and reboot, depending on config).
Vegard Nossum posted version 3 of a patch series titled “kmod: harden user namespaces with new kernel.ns_modules_allowed syscall”, which seeks to “reduce the attack surface and block exploits by ensuring that user namespaces cannot trigger module (auto-)loading”.
Arseniy Lesin reposted an RFC (Request For Comments) of a “SIGOOM Proposal” that would seek to enable the kernel to send a signal whenever a task (process) was in danger of being killed by the “OOM” (Out Of Memory) killer due to consuming too much anonymous (regular) memory. Willy Tarreau and Ted Ts’o noted that we were actually essentially out of space for new signals, and so rather than declaring a new “SIGOOM”, it would be better to allow a process to select which of the existing signals should be used for this process when it registered to receive such notifications. Arseniy said they would follow up with patches that followed this approach.
Architectures
On the architecture front, Mark Brown posted the 4th version of a patch series enabling support for Arm’s SME (Scalable Matrix Extension) version 2 and 2.1. Huang Ying posted patches enabling “migrate_pages()” (which moves memory between NUMA nodes – memory chips specific to e.g. a certain socket in a server) to support batching of the new(er) memory “folios”, rather than doing them one at a time. Batching allows associated TLB invalidation (tearing down the MMU’s understanding of active virtual to physical addresses) to be batched, which is important on Intel systems using IPIs (Inter-Processor-Interrupts), which are reduced by 99.1% during the associated testing, increasing pages migrated per second on a 2P server by 291.7%.
Xin Li posted version 6 of a patch series titled “x86: Enable LKGS instruction”. The “LKGS instruction is introduced with Intel FRED (flexible return and event delivery) specification. As LKGS is independent of FRED, we enable it as a standalone feature”. LKGS (which is an abbreviation of “load into IA32_KERNEL_GS_BASE”) “behaves like the MOV to GS instruction except that it loads the base address into the IA32_KERNEL_GS_BASE MSR instead of the GS segment’s descriptor cache.” This means that an Operating System can perform the necessary work to context switch a user-level thread by updating IA32_KERNEL_GS_BASE and avoiding an explicit set of balanced calls to SWAPGS. This is part of the broader “FRED” architecture defined by Intel in the Flexible Return and Event Delivery (FRED) Specification.
David E. Box posted version 2 of a patch series titled “Extend Intel On Demand (SDSi) support, noting that “Intel Software Defined Silicon (SDSi) is now known as Intel On Demand”. These patches enable support for the Intel feature intended to allow users to load signed payloads into their CPUs to turn on certain features after purchasing a system. This might include (for example) certain accelerators present in future chips that could be enabled as needed, similar to how certain automobiles now include subscription-locked heated seats and other features.
Meanwhile, Anup Patel posted patches titled “RISC-V KVM virtualize AIA CSRs” that enable support for the new AIA (Advanced Interrupt Architecture), which replaces the legacy “PLIC”, and Sia Jee Heng posted patches that enable “RISC-V Hibernation Support”.
Final words
A number of conferences are returning in 2023, including the Linux Storage, Filesystem, Memory Management, and BPF (LSF/MM/BPF) Summit, which will be held from May 8 to May 10 at the Vancouver Convention Center. Josef Bacik noted that the CFP was now open.
Don’t forget to give me your feedback on this pilot episode! jcm@jonmasters.org.
]]>