Carview!

CARVIEW

MOTORHOMES

Select Language

HTTP/1.1 200 OK Server: nginx Date: Fri, 26 Dec 2025 12:30:33 GMT Content-Type: text/xml; charset=utf-8 Content-Length: 77569 Connection: keep-alive Keep-Alive: timeout=50 Referrer-Policy: no-referrer-when-downgrade X-AWS-Id: kr-bulkws03 X-LJ-Flow-ID: aU4WcVPF36C3QXdUeoc6wgAAABo Cache-Control: private, proxy-revalidate Content-Encoding: gzip Content-MD5: rWJP8qMXxVwnGZXDutQMZw Vary: Accept-Encoding,ETag Content-Security-Policy: default-src 'self' *.livejournal.com *.livejournal.net *.dsp-rambler.ru *.google.com google.com *.rambler-co.ru rambler-co.ru *.rambler.ru rambler.ru *.tiktok.com tiktok.com *.youtube.com youtube.com; script-src 'self' *.livejournal.com *.livejournal.net *.24smi.net *.adfox.ru *.adlooxtracking.com adlooxtracking.com *.adlooxtracking.ru adlooxtracking.ru ad.mail.ru api.giphy.com bs.serving-sys.ru cdn.ampproject.org cdn.jsdelivr.net cdnjs.smi2.ru *.cdn-vk.ru content.adriver.ru *.criteo.com *.criteo.net *.doubleclick.net *.dropbox.com dsp-rambler.ru *.dsp-rambler.ru embed.bsky.app *.exelator.com *.facebook.com *.facebook.net gist.github.com googleads.g.doubleclick.net *.google-analytics.com *.googleapis.com *.google.com google.com *.google.ru *.googlesyndication.com *.googletagmanager.com googletagmanager.com *.googletagservices.com *.gstatic.com id.sber.ru *.instagram.com js.mamydirect.com *.lj.ru mc.yandex.com mc.yandex.ru *.newrelic.com *.nr-data.net *.ok.ru openstat.net pingback.giphy.com *.pingdom.com *.pingdom.net *.pinterest.com *.plista.com privacy-cs.mail.ru *.rambler-co.ru rambler-co.ru *.rambler.ru rambler.ru rb.infox.sg r.mradx.net *.rnet.plus *.rubiconproject.com r.webturn.ru *.scorecardresearch.com sdk.canva.com *.services.livejournal.com smi2.ru ssl.p.jwpcdn.com static.smi2cdn.ru static.smi2.net static.xx.fbcdn.net stat.media telegram.org tiktokcdn-us.com *.tiktok.com tiktok.com tns-counter.ru *.top100.ru top-fwz1.mail.ru tpc.googlesyndication.com *.ttwstatic.com twemoji.maxcdn.com *.twimg.com *.twitter.com *.videos.livejournal.com vk.com *.vk.com vk.ru *.vk.ru *.weborama.fm weborama.fm *.weborama.fr weborama.fr *.weborama.ru weborama.ru *.weborama-tech.ru weborama-tech.ru *.webturn.ru *.webvisor.org *.yahooapis.com *.yandex.ru yandex.ru yastatic.net ymetrica.com *.youtube.com youtube.com z.moatads.com 'unsafe-inline' 'unsafe-eval'; style-src http: https: data: 'unsafe-inline'; img-src blob: http: https: data:; frame-src http: https:; font-src http: https: data:; connect-src 'self' *.livejournal.com *.livejournal.net ad.adriver.ru ad.mail.ru *.ad-tech.ru api.giphy.com bs.serving-sys.ru cdn.ampproject.org *.criteo.com csi.gstatic.com data00.adlooxtracking.com dsp-rambler.ru *.dsp-rambler.ru *.eaglecdn.com event.top100.su export-download.canva.com ext.clickstream.sberbank.ru sdk.canva.com *.g.doubleclick.net googleads.g.doubleclick.net *.google-analytics.com *.googleapis.com *.google.com google.com *.googletagmanager.com googletagmanager.com graph.facebook.com gstatic.com id.sber.ru *.lj.ru lj.stat.eagleplatform.com mc.yandex.by mc.yandex.com mc.yandex.md mc.yandex.ru pingback.giphy.com *.pingdom.net privacy-cs.mail.ru *.rambler-co.ru rambler-co.ru *.rambler.ru rambler.ru rb.infox.sg *.rnet.plus *.services.livejournal.com *.ssp.rambler.ru ssp.rambler.ru static-mon.yandex.net static.xx.fbcdn.net stat.media stats.g.doubleclick.net smi2.net smi2.ru sve.online.sberbank.ru *.tiktok.com tiktok.com top-fwz1.mail.ru *.twitter.com *.webturn.ru *.webvisor.org wss://mc.yandex.ru wss://www.livejournal.com yandexmetrica.com yandexmetrica.com:29010 yandexmetrica.com:30103 *.yandex.net *.yandex.ru yandex.ru yastatic.net ymetrica1.com ymetrica.com *.youtube.com youtube.com; report-uri https://www.livejournal.com/csp_reports; report-to livejournal; media-src http: https: blob: data: storage.mds.yandex.net; frame-ancestors 'self'; worker-src 'self' blob:; object-src 'self' blob: *.livejournal.net youtube.com *.youtube.com; child-src 'self' blob:; Reporting-Endpoints: livejournal="https://www.livejournal.com/csp_reports" Last-Modified: Thu, 10 Oct 2024 17:16:06 GMT X-Varnish: 415346878 350381402 Age: 26999 X-VWS-Id: kr-varn04-new.lj.rambler.tech ETag: GgZzrWJP8qMXxVwnGZXDutQMZw Accept-Ranges: bytes X-SplitTest: none Permissions-Policy: browsing-topics=() Set-Cookie: luid=URNKAGlOf+lC7zAeaB8kAgB=; expires=Thu, 31-Dec-37 23:55:55 GMT; domain=.livejournal.com; path=/; secure; samesite=none P3P: CP="NON DSP NID ADMa DEVa TAIa PSAa PSDa OUR IND UNI COM NAV" Paul E. McKenney's Journal https://paulmck.livejournal.com/ Paul E. McKenney's Journal - LiveJournal.com Thu, 10 Oct 2024 17:16:06 GMT LiveJournal / LiveJournal.com paulmck 18342169 personal https://l-userpic.livejournal.com/108549024/18342169 Paul E. McKenney's Journal https://paulmck.livejournal.com/ 72 100 https://paulmck.livejournal.com/70231.html Thu, 10 Oct 2024 17:16:06 GMT Parallel Programming: Cooperation paulmck https://paulmck.livejournal.com/70231.html First, let me paraphrase something from my LiveJournal profile: These posts are my own, and in particular do not necessarily reflect my employer's positions, strategies, or opinions. With that said, some say that the current geopolitical outlook is grim.  And far be it from me to minimize the present-day geopolitical problems, nor am I at all interested in comparing them to their counterparts in the "good old days".  But neither do I wish to obsess on these problems.  I will instead call attention to a few instances of global cooperation, current and past. Last month, NASA's oldest active astronaut traveled to Kazakhstan's Baikonur Cosmodrome, entered a Soyuz capsule atop a Roscosmos rocket and flew to the International Space Station. For me, this is especially inspiring: If he can do that at age 69, I should certainly be able to continue doing my much less demanding job for many years to come. Some decades ago, during the Cold War, I purchased an English translation of Gradshteyn's and Ryzhik's classic "Table of Integrals, Series, and Products".  Although computer-algebra systems have largely replaced this book, I have used it within the past few years and I used it heavily in the 1980s and early 1990s.  Thus, along with many others, I am indebted to the longstanding Russian tradition of excellence in mathematics. So just this past month, I was happy to receive hard copies of "<a href="https://bhv.ru/product/parallelnoe-programmirovanie-tak-li-eto-slozhno/" target="_blank" rel="nofollow">Параллельное программирование – так ли это сложно?</a>", which is a Russian translation of "<a href="https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html" target="_blank" rel="nofollow">Is Parallel Programming Hard, And, If So, What Can You Do About It?</a>"  I would like to think that this might be a down payment on my aforementioned debt. Many other countries have also made many excellent contributions to mathematics, science, and technology.  For example, the smartphone that I used hails from South Korea.  And earlier this year, SeongJae (SJ) Park completed a <a href="https://github.com/sjp38/perfbook-ko_KR" target="_blank" rel="nofollow">Korean translation</a> of the Second Edition of "Is Parallel Programming Hard, And, If So, What Can You Do About It?" Returning to rocketry, China started working with rockets in the 1200s, if not earlier, and has made a great deal of more recent progress in a wide variety of fields.  And rumor has it that a Chinese translation of the Second Edition will be appearing shortly. So if you tried reading this book, but the English got in the way, you now have two other options and hopefully soon a third!  But what if you want a fourth option?  Then you, too, can do a translation!  Just send me a translated chapter and I will add it to the list in the book's FAQ.txt file. <a name='cutid1-end'></a> https://paulmck.livejournal.com/70231.html?view=comments#comments public

https://paulmck.livejournal.com/70127.html Fri, 02 Feb 2024 13:35:05 GMT Stupid RCU Tricks: So You Want to Torture RCU With a Non-Trivial Userspace? paulmck https://paulmck.livejournal.com/70127.html In order to save mass-storage space and to reduce boot times, rcutorture runs out of a tiny initrd filesystem that consists only of a root directory and a statically linked init program based on <a href="https://lwn.net/Articles/920158/" target="_blank" rel="nofollow">nolibc</a>.  This init program binds itself to a randomly chosen CPU, spins for a short time, sleeps for a short time, and repeats, the point being to inject at least a little userspace execution for the benefit of nohz_full CPUs. This works very well most of the time.  But what if you want to use a full userspace when torturing RCU, perhaps because you want to test runtime changes to RCU's many sysfs parameters, run heavier userspace loads on nohz_full CPUs, or even to run BPF programs? What you do is go back to the old way of building rcutorture's initrd. Which raises the question as to what the new way might be. What rcutorture does is to look at the tools/testing/selftests/rcutorture/initrd directory.  If this directory does not already a file named init, the tools/testing/selftests/rcutorture/bin/mkinitrd.sh script builds the aforementioned statically linked init program. Which means that you can put whatever initrd file tree you wish into that initrd directory, and as long as it contains a /init program, rcutorture will happily bundle that file tree into an initrd in each the resulting rcutorture kernel images. And back in the old days, that is exactly what I did.  I grabbed any convenient initrd and expanded it into my tools/testing/selftests/rcutorture/initrd directory.  This still works, so you can do this too! <a name='cutid1-end'></a> https://paulmck.livejournal.com/70127.html?view=comments#comments public

https://paulmck.livejournal.com/69651.html Mon, 12 Jun 2023 16:40:46 GMT Parallel Programming: June 2023 Update paulmck https://paulmck.livejournal.com/69651.html The v2023.06.11a release of <a href="https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html" target="_blank" rel="nofollow">Is Parallel Programming Hard, And, If So, What Can You Do About It?</a> is now available! The double-column version is also available from <a href="https://arxiv.org/abs/1701.00854" target="_blank" rel="nofollow">arXiv.org</a>. This release contains a new section on thermal throttling (along with a new cartoon), improvements to the memory-ordering chapter (including intuitive subsets of the Linux-kernel memory model), fixes to the deferred-processing chapter, additional clocksource-deviation material to the "What Time Is It?" section, and numerous fixes inspired by questions and comments from readers.  Discussions with Yariv Aridor were especially fruitful.  Akira Yokosawa contributed some quick quizzes and other upgrades of the technical discussions, along with a great many improvements to grammar, glossaries, epigraphs, and the build system.  Leonardo Bras also provided some much-appreciated build-system improvements, and also started up continuous integration for some of the code samples. Elad Lahav, Alan Huang, Zhouyi Zhou, and especially SeongJae Park contributed numerous excellent fixes for grammatical and typographical errors.  SeongJae's fixes were from his Korean translation of this book. Elad Lahav, Alan Huang, and Patrick Pan carried out some much-needed review of the code samples and contributed greatly appreciated fixes and improvements.  In some cases, they drug the code kicking and screaming into the 2020s.  :-) https://paulmck.livejournal.com/69651.html?view=comments#comments public

https://paulmck.livejournal.com/69622.html Thu, 26 Jan 2023 01:26:10 GMT What Does It Mean To Be An RCU Implementation? paulmck https://paulmck.livejournal.com/69622.html <h1>Under Construction</h1> A correspondent closed out 2022 by sending me an off-list email asking whether or not a pair of Rust crates (<a href="https://docs.rs/rcu-clean/latest/rcu_clean/" target="_blank" rel="nofollow">rcu_clean</a> and <a href="https://docs.rs/left-right/latest/left_right/" target="_blank" rel="nofollow">left_right</a>) were really implementations of read-copy update (<a href="https://docs.kernel.org/RCU/Design/Requirements/Requirements.html" target="_blank" rel="nofollow">RCU</a>), with an LWN <a href="https://lwn.net/Articles/921412/" target="_blank" rel="nofollow">commenter</a> throwing in <a href="https://crates.io/crates/crossbeam" target="_blank" rel="nofollow">crossbeam</a>'s <a href="https://docs.rs/crossbeam/0.8.2/crossbeam/epoch/index.html" target="_blank" rel="nofollow">epoch crate</a> for good measure.  At first glance, this is a pair of simple yes/no questions that one should be able to answer off the cuff. <h1>What Is An RCU?</h1> Except that there is quite a variety of RCU implementations in the wild.  Even if we remain within the cozy confines of the Linux kernel, we have: (1) The original "vanilla" RCU, (2) Sleepable RCU (<a href="https://lwn.net/Articles/202847/" target="_blank" rel="nofollow">SRCU</a>), (3) Tasks RCU, (4) Tasks Rude RCU, and Tasks Trace RCU.  These differ not just in performance characteristics, in fact, it is not in general possible to mechanically convert (say) SRCU to RCU.  The key attributes of RCU implementations are the marking of read-side code regions and data accesses on the one hand and some means of waiting on all pre-existing readers on the other.  For more detail, see the <a href="https://lwn.net/Articles/777036/" target="_blank" rel="nofollow">2019 LWN article</a> and for more background, see the Linux Foundation RCU presentations <a href="https://linuxfoundation.org/webinars/unraveling-rcu-usage-mysteries/" target="_blank" rel="nofollow">here</a> and <a href="https://linuxfoundation.org/webinars/unraveling-rcu-usage-mysteries-additional-use-cases/" target="_blank" rel="nofollow">here</a>. The next sections provide an overview of the Linux-kernel RCU implementations' functional properties, with performance and scalability characteristics left as an exercise for the interested reader. <h2>Vanilla RCU</h2> Vanilla RCU has quite a variety of bells and whistles: <ul> <li>Explicit nesting read-side markers, rcu_read_lock(), rcu_read_unlock(), rcu_dereference(), and friends.</li> <li>Pointer-update function, rcu_assign_pointer().</li> <li>Synchronous grace-period-wait primitives, synchronize_rcu() and synchronize_rcu_expedited().</li> <li>An asynchronous grace-period wait primitive, call_rcu().  And additionally a synchronous callback wait primitive, rcu_barrier().</li> <li><a href="https://paulmck.livejournal.com/65800.html" target="_blank">Polled grace-period wait primitives</a>.</li> </ul> <h2>Sleepable RCU (SRCU)</h2> SRCU has a similar variety of bells and whistles, but some important differences.  The most important difference is that SRCU supports multiple domains, each represented by an srcu_struct structure.  A reader in one domain does not block a grace period in another domain.  In contrast, RCU is global in nature, with exactly one domain.  On the other hand, the price SRCU pays for this flexibility is reduced amortization of grace-period overhead. <ul> <li>Explicit read-side markers, srcu_read_lock(), srcu_read_unlock(), srcu_dereference(), and friends.  Except that, unlike rcu_read_lock() and rcu_read_unlock(), srcu_read_lock() and srcu_read_unlock() do not nest.  Instead, the return value from srcu_read_lock() must be passed to the corresponding srcu_read_unlock().  This means that SRCU (but not RCU!) can represent non-nested partially overlapping read-side critical sections.  Not that this was considered a good thing, instead being a way of avoiding the need for T*S storage, where T is the number of tasks and S the number of srcu_struct structures.</li> <li>Synchronous grace-period-wait primitives, synchronize_srcu() and synchronize_srcu_expedited().</li> <li>An asynchronous grace-period wait primitive, call_srcu(). And additionally a synchronous callback wait primitive, srcu_barrier().</li> <li>Polled grace-period wait primitives, although less variety than RCU enjoys.  (Does this enjoyment extend to RCU's users?  You decide.)</li> </ul> <h2>Tasks RCU</h2> Tasks RCU was designed specially to handle the trampolines used in Linux-kernel tracing. <ul> <li>It has no explicit read-side markers.  Instead, voluntary context switches separate successive Tasks RCU read-side critical sections.</li> <li>A synchronous grace-period-wait primitives, synchronize_rcu_tasks().</li> <li>An asynchronous grace-period wait primitive, call_rcu_tasks(). And additionally a synchronous callback wait primitive, rcu_barrier_tasks().</li> <li>No polled grace-period wait primitives.</li> </ul> <h2>Tasks Rude RCU</h2> By design, Tasks RCU does not wait for idle tasks.  Something about them never doing any voluntary context switches on CPUs that remain idle for long periods of time.  So trampoline that might be involved in tracing of code within the idle loop need something else, and that something is Tasks Rude RCU. <ul> <li>It has no explicit read-side markers. Instead, any preemption-disabled region of code is a Tasks Rude RCU reader.</li> <li>A synchronous grace-period-wait primitives, synchronize_rcu_tasks_rude().</li> <li>An asynchronous grace-period wait primitive, call_rcu_tasks_rude(). And additionally a synchronous callback wait primitive, rcu_barrier_tasks_rude().</li> <li>No polled grace-period wait primitives.</li> </ul> <h2>Tasks Trace RCU</h2> Both Tasks RCU and Tasks Rude RCU disallow sleeping while executing in a given trampoline.  Some BPF programs need to sleep, hence Tasks Trace RCU. <ul> <li>Explicit nesting read-side markers, rcu_read_lock_trace() and rcu_read_unlock_trace().</li> <li>A synchronous grace-period-wait primitives, synchronize_rcu_tasks_trace().</li> <li>An asynchronous grace-period wait primitive, call_rcu_tasks_trace(). And additionally a synchronous callback wait primitive, rcu_barrier_tasks_trace().</li> <li>No polled grace-period wait primitives.</li> </ul> <h2>DYNIX/ptx rclock</h2> The various Linux examples are taken from a code base in which RCU has been under active development for more than 20 years, which might yield an overly stringent set of criteria.  In contrast, the 1990s DYNIX/ptx implementation of RCU (called "rclock" for "read-copy lock") was only under active development for about five years.  The implementation was correspondingly minimal, as can be seen from this February 2001 <a href="https://lse.sourceforge.net/locking/rcu/patches/rclock-2.4.1-01.patch" target="_blank" rel="nofollow">patch</a> (hat trick to Greg Lehey): <ul> <li>Explicit nesting read-side markers, RC_RDPROTECT() and RC_RDUNPROTECT(). The lack of anything resembling rcu_dereference() shows just how small DYNIX/ptx's installed base was.</li> <li>Pointer-update barrier, RC_MEMSYNC().  This is the counterpart of smp_wmb() in early Linux-kernel RCU use cases.</li> <li>No synchronous grace-period-wait primitive.</li> <li>An asynchronous grace-period wait primitive, rc_callback(). However, there was no synchronous callback wait primitive, perhaps because DYNIX/ptx did not have modules, let alone unloadable ones.</li> <li>No polled grace-period wait primitives.</li> </ul> Perhaps this can form the basis of an RCU classification system, though some translation will no doubt be required to bridge from C to Rust.  There is ownership, if nothing else! <h1>RCU Classification and Rust RCU Crates</h1> Except that the first RCU crate, rcu_clean, throws a monkey wrench into the works.  It does not have any grace-period primitives, but instead a clean() function that takes a reference to a RCU-protected data item.  The user invokes this at some point in the code where it is known that there are no readers, either within this thread or anywhere else.  In true Rust fashion, in some cases, the compiler is able to prove the presence or absence of readers and issue a diagnostic when needed.  The <a href="https://docs.rs/rcu-clean/latest/rcu_clean/" target="_blank" rel="nofollow">documentation</a> notes that the addition of grace periods (also known as "epochs") would allow greater accuracy. This sort of thing is not unprecedented.  The <a href="https://liburcu.org/" target="_blank" rel="nofollow">userspace RCU library</a> has long had an rcu_quiescent_state() function that can be invoked from a given thread when that particular thread is in a quiescent state, and thus cannot have references to any RCU-protected object.  However, rcu_clean takes this a step further by having no RCU grace-period mechanism at all. Nevertheless, rcu_clean could be used to implement the <a href="https://www.linuxfoundation.org/webinars/unraveling-rcu-usage-mysteries-additional-use-cases?hsLang=en" target="_blank" rel="nofollow">add-only list RCU use case</a>, so it is difficult to argue that is not an RCU implementation.  But it is clearly a very primitive implementation.  That said, primitive implementations do have their place, for example: <ul> <li>Languages with garbage collectors have built-in RCU updaters.</li> <li>Programs with short runtimes can just leak memory, cleaning up when restarted.</li> <li>Other synchronization primitives can be used to protect and exclude readers.</li> </ul> In addition, an RCU implementation even more primitive than rcu_clean would omit the clean() function, instead leaking memory that had been removed from an RCU-protected structure. The left_right crate definitely uses RCU in the guise of epochs, and it can be used for at least some of the things that RCU can be used for.  It does have a single-writer restriction, though as the documentation says, you could use a Mutex to serialize at least some multi-writer use cases.  In addition, it has long been known that RCU use cases involving only a single writer thread permit wait-free updaters as well as wait-free readers. One might argue that the fact that the left_right crate uses RCU means that it cannot possibly be itself an implementation of RCU.  Except that in the Linux kernel, RCU Tasks uses vanilla RCU, RCU Tasks Trace uses SRCU, and previous versions of SRCU used vanilla RCU.  So let's give the left_right crate the benefit of the doubt, at least for the time being, but with the understanding that it might eventually instead be classified as an RCU use case rather than an RCU implementation. The crossbeam epoch crate again uses the guise of epochs.  It has explicit read-side markers in RAII guard form using the <a href="https://docs.rs/crossbeam/0.8.2/crossbeam/epoch/fn.pin.html" target="_blank" rel="nofollow">pin</a> function and its <a href="https://docs.rs/crossbeam/0.8.2/crossbeam/epoch/struct.Atomic.html" target="_blank" rel="nofollow">Atomic pointers</a>.  Grace periods are computed automatically, and the <a href="https://docs.rs/crossbeam/0.8.2/crossbeam/epoch/struct.Guard.html#method.defer" target="_blank" rel="nofollow">defer</a> method provides an asynchronous grace-period-wait function.  As with DYNIX/ptx, the crossbeam epoch crate lacks any other means of waiting for grace periods, and it also lacks a callback-wait API.  However, to it credit, and unlike DYNIX/ptx, this crate does provide safe means for handling pointers to RCU-protected data. Here is a prototype classification system, again, leaving performance and scalability aside: <ol> <li>Are there explicit RCU read-side markers?  Of the Linux-kernel RCU implementations, RCU Tasks and RCU Tasks Rude lack such markers.  Given the Rust borrow checker, it is hard to imagine an implementation without such markers, but feel free to prove me wrong.</li> <li>Are grace periods computed automatically?  (If not, as in rcu_clean, none of the remaining questions apply.)</li> <li>Are there synchronous grace-period-wait APIs?  All of the Linux-kernel implementations do, and left_right also looks to.</li> <li>Are there asynchronous grace-period-wait APIs?  If so, are there callback-wait APIs?All of the Linux-kernel implementations do, but left_right does not appear to.  Providing them seems doable, but might result in more than two copies of recently-updated data structures.  The crossbeam's epoch crate provides an asynchronous grace-period-wait function in the form of defer, but lacks a callback-wait API.</li> <li>Are there polled grace-period-wait APIs?  The Linux-kernel RCU and SRCU implementations do.</li> <li>Are there multiple grace-period domains?  The Linux-kernel SRCU implementation does.</li> </ol> But does this classification scheme work for your favorite RCU implementation?  What about your favorite RCU use case? <h1>History</h1> <ul> <li>January 25, 2023: Initial version.</li> <li>January 26, 2023: Add DYNIX/ptx RCU equivalent, note that left_right might be a use of RCU rather than an implementation, and call out the fact that some of the Linux-kernel RCU implementations are based on others.</li> <li>January 30, 2023: Respond to LWN request for the crossbeam crate.  Expand the section summarizing RCU.</li> </ul> <a name='cutid1-end'></a> https://paulmck.livejournal.com/69622.html?view=comments#comments rcu lkmm rust linux public

https://paulmck.livejournal.com/69158.html Thu, 15 Dec 2022 00:59:47 GMT Stupid RCU Tricks: So You Want To Add Kernel-Boot Parameters Behind rcutorture's Back? paulmck https://paulmck.livejournal.com/69158.html A previous <a href="https://paulmck.livejournal.com/58077.html" target="_blank">post</a> in this series showed how you can use the --bootargs parameter and .boot files to supply kernel boot parameters to the kernels under test.  This works, but it turns out that there is another way, which is often the case with the Linux kernel.  This other way is Masami Hiramatsu's bootconfig facility, which is nicely documented in detail <a href="https://dri.freedesktop.org/docs/drm/admin-guide/bootconfig.html" target="_blank" rel="nofollow">here</a>.  This blog post is a how-to guide on making use of bootconfig when running rcutorture. The bootconfig facility allows kernel boot parameters to be built into initrd or directly into the kernel itself, this last being the method used here.  This requires that the kernel build system be informed of the parameters.  Suppose that these parameters are placed in a file named /tmp/dump_tree.bootparam as follows: kernel.rcutree.dump_tree=1 kernel.rcutree.blimit=15 Note well the "kernel." prefix, which is required here.  The other option is an "init." prefix, which would cause the parameter to instead be passed to the init process. Then the following three Kconfig options inform the build system of this file: CONFIG_BOOT_CONFIG=y CONFIG_BOOT_CONFIG_EMBED=y CONFIG_BOOT_CONFIG_EMBED_FILE="/tmp/dump_tree.bootparam" The resulting kernel image will then contain the above pair of kernel boot parameters.  Except that you also have to tell the kernel to look for these parameters, which is done by passing in the "bootconfig" kernel boot parameter.  And no, it does not work to add a "kernel.bootconfig" line to the /tmp/dump_tree.bootparam file!  You can instead add it to a .boot file or to the kvm.sh command line like this: "--bootargs bootconfig". For example, given this command: tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 30s --configs TREE05 \    --bootargs "bootconfig" --trust-make The resulting console.log file would contain the following text, indicating that these boot parameters had in fact been processed correctly, as indicated by the "Boot-time adjustment of callback invocation limit to 15." and the last three lines that begin with "rcu_node tree layout dump". ----------- Running RCU self tests rcu: Preemptible hierarchical RCU implementation. rcu:     CONFIG_RCU_FANOUT set to non-default value of 6. rcu:     RCU lockdep checking is enabled. rcu:     Build-time adjustment of leaf fanout to 6. rcu:     Boot-time adjustment of callback invocation limit to 15. rcu:     RCU debug GP pre-init slowdown 3 jiffies. rcu:     RCU debug GP init slowdown 3 jiffies. rcu:     RCU debug GP cleanup slowdown 3 jiffies. Trampoline variant of Tasks RCU enabled. rcu: RCU calculated value of scheduler-enlistment delay is 100 jiffies. rcu: rcu_node tree layout dump rcu:  0:7 ^0 rcu:  0:3 ^0  4:7 ^1 ----------- What happens if you use both CONFIG_BOOT_CONFIG_EMBED_FILE and the --bootargs parameter?  The kernel boot parameters passed to --bootargs will be processed first, followed by those in /tmp/dump_tree.bootparam.  Please note that the semantics of repeated kernel-boot parameters is subsystem-specific, so please also be careful. The requirement that the "bootconfig" parameter be specified on the normal kernel command line can be an issue in environments where the this command line is not easily modified.  One way of avoiding such issues is to create a Kconfig option that causes the kernel to act as if the "bootconfig" parameter had been specified.  For example, the following -rcu commit does just this with a new CONFIG_BOOT_CONFIG_FORCE Kconfig option: 674b57ddd75e ("bootconfig: Allow forcing unconditional bootconfig processing") It is important to note that although these embedded kernel-boot parameters show up at the beginning of the "/proc/cmdline" file, they may also be found in isolation in the "/proc/bootconfig" file, for example, like this: $ cat /proc/bootconfig kernel.rcutree.dump_tree = 1 kernel.rcutree.blimit = 15 Why do the embedded kernel-boot parameters show up at the beginning of "/proc/cmdline" instead of at the end, given that they are processed after the other parameters?  Because of the possibility of a "--" in the non-embedded traditionally sourced kernel boot parameters, which would make it appear that the embedded kernel-boot parameters were intended for the init process rather than for the kernel. But what is the point of all this? Within the context of rcutorture, probably not much.  But there are environments where external means of setting per-kernel-version kernel-boot parameters is inconvenient, and rcutorture is an easy way of testing the embedding of those parameters directly into the kernel image itself. <a name='cutid1-end'></a> https://paulmck.livejournal.com/69158.html?view=comments#comments rcutorture rcu scalability stupid rcu public

https://paulmck.livejournal.com/68926.html Sun, 06 Nov 2022 21:15:29 GMT Hiking Hills paulmck https://paulmck.livejournal.com/68926.html A few years ago, I <a href="https://paulmck.livejournal.com/56773.html" target="_blank">posted</a> on the challenges of maintaining low weight as one ages.  I have managed to  stay near my target weight, with the occasional excursion in either direction, though admittedly more often up than down.  My suspicion that maintaining weight would prove 90% as difficult as losing it has proven to be all too well founded.  As has the observation that exercise is inherently damaging to muscles (see for example <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6510035/" target="_blank" rel="nofollow">here</a>), especially as one's body's ability to repair itself decreases inexorably with age. It can be helpful to refer back to those old college physics courses.  One helpful formula is the well-worn Newtonian formula for kinetic energy, which is equal to half your mass times the square of your velocity.  Now, the human body does not maintain precisely the same speed while moving (that is after all what bicycles are for), and the faster you are going, the more energy your body must absorb when decreasing your velocity by a set amount on each footfall.  In fact, this amount of energy increases linearly with your average velocity.  So you can reduce the energy absorption (and thus the muscle and joint damage) by decreasing your speed.  And here you were wondering why old people often move much more slowly than do young people! But moving more slowly decreases the quality of the exercise, for example, requiring more time to gain the same cardiovascular benefits.  One approach is to switch from (say) running to hiking uphill, thus decreasing velocity while increasing exertion.  This works quite well, at least until it comes time to hike back down the hill. At this point, another formula comes into play, that for potential energy.  The energy released by lowering your elevation is your mass times the force of gravity time the difference in elevation.  With each step you take downhill, your body must dissipate this amount of energy.  Alternatively, you can transfer the potential energy into kinetic energy, but please see the previous discussion.  And again, this is what bicycles are for, at least for those retaining sufficiently fast reflexes to operate them safely under those conditions.  (Not me!!!) The potential energy can be dissipated by your joints or by your muscles, with muscular dissipation normally being less damaging.  In other words, bend your knee and hip before, during, and after the time that your foot strikes the ground.  This gives your leg muscles more time to dissipate that step's worth of potential energy.  Walking backwards helps by bringing your ankle joint into play and also by increasing the extent to which your hip and knee can flex.  Just be careful to watch where you are going, as falling backwards down a hill is not normally what you want to be doing.  (Me, I walk backwards down the steepest slopes, which allow me to see behind myself just by looking down.  It is also helpful to have someone else watching out for you.) Also, take small steps.  This reduces the difference in elevation, thus reducing the amount of energy that must be dissipated per step. But wait!  This also increases the number of steps, so that the effect of reducing your stride cancels out, right? Wrong. First, longer stride tends to result in higher velocity, the damaging effects of which were described above.  Second, the damage your muscles incur while dissipating energy is non-linear with both the force that your muscles are exerting and the energy per unit time (also known as "power") that they are dissipating.  To see this, recall that a certain level of force/power will cause your muscle to rupture completely, so that a (say) 10x reduction in force/power results in much greater than a 10x reduction in damage. These approaches allow you to get good exercise with minimal damage to your body.  Other strategies include the aforementioned bicycling as well as swimming.  Although I am fond of swimming, I recognize that it is my exercise endgame, and that I will therefore need to learn to like it.  But not just yet. To circle back to the subject of the earlier <a href="https://paulmck.livejournal.com/56773.html" target="_blank">blog post</a>, one common term in the formulas for both kinetic and potential energy is one's mass.  And I do find hiking easier than it was when I weighed 30 pounds more than I do now.  Should I lose more weight?  On this, I defer to my wife, who is a dietitian.  She assures me that 180 pounds is my target weight. So here I am!  And here I will endeavor to stay, despite my body's continued fear of the food-free winter that it has never directly experienced. <a name='cutid1-end'></a> https://paulmck.livejournal.com/68926.html?view=comments#comments aging hacker advice confessions public

https://paulmck.livejournal.com/68686.html Fri, 07 Oct 2022 11:05:30 GMT Stupid RCU Tricks: CPP Summit Presentation paulmck https://paulmck.livejournal.com/68686.html I had the privilege of presenting <a href="https://drive.google.com/file/d/143W28Yg_oRY1CwCrFGtKdk2QQSV5io2b/view?usp=sharing" target="_blank" rel="nofollow"><ins>Unraveling Fence & RCU Mysteries (C++ Concurrency Fundamentals)</ins></a> to the <a href="https://cpp-summit.org/en" target="_blank" rel="nofollow">CPP Summit</a>.  As the title suggests, this covered RCU from a C++ viewpoint.  At the end, there were several excellent questions: <ol> <li>How does the rcu_synchronize() wait-for-readers operation work?</li> <li>But the rcu_domain class contains lock() and unlock() member functions!!!</li> <li>Lockless things make me nervous!</li> </ol> There was limited time for questions, and each question's answer could easily have consumed the full 50 minutes alloted for the full talk.  Therefore, I address these questions in the following sections. <h1>How Does rcu_synchronize() Work?</h1> There are a great many ways to make this work.  Very roughly speaking, userspace RCU implementations usually have per-thread counters that are updated by readers and sampled by updaters, with the updaters waiting for all of the counters to reach zero.  There are a large number of pitfalls and optimizations, some of which are covered in the 2012 Transactions On Parallel and Distributed Systems <a href="https://www.computer.org/csdl/journal/td/2012/02/ttd2012020375/13rRUxjQybC" target="_blank" rel="nofollow">paper</a> (<a href="https://www.rdrop.com/users/paulmck/RCU/urcu-main-accepted.2011.08.30a.pdf" target="_blank" rel="nofollow">non-paywalled draft</a>).  The most detailed discussion is in the <a href="https://www.rdrop.com/users/paulmck/RCU/urcu-supp-accepted.2011.08.30a.pdf" target="_blank" rel="nofollow">supplementary materials</a>. More recent information may be found in Section 9.5 of <a href="https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html" target="_blank" rel="nofollow">Is Parallel Programming Hard, And, If So, What Can You Do About It</a>? <h1>The rcu_domain Class Contains lock() and unlock() Member Functions?</h1> Indeed it does! But names notwithstanding, these lock() and unlock() member functions need not contain memory-barrier instructions, let alone read-modify-write atomic operations, let alone acquisition and release of actual locks. So why these misleading names???  These misleading names exist so that the rcu_domain class meets the requirements of Cpp17BasicLockable, which provides RAII capability for C++ RCU readers.  Earlier versions of the RCU proposal for the C++ standard rolled their own RAII capability, but the committee wisely insisted that Cpp17BasicLockable's existing RAII capabilities be used instead. So it is that rcu_domain::lock() simply enters an RCU read-side critical section and rcu_domain::unlock() exits that critical section.  Yes, RCU read-side critical sections can be nested. <h1>Lockless Things Make Me Nervous!!!</h1> As well they should! The wise developer will be at least somewhat nervous when implementing lockless code because that nervousness will help motivate the developer to be careful, to test and stress-test carefully, and, when necessary, make good use of formal verification. In fact, one of the purposes of RCU is to package lockless code so as to make it easier to use.  This presentation dove into one RCU use case, and other materials called out in this CPP Summit presentation looked into many other RCU use cases. So proper use of RCU should enable developers to be less nervous.  But hopefully not completely lacking in nervousness!  :-) <a name='cutid1-end'></a> https://paulmck.livejournal.com/68686.html?view=comments#comments stupid rcu tricks public

https://paulmck.livejournal.com/68537.html Sun, 25 Sep 2022 22:43:50 GMT Parallel Programming: September 2022 Update paulmck https://paulmck.livejournal.com/68537.html The v2022.09.25a release of <a href="https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html" target="_blank" rel="nofollow">Is Parallel Programming Hard, And, If So, What Can You Do About It?</a> is now available! The double-column version is also available from <a href="https://arxiv.org/abs/1701.00854" target="_blank" rel="nofollow">arXiv.org</a>. This version boasts an expanded index and API index, and also adds a number of improvements, perhaps most notably boldface for the most pertinent pages for a given index entry, courtesy of Akira Yokosawa. Akira also further improved the new ebook-friendly PDFs, expanded the list of acronyms, updated the build system to allow different perfbook formats to be built concurrently, adjusted for Ghostscript changes, carried out per-Linux-version updates, and did a great deal of formatting and other cleanup. One of the code samples now use C11 thread-local storage instead of the GCC __thread storage class, courtesy of Elad Lahav. Elad also added support for building code samples on QNX. Johann Klähn, SeongJae Park, Xuwei Fu, and Zhouyi Zhou provided many welcome fixes throughout the book. This release also includes a number of updates to RCU, memory ordering, locking, and non-blocking synchronization, as well as additional information on the combined use of synchronization mechanisms. https://paulmck.livejournal.com/68537.html?view=comments#comments is parallel programming hard perfbook public

https://paulmck.livejournal.com/68136.html Tue, 13 Sep 2022 07:59:52 GMT Kangrejos 2022: The Rust for Linux Workshop paulmck https://paulmck.livejournal.com/68136.html I had the honor of attending <a href="https://kangrejos.com" target="_blank" rel="nofollow">this workshop</a>, however, this is not a full report. Instead, I am reporting on what I learned and how it relates to my <a href="https://paulmck.livejournal.com/62436.html" target="_blank">So You Want to Rust the Linux Kernel?</a> blog series that I started in 2021, and of which this post is a member. I will leave more complete coverage of a great many interesting sessions to others (for example, <a href="https://lwn.net/Articles/907405/" target="_blank" rel="nofollow">here</a>, <a href="https://lwn.net/Articles/907685/" target="_blank" rel="nofollow">here</a>, and <a href="https://lwn.net/Articles/907876/" target="_blank" rel="nofollow">here</a>), who are probably better versed in Rust than am I in any case. <h2>Device Communication and Memory Ordering</h2>Andreas Hindborg talked about his work on a Rust-language <a href="https://en.wikipedia.org/wiki/PCI_Express" target="_blank" rel="nofollow">PCI</a> <a href="https://en.wikipedia.org/wiki/NVM_Express" target="_blank" rel="nofollow">NVMe</a> driver for the Linux kernel x86 architecture. I focus on the driver's interaction with device firmware in main memory. As is often the case, there is a shared structure in normal memory in which the device firmware reports the status of an I/O request. This structure has a special 16-bit field that is used to report that all of the other fields are now filled in, so that the Linux device driver can now safely access them. This of course requires some sort of memory ordering, both on the part of the device firmware and on the part of the device driver. One straightforward approach is for the driver to use <tt>smp_load_acquire()</tt> to access that 16-bit field, and only if the returned value indicates that the other fields have been filled in, access those other fields. To the surprise of several Linux-kernel developers in attendance, Andreas's driver instead does a volatile load of the entire structure, 16-bit field and all. If the 16-bit field indicates that all the fields have been updated by the firmware, it then does a second volatile load of the entire structure and then proceeds to use the values from this second load. Apparently, the performance degradation from these duplicate accesses, though measurable, is quite small. However, given the possibility of larger structures for which the performance degradation might be more profound, Andreas was urged to come up with a more elaborate Rust construction that would permit separate access to the 16-bit field, thus avoiding the redundant memory traffic. In a subsequent conversation, Andreas noted that the overall performance degradation of this Rust-language prototype compared to the C-language version is at most 2.61%, and that in one case there is a yet-as-unexplained performance increase of 0.67%. Of course, given that both C and Rust are compiled, statically typed, non-garbage-collected, and handled by the same compiler back-end, it should not be a surprise that they achieve similar performance. An interesting question is whether the larger amount of information available to the Rust compiler will ever allow it to consistently provide better performance than does C. <h2>Rust and Linux-Kernel Filesystems</h2>Wedson Almeida Filho described his work on a Rust implementations of portions of the Linux kernel's <a href="https://en.wikipedia.org/wiki/9P_(protocol)" target="_blank" rel="nofollow">9p filesystem</a>. The part of this effort that caught my attention was use of the Rust language's asynchronous facilities to implement some of the required state machines and use of a Rust-language type-aware packet-decoding facility. So what do these two features have to do with concurrency in general, let alone memory-ordering in particular? Not much. Sure, <tt>call_rcu()</tt> is (most of) the async equivalent of <tt>synchronize_rcu()</tt>, but that should be well-known and thus not particularly newsworthy. But these two features might have considerable influence over the success or lack thereof for the Rust-for-Linux effort. In the past, much of the justification for use of Rust in Linux has been memory safety, based on the fact that 70% of the Linux exploits have involved memory unsafety. Suppose that Rust manages to address the entire 70%, as opposed to relegating (say) half of that to Rust-language unsafe code. Then use of Rust might at best provide a factor-of-three improvement. Of course, a factor of three is quite good in absolute terms. However, in my experience, at least an order of magnitude is usually required to with high probability motivate the kind of large change represented by the C-to-Rust transition. Therefore, in order for me to rate the Rust-for-Linux projects success as highly probable, another factor of three is required. One possible source of the additional factor of three might come from the Rust-for-Linux community giving neglected drivers some much-needed attention. Perhaps the 9p work would qualify, but it seems unlikely that the NVMe drivers are lacking attention. Perhaps Rust async-based state machines and type-aware packet decoding will go some ways towards providing more of the additional factor of three. Or perhaps a single factor of three will suffice in this particular case. Who knows? <h2>“Pinning” for the Linux Kernel in Rust</h2>Benno Lossin and Gary Guo reported on their respective efforts to implement pinning in Rust for the Linux kernel (a more detailed report may be found <a href="https://lwn.net/Articles/907876/" target="_blank" rel="nofollow">here</a>). But perhaps you, like me, are wondering just what pinning is and why the Linux kernel needs it. It turns out that the Rust language allows the compiler great freedom to rearrange data structures by default, similar to what would happen in C-language code if each and every structure was marked with the Linux kernel's <tt>__randomize_layout</tt> directive. In addition, by default, Rust-language data can be moved at runtime. This apparently allows additional optimizations, but it is not particularly friendly to Linux-kernel idioms that take addresses of fields in structures. Such idioms include pretty much all of Linux's linked-list code, as well as things like RCU callbacks. And who knows, perhaps this situation helps to explain recent hostility towards linked lists expressed on social media. ;-) The Rust language accommodates these idioms by allowing data to be pinned, thus preventing the compiler from moving it around. However, pinning data has consequences, including the fact that such pinning is visible in Rust's type system. This visibility allows Rust compilers to complain when (for example) <tt>list_for_each_entry()</tt> is passed an unpinned data structure. Such complaints are important, given that passing unpinned data to primitives expecting pinned data would result in memory unsafety. The code managing the pinning is expected to be optimized away, but there are concerns that it would make itself all too apparent during code review and debugging. Benno's and Gary's approaches were compared and contrasted, but my guess is that additional work will be required to completely solve this problem. Some attendees would like to see pinning addressed at the Rust-language level rather than at the library level, but time will tell what approach is most appropriate. <h2>RCU Use Cases</h2>Although RCU was not an official topic, for some reason it came up frequently during the hallway tracks that I participated in. Which is probably a good thing. ;-) One question is exactly how the various RCU use cases should be represented in Rust. Rust's atomic-reference-count facility has been put forward, and it is quite possible that, in combination with liberal use of atomics, it will serve for some of the more popular use cases. Wedson suggested that <a href="https://rust-for-linux.github.io/docs/kernel/revocable/struct.Revocable.html" target="_blank" rel="nofollow">Revocable</a> covers another group of use cases, and at first glance, it appears that Wedson might be quite right. Still others might be handled by wait-wakeup mechanisms. There are still some use cases to be addressed, but some progress is being made. <h2>Rust and Python</h2>Some people suggested that Rust might take over from Python, adding that Rust solves many problems that Python is said to encounter in large-scale programs, and that Rust should prove more ergonomic than Python. The first is hard to argue with, given that Python seems to be most successful in smaller-scale projects that make good use of Python's extensive libraries. The second seems quite unlikely to me, in fact, it is hard for me to imagine that anything about Rust would seem in any way ergonomic to a dyed-in-the-wool Python developer. One particularly non-ergonomic attribute of current Rust is likely to be its libraries, which last I checked were considerably less extensive than those of Python. Although the software-engineering shortcomings of large-scale Python projects can and have motivated conversion to Rust, it appears to me that smaller Python projects making heavier use of a wider variety of libraries might need more motivation. Automatic conversion of Python libraries, anyone? ;-) <h2>Conclusions</h2>I found Kangrejos 2022 to be extremely educational and thought-provoking, and am very glad that I was able to attend. I am still uncertain about Rust's prospects in the Linux kernel, but the session provided some good examples of actual work done and problems solved. Perhaps Rust's async and type-sensitive packet-decoding features will help increase its probability of success, and perhaps there are additional Rust features waiting to be applied, for example, use of Rust iterators to simplify lockdep cycle detection. And who knows? Perhaps future versions of Rust will be able to offer consistent performance advantages. Stranger things have happened! <h4>History</h4>September 19, 2022: Changes based on <a href="https://lpc.events/" target="_blank" rel="nofollow">LPC</a> and <a href="https://lwn.net/Kernel/Index/#Development_tools-Rust" target="_blank" rel="nofollow">LWN</a> reports. https://paulmck.livejournal.com/68136.html?view=comments#comments lkmm rust linux public

https://paulmck.livejournal.com/67992.html Thu, 01 Sep 2022 15:06:16 GMT Confessions of a Recovering Proprietary Programmer, Part XIX: Concurrent Computational Models paulmck https://paulmck.livejournal.com/67992.html The September 2022 issue of the <a href="https://cacm.acm.org/magazines/2022/9" target="_blank" rel="nofollow">Communications of the ACM</a> features an entertaining pair of point/counterpoint articles, with William Dally advocating for concurrent computational models that more accurately model both energy costs and latency <a href="https://cacm.acm.org/magazines/2022/9/263792-on-the-model-of-computation-point/fulltext" target="_blank" rel="nofollow">here</a> and Uzi Vishkin advocating for incremental modifications to the existing Parallel Random Access Machine (PRAM) model <a href="https://cacm.acm.org/magazines/2022/9/263793-on-the-model-of-computation-counterpoint/fulltext" target="_blank" rel="nofollow">here</a>. Some cynics might summarize Dally's article as “please develop software that makes better use of our hardware so that we can sell you more hardware” while other cynics might summarize Vishkin's article as “please keep funding our research project“. In contrast, I claim that both articles are worth a deeper look. The next section looks at Dally's article, the section after that at Vishkin's article, followed by discussion. TL;DR: In complete consonance with a depressingly large fraction of 21<superscript>st</superscript>-century discourse, they are talking right past each other. <h3>On the Model of Computation: Point: We Must Extend Our Model of Computation to Account for Cost and Location</h3><a href="https://cacm.acm.org/magazines/2022/9/263792-on-the-model-of-computation-point/fulltext" target="_blank" rel="nofollow">Dally</a> makes a number of good points. First, he notes that past decades have seen a shift from computation being more expensive than memory references to memory references being more expensive than computation, and by a good many orders of magnitude. Second, he points out that in production, constant factors can be more important than asymptotic behavior. Third, he describes situations where hardware caches are not a silver bullet. Fourth, he calls out the necessity of appropriate computational models for performance programming, though to his credit, he does clearly state that not all programming is performance programming. Fifth, he calls out the global need for reduced energy expended on computation, motivated by projected increases in computation-driven energy demand. Sixth and finally, he advocates for a new computational model that builds on PRAM by (1) assigning different costs to computation and to memory references and (1) making memory-reference costs a function of a distance metric based on locations assigned to processing units and memories. <h3>On the Model of Computation: Counterpoint: Parallel Programming Wall and Multicore Software Spiral: Denial Hence Crisis</h3><a href="https://cacm.acm.org/magazines/2022/9/263793-on-the-model-of-computation-counterpoint/fulltext" target="_blank" rel="nofollow">Vishkin</a> also makes some interesting points. First, he calls out the days of Moore's-Law frequency scaling, calling for the establishing of a similar “free lunch” scaling in terms of numbers of CPUs. He echoes Dally's point about not all programming being performance programming, but argues that in addition to a “memory wall” and an “energy wall”, there is also a “parallel programming wall” because programming multicore systems is in “simply too difficult”. Second, Vishkin calls on hardware vendors to take more account of ease of use when creating computer systems, including systems with GPUs. Third, Vishkin argues that compilers should take up much of the burden of squeezing performance out of hardware. Fourth, he makes several arguments that concurrent systems should adapt to PRAM rather than adapting PRAM to existing hardware. Fifth and finally, he calls for funding for a “multicore software spiral” that he hopes will bring back free-lunch performance increases driven by Moore's Law, but based on increasing numbers of cores rather than increasing clock frequency. <h3>Discussion</h3>Figure 2.3 of <a href="https://www.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html" target="_blank" rel="nofollow">Is Parallel Programming Hard, And, If So, What Can You Do About It?</a> (AKA “perfbook”) shows the “Iron Triangle” of concurrent programming. Vishkin focuses primarily at the upper edge of this triangle, which is primarily about productivity. Dally focuses primarily on the lower left edge of this triangle, which is primarily about performance. Except that Vishkin's points about the cost of performance programming are all too valid, which means that defraying the cost of performance programming in the work-a-day world requires very large numbers of users, which is often accomplished by making performance-oriented concurrent software be quite general, thus amortizing its cost over large numbers of use cases and users. This puts much performance-oriented concurrent software at the lower tip of the iron triangle, where performance meets generality. One could argue that Vishkin's proposed relegation of performance-oriented code to compilers is one way to break the iron triangle of concurrent programming, but this argument fails because compilers are themselves low-level software (thus at the lower tip of the triangle). Worse yet, there have been many attempts to put the full concurrent-programming burden on the compiler (or onto transactional memory), but rarely to good effect. Perhaps SQL is the exception that proves this rule, but this is not an example that Vishkin calls out. Both Dally and Vishkin correctly point to the ubiquity of multicore systems, so it only reasonable to look at how these systems are handled in practice. After all, it is only fair to check their apparent assumption of “badly”. And Section 2.3 of perfbook lists several straightforward approaches: (1) Using multiple instances of a sequential application, (2) Using the large quantity of existing parallel software, ranging from operating-system kernels to database servers (SQL again!) to numerical software to image-processing libraries to machine-learning libraries, to name but a few of the areas that Python developers could teach us about, and (3) Applying sequential performance optimizations such that single-threaded performance suffices. Those taking door 1 or door 2 will need only a few parallel performance programmers, and it seems to have proven eminently feasible to have those few to use a sufficiently accurate computational model. The remaining users simply invoke the software produced by these parallel performance programmers, and need not worry much about concurrency. The cases where none of these three doors apply are more challenging, but surmounting the occasional challenge is part of the programming game, not just the parallel-programming game. One additional historical trend is the sharply decreasing cost of concurrent systems, both in terms of purchase price and energy consumption. Where an entry-level late-1980s parallel system cost millions of dollars and consumed kilowatts, an entry-level 2022 system can be had for less than $100. This means that there is no longer an overwhelming economic imperative to extract every ounce of performance from most year-2022 concurrent systems. For example, much smartphone code attains the required performance while running single-threaded, which means that additional performance from parallelism could well be a waste of energy. Instead, the smartphone automatically powers down the unused hardware, thus providing only the performance that is actually needed, and does so in an energy-efficient manner. The occasional heavy-weight application might turn on all the CPUs, but only such applications need the ministrations of parallel performance programmers and complex computational models. Thus, Dally and Vishkin are talking past each other, describing different types of software that is produced by different types of developers, who can each use whatever computational model is appropriate to the task at hand. Which raises the question of whether there might be a one-size-fits-all computational model. I have my doubts. In my 45 years of supporting myself and (later) my family by developing software, I have seen many models, frameworks, languages, and libraries come and go. In contrast, to the best of my knowledge, the speed of light and the sizes of the various types of atoms have remained constant. Don't get me wrong, I do hope that the trend towards vertical stacking of electronics within CPU chips will continue, and I also hope that this trend will result in smaller systems that are correspondingly less constrained by speed-of-light and size-of-atoms limitations. However, the laws of physics appear to be exceedingly stubborn things, and we would therefore be well-advised to work together. Dally's attempts to dictate current hardware conveniences to software developers is about as likely to succeed as is Vishkin's attempts to dictate current software conveniences to hardware developers. Both of these approaches are increasingly likely to fail should the trend towards special-purpose hardware continue full force. Software developers cannot always ignore the laws of physics, and by the same token, hardware developers cannot always ignore the large quantity of existing software and the large number of existing software developers. To be sure, in order to continue to break new ground, both software and hardware developers will need to continue to learn new tricks, but many of these tricks will require coordinated hardware/software effort. Sadly, there is no hint in either Dally's or Vishkin's article of any attempt to learn what the many successful developers of concurrent software actually do. Thus, it is entirely possible that neither Dally's nor Vishkin's preferred concurrent computational model is applicable to actual parallel programming practice. In fact, in my experience, developers often use the actual hardware and software as the model, driving their development and optimization efforts from actual measurements and from intuitions built up from those measurements. Some might look down on this approach as being both non-portable and non-mathematical, but portability is often achieved simply by testing on a variety of hardware, courtesy of the relatively low prices of modern computer systems. As for mathematics, those of us who appreciate mathematics would do well to admit that it is often much more efficient to let the objective universe carry out the mathematical calculations in its normal implicit and ineffable manner, especially given the complexity of present day hardware and software. Circling back to the cynics, there are no doubt some cynics that would summarize this blog post as “please download and read my <a href="https://www.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html" target="_blank" rel="nofollow">book</a>”. Except that in contrast to the hypothesized cynical summarization of Dally's and Vishkin's articles, you can download and read my book free of charge. Which is by design, given that the whole point is to promulgate information. Cue cynics calling out which of the three of us is the greater fool! ;-) https://paulmck.livejournal.com/67992.html?view=comments#comments is parallel programming hard performance parallel readings perfbook confessions public

https://paulmck.livejournal.com/67832.html Fri, 12 Aug 2022 20:06:40 GMT Stupid SMP Tricks: A Review of Locking Engineering Principles and Hierarchy paulmck https://paulmck.livejournal.com/67832.html Daniel Vetter put together a pair of intriguing blog posts entitled <a href="https://blog.ffwll.ch/2022/07/locking-engineering.html" target="_blank" rel="nofollow">Locking Engineering Principles</a> and <a href="https://blog.ffwll.ch/2022/08/locking-hierarchy.html" target="_blank" rel="nofollow">Locking Engineering Hierarchy</a>. These appear to be an attempt to establish a set of GPU-wide or perhaps even driver-tree-wide concurrency coding conventions. Which would normally be none of my business. After all, to establish such conventions, Daniel needs to negotiate with the driver subsystem's developers and maintainers, and I am neither. Except that he did <a href="https://twitter.com/danvet/status/1554799358096334848?ref_src=twsrc%5Etfw" target="_blank" rel="nofollow">call me out</a> on Twitter on this topic. So here I am, <a href="https://twitter.com/paulmckrcu/status/1555327265642147840" target="_blank" rel="nofollow">as promised</a>, offering color commentary and the occasional suggestion for improvement, both of Daniel's proposal and of the kernel itself. The following sections review his two posts, and then summarize and amplify suggestions for improvement. <h1><a href="https://blog.ffwll.ch/2022/07/locking-engineering.html" target="_blank" rel="nofollow">Locking Engineering Principles</a></h1> “Make it Dumb” should be at least somewhat uncontroversial, as this is quite similar to a rather more flamboyant sound bite from my 1970s Mechanical Engineering university coursework. Kernighan's Law asserting that debugging is twice as hard as coding should also be a caution. This leads us to “Make it Correct”, which many might argue should come first in the list. After all, if correctness is not a higher priority than dumbness (or simplicity, if you prefer), then the do-nothing null solution would always be preferred. And to his credit, Daniel does explicitly state that “simple doesn't necessarily mean correct”. It is hard to argue with Daniel's choosing to give <a href="https://lwn.net/Articles/185666/" target="_blank" rel="nofollow">lockdep</a> place of pride, especially given how many times it has saved me over the years, and not just from deadlock. I never have felt the need to teach lockdep RCU's locking rules, but then again, RCU is much more monolithic than is the GPU subsystem. Perhaps if RCU's locking becomes more ornate, I would also feel the need to acquire RCU's key locks in the intended order at boot time. Give or take the fact that much of RCU can be used very early, even before the call to <tt>rcu_init()</tt>. The validation point is also a good one, with Daniel calling out <tt>might_lock()</tt>, <tt>might_sleep()</tt>, and <tt>might_alloc()</tt>. I go further and add <tt>lockdep_assert_irqs_disabled()</tt>, <tt>lockdep_assert_irqs_enabled()</tt>, and <tt>lockdep_assert_held()</tt>, but perhaps these additional functions are needed in low-level code like RCU more than they are in GPU drivers. Daniel's admonishments to avoid reinventing various synchronization wheels of course comprise excellent common-case advice. And if you believe that your code is the uncommon case, you are most likely mistaken. Furthermore, it is hard to argue with “pick the simplest lock that works”. It is hard to argue with the overall message of “Make it Fast”, especially the bit about the pointlessness of crashing faster. However, a minimum level of speed is absolutely required. For a 1977 example, an old school project required keeping up with the display. If the code failed to keep up with the display, that code was useless. More recent examples include deep sub-second request-response latency requirements in many Internet datacenters. In other words, up to a point, performance is not simply a “Make it Fast” issue, but also a “Make it Correct” issue. On the other hand, once the code meets its performance requirements, any further performance improvements instead falls into the lower-priority “Make it Fast” bucket. Except that things are not always quite so simple. Not all performance improvements can be optimized into existence late in the game. In fact, if your software's performance requirements are expected to tax your system's resources, it will be necessary to make performance a first-class consideration all the way up to and including the software-architecture level, as discussed <a href="https://www.usenix.org/system/files/conference/hotpar12/hotpar12-final18.pdf" target="_blank" rel="nofollow">here</a>. And, to his credit, Daniel hints at this when he notes that “the right fix for performance issues is very often to radically update the contract and sharing of responsibilities between the userspace and kernel driver parts”. He also helpfully calls out <a href="https://lwn.net/Kernel/Index/#io_uring" target="_blank" rel="nofollow">io_uring</a> as one avenue for radical updates. The “Protect Data, not Code” is good general advice and goes back to Jack Inman's 1985 USENIX paper entitled “Implementing Loosely Coupled Functions on Tightly Coupled Engines”, if not further. However, there are times where what needs to be locked is not data, but perhaps a particular set of state transitions, or maybe a rarely accessed state, which might be best represented by the code for that state. That said, there can be no denying the big-kernel-lock hazards of excessive reliance on code locking. Furthermore, I have only rarely needed abstract non-data locking. Nevertheless, a body of code as large as the full set of Linux-kernel device drivers should be expected to have a large number of exceptions to the “Protect Data” rule of thumb, reliable though that rule has proven in my own experience. The usual way of handling this is a waiver process. After all, given that every set of safety-critical coding guidelines I have seen has a waiver process, it only seems reasonable that device-driver concurrency coding conventions should also involve a waiver process. But that decision belongs to the device-driver developers and maintainers, not with me. <h1><a href="https://blog.ffwll.ch/2022/08/locking-hierarchy.html" target="_blank" rel="nofollow">Locking Engineering Hierarchy</a></h1> Most of the Level 0: No Locking section should be uncontroversial: Do things the easy way, at least when the easy way works. This approach goes back decades, for but one example, single-threaded user-interface code delegating concurrency to a database management system. And it should be well-understood that complicated things can be made easy (or at least easier) through use of carefully constructed and time-tested APIs, for example, the <tt>queue_work()</tt> family of APIs called out in this section. One important question for all such APIs is this: “Can someone with access to only the kernel source tree and a browser quickly find and learn how to use the given API?” After all, people are able to properly use the APIs they see elsewhere if they can quickly and easily learn about them. And yes, given Daniel's comments regarding <tt>struct dma_fence</tt> later in this section, the answer for <tt>SLAB_TYPESAFE_BY_RCU</tt> appears to be “no” (but see <tt>Documentation/RCU/rculist_nulls.rst</tt>). The trick with <tt>SLAB_TYPESAFE_BY_RCU</tt> is that readers must validate the allocation. A common way of doing this is to have a reference count that is set to the value one immediately after obtaining the object from <tt>kmem_struct_alloc()</tt> and set to the value zero immediately before passing the object to <tt>kmem_struct_free()</tt>. And yes, I have a patch queued to fix the misleading text in <tt>Documentation/RCU/whatisRCU.rst</tt>, and I apologize to anyone who might have attempted to use locking without a reference count. But please also let the record show that there was no bug report. A reader attempting to obtain a new lookup-based reference to a <tt>SLAB_TYPESAFE_BY_RCU</tt> object must do something like this: <ol> <li> <tt>atomic_inc_not_zero()</tt>, which will return <tt>false</tt> and not do the increment if initial value was zero. Presumably <tt>dma_fence_get_rcu()</tt> handles this for <tt>struct dma_fence</tt>. <li> If the value returned above was false, pretend that the lookup failed. Otherwise, continue with the following steps. <li> Check the identity of the object. If this turns out to not be the desired object, release the reference (cleaning up if needed) and pretend that the lookup failed. Otherwise, continue. Presumably <tt>dma_fence_get_rcu_safe()</tt> handles this for <tt>struct dma_fence</tt>, in combination with its call to <tt>dma_fence_put()</tt>. <li> Use the object. <li> Release the reference, cleaning up if needed. </ol>This of course underscores Daniel's point (leaving aside the snark), which is that you should not use <tt>SLAB_TYPESAFE_BY_RCU</tt> unless you really need to, and even then you should make sure that you know how to use it properly. But just when do you need to use <tt>SLAB_TYPESAFE_BY_RCU</tt>? One example is when you need RCU protection on such a hot fastpath that you cannot tolerate RCU-freed objects becoming cold in the CPU caches due to RCU's grace-period delays. Another example is where objects are being allocated and freed at such a high rate that if each and every <tt>kmem_cache_free()</tt> was subject to grace-period delays, excessive memory would be tied up waiting for grace periods, perhaps even resulting in out-of-memory (OOM) situations. In addition, carefully constructed semantics are required. Semantics based purely on ordering tend to result in excessive conceptual complexity and thus in confusion. More appropriate semantics tend to be conditional, for example: (1) Objects that remain in existence during the full extent of the lookup are guaranteed to be found, (2) Objects that do not exist at any time during the full extent of the lookup are guaranteed not to be found, and (3) Objects that exist for only a portion of the lookup might or might not be found. Does <tt>struct dma_fence</tt> need <tt>SLAB_TYPESAFE_BY_RCU</tt>? Is the code handling <tt>struct dma_fence</tt> doing the right thing? For the moment, I will just say that: (1) The existence of <tt>dma_fence_get_rcu_safe()</tt> gives me at least some hope, (2) Having two different slab allocators stored into different static variables having the same name is a bit code-reader-unfriendly, and (3) The use of <tt>SLAB_TYPESAFE_BY_RCU</tt> for a structure that contains an <tt>rcu_head</tt> structure is a bit unconventional, but perhaps these structures are sometimes obtained from <tt>kmalloc()</tt> instead of from <tt>kmem_struct_alloc()</tt>. One thing this section missed (or perhaps intentionally omitted): If you do need memory barriers, you are almost always better off using <tt>smp_store_release()</tt> and <tt>smp_load_acquire()</tt> than the old-style <tt>smp_wmb()</tt> and <tt>smp_rmb()</tt>. The Level 1: Big Dumb Lock certainly summarizes the pain of finding the right scope for your locks. Despite Daniel's suggesting that lockless tricks are almost always the wrong approach, such tricks are in common use. I would certainly agree that randomly hacking lockless tricks into your code will almost always result in disaster. In fact, this process will use your own intelligence against you: The smarter you think you are, the deeper a hole you will have dug for yourself before you realize that you are in trouble. Therefore, if you think that you need a lockless trick, first actually measure the performance. Second, if there really is a performance issue, do the work to figure out which code and data is actually responsible, because your blind guesses will often be wrong. Third, read documentation, look at code, and talk to people to learn and fully understand how this problem has been solved in the past, then carefully apply those solutions. Fourth, if there is no solution, review your overall design, because an ounce of proper partitioning is worth some tons of clever lockless code. Fifth, if you must use a new solution, verify it beyond all reason. The code in <tt>kernel/rcu/rcutorture.c</tt> will give you some idea of the level of effort required. The Level 2: Fine-grained Locking sections provide some excellent advice, guidance, and cautionary tales. Yes, lockdep is a great thing, but there are deadlocks that it does not detect, so some caution is still required. The yellow-highlighted Locking Antipattern: Confusing Object Lifetime and Data Consistency describes how holding locks across non-memory-style barrier functions, that is, things like <tt>flush_work()</tt> as opposed to things like <tt>smp_mb()</tt>, can cause problems, up to and including deadlock. In contrast, whatever other problems memory-barrier functions such as <tt>smp_mb()</tt> cause, it does take some creativity to add them to a lock-based critical section so as to cause them to participate in a deadlock cycle. I would not consider <tt>flush_work()</tt> to be a memory barrier in disguise, although it is quite true that a correct high-performance implementation of <tt>flush_work()</tt> will require careful memory ordering. I would have expected a rule such as “Don't hold a lock across <tt>flush_work()</tt> that is acquired in any corresponding workqueue handler”, but perhaps Daniel is worried about future deadlocks as well as present-day deadlocks. After all, if you do hold a lock across a call to <tt>flush_work()</tt>, perhaps there will soon be some compelling reason to acquire that lock in a workqueue handler, at which point it is game over due to deadlock. This issue is of course by no means limited to workqueues. For but one example, it is quite possible to generate similar deadlocks by holding spinlocks across calls to <tt>del_timer_sync()</tt> that are acquired within timer handlers. And that is why both <tt>flush_work()</tt> and <tt>del_timer_sync()</tt> tell lockdep what they are up to. For example, <tt>flush_work()</tt> invokes <tt>__flush_work()</tt> which in turn invokes <tt>lock_map_acquire()</tt> and <tt>lock_map_release()</tt> on a fictitious lock, and this same fictitious lock is acquired and released by <tt>process_one_work()</tt>. Thus, if you acquire a lock in a workqueue handler that is held across a corresponding <tt>flush_work()</tt>, lockdep will complain, as shown in the 2018 commit 87915adc3f0a ("workqueue: re-add lockdep dependencies for flushing") by Johannes Berg. Of course, <tt>del_timer_sync()</tt> uses this same trick, as shown in the 2009 commit 6f2b9b9a9d75 ("timer: implement lockdep deadlock detection"), also by Johannes Berg. Nevertheless, deadlocks involving <tt>flush_work()</tt> and workqueue handlers can be subtle. It is therefore worth investing some up-front effort to avoid them. The orange-highlighted Level 2.5: Splitting Locks for Performance Reasons discusses splitting locks. As the section says, there are complications. For one thing, lock acquisitions are anything but free, having overheads of hundreds or thousands of instructions at best, which means that adding additional levels of finer-grained locks can actually slow things down. Therefore, Daniel's point about prioritizing architectural restructuring over low-level synchronization changes is an extremely good one, and one that is all too often ignored. As Daniel says, when moving to finer-grained locking, it is necessary to avoid increasing the common-case number of locks being acquired. As one might guess from the tone of this section, this is not necessarily easy. Daniel suggests reader-writer locking, which can work well in some cases, but suffers from performance and scalability limitations, especially in situations involving short read-side critical sections. The fact that readers and writers exclude each other can also result in latency/response-time issues. But again, reader-writer locking can be a good choice in some cases. The last paragraph is an excellent cautionary tale. Take it from Daniel: Never let userspace dictate the order in which the kernel acquires locks. Otherwise, you, too, might find yourself using wait/wound mutexes. Worse yet, if you cannot set up an two-phase locking discipline in which all required locks are acquired before any work is done, you might find yourself writing deadlock-recovery code, or wounded-mutex recovery code, if you prefer. This recovery code can be surprisingly complex. In fact, one of the motivations for the early 1990s introduction of RCU into DYNIX/ptx was that doing so allowed deletion of many thousands of lines of such recovery code, along with all the yet-undiscovered bugs that code contained. The red-highlighted Level 3: Lockless Tricks section begins with the ominous sentence “Do not go here wanderer!” And I agree. After all, if you are using the facilities discussed in this section, namely RCU, atomics, <tt>preempt_disable()</tt>, <tt>local_bh_disable()</tt>, <tt>local_irq_save()</tt>, or the various memory barriers, you had jolly well better not be wandering!!! Instead, you need to understand what you are getting into and you need to have a map in the form of a principled design and a careful well-validated implementation. And new situations may require additional maps to be made, although there are quite a few well-used maps already in <tt>Documentation/rcu</tt> and <a href="https://docs.google.com/document/d/1X0lThx8OK0ZgLMqVoXiR4ZrGURHrXK6NyLRbeXe3Xac/edit" target="_blank" rel="nofollow">over here</a>. In addition, as Daniel says, algorithmic and architectural fixes can often provide much better results than can lockless tricks applied at low levels of abstraction. Past experience suggests that some of those algorithmic and architectural fixes will involve lockless tricks on fastpaths, but life is like that sometimes. It is now time to take a look at Daniel's alleged antipatterns. The “Locking Antipattern: Using RCU” section does have some “interesting” statements: <ol> <li> RCU is said to mix up lifetime and consistency concerns. It is not clear exactly what motivated this statement, but this view is often a symptom of a strictly temporal view of RCU. To use RCU effectively, one must instead take a combined spatio-temporal view, as described <a href="https://linuxfoundation.org/webinars/unraveling-rcu-usage-mysteries/" target="_blank" rel="nofollow">here</a> and <a href="https://linuxfoundation.org/webinars/unraveling-rcu-usage-mysteries-additional-use-cases/" target="_blank" rel="nofollow">here</a>. <li> <tt>rcu_read_lock()</tt> is said to provide both a read-side critical section and to extend the lifetime of any RCU-protected object. This is true in common use cases because the whole purpose of an RCU read-side critical section is to ensure that any RCU-protected object that was in existence at any time during that critical section remains in existence up to the end of that critical section. It is not clear what distinction Daniel is attempting to draw here. Perhaps he likes the determinism provided by full mutual exclusion. If so, never forget that the laws of physics dictate that such determinism is often surprisingly expensive. <li> The next paragraph is a surprising assertion that RCU readers' deadlock immunity is a bad thing! Never forget that although RCU can be used to replace reader-writer locking in a great many situations, RCU is not reader-writer locking. Which is a good thing from a performance and scalability viewpoint as well as from a deadlock-immunity viewpoint. <li> In a properly designed system, locks and RCU are not “papering over” lifetime issues, but instead properly managing object lifetimes. And if your system is not properly designed, then any and all facilities, concurrent or not, are weapons-grade dangerous. Again, perhaps more maps are needed to help those who might otherwise wander into improper designs. Or perhaps existing maps need to be more consistently used. <li> RCU is said to practically force you to deal with “zombie objects”. Twitter discussions with Daniel determined that such “zombie objects” can no longer be looked up, but are still in use. Which means that many use cases of good old reference counting also force you to deal with zombie objects: After all, removing an object from its search structure does not invalidate the references already held on that object. But it is quite possible to avoid zombie objects for both RCU and for reference counting through use of per-object locks, as is done in the Linux-kernel code that maps from a System-V semaphore ID to the corresponding in-kernel data structure, first described in Section of <a href="https://www2.rdrop.com/~paulmck/RCU/rcu.FREENIX.2003.06.14.pdf" target="_blank" rel="nofollow">this paper</a>. In short, if you don't like zombies, there are simple RCU use cases that avoid them. You get RCU's speed and deadlock immunity within the search structure, but full ordering, consistency, and zombie-freedom within the searched-for object. <li> The last bullet in this section seems to argue that people should upgrade from RCU read-side critical sections to locking or reference counting as quickly as possible. Of course, such locking or reference counting can add problematic atomic-operation and cache-miss overhead to those critical sections, so specific examples would be helpful. One non-GPU example is the System-V semaphore ID example mentioned above, where immediate lock acquisition is necessary to provide the necessary System-V semaphore semantics. <li> On freely using RCU, again, proper design is required, and not just with RCU. One can only sympathize with a driver dying in <tt>synchronize_rcu()</tt>. Presumably Daniel means that one of the driver's tasks hung in <tt>synchronize_rcu()</tt>, in which case there should have been an RCU CPU stall warning message which would point out what CPU or task was stuck in an RCU read-side critical section. It is quite easy to believe that diagnostics could be improved, both within RCU and elsewhere, but if this was intended to be a bug report, it is woefully insufficient. <li> It is good to see that Daniel found at least one RCU use case that he likes (or at least doesn't hate too intensely), namely <tt>xarray</tt> lookups combined with <tt>kref_get_unless_zero()</tt> and <tt>kfree_rcu()</tt>. Perhaps this is a start, especially if the code following that <tt>kref_get_unless_zero()</tt> invocation does additional atomic operations on the <tt>xarray</tt> object, which would hide at least some of the <tt>kref_get_unless_zero()</tt> overhead. </ol> The following table shows how intensively the RCU API is used by various v5.19 kernel subsystems: <blockquote><pre> Subsystem Uses LoC Uses/KLoC --------- ---- ---------- --------- ipc 91 9,822 9.26 virt 68 9,013 7.54 net 7457 1,221,681 6.10 security 599 107,622 5.57 kernel 1796 423,581 4.24 mm 324 170,176 1.90 init 8 4,236 1.89 block 108 65,291 1.65 lib 319 214,291 1.49 fs 1416 1,470,567 0.96 include 836 1,167,274 0.72 drivers 5596 20,861,746 0.27 arch 546 2,189,975 0.25 crypto 6 102,307 0.06 sound 21 1,378,546 0.02 --------- ---- ---------- --------- Total 19191 29,396,128 0.65 </pre></blockquote> As you can see, the drivers subsystem has the second-highest total number of RCU API uses, but it also has by far the largest number of lines of code. As a result, the RCU usage intensity in the drivers subsystem is quite low, at about 27 RCU uses per 100,000 lines of code. But this view is skewed, as can be seen by looking more deeply within the drivers subsystem, but leaving out (aside from in the Total line) and drivers containing fewer than 100 instances of the RCU API: <blockquote><pre> Subsystem Uses LoC Uses/KLoC --------- ---- ---------- --------- drivers/target 293 62,532 4.69 drivers/block 355 97,265 3.65 drivers/md 334 147,881 2.26 drivers/infiniband 548 434,430 1.26 drivers/net 2607 4,381,955 0.59 drivers/staging 114 618,763 0.18 drivers/scsi 150 1,011,146 0.15 drivers/gpu 399 5,753,571 0.07 --------- ---- ---------- --------- Total 5596 20,861,746 0.27 </pre></blockquote> The <tt>drivers/infiniband</tt> and <tt>drivers/net</tt> subtrees account for more than half of the RCU usage in the Linux kernel's drivers, and could be argued to be more about networking than about generic device drivers. And although <tt>drivers/gpu</tt> comes in third in terms of RCU usage, it also comes in first in terms of lines of code, making it one of the least intense users of RCU. So one could argue that Daniel already has his wish, at least within the confines of <tt>drivers/gpu</tt>. This data suggests that Daniel might usefully consult with the networking folks in order to gain valuable guidelines on the use of RCU and perhaps atomics and memory barriers as well. On the other hand, it is quite possible that such consultations actually caused some of Daniel's frustration. You see, a system implementing networking must track the state of the external network, and it can take many seconds or even minutes for changes in that external state to propagate to that system. Therefore, expensive synchronization within that system is less useful than one might think: No matter how many locks and mutexes that system acquires, it cannot prevent external networking hardware from being reconfigured or even from failing completely. Moving to <tt>drivers/gpu</tt>, in theory, if there are state changes initiated by the GPU hardware without full-system synchronization, networking RCU usage patterns should apply directly to GPU drivers. In contrast, there might be significant benefits from more tightly synchronizing state changes initiated by the system. Again, perhaps the aforementioned System-V semaphore ID example can help in such cases. But to be fair, given Daniel's preference for immediately acquiring a reference to RCU-protected data, perhaps <tt>drivers/gpu</tt> code is already taking this approach. The “Locking Antipattern: Atomics” section is best taken point by point: <ol> <li> For good or for ill, the ordering (or lack thereof) of Linux kernel atomics predates C++ atomics by about a decade. One big advantage of the Linux-kernel approach is that all accesses to <tt>atomic*_t</tt> variables are marked. This is a great improvement over C++, where a sequentially consistent load or store looks just like a normal access to a local variable, which can cause a surprising amount of confusion. <li> Please please please do not “sprinkle” memory barriers over the code!!! Instead, actually design the required communication and ordering, and then use the best primitives for the job. For example, instead of “<tt>smp_mb__before_atomic(); atomic_inc(&myctr); smp_mb__after_atomic();</tt>”, maybe you should consider invoking <tt>atomic_inc_return()</tt>, discarding the return value if it is not needed. <li> Indeed, some atomic functions operate on non-<tt>atomic*_t</tt> variables. But in many cases, you can use their <tt>atomic*_t</tt> counterparts. For example, instead of <tt>READ_ONCE()</tt>, <tt>atomic_read()</tt>. Instead of <tt>WRITE_ONCE()</tt>, <tt>atomic_set()</tt>. Instead of <tt>cmpxchg()</tt>, <tt>atomic_cmpxchg()</tt>. Instead of <tt>set_bit()</tt>, in many situations, <tt>atomic_or()</tt>. On the other hand, I will make no attempt to defend the naming of <tt>set_bit()</tt> and <tt>__set_bit()</tt>. </ol> To Daniel's discussion of “unnecessary trap doors”, I can only agree that reinventing read-write semaphores is a very bad thing. I will also make no attempt to defend ill-thought-out hacks involving weak references or RCU. Sure, a quick fix to get your production system running is all well and good, but the real fix should be properly designed. And Daniel makes an excellent argument when he says that if a counter can be protected by an already held lock, that counter should be implemented using normal C-language accesses to normal integral variables. For those situations where no such lock is at hand, there are a lot of atomic and per-CPU counting examples that can be followed, both in the Linux kernel and in Chapter 5 of “<a href="https://kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook-e2.pdf" target="_blank" rel="nofollow">Is Parallel Programming Hard, And, If So, What Can You Do About It?</a>”. Again, why unnecessarily re-invent the wheel? However, the last paragraph, stating that atomic operations should only be used for locking and synchronization primitives in the core kernel is a bridge too far. After all, a later section allows for memory barriers to be used in libraries (at least driver-hacker-proof libraries), so it seems reasonable that atomic operations can also be used in libraries. Some help is provided by the executable Linux-kernel memory model (LKMM) in <tt>tools/memory-model</tt> along with the kernel concurrency sanitizer (KCSAN), which is documented in <tt>Documentation/dev-tools/kcsan.rst</tt>, but there is no denying that these tools currently require significant expertise. Help notwithstanding, it almost always makes a lot of sense to hide complex operations, including complex operations involving concurrency, behind well-designed APIs. And help is definitely needed, given that there are more than 10,000 invocations of atomic operations in the drivers tree, more than a thousand of which are in <tt>drivers/gpu</tt>. There is not much to say about the “Locking Antipattern: preempt/local_irq/bh_disable() and Friends” section. These primitives are not heavily used in the drivers tree. However, lockdep does have enough understanding of these primitives to diagnose misuse of irq-disabled and bh-disabled spinlocks. Which is a good thing, given that there are some thousands of uses of irq-disabled spinlocks in the drivers tree, along with a good thousand uses of bh-disabled spinlocks. The “Locking Antipattern: Memory Barriers” suggests that memory barriers should be packaged in a library or core kernel service, which is in the common case excellent advice. Again, the executable LKMM and KCSAN can help, but again these tools currently require some expertise. I was amused by Daniel's “I love to read an article or watch a talk by Paul McKenney on RCU like anyone else to get my brain fried properly”, and I am glad that my articles and talks provide at least a little entertainment value, if nothing else. ;-) Summing up my view of these two blog posts, Daniel recommends that most driver code avoid concurrency entirely. Failing that, he recommends sticking to certain locking and reference-counting use cases, albeit including a rather complex acquire-locks-in-any-order use case. For the most part, he recommends against atomics, RCU, and memory barriers, with a very few exceptions. For me, reading these blog posts induced great nostalgia, taking me back to my early 1990s days at Sequent, when the guidelines were quite similar, give or take a large number of non-atomically manipulated per-CPU counters. But a few short years later, many Sequent engineers were using atomic operations, and yes, a few were even using RCU. Including one RCU use case in a device driver, though that use case could instead be served by the Linux kernel's <tt>synchronize_irq()</tt> primitive. Still, the heavy use of RCU and (even more so) of atomics within the drivers tree, combined with Daniel's distaste for these primitive, suggests that some sort of change might be in order. <h1>Can We Fix This?</h1>Of course we can!!! But will a given fix actually improve the situation? That is the question. Reading through this reminded me that I need to take another pass through the RCU documentation. I have queued a commit to fix the misleading wording for <tt>SLAB_TYPESAFE_BY_RCU</tt> on the -rcu tree: <tt>08f8f09b2a9e ("doc: SLAB_TYPESAFE_BY_RCU uses cannot rely on spinlocks")</tt>. I also expect to improve the documentation of reference counting and its relation to <tt>SLAB_TYPESAFE_BY_RCU</tt>, and will likely find a number of other things in need of improvement. This is also as good a time as any to announce that I will be holding an an RCU Office Hours birds-of-a-feather session at the <a href="https://lpc.events/" target="_blank" rel="nofollow">2022 Linux Plumbers Conference</a>, in case that is helpful. However, the RCU documentation must of necessity remain fairly high level. And to that end, GPU-specific advice about use of <tt>xarray</tt>, <tt>kref_get_unless_zero()</tt>, and <tt>kfree_rcu()</tt> really needs to be <tt>Documentation/gpu</tt> as opposed to <tt>Documentation/RCU</tt>. This would allow that advice to be much more specific and thus much more helpful to the GPU developers and maintainers. Alternatively, perhaps improved GPU-related APIs are required in order to confine concurrency to functions designed for that purpose. This alternative approach has the benefit of allowing GPU device drivers to focus more on GPU-specific issues and less on concurrency. On the other hand, given that the GPU drivers comprise some millions of lines of code, this might be easier said than done. It is all too easy to believe that it is possible to improve the documentation for a number of other facilities that Daniel called on the carpet. At the same time, it is important to remember that the intent of documentation is communication, and that the optimal mode of communication depends on the target audience. At its best, documentation builds a bridge from where the target audience currently is to where they need to go. Which means the broader the target audience, the more difficult it is to construct that bridge. Which in turn means that a given subsystem likely need usage advice and coding standards specific to that subsystem. One size does not fit all. The <tt>SLAB_TYPESAFE_BY_RCU</tt> facility was called on the carpet, and perhaps understandably so. Would it help if <tt>SLAB_TYPESAFE_BY_RCU</tt> were to be changed so as to allow locks to be acquired on objects that might at any time be passed to <tt>kmem_cache_free()</tt> and then reallocated via <tt>kmem_cache_alloc()</tt>? In theory, this is easy: Just have the <tt>kmem_cache</tt> in question zero pages allocated from the system before splitting them up into objects. Then a given object could have an “initialized” flag, and if that flag was cleared in an object just returned from <tt>kmem_struct_alloc()</tt>, then and only then would that lock be initialized. This would allow a lock to be acquired (under <tt>rcu_read_lock()</tt>, of course) on a freed object, and would allow a lock to be held on an object despite its being passed to <tt>kmem_struct_free()</tt> and returned from <tt>kmem_struct_alloc()</tt> in the meantime. In practice, some existing <tt>SLAB_TYPESAFE_BY_RCU</tt> users might not be happy with the added overhead of page zeroing, so this might require an additional <tt>GFP_</tt> flag to allow zeroing on a <tt>kmem_cache</tt>-by-<tt>kmem_cache</tt> basis. However, the first question is “Would this really help?”, and answering that question requires feedback developers and maintainers who are actually using <tt>SLAB_TYPESAFE_BY_RCU</tt>. Some might argue that the device drivers should all be rewritten in Rust, and cynics might argue that Daniel wrote his pair of blog posts with exactly that thought in mind. I am happy to let those cynics make that argument, especially given that I have already held forth on Linux-kernel concurrency in Rust <a href="https://paulmck.livejournal.com/62436.html" target="_blank">here</a>. However, a possible desire to rust Linux-kernel device drivers does not explain Daniel's distaste for what he calls “zombie objects” because Rust is in fact quite capable of maintaining references to objects that have been removed from their search structure. <h1>Summary and Conclusions</h1>As noted earlier, reading these blog posts induced great nostalgia, taking me back to my time at Sequent in the early 1990s. A lot has happened in the ensuing three decades, including habitual use of locking in across the industry, and sometimes even correct use of locking. But will generic developers ever be able to handle more esoteric techniques involving atomic operations and RCU? I believe that the answer to this question is “yes”, as laid out in my 2012 paper <a href="https://www2.rdrop.com/~paulmck/scalability/paper/beyondmacho.2012.09.17b.pdf" target="_blank" rel="nofollow">Beyond Expert-Only Parallel Programming?</a>. As in the past, tooling (including carefully designed APIs), economic forces (including continued ubiquitous multi-core systems), and acculturation (assisted by a vast quantity of open-source software) have done the trick, and I see no reason why these trends will not continue. But what happens in <tt>drivers/gpu</tt> is up to the GPU developers and maintainers! <h1>References</h1> Atomic Operations and Memory Barriers, though more description than reference:<ol> <li> <tt>Documentation/atomic_t.txt</tt> <li> <tt>Documentation/atomic_bitops.txt</tt> <li> <tt>Documentation/memory-barriers.txt</tt> <li> Sometimes the docbook header is on the x86 <tt>arch_</tt> function, for example, <tt>arch_atomic_inc()</tt> in <tt>arch/x86/include/asm/atomic.h</tt> rather than <tt>atomic_inc()</tt>. <li> The LKMM references below can also be helpful. </ol> Kernel Concurrency Sanitizer (KCSAN):<ol> <li> <tt>Documentation/dev-tools/kcsan.rst</tt> <li> <a href="https://lwn.net/Articles/802128/" target="_blank" rel="nofollow">Finding race conditions with KCSAN</a> <li> <a href="https://lwn.net/Articles/816850/" target="_blank" rel="nofollow">Concurrency bugs should fear the big bad data-race detector (part 1)</a> <li> <a href="https://lwn.net/Articles/816854/" target="_blank" rel="nofollow">Concurrency bugs should fear the big bad data-race detector (part 2)</a> <li> <a href="https://lwn.net/Articles/877200/" target="_blank" rel="nofollow">Detecting missing memory barriers with KCSAN</a> </ol> Linux-Kernel Memory Model (LKMM):<ol> <li> <tt>tools/memory-model</tt>, including its <tt>Documentation</tt> subdirectory. <li> <a href="https://lwn.net/Articles/718628/" target="_blank" rel="nofollow">A formal kernel memory-ordering model (part 1)</a> <li> <a href="https://lwn.net/Articles/720550/" target="_blank" rel="nofollow">A formal kernel memory-ordering model (part 2)</a> <li> <a href="https://lwn.net/Articles/793253/" target="_blank" rel="nofollow">Who's afraid of a big bad optimizing compiler?</a> <li> <a href="https://lwn.net/Articles/799218/" target="_blank" rel="nofollow">Calibrating your fear of big bad optimizing compilers</a> <li> <a href="https://dl.acm.org/doi/10.1145/3173162.3177156" target="_blank" rel="nofollow">Frightening Small Children and Disconcerting Grown-ups: Concurrency in the Linux Kernel</a> (<a href="https://diy.inria.fr/linux/" target="_blank" rel="nofollow">non-paywalled extended )edition</a> </ol> Read-Copy Update (RCU):<ol> <li> <tt>Documentation/RCU</tt> <li> <a href="https://lwn.net/Articles/777036/" target="_blank" rel="nofollow">The RCU API, 2019 edition</a> <li> <a href="https://linuxfoundation.org/webinars/unraveling-rcu-usage-mysteries/" target="_blank" rel="nofollow">Unraveling RCU-Usage Mysteries (Fundamentals)</a> <li> <a href="https://linuxfoundation.org/webinars/unraveling-rcu-usage-mysteries-additional-use-cases/" target="_blank" rel="nofollow">Unraveling RCU-Usage Mysteries (Additional Use Cases)</a> <li> Sections 9.5 and 9.6 of “<a href="https://kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook-e2.pdf" target="_blank" rel="nofollow">Is Parallel Programming Hard, And, If So, What Can You Do About It?</a> <li> Many other references, some of which are listed <a href="https://docs.google.com/document/d/1X0lThx8OK0ZgLMqVoXiR4ZrGURHrXK6NyLRbeXe3Xac/edit?usp=sharing" target="_blank" rel="nofollow">here</a>. </ol> https://paulmck.livejournal.com/67832.html?view=comments#comments advice validation is parallel programming hard linux bugs stupid smp tricks locking linux bugs public

https://paulmck.livejournal.com/67547.html Sun, 29 May 2022 04:08:25 GMT Stupid RCU Tricks: How Read-Intensive is The Kernel's Use of RCU? paulmck https://paulmck.livejournal.com/67547.html RCU is a specialized synchronization mechanism, and is typically used where there are far more readers (<tt>rcu_read_lock()</tt>, <tt>rcu_read_unlock()</tt>, <tt>rcu_dereference()</tt>, and so on) than there are updaters (<tt>synchronize_rcu()</tt>, <tt>call_rcu()</tt>, <tt>rcu_assign_pointer()</tt>, and so on). But does the Linux kernel really make heavier use of RCU's read-side primitives than of its update-side primitives? One way to determine this would be to use something like ftrace to record all the calls to these functions. This works, but trace messages can be lost, especially when applied to frequently invoked functions. Also, dumping out the trace buffer can perturb the syatem. Another approach is to modify the kernel source code to count these function invocations in a cache-friendly manner, then come up with some way to dump this to userspace. This works, but I am lazy. Yet another approach is to ask the tracing folks for advice. This last is what I actually did, and because the tracing person I happened to ask happened to be Andrii Nakryiko, I learned quite a bit about BPF in general and the <tt>bpftrace</tt> command in particular. If you don't happen to have Andrii on hand, you can do quite well with Appendix A and Appendix B of Brendan Gregg's “BPF Performance Tools”. You will of course need to install <tt>bpftrace</tt> itself, which is reasonably straightforward on many Linux distributions. <h2>Linux-Kernel RCU Read Intensity</h2>Those of you who have used <tt>sed</tt> and <tt>awk</tt> have a bit of a running start because you can invoke <tt>bpftrace</tt> with a <tt>-e</tt> argument and a series of tracepoint/program pairs, where a program is <tt>bpftrace</tt> code enclosed in curly braces. This code is compiled, verified, and loaded into the running kernel as a kernel module. When the code finishes executing, the results are printed right there for you on <tt>stdout</tt>. For example: <blockquote><pre> bpftrace -e 'kprobe:__rcu_read_lock { @rcu_reader = count(); } kprobe:rcu_gp_fqs_loop { @gp = count(); } interval:s:10 { exit(); }' </pre></blockquote> This command uses the <tt>kprobe</tt> facility to attach a program to the <tt>__rcu_read_lock()</tt> function and to attach a very similar program to the <tt>rcu_gp_fqs_loop()</tt> function, which happens to be invoked exactly once per RCU grace period. Both programs count the number of calls, with <tt>@gp</tt> being the <tt>bpftrace</tt> “variable” accumulating the count, and the <tt>count()</tt> function doing the counting in a cache-friendly manner. The final <tt>interval:s:10</tt> in effect attaches a program to a timer, so that this last program will execute every 10 seconds (“<tt>s:10</tt>”). Except that the program invokes the <tt>exit()</tt> function that terminates this <tt>bpftrace</tt> program at the end of the very first 10-second time interval. Upon termination, <tt>bpftrace</tt> outputs the following on an idle system: <blockquote><pre> Attaching 3 probes... @gp: 977 @rcu_reader: 6435368 </pre></blockquote> In other words, there were about a thousand grace periods and more than six million RCU readers during that 10-second time period, for a read-to-grace-period ratio of more than six thousand. This certainly qualifies as read-intensive. But what if the system is busy? Much depends on exactly how busy the system is, as well as exactly how it is busy, but let's use that old standby, the kernel build (but using the <tt>nice</tt> command to avoid delaying <tt>bpftrace</tt>). Let's also put the <tt>bpftrace</tt> script into a creatively named file <tt>rcu1.bpf</tt> like so: <blockquote><pre> kprobe:__rcu_read_lock { @rcu_reader = count(); } kprobe:rcu_gp_fqs_loop { @gp = count(); } interval:s:10 { exit(); } </pre></blockquote> This allows the command <tt>bpftrace rcu1.bpf</tt> to produce the following output: <blockquote><pre> Attaching 3 probes... @gp: 274 @rcu_reader: 78211260 </pre></blockquote> Where the idle system had about one thousand grace periods over the course of ten seconds, the busy system had only 274. On the other hand, the busy system had 78 million RCU read-side critical sections, more than ten times that of the idle system. The busy system had more than one quarter million RCU read-side critical sections per grace period, which is seriously read-intensive. RCU works hard to make the same grace-period computation cover multiple requests. Because <tt>synchronize_rcu()</tt> invokes <tt>call_rcu()</tt>, we can use the number of <tt>call_rcu()</tt> invocations as a rough proxy for the number of updates, that is, the number of requests for a grace period. (The more invocations of <tt>synchronize_rcu_expedited()</tt> and <tt>kfree_rcu()</tt>, the rougher this proxy will be.) We can make the <tt>bpftrace</tt> script more concise by assigning the same action to a group of tracepoints, as in the <tt>rcu2.bpf</tt> file shown here: <pre><blockquote> kprobe:__rcu_read_lock, kprobe:call_rcu, kprobe:rcu_gp_fqs_loop { @[func] = count(); } interval:s:10 { exit(); } </pre></blockquote> With this file in place, <tt>bpftrace rcu2.bpf</tt> produces the following output in the midst of a kernel build: <pre><blockquote> Attaching 4 probes... @[rcu_gp_fqs_loop]: 128 @[call_rcu]: 195721 @[__rcu_read_lock]: 21985946 </pre></blockquote> These results look quite different from the earlier kernel-build results, confirming any suspicions you might harbor about the suitability of kernel builds as a repeatable benchmark. Nevertheless, there are about 180K RCU read-side critical sections per grace period, which is still seriously read-intensive. Furthermore, there are also almost 2K <tt>call_rcu()</tt> invocations per RCU grace period, which means that RCU is able to amortize the overhead of a given grace period down to almost nothing per grace-period request. <h2>Linux-Kernel RCU Grace-Period Latency</h2>The following <tt>bpftrace</tt> program makes a histogram of grace-period latencies, that is, the time from the call to <tt>rcu_gp_init()</tt> to the return from <tt>rcu_gp_cleanup()</tt>: <pre><blockquote> kprobe:rcu_gp_init { @start = nsecs; } kretprobe:rcu_gp_cleanup { if (@start) { @gplat = hist((nsecs - @start)/1000000); } } interval:s:10 { printf("Internal grace-period latency, milliseconds:\n"); exit(); } </pre></blockquote> The <tt>kretprobe</tt> attaches the program to the return from <tt>rcu_gp_cleanup()</tt>. The <tt>hist()</tt> function computes a log-scale histogram. The check of the <tt>@start</tt> variable avoids a beginning-of-time value for this variable in the common case where this script start in the middle of a grace period. (Try it without that check!) The output is as follows: <pre><blockquote> Attaching 3 probes... Internal grace-period latency, milliseconds: @gplat: [2, 4) 259 |@@@@@@@@@@@@@@@@@@@@@@ | [4, 8) 591 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [8, 16) 137 |@@@@@@@@@@@@ | [16, 32) 3 | | [32, 64) 5 | | @start: 95694642573968 </pre></blockquote> Most of the grace periods complete within between four and eight milliseconds, with most of the remainder completing within between two and four milliseconds and then between eight and sixteen milliseonds, but with a few stragglers taking up to 64 milliseconds. The final <tt>@start</tt> line shows that <tt>bpftrace</tt> simply dumps out all the variables. You can use the <tt>delete(@start)</tt> function to prevent printing of <tt>@start</tt>, but please note that the next invocation of <tt>rcu_gp_init()</tt> will re-create it. It is nice to know the internal latency of an RCU grace period, but most in-kernel users will be more concerned about the latency of the <tt>synchronize_rcu()</tt> function, which will need to wait for the current grace period to complete and also for callback invocation. We can measure this function's latency with the following <tt>bpftrace</tt> script: <pre><blockquote> kprobe:synchronize_rcu { @start[tid] = nsecs; } kretprobe:synchronize_rcu { if (@start[tid]) { @srlat = hist((nsecs - @start[tid])/1000000); delete(@start[tid]); } } interval:s:10 { printf("synchronize_rcu() latency, milliseconds:\n"); exit(); } </pre></blockquote> The <tt>tid</tt> variable contains the ID of the currently running task, which allows this script to associate a given return from <tt>synchronize_rcu()</tt> with the corresponding call by using <tt>tid</tt> as an index to the <tt>@start</tt> variable. As you would expect, the resulting histogram is weighted towards somewhat longer latencies, though without the stragglers: <pre><blockquote> Attaching 3 probes... synchronize_rcu() latency, milliseconds: @srlat: [4, 8) 9 |@@@@@@@@@@@@@@@ | [8, 16) 31 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [16, 32) 31 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| @start[4075307]: 96560784497352 </pre></blockquote> In addition, we see not one but two values for <tt>@start</tt>. The <tt>delete</tt> statement gets rid of old ones, but any new call to <tt>synchronize_rcu()</tt> will create more of them. <h2>Linux-Kernel Expedited RCU Grace-Period Latency</h2>Linux kernels will sometimes executed <tt>synchronize_rcu_expedited()</tt> to obtain a faster grace period, and the following command will further cause <tt>synchronize_rcu()</tt> to act like <tt>synchronize_rcu_expedited()</tt>: <blockquote><pre> echo 1 > /sys/kernel/rcu_expedited </pre></blockquote> Doing this on a dual-socket system with 80 hardware threads might be ill-advised, but you only live once! Ill-advised or not, the following <tt>bpftrace</tt> script measures <tt>synchronize_rcu_expedited()</tt> latency, but in microseconds rather than milliseconds: <blockquote><pre> kprobe:synchronize_rcu_expedited { @start[tid] = nsecs; } kretprobe:synchronize_rcu_expedited { if (@start[tid]) { @srelat = hist((nsecs - @start[tid])/1000); delete(@start[tid]); } } interval:s:10 { printf("synchronize_rcu() latency, microseconds:\n"); exit(); } </pre></blockquote> The output of this script run concurrently with a kernel build is as follows: <blockquote><pre> Attaching 3 probes... synchronize_rcu() latency, microseconds: @srelat: [128, 256) 57 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [256, 512) 14 |@@@@@@@@@@@@ | [512, 1K) 1 | | [1K, 2K) 2 |@ | [2K, 4K) 7 |@@@@@@ | [4K, 8K) 2 |@ | [8K, 16K) 3 |@@ | @start[4140285]: 97489845318700 </pre></blockquote> Most <tt>synchronize_rcu_expedited()</tt> invocations complete within a few hundred microseconds, but with a few stragglers around ten milliseconds. But what about linear histograms? This is what the <tt>lhist()</tt> function is for, with added minimum, maximum, and bucket-size arguments: <blockquote><pre> kprobe:synchronize_rcu_expedited { @start[tid] = nsecs; } kretprobe:synchronize_rcu_expedited { if (@start[tid]) { @srelat = lhist((nsecs - @start[tid])/1000, 0, 1000, 100); delete(@start[tid]); } } interval:s:10 { printf("synchronize_rcu() latency, microseconds:\n"); exit(); } </pre></blockquote> Running this with the usual kernel build in the background: <blockquote><pre> Attaching 3 probes... synchronize_rcu() latency, microseconds: @srelat: [100, 200) 26 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [200, 300) 13 |@@@@@@@@@@@@@@@@@@@@@@@@@@ | [300, 400) 5 |@@@@@@@@@@ | [400, 500) 1 |@@ | [500, 600) 0 | | [600, 700) 2 |@@@@ | [700, 800) 0 | | [800, 900) 1 |@@ | [900, 1000) 1 |@@ | [1000, ...) 18 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | @start[4184562]: 98032023641157 </pre></blockquote> The final bucket is overflow, containing measurements that exceeded the one-millisecond limit. The above histogram had only a few empty buckets, but that is mostly because the 18 <tt>synchronize_rcu_expedited()</tt> instances that overflowed the one-millisecond limit are consolidated into a single <tt>[1000, ...)</tt> overflow bucket. This is sometimes what is needed, but other times losing the maximum latency can be a problem. This can be dealt with given the following <tt>bpftrace</tt> program: <blockquote><pre> kprobe:synchronize_rcu_expedited { @start[tid] = nsecs; } kretprobe:synchronize_rcu_expedited { if (@start[tid]) { @srelat[(nsecs - @start[tid])/100000*100] = count(); delete(@start[tid]); } } interval:s:10 { printf("synchronize_rcu() latency, microseconds:\n"); exit(); } </pre></blockquote> Given the usual kernel-build background load, this produces the following output: <blockquote><pre> Attaching 3 probes... synchronize_rcu() latency, microseconds: @srelat[1600]: 1 @srelat[500]: 1 @srelat[1000]: 1 @srelat[700]: 1 @srelat[1100]: 1 @srelat[2300]: 1 @srelat[300]: 1 @srelat[400]: 2 @srelat[600]: 3 @srelat[200]: 4 @srelat[100]: 20 @start[763214]: 17487881311831 </blockquote></pre> This is a bit hard to read, but simple scripting can be applied to this output to produce something like this: <blockquote><pre> 100: 20 200: 4 300: 1 400: 2 500: 1 600: 3 700: 1 1000: 1 1100: 1 1600: 1 </pre></blockquote> This produces compact output despite outliers such as the last entry, corresponding to an invocation that took somewhere between 1.6 and 1.7 milliseconds. <h2>Summary</h2>The <tt>bpftrace</tt> command can be used to quickly and easily script compiled in-kernel programs that can measure and monitor a wide variety of things. This post focused on a few aspects of RCU, but quite a bit more material may be found in Brendan Gregg's “BPF Performance Tools” book. https://paulmck.livejournal.com/67547.html?view=comments#comments performance linux stupid rcu tricks public

https://paulmck.livejournal.com/67073.html Fri, 27 May 2022 00:43:33 GMT Stupid RCU Tricks: Is RCU Watching? paulmck https://paulmck.livejournal.com/67073.html It is just as easy to ask why RCU wouldn't be watching all the time. After all, you never know when you might need to synchronize! Unfortunately, an eternally watchful RCU is impractical in the Linux kernel due to energy-efficiency considerations. The problem is that if RCU watches an idle CPU, RCU needs that CPU to execute instructions. And making an idle CPU unnecessarily execute instructions (for a rather broad definition of the word “unnecessarily”) will terminally annoy a great many people in the battery-powered embedded world. And for good reason: Making RCU avoid watching idle CPUs can provide 30-40% increases in battery lifetime. In this, CPUs are not all that different from people. Interrupting someone who is deep in thought can cause them to lose 20 minutes of work. Similarly, when a CPU is deeply idle, asking it to execute instructions will consume not only the energy required for those instructions, but also much more energy to work its way out of that deep idle state, and then to return back to that deep idle state. And this is why CPUs must tell RCU to stop watching them when they go idle. This allows RCU to ignore them completely, in particular, to refrain from asking them to execute instructions. In some kernel configurations, RCU also ignores portions of the kernel's entry/exit code, that is, the last bits of kernel code before switching to userspace and the first bits of kernel code after switching away from userspace. This happens only in kernels built with <a href="https://lwn.net/Articles/549580/" target="_blank" rel="nofollow"><tt>CONFIG_NO_HZ_FULL=y</tt></a>, and even then only on CPUs mentioned in the CPU list passed to the <tt>nohz_full</tt> kernel parameter. This enables carefully configured HPC applications and CPU-bound real-time applications to get near-bare-metal performance from such CPUs, while still having the entire Linux kernel at their beck and call. Because RCU is not watching such applications, the scheduling-clock interrupt can be turned off entirely, thus avoiding disturbing such performance-critical applications. But if RCU is not watching a given CPU, <tt>rcu_read_lock()</tt> has no effect on that CPU, which can come as a nasty shock to the corresponding RCU read-side critical section, which naively expected to be able to safely traverse an RCU-protected data structure. This can be a trap for the unwary, which is why kernels built with <tt>CONFIG_PROVE_LOCKING=y</tt> (lockdep) complain bitterly when <tt>rcu_read_lock()</tt> is invoked on CPUs that RCU is not watching. But suppose that you have code using RCU that is invoked both from deep within the idle loop and from normal tasks. Back in the day, this was not much of a problem. True to its name, the idle loop was not much more than a loop, and the deep architecture-specific code on the kernel entry/exit paths had no need of RCU. This has changed, especially with the advent of idle drivers and governors, to say nothing of tracing. So what can you do? First, you can invoke <tt>rcu_is_watching()</tt>, which, as its name suggests, will return <tt>true</tt> if RCU is watching. And, as you might expect, lockdep uses this function to figure out when it should complain bitterly. The following example code lays out the current possibilities: <blockquote><pre> if (rcu_is_watching()) printk("Invoked from normal or idle task with RCU watching.\n"); else if (is_idle_task(current)) printk("Invoked from deep within in the idle task where RCU is not watching.\"); else printk("Invoked from nohz_full entry/exit code where RCU is not watching.\"); </pre></blockquote> Except that even invoking <tt>printk()</tt> is an iffy proposition while RCU is not watching. So suppose that you invoke <tt>rcu_is_watching()</tt> and it helpfully returns <tt>false</tt>, indicating that you cannot invoke <tt>rcu_read_lock()</tt> and friends. What now? You could do what the v5.18 Linux kernel's <tt>kernel_text_address()</tt> function does, which can be abbreviated as follows: <blockquote><pre> no_rcu = !rcu_is_watching(); if (no_rcu) rcu_nmi_enter(); // Make RCU watch!!! do_rcu_traversals(); if (no_rcu) rcu_nmi_exit(); // Return RCU to its prior watchfulness state. </pre></blockquote> If your code is not so performance-critical, you can do what the arm64 implementation of the <tt>cpu_suspend()</tt> function does: <blockquote><pre> RCU_NONIDLE(__cpu_suspend_exit()); </pre></blockquote> This macro forces RCU to watch while it executes its argument as follows: <blockquote><pre> #define RCU_NONIDLE(a) \ do { \ rcu_irq_enter_irqson(); \ do { a; } while (0); \ rcu_irq_exit_irqson(); \ } while (0) </pre></blockquote> The <tt>rcu_irq_enter_irqson()</tt> and <tt>rcu_irq_exit_irqson()</tt> functions are essentially wrappers around the aforementioned <tt>rcu_nmi_enter()</tt> and <tt>rcu_nmi_exit()</tt> functions. Although <tt>RCU_NONIDLE()</tt> is more compact than the <tt>kernel_text_address()</tt> approach, it is still annoying to have to pass your code to a macro. And this is why Peter Zijlstra has been reworking the various idle loops to cause RCU to be watching a much greater fraction of their code. This might well be an ongoing process as the idle loops continue gaining functionality, but Peter's good work thus far at least makes RCU watch the idle governors and a much larger fraction of the idle loop's trace events. When combined with the kernel entry/exit work by Peter, Thomas Gleixner, Mark Rutland, and many others, it is hoped that the functions not watched by RCU will all eventually be decorated with something like <tt>noinstr</tt>, for example: <blockquote><pre> static noinline noinstr unsigned long rcu_dynticks_inc(int incby) { return arch_atomic_add_return(incby, this_cpu_ptr(&rcu_data.dynticks)); } </pre></blockquote> We don't need to worry about exactly what this function does. For this blog entry, it is enough to know that its <tt>noinstr</tt> tag prevents tracing this function, making it less problematic for RCU to not be watching it. What exactly are you prohibited from doing while RCU is not watching your code? As noted before, RCU readers are a no-go. If you try invoking <tt>rcu_read_lock()</tt>, <tt>rcu_read_unlock()</tt>, <tt>rcu_read_lock_bh()</tt>, <tt>rcu_read_unlock_bh()</tt>, <tt>rcu_read_lock_sched()</tt>, or <tt>rcu_read_lock_sched()</tt> from regions of code where <tt>rcu_is_watching()</tt> would return <tt>false</tt>, lockdep will complain. On the other hand, using SRCU (<tt>srcu_read_lock()</tt> and <tt>srcu_read_unlock()</tt>) is just fine, as is RCU Tasks Trace (<tt>rcu_read_lock_trace()</tt> and <tt>rcu_read_unlock_trace()</tt>). RCU Tasks Rude does not have explicit read-side markers, but anything that disables preemption acts as an RCU Tasks Rude reader no matter what <tt>rcu_is_watching()</tt> would return at the time. RCU Tasks is an odd special case. Like RCU Tasks Rude, RCU Tasks has implicit read-side markers, which are any region of non-idle-task kernel code that does not do a voluntary context switch (the idle tasks are instead handled by RCU Tasks Rude). Except that in kernels built with <tt>CONFIG_PREEMPTION=n</tt> and without any of RCU's test suite, the RCU Tasks API maps to plain old RCU. This means that code not watched by RCU is ignored by the remapped RCU Tasks in such kernels. Given that RCU Tasks ignores the idle tasks, this affects only user entry/exit code in kernels built with <tt>CONFIG_NO_HZ_FULL=y</tt>, and even then, only on CPUs mentioned in the list given to the <tt>nohz_full</tt> kernel boot parameter. However, this situation can nevertheless be a trap for the unwary. Therefore, in post-v5.18 mainline, you can build your kernel with <tt>CONFIG_FORCE_TASKS_RCU=y</tt>, in which case RCU Tasks will always be built into your kernel, avoiding this trap. In summary, energy-efficiency, battery-lifetime, and application-performance/latency concerns force RCU to avert its gaze from idle CPUs, and, in kernels built with <tt>CONFIG_NO_HZ_FULL=y</tt>, also from <tt>nohz_full</tt> CPUs on the low-level kernel entry/exit code paths. Fortunately, recent changes have allowed RCU to watch more code, but this being the kernel, corner cases will always be with us. This corner-case code from which RCU must avert its gaze requires the special handling described in this blog post. https://paulmck.livejournal.com/67073.html?view=comments#comments stupid rcu tricks public

https://paulmck.livejournal.com/66892.html Fri, 24 Dec 2021 20:22:09 GMT Parallel Programming: December 2021 Update paulmck https://paulmck.livejournal.com/66892.html It is past time for another release of <a href="https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html" target="_blank" rel="nofollow">Is Parallel Programming Hard, And, If So, What Can You Do About It?</a>. But first, what is the difference between an <a href="https://paulmck.livejournal.com/60291.html" target="_blank">edition</a> and a release? The main difference is the level of validation. For example, during the several months leading up to the second edition, I read the entire book, fixing issues as I found them. So where an edition is a completed work, a release is primarily for the benefit of people who would like to see a recent snapshot of this book, but without having to <a href="https://github.com/paulmckrcu/perfbook/blob/master/FAQ-BUILD.txt" target="_blank" rel="nofollow">install and run LaTeX and its dependencies</a>. Having cleared that up, here are the big-animal changes since the <a href="https://paulmck.livejournal.com/60291.html" target="_blank">second edition</a>: <ol> <li> A lot of detailed work to make the new <a href="https://kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook-eb.2021.12.22a.pdf" target="_blank" rel="nofollow">ebook-sized PDF</a> usable, along with other formatting improvements, courtesy of Akira Yokosawa, SeongJae Park, and Balbir Singh. <li> The <tt>nq</tt> build now places quick quizzes at the end of each chapter, courtesy of Akira Yokosawa. <li> There are a few new sections in the “Locking” chapter, the “Putting It All Together” chapter, the “Advanced Synchronization: Memory Ordering” chapter, and the “Important Questions” appendix. <li> There is now more validation information associated with much of the sample code throughout the book. <li> The “RCU Usage” section has received a much-needed upgrade. <li> There is now an index and a list of acronyms, courtesy of Akira Yokosawa. <li> A new <tt>install_latex_package.sh</tt> script, courtesy of SeongJae Park. <li> Greatly improved error handling, including scripts that check for common LaTeX formatting mistakes, courtesy of Akira Yokosawa. </ol> Yang Lu and Zhouyi Zhou are translating the Second Edition to Chinese, and would greatly appreciate additional help. Everyone mentioned above contributed a great many wordsmithing fixes, as did Chin En Lin, Elad Lahav, Zhouyi Zhou, and GitHub user “rootbeer”. A grateful “thank you” to everyone who contributed! https://paulmck.livejournal.com/66892.html?view=comments#comments is parallel programming hard perfbook public

https://paulmck.livejournal.com/66807.html Mon, 20 Dec 2021 22:14:32 GMT Stupid RCU Tricks: Removing CONFIG_RCU_FAST_NO_HZ paulmck https://paulmck.livejournal.com/66807.html The <tt>CONFIG_RCU_FAST_NO_HZ</tt> Kconfig option was added many years ago to improve energy efficiency for systems having significant numbers of short bursts of idle time. Prior to the addition of <tt>CONFIG_RCU_FAST_NO_HZ</tt>, RCU would insist on keeping a given idle CPU's scheduling-clock tick enabled until all of that CPU's RCU callbacks had been invoked. On certain types of battery-powered embedded systems, these few additional scheduling-clock ticks would consume up to 40% of the battery lifetime. The people working on such systems were not amused, and were not shy about letting me know of their dissatisfaction with RCU's life choices. Please note that “letting me know” did not take the form of flaming me on LKML. Instead, they called me on the telephone and yelled at me. Given that history, why on earth would I even be thinking about removing <tt>CONFIG_RCU_FAST_NO_HZ</tt>, let alone <a href="https://lore.kernel.org/all/20211202001838.GA3126627@paulmck-ThinkPad-P17-Gen-1/" target="_blank" rel="nofollow">queuing a patch series intended for the v5.17 merge window</a>??? The reason is that everyone I know of who builds their kernels with <tt>CONFIG_RCU_FAST_NO_HZ=y</tt> also boots those systems with each and every CPU designated as a <tt>rcu_nocbs</tt> CPU. With this combination, <tt>CONFIG_RCU_FAST_NO_HZ=y</tt> is doing nothing but placing a never-taken branch in the fastpath to and from idle. Such systems should therefore run slightly faster and with slightly better battery lifetime if their kernels were instead built with <tt>CONFIG_RCU_FAST_NO_HZ=n</tt>, which would get rid of that never-taken branch. But given that battery-powered embedded folks badly wanted <tt>CONFIG_RCU_FAST_NO_HZ=y</tt>, and given that they are no longer getting any benefit from it, why on earth haven't they noticed? The have not noticed because <tt>rcu_nocbs</tt> CPUs do not invoke their own RCU callbacks. This work is instead delegated to a set of per-CPU <tt>rcuoc</tt> kthreads, with a smaller set of <tt>rcuog</tt> kthreads managing those callbacks and requesting grace periods as needed. By default, these <tt>rcuoc</tt> and <tt>rcuog</tt> kthreads are not bound, which allows both the scheduler (and for that matter, the systems administrator) to take both performance and energy efficiency into account and to run those kthreads wherever is appropriate at any given time. In contrast, non-<tt>rcu_nocbs</tt> CPUs will always run their own callbacks, even if that means powering up an inconveniently placed portion of the system at an inconvenient time. This includes <tt>CONFIG_RCU_FAST_NO_HZ=y</tt> kernels, whose only advantage is that they power up inconveniently placed portions of systems at inconvenient times only 25% as often as would a non-<tt>rcu_nocbs</tt> CPU in a <tt>CONFIG_RCU_FAST_NO_HZ=n</tt> kernel. In short, the <tt>rcu_nocbs</tt> CPUs' practice of letting the scheduler decide where to run the callbacks is especially helpful on asymmetric systems (AKA big.LITTLE systems), as shown by <a href="https://pdfs.semanticscholar.org/7205/b808c0ddab0b017f4d89fd066448d029cdcc.pdf" target="_blank" rel="nofollow">data collected by Dietmar Eggeman and Robin Randhawa</a>. This point is emphasized by the aforementioned fact that everyone I know of who builds their kernels with <tt>CONFIG_RCU_FAST_NO_HZ=y</tt> also boots those systems with each and every CPU designated as a <tt>rcu_nocbs</tt> CPU. So if no one is getting any benefit from building their kernels with <tt>CONFIG_RCU_FAST_NO_HZ=y</tt>, why keep that added complexity in the Linux kernel? Why indeed, and hence the <a href="https://lore.kernel.org/all/20211202001838.GA3126627@paulmck-ThinkPad-P17-Gen-1/" target="_blank" rel="nofollow">patch series intended for the v5.17 merge window</a>. So if you know of someone who is getting significant benefit from <tt>CONFIG_RCU_FAST_NO_HZ=y</tt> who could not get that benefit from booting with <tt>rcu_nocbs</tt> CPUs, this would be a most excellent time to let me know! https://paulmck.livejournal.com/66807.html?view=comments#comments rcu embedded smartphone linux stupid rcu tricks public

https://paulmck.livejournal.com/66371.html Fri, 03 Dec 2021 01:46:47 GMT Stupid RCU Tricks: Creating Branches For the -rcu Tree paulmck https://paulmck.livejournal.com/66371.html Several people have expressed interest in how I go about creating the topic branches in the <tt>-rcu</tt> tree. So when I created branches earlier this week, I actually kept track. But why bother with topic branches? In my case, reason is to allow anyone reviewing RCU patches to focus on a particular topic and to be able to keep that topic “in cache” while reviewing all of that topic's patches. Even more important, it makes it easy for reviewers to completely ignore patches that are not of interest to them. <h2>Preparation</h2>If you intend to follow along by actually executing the commands (highly recommended), first run “<tt>git checkout dev.2021.11.30b</tt>” in <a href="https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/rcutodo.html" target="_blank" rel="nofollow">your clone</a> of the <tt>-rcu</tt> tree. Before creating branches, you need some essential tools. First, you absolutely must be able to see what you are doing. For this, there is the <tt>gitk</tt> tool, or, for those forswearing GUIs, the <tt>tig</tt> tool. If you cannot see what you are doing, your picture of your repository will diverge from that of <tt>git</tt>, and <tt>git</tt> always wins. If you can tolerate GUIs at all, install <tt>gitk</tt>. It will change your life. Once you have <tt>gitk</tt> or <tt>tig</tt> installed, use it to review the commits that you are looking to organize into branches. This allows checking for bugs and also helps identify potential topics. Pro tip: Expand the tool's window to the full height of the screen. Another pro tip: Restrict <tt>gitk</tt> to the commits of interest, in my case by running “<tt>gitk v5.16-rc1..&</tt>”. From here on out, I will just say “gitk”. If you are smart enough to use a non-GUI tool like <tt>tig</tt>, you are smart enough to translate. ;–) <h2>Sorting Commits Into Topic Branches</h2>The next step is to run this script, placing the output in a file: <blockquote><pre> #!/bin/bash git log --pretty=format:"%h %s" $1 | \ awk '{ print NR, "pick " $0 }' | \ sort -k1n </pre></blockquote> I have this script in my <tt>bin</tt> directory under the name <tt>git-rebase-prep.sh</tt>, so I simply type: <blockquote><pre> git-rebase-prep.sh v5.16-rc1..dev.2021.11.30b > /tmp/rebase.txt </pre></blockquote> The first five lines of this file are as follows: <blockquote><pre> 126 pick 12637ec9a1505 tools/memory-model: Document locking corner cases 125 pick 5665cde49ec08 tools/memory-model: Make judgelitmus.sh note timeouts 124 pick 2e8007e79af5b tools/memory-model: Make cmplitmushist.sh note timeouts 123 pick 7a31810649915 tools/memory-model: Make judgelitmus.sh identify bad macros 122 pick 4e11469f67f2e tools/memory-model: Make judgelitmus.sh detect hard deadlocks </pre></blockquote> The first column is the commit number, with lower numbers meaning newer commits (that is, numbered in “<tt>git log</tt>” order), but the commits are ordered with the older commits first (that is, in the order that “<tt>git rebase</tt>” expects them). The second column is the commit's SHA-1 ID, and the remaining columns are the commit-log subject line. Edit this file in your choice of editor, but in a window that is the full height of your monitor. Adjust your editor so that you have access two two different regions of this file, for example, in <tt>vim</tt>, type control-W followed by the letter “s”. The upper pane will eventually have topic-branch names each followed by the commits in that topic branch. This time was a bit of a special case because this past merge window's change to the <tt>CONFIG_PREEMPT_DYNAMIC</tt> Kconfig option broke all non-preemptible <tt>CONFIG_SMP=n</tt> torture-test scenarios. (The default choice of <tt>CONFIG_PREEMPT_DYNAMIC=y</tt> forces <tt>CONFIG_PREEMPTION=y</tt>, which breaks tests of Tiny RCU and Tiny SRCU.) Therefore, the first “topic branch” is a single commit that adds <tt>CONFIG_PREEMPT_DYNAMIC=n</tt> to all affected torture-test scenarios. All RCU-related topic branches are then placed on top of this commit, thus allowing each topic branch to be torture-tested separately. This means that the first few lines of the “<tt>/tmp/rebase.txt</tt>” file are as follows: <blockquote><pre> Underneath: 26 pick 37cf3c8c1ebda rcutorture: Add CONFIG_PREEMPT_DYNAMIC=n to tiny scenarios @@@ </pre></blockquote> And this line has been removed from the remainder of this file. The “<tt>@@@</tt>” is a marker separating the topic-branched commits from those still waiting to be classified. From my scan of the commits, I tentatively created the following topic branches, so that the first few lines of the “<tt>/tmp/rebase.txt</tt>” file are now as follows: <blockquote><pre> Underneath: 26 pick 37cf3c8c1ebda rcutorture: Add CONFIG_PREEMPT_DYNAMIC=n to tiny scenarios clocksource.2021.11.30c doc.2021.11.30c exp.2021.11.30c fastnohz.2021.11.30c fixes.2021.11.30c nocb.2021.11.30c nolibc.2021.11.30c rcutorture.2021.11.30c srcu.2021.11.30c tasks.2021.11.30c torture.2021.11.30c torturescript.2021.11.30c Merge above. kcsan.2021.11.30c (merge) lkmm.2021.11.30c (merge, then merge lkmm-dev) clocksource.2021.11.30c (merge) @@@ </pre></blockquote> In other words, we will cherry-pick the first commit, rebase 11 branches on top of it (<tt>clocksource.2021.11.30c</tt> through <tt>torturescript.2021.11.30c</tt>, that is, the commits subject to some form of torture testing), and then merge the last ten of these branches together. The <tt>kcsan</tt> commits will be rebased on top of <tt>v5.16-rc1</tt>, as will the <tt>lkmm</tt> commits (but with the <tt>lkmm-dev</tt> commits rebased on top of the <tt>lkmm</tt> commits). Each of the <tt>kcsan</tt>, <tt>lkmm</tt>, and <tt>lkmm-dev</tt> branches will be merged separately on top of the ten-way merge point, and finally the <tt>clocksource.2021.11.30c</tt> branch will be merged on top of all of that. There are always a few commits not ready for the next merge window or that are to be upstreamed elsewhere, and these will be rebased on top of the <tt>clocksource.2021.11.30c</tt> merge point. Please feel free to run “<tt>gitk v5.16-rc1..dev.2021.11.30c</tt>” in order to get a firm picture of the desired state. The next step is to go through the remaining commits and assign them to branches. This process brought home the fact that there are no <tt>kcsan</tt> commits this time around (but there were plenty last time and likely will be plenty more next time), so I removed the “<tt>kcsan.2021.11.30c (merge)</tt>” line from <tt>/tmp/rebase.txt</tt>. In addition, there were not all that many commits combined for the <tt>rcutorture.2021.11.30c</tt> and <tt>torture.2021.11.30c</tt> topic branches, so I merged them all into <tt>torture.2021.11.30c</tt>. This is why the numbers in the first column of each commit line in <tt>/tmp/rebase.txt</tt> are important: In some cases, it is necessary to rearrange the topic branches, and it is sometimes important to get the commits in the correct order. An this point, the <tt>/tmp/rebase.txt</tt> file looks (almost) like <a href="https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/Answers/RCU/rebase-v5.16-rc1.txt" target="_blank" rel="nofollow">this</a>. Time to start actually creating the topic branches! <h2>Create Topic Branches</h2>We need two <tt>gitk</tt> instances, one for the old state (“<tt>gitk v5.16-rc1..dev.2021.11.30b</tt>”) and another for the new state which we will launch shortly: <blockquote><pre> git checkout v5.16-rc1 git cherry-pick 37cf3c8c1ebda gitk v5.16-rc1..& </pre></blockquote> The first of the above commands put us at the desired <tt>v5.16-rc1</tt> base commit for the topic branches, which just so happens to be where <tt>dev.2021.11.30b</tt> is based. Important safety tip: Don't try creating branches while also moving the pile of commits to some other base commit at the same time. There are just too many things that can go wrong when you try that. The second of the above commands rebases (or “cherry-picks”) the aforementioned <tt>CONFIG_PREEMPT_DYNAMIC</tt>-compensation commit. The final command launches the other <tt>gitk</tt> instance that will show us the evolving state of the topic branches. Note that my rebased <tt>CONFIG_PREEMPT_DYNAMIC</tt>-compensation commit happened to have the SHA-1 commit ID <tt>8c0abfd6d2f6b0221194241ac2908751a2a0385f</tt>. If you are following along, you will need to change the <tt>git rebase</tt> argument for <tt>--onto</tt> to the corresponding SHA-1 commit ID in your tree. If you instead use my ID, everything will work, except that you will be rebasing on my rebased <tt>CONFIG_PREEMPT_DYNAMIC</tt>-compensation commit instead of your own. Rebasing some on mine and some on yours will probably actually work, but I leave the decision as to whether to carry out that experiment to your better judgment. I suggest placing the old-state <tt>gitk</tt> at the left of your screen, the <tt>rebase.txt</tt> and your command window in the middle, and the new-state <tt>gitk</tt> at the right of your screen. Being able to see everything all the time is key to successful creation of topic branches. The next set of commands rebases the clocksource-related commits on top of the rebased <tt>CONFIG_PREEMPT_DYNAMIC</tt>-compensation commit: <blockquote><pre> git branch clocksource.2021.11.30c a1f5859d3cf6dee7647d39ec926774f05c8d5593 git rebase -i --onto 8c0abfd6d2f6b0221194241ac2908751a2a0385f \ d069e38e66dbfeca541d7a2f9790ef8e8d3c911a clocksource.2021.11.30c </pre></blockquote> The <tt>-i</tt> argument to the “<tt>git rebase</tt>” command specifies interactive mode, which will place you in an editor with a very long series of commits. In <tt>vim</tt>, type “<tt>c}</tt>” to delete all of those commits and enter <tt>vim</tt> insertion mode. Then copy and paste the two lines following <tt>clocksource.2021.11.30c</tt> from the <a href="https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/Answers/RCU/rebase-v5.16-rc1.txt" target="_blank" rel="nofollow"><tt>rebase.txt</tt></a> file. Hit the “escape” key to exit <tt>vim</tt> insertion mode and then strip the commit numbers from all of the lines, for example, by using the “<tt>:1,.s/^[0-9]* //</tt>” <tt>vim</tt> command. Write out the file and exit the editor (in <tt>vim</tt>, your choice of <tt>ZZ</tt>, <tt>:x</tt>, <tt>:wq</tt>, and probably others as well), at which point <tt>git</tt> will commence rebasing. Don't forget to hit the <F5> key in the new <tt>gitk</tt> window after the completion of each rebase command. The next several branches are constructed in a similar manner: <blockquote><pre> git branch doc.2021.11.30c a1f5859d3cf6dee7647d39ec926774f05c8d5593 git rebase -i --onto 8c0abfd6d2f6b0221194241ac2908751a2a0385f \ d069e38e66dbfeca541d7a2f9790ef8e8d3c911a doc.2021.11.30c git branch exp.2021.11.30c a1f5859d3cf6dee7647d39ec926774f05c8d5593 git rebase -i --onto 8c0abfd6d2f6b0221194241ac2908751a2a0385f \ d069e38e66dbfeca541d7a2f9790ef8e8d3c911a exp.2021.11.30c git branch fastnohz.2021.11.30c a1f5859d3cf6dee7647d39ec926774f05c8d5593 git rebase -i --onto 8c0abfd6d2f6b0221194241ac2908751a2a0385f \ d069e38e66dbfeca541d7a2f9790ef8e8d3c911a fastnohz.2021.11.30c git branch fixes.2021.11.30c a1f5859d3cf6dee7647d39ec926774f05c8d5593 git rebase -i --onto 8c0abfd6d2f6b0221194241ac2908751a2a0385f \ d069e38e66dbfeca541d7a2f9790ef8e8d3c911a fixes.2021.11.30c git branch nocb.2021.11.30c a1f5859d3cf6dee7647d39ec926774f05c8d5593 git rebase -i --onto 8c0abfd6d2f6b0221194241ac2908751a2a0385f \ d069e38e66dbfeca541d7a2f9790ef8e8d3c911a nocb.2021.11.30c git branch nolibc.2021.11.30c a1f5859d3cf6dee7647d39ec926774f05c8d5593 git rebase -i --onto 8c0abfd6d2f6b0221194241ac2908751a2a0385f \ d069e38e66dbfeca541d7a2f9790ef8e8d3c911a nolibc.2021.11.30c </blockquote></pre> For each “<tt>git rebase</tt>” command, copy and paste the corresponding section of the <tt>rebase.txt</tt> file. Don't forget to hit <F5> in the new-state <tt>gitk</tt> window after running each “<tt>git rebase</tt>” command. Remember that “almost” I mentioned above about the <a href="https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/Answers/RCU/rebase-v5.16-rc1.txt" target="_blank" rel="nofollow"><tt>rebase.txt</tt></a> file? We now get to one of the deviations. At this point, I belatedly realized that there was only one SRCU commit (<tt>441a467cd9793</tt> “srcu: Prevent redundant __srcu_read_unlock() wakeup”) and that it would be better to add that single commit to the <tt>fixes.2021.11.30c</tt> topic branch: <blockquote><pre> git checkout fixes.2021.11.30c git cherry-pick 441a467cd9793 </pre></blockquote> Note that I ignored commit ordering in this case. It is safe to do so because this is the only commit in the whole pile that touches <tt>kernel/rcu/srcutiny.c</tt>, it touches no other file, and there are no other types of dependencies against the other commits. Some times you get lucky! Next, the rest of the RCU-related branches: <blockquote><pre> git branch tasks.2021.11.30c a1f5859d3cf6dee7647d39ec926774f05c8d5593 git rebase -i --onto 8c0abfd6d2f6b0221194241ac2908751a2a0385f \ d069e38e66dbfeca541d7a2f9790ef8e8d3c911a tasks.2021.11.30c git branch torture.2021.11.30c a1f5859d3cf6dee7647d39ec926774f05c8d5593 git rebase -i --onto 8c0abfd6d2f6b0221194241ac2908751a2a0385f \ d069e38e66dbfeca541d7a2f9790ef8e8d3c911a torture.2021.11.30c git branch torturescript.2021.11.30c a1f5859d3cf6dee7647d39ec926774f05c8d5593 git rebase -i --onto 8c0abfd6d2f6b0221194241ac2908751a2a0385f \ d069e38e66dbfeca541d7a2f9790ef8e8d3c911a torturescript.2021.11.30c </blockquote></pre> Once again, copy and paste the commits from the corresponding section of the <tt>rebase.txt</tt> file and once again do not forget to hit <F5> after each “<tt>git rebase</tt>” command. It is now time to do a merge! Unfortunately, I messed up the copy-and-paste. Twice. And then forgot that “<tt>git reset --hard</tt>” takes the current branch with it. Again, more than once. Then I accidentally included <tt>clocksource.2021.11.30c</tt> in the merge. Perhaps building branches late in the day was a mistake, but here are the mistaken commands: <blockquote><pre> git merge clocksource.2021.11.30c doc.2021.11.30c exp.2021.11.30c fastnohz.2021.11.30c git reset --hard 8c0abfd6d2f6b0221194241ac2908751a2a0385f git merge clocksource.2021.11.30c doc.2021.11.30c exp.2021.11.30c fastnohz.2021.11.30c git reset --hard 8c0abfd6d2f6b0221194241ac2908751a2a0385f git merge clocksource.2021.11.30c doc.2021.11.30c exp.2021.11.30c fastnohz.2021.11.30c \ fixes.2021.11.30c nocb.2021.11.30c nolibc.2021.11.30c tasks.2021.11.30c \ torture.2021.11.30c torturescript.2021.11.30c git reset --hard 8c0abfd6d2f6b0221194241ac2908751a2a0385f git checkout -B torturescript.2021.11.30c 90b21bcfb2846625c4f2e4a0bf5969543cef8ba7 git merge clocksource.2021.11.30c doc.2021.11.30c exp.2021.11.30c fastnohz.2021.11.30c \ fixes.2021.11.30c nocb.2021.11.30c nolibc.2021.11.30c tasks.2021.11.30c \ torture.2021.11.30c torturescript.2021.11.30c git reset --hard 8c0abfd6d2f6b0221194241ac2908751a2a0385f git checkout -B torturescript.2021.11.30c 90b21bcfb2846625c4f2e4a0bf5969543cef8ba7 git checkout 8c0abfd6d2f6b0221194241ac2908751a2a0385f git merge clocksource.2021.11.30c doc.2021.11.30c exp.2021.11.30c fastnohz.2021.11.30c \ fixes.2021.11.30c nocb.2021.11.30c nolibc.2021.11.30c tasks.2021.11.30c \ torture.2021.11.30c torturescript.2021.11.30c git reset --hard 8c0abfd6d2f6b0221194241ac2908751a2a0385f </blockquote></pre> If nothing else, this sorry story shows that you can recover from mistakes. And making sure to hit <F5> in the new-state <tt>gitk</tt> means that all the commits are right there, enabling you to piece things back together. But I suggest that you instead execute the following single command: <blockquote><pre> git checkout 8c0abfd6d2f6b0221194241ac2908751a2a0385f </blockquote></pre> Then hit <F5> in the new-state <tt>gitk</tt> window and verify that each topic branch has its branch name in place, that none of those branch names are in bold face, and that none of the little per-commit circles are yellow, except for the v5.16-rc1 commit. If all of that is in place, it is time to do the merge: <blockquote><pre> git merge doc.2021.11.30c exp.2021.11.30c fastnohz.2021.11.30c fixes.2021.11.30c \ nocb.2021.11.30c nolibc.2021.11.30c tasks.2021.11.30c torture.2021.11.30c \ torturescript.2021.11.30c </pre></blockquote> This will put you into your editor, where you can fill out a commit log for the merge. This is what I did, documenting the topics: <blockquote><pre> Merge branches 'doc.2021.11.30c', 'exp.2021.11.30c', 'fastnohz.2021.11.30c', 'fixes.2021.11.30c', 'nocb.2021.11.30c', 'nolibc.2021.11.30c', 'tasks.2021.11.30c', 'torture.2021.11.30c' and 'torturescript.2021.11.30c' into HEAD doc.2021.11.30c: Documentation updates. exp.2021.11.30c: Expedited-grace-period fixes. fastnohz.2021.11.30c: Remove CONFIG_RCU_FAST_NO_HZ. fixes.2021.11.30c: Miscellaneous fixes. nocb.2021.11.30c: No-CB CPU updates. nolibc.2021.11.30c: Tiny in-kernel library updates. tasks.2021.11.30c: RCU-tasks updates, including update-side scalability. torture.2021.11.30c: Torture-test in-kernel module updates. torturescript.2021.11.30c: Torture-test scripting updates. </blockquote></pre> The text starting with “<tt>Merge branches</tt>” that extends up to but not including the blank line is a single long line, just so you know. Yet again, be sure to hit <F5> in the new-state <tt>gitk</tt> to see the results of the merge. Copy the SHA-1 commit ID of this merge point somewhere, keeping in mind that clicking on a commit in <tt>gitk</tt> places that commit's SHA-1 ID in your clipboard. Alternatively, you can use <tt>bash</tt> variables: “<tt>mergepoint=`git rev-parse HEAD`</tt>” The next task is to rebase the various Linux-kernel memory model (LKMM) commits, as always substituting the commits from the corresponding portions of the <a href="https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/Answers/RCU/rebase-v5.16-rc1.txt" target="_blank" rel="nofollow"><tt>rebase.txt</tt></a> file: <blockquote><pre> git branch lkmm.2021.11.30c a1f5859d3cf6dee7647d39ec926774f05c8d5593 git rebase -i --onto v5.16-rc1 d069e38e66dbfeca541d7a2f9790ef8e8d3c911a lkmm.2021.11.30c git branch lkmm-dev.2021.11.30c d069e38e66dbfeca541d7a2f9790ef8e8d3c911a git rebase --onto c438b7d860b4c1acb4ebff6d8d946d593ca5fe1e v5.16-rc1 lkmm-dev.2021.11.30c </blockquote></pre> Don't forget <F5>! The next step is to check out the previous merge point (you will need to substitute the SHA-1 ID recorded above) and do the remaining merges: <blockquote><pre> git checkout 32e5555b62e6a5f7304b13470cd1881a49a38d18 git merge lkmm.2021.11.30c git merge lkmm-dev.201.11.30c git merge lkmm-dev.2021.11.30c git merge clocksource.2021.11.30c </blockquote></pre> Finally, rebase the five remaining commits from the “On top:” section of the <a href="https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/Answers/RCU/rebase-v5.16-rc1.txt" target="_blank" rel="nofollow"><tt>rebase.txt</tt></a> file: <blockquote><pre> git branch dev.2021.11.30c a1f5859d3cf6dee7647d39ec926774f05c8d5593 git rebase -i --onto ce410b77746063e1c03f6e5bf85d68cca3e1c7ea \ d069e38e66dbfeca541d7a2f9790ef8e8d3c911a dev.2021.11.30c </blockquote></pre> Alternatively, you might instead choose to check out the new branch and then cherry-pick the five commits as follows: <blockquote><pre> git checkout -b dev.2021.11.30c git cherry-pick 2a1d7ed8553da ca507a89ffbb9 0ebafffe561ac e824a258c37d4 74f780ac279b0 </blockquote></pre> This checkout-and-cherry-pick operation is completely equivalent to the prior branch-and-rebase operation. Either way, as always, don't forget to hit <F5> in your new-state <tt>gitk</tt>. <h2>Check Your Work!</h2>And finally, the most important thing is to check your work: <blockquote><pre> git diff dev.2021.11.30b </blockquote></pre> If these diffs are non-empty, there was some mistake somewhere in the process. <h2>Possible Complications</h2>This topic-branch exercise went quite smoothly and took about an hour to complete. Sometimes life is harder: <ol> <li> There might be rebase and/or merge conflicts. I normally take this as a hint that my assignment of commits to topic branches was suboptimal. The commit numbers assigned by the <tt>git-rebase-prep.sh</tt> script are quite helpful when rejuggling commits. Yes, sometimes the resulting conflicts must be resolved manually, but that is a blog post for another day. <li> One of the topic branches might depend on another. In this case, I rebase the dependent topic branch on top of the topic branch that it depends on. Looking at it another way, sometimes the topic branches are arranged in series instead of in parallel. <li> A commit might depend on a number of topics, and the rest of the topics might depend on that commit. This is rare, but it does happen. The trick in this case is to create and merge the topic branches that the commit depends on, cherry-pick that commit on top of the resulting merge point, and then create and merge the remaining topic branches on top of that commit. </ol> If you made it all the way down here, thank you for your time and attention, and I hope that this material was helpful. If you remember nothing else from this blog post, let it be <tt>gitk</tt> (or <tt>tig</tt>, as the case may be). Being able to see what you are doing allows you to learn to <tt>git</tt> much faster and allows you to safely take on much more complicated <tt>git</tt>-related tasks. https://paulmck.livejournal.com/66371.html?view=comments#comments stupid rcu tricks public

https://paulmck.livejournal.com/66175.html Wed, 03 Nov 2021 17:14:54 GMT What Memory Model Should the Rust Language Use? paulmck https://paulmck.livejournal.com/66175.html This blog post discusses a few alternative Rust-language memory models. I hope that this discussion is of value to the Rust community, but in the end, it is their language, so it is also their choice of memory model. This discussion takes the Rust fearless-concurrency viewpoint, tempered by discussions I have observed and participated in while creating this <a href="https://paulmck.livejournal.com/62436.html" target="_blank">blog series</a>. Different members of that community of course have different viewpoints, and thus might reasonably advocate for different choices. Those who know me will understand that these viewpoints differ significantly from my own. However, my viewpoint is dictated by my long-standing privilege of living at the edge of what is possible in terms of performance, scalability, real-time response, energy efficiency, robustness, and much else besides. Where I live, significant levels of fear are not just wise, but also necessary for survival. To borrow an an old saying from aviation, there are old pilots, and there are bold pilots, but there are no old bold pilots. Nevertheless, I expect that my more than three decades of experience with concurrency, my work on the C/C++ memory model (<tt>memory_order_consume</tt> notwithstanding), and my role as lead maintainer of the <a href="https://github.com/torvalds/linux/tree/master/tools/memory-model" target="_blank" rel="nofollow">Linux-kernel memory model</a> (LKMM) will help provide a good starting point for the more work-a-day situation that I believe that the Rust community wishes to support. <h2>But Doesn't Rust Already Have a Memory Model?</h2>Kinda sorta? Some in academia have assumed that Rust's memory model will be based on that of C/C++, for example, <a href="https://plv.mpi-sws.org/rustbelt/rbrlx/" target="_blank" rel="nofollow">see here</a>. And the <a href="https://doc.rust-lang.org/nomicon/atomics.html" target="_blank" rel="nofollow">Rustnomicon agrees</a>, though the author of that text does not appear to be very happy about it: <blockquote>Rust pretty blatantly just inherits the memory model for atomics from C++20. This is not due to this model being particularly excellent or easy to understand. Indeed, this model is quite complex and known to have several flaws. Rather, it is a pragmatic concession to the fact that everyone is pretty bad at modeling atomics. At very least, we can benefit from existing tooling and research around the C/C++ memory model. (You'll often see this model referred to as "C/C++11" or just "C11". C just copies the C++ memory model; and C++11 was the first version of the model but it has received some bugfixes since then.) Trying to fully explain the model in this book is fairly hopeless. It's defined in terms of madness-inducing causality graphs that require a full book to properly understand in a practical way. If you want all the nitty-gritty details, you should check out the <a href="https://en.cppreference.com/w/cpp/atomic/memory_order" target="_blank" rel="nofollow">C++ specification</a>. Still, we'll try to cover the basics and some of the problems Rust developers face. The C++ memory model is fundamentally about trying to bridge the gap between the semantics we want, the optimizations compilers want, and the inconsistent chaos our hardware wants. We would like to just write programs and have them do exactly what we said but, you know, fast. Wouldn't that be great? </blockquote> I would argue that compiler optimizations are the source of much more inconsistent chaos than hardware would ever dream of providing, but that argument has been going on for many years and I doubt that we will be able to settle it here. Leaving that argument aside, entertaining though it might be, the wording of this passage of the Rustnomicon certainly invites alternatives to the C/C++ memory model. Let's see what we can do. <h2>Where to Start?</h2>Let's first eliminate a number of candidate memory models. Rust's presumed portability goals rules out any of the hardware memory models. Rust's growing connection to the deep embedded world rules out the memory models of dynamic languages such as Java and Javascript. The growing number of memory models derived from the C/C++ memory model are represented by the actual C/C++ memory model, so we will not consider those variants separately. This leaves the aforementioned C/C++ memory model and of course LKMM, the latter inspired by the Rust communities ambitions within the Linux kernel. Because this blog series focuses on the Linux kernel, this post will start with LKMM, move to the C/C++ memory model, and end with a specific recommendation. <h2>Linux-Kernel Memory Model (LKMM)</h2>This section will consider aspects of LKMM, starting with the most fear-inducing aspects and moving to the more work-a-day aspects. <a href="https://paulmck.livejournal.com/63151.html" target="_blank">Control dependencies</a> are not the most fearless portion of LKMM, in fact, in my role as lead maintainer of LKMM, I need people using control dependencies not to merely feel fearful, but instead to feel downright terrified. Control dependencies are fragile and easy for control-dependency-oblivious compiler optimizations to destroy. Therefore, for the time being, control dependencies should be excluded from any Rust memory model. Perhaps the day will come when we have a good way to communicate the intent of control dependencies to compilers, but today is not that day. <a href="https://paulmck.livejournal.com/63316.html" target="_blank">Address and data dependencies</a> carried by pointers are not free of risk, but they do provide dependency-oblivious compilers significantly less scope for destructive optimizations. They should thus require much less fear than do control dependencies, little though that might be saying. Nevertheless, address and data dependencies are mainly used in conjunction with RCU, and we do not yet have a known good way of expressing all RCU use cases within the confines of Rust's ownership model. Therefore, address and data dependencies should also be excluded from Rust's memory model until such time as significant RCU use cases are handled or some other clear and present use case militates for address and data dependencies. It would be increasingly reasonable to permit control and data dependencies within Rust <tt>unsafe</tt> mode should use of RCU within Rust increase, keeping in mind that things like epoch-based reclamation (EBR) are particular classes of implementations of RCU. At first glance, it seems entirely reasonable to countenance use of <tt>READ_ONCE()</tt> and <tt>WRITE_ONCE()</tt> within prospective Linux-kernel Rust code, but this post is discussing Rust in general, not just Rust in the Linux kernel. And the fact is that address, data, and control dependencies are the only things that prevent the Linux kernel's use of <tt>READ_ONCE()</tt> and <tt>WRITE_ONCE()</tt> from resulting in <a href="https://paulmck.livejournal.com/63517.html" target="_blank">out-of-thin-air (OOTA)</a> behavior. Except that these operations (along with the Linux kernel's unordered atomic read-modify-write (RMW) operations) are implemented so as to prevent the compiler from undertaking code-motion optimizations that might otherwise reorder such operations. Furthermore, all of the underlying hardware memory models of which I am aware preserve dependency orderings. Therefore, one might expect that these unordered operations might reasonably be part of a Rust memory model. Unfortunately, one imporant benefit of a memory model is the tooling that analyzes concurrent code fragments, and if this tooling is to exclude OOTA behaviors, it is absolutely necessary for that tooling to understand dependencies. Except that we have already excluded such dependencies from the Rust memory model. Therefore, the Rust memory model should restrict its support for Linux-kernel atomic operations to those that provide ordering. These would be the value-returning non-relaxed read-modify-write (RMW) atomic operations along with the <tt>_acquire()</tt> and <tt>_release()</tt> variants of non-value-returning RMW atomic operations. It might also make sense to allow combinations of unordered RMW operations with combination memory barriers, for example, <tt>atomic_inc()</tt> followed by <tt>smp_mb__after_atomic()</tt>, but it would make even more sense to wrapper these in combination as a single Rust-accessible primitive. This combined Rust primitive would no longer be unordered, and could thus be included as an ordered unit in the Rust memory model. Alternatively, unordered atomics might be relegated to Rust's <tt>unsafe</tt> mode. It is hard to imagine a useful Rust memory model that excludes locking. Thus, starting from LKMM, we arrive at a model that supports ordered atomic operations and locking, possibly including unordered atomics in <tt>unsafe</tt> mode. <h2>C/C++ Memory Model</h2>This section will instead start from the C/C++ memory model. Because <tt>memory_order_relaxed</tt> accesses can yield OOTA results, they seem inconsistent with Rust's fearless-concurrency goals. These cannot actually occur in practice with any implementation that I am aware of, but fearless concurrency requires accurate tools, and this accuracy requires excluding even the theoretical possibility of OOTA results. Another reasonable approach would permit use of <tt>memory_order_relaxed</tt> only within Rust <tt>unsafe</tt> code. The <tt>memory_order_consume</tt> is primarily useful in conjunction with RCU. In addition, in all implementations that I am aware of, <tt>memory_order_consume</tt> is simply promoted to <tt>memory_order_acquire</tt>. It therefore seems unnecessary to include <tt>memory_order_consume</tt> within a Rust memory model. As before, a reasonable alternative would permit its use only within Rust <tt>unsafe</tt> code. In contrast, <tt>memory_order_acquire</tt>, <tt>memory_order_release</tt>, and <tt>memory_order_acq_rel</tt>, are all easily analyzed and have been heavily used in practice for decades. Although the celebrated <tt>memory_order_seq_cst</tt> (sequentially consistent) memory ordering can be <a href="https://plv.mpi-sws.org/scfix/" target="_blank" rel="nofollow">difficult to analyze</a>, its strong ordering is absolutely required by some use cases, including a sadly large fraction of concurrent algorithms published by academics and industrial researchers. In addition, there is a decades-long tradition of proofs and tooling handling sequential consistency, difficulties aside. Use of all four of these orderings should therefore be permitted from safe Rust code. With the C/C++ memory model as it was with LKMM, it is hard to imagine a useful Rust memory model that excludes locking. Thus, starting from the C/C++ memory model, we also arrive at a model that supports ordered atomic operations and locking, possibly along with unordered atomic operations in <tt>unsafe</tt> mode. <h2>Recommendation</h2>The fact that starting with LKMM got us to roughly the same place as did starting with the C/C++ memory model should give us considerable confidence in that destination. Why the “roughly”? Because there are subtle differences, as can be seen by <a href="https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p0124r7.html" target="_blank" rel="nofollow">comparing the C and C++ standards to LKMM</a>. One reaction to this situation would be to go through these memory models one corner case at a time, and for each corner case making a choice for Rust, now and forever. Perhaps a better approach is to pick one and adapt as needed based on real situations as they arise. At this time, there are far more projects living within the C/C++ memory model than within LKMM. Therefore, despite my role as lead maintainer of LKMM, it is my painful duty to recommend the following, based on the C/C++ memory model:<ol> <li> Locking may be used from both safe and <tt>unsafe</tt> modes. <li> Atomic operations using <tt>memory_order_acquire</tt>, <tt>memory_order_release</tt>, <tt>memory_order_acq_rel</tt>, and <tt>memory_order_seq_cst</tt> may be used from both safe and <tt>unsafe</tt> modes. <li> Atomic operations using <tt>memory_order_relaxed</tt> and <tt>memory_order_consume</tt> may be only from <tt>unsafe</tt> mode. <li> All atomic operations in Rust code should be marked, that is, Rust should avoid following the C/C++ practice of interpreting an unmarked mention of an atomic variable as a <tt>memory_order_seq_cst</tt> access to that variable. Requiring marking allows accesses to concurrently accessed shared variables to be identified at a glance, and also makes the choice of any default <tt>memory_order</tt> value much less pressing. </ol>This provides explicit support for the operations and orderings that have a long history of heavy use and and for which analysis tools have long been available. It also provides provisional support of operations and orderings that, while also having a long history of heavy use, are beyond the current state of the art for accurate and complete analysis. <h2>Other Options</h2>One could argue that <tt>memory_order_relaxed</tt> also has many simple use cases, for but one example, constructing distributed counters. However, given the <a href="https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p2055r0.pdf" target="_blank" rel="nofollow">difficulty verifying its full range of use cases</a>, <tt>unsafe</tt> mode seems the safest bet at the moment. Should any of the current efforts to more straightforwardly verify <tt>memory_order_relaxed</tt> accesses bear fruit, then perhaps <tt>memory_order_relaxed</tt> might be permitted in Rust safe mode. Finally, one could argue that <tt>memory_order_consume</tt> should be excluded entirely, rather than simply being relegated to <tt>unsafe</tt> mode. However, some Rust libraries already use RCU internally, and the ability to flag RCU pointer traversals could prove helpful should Rust someday fully support address and data dependencies. In contrast, as noted earlier, the other four <tt>memory_order</tt> <tt>enum</tt> members are heavily used and routinely analyzed. It is therefore reasonable to permit use of <tt>memory_order_acquire</tt>, <tt>memory_order_release</tt>, <tt>memory_order_acq_rel</tt>, and <tt>memory_order_seq_cst</tt> within Rust safe code. <h2>What Does This Mean for the Linux Kernel?</h2>As it turns out, not much. Please keep in mind that the Linux kernel currently interoperates with the full range of memory models used by arbitrary userspace code written in arbitrary languages when running on a wide range of hardware memory models. Any adjustments required are currently handled by architecture specific code, for example, in the system-call layer or in exception entry/exit code. Strong ordering is also provided deeper within the Linux kernel, with but one example being the context-switch code, which must provide full ordering when processes migrate from one CPU to another. It turns out that Rust code in Linux has wrappers around any C code that it might invoke, and it is through these wrappers that Rust code will make use of the non-Rust portions of the Linux kernel. These wrappers will therefore contain calls to whichever memory-ordering primitives might be required at any particular point in Rust's memory-model evolution. <h2>But Does Rust Really Need to Define Its Memory Model Right Now?</h2>This is of course not my decision. But it is only fair to point out that the longer the Rust community waits, the more the current “kinda sorta” choice of the full C/C++ memory model will become the actual long-term choice. Including those portions of the C/C++ memory model that have proven troublesome, and that are therefore likely to be subject to change. My recommendation is therefore to adopt the untroublesome portions of the C/C++ memory model in safe mode, and the rest in <tt>unsafe</tt> mode. This approach allows people to write work-a-day concurrent algorithms in Rust and to be confident that the resulting code will still work in the future. It also allows those needing to live more dangerously to confine their code to <tt>unsafe</tt> blocks so as to attract an appropriate level of skepticism and attention. But again, this decision rests not with me, but with the Rust communty. <h2>History</h2>November 4, 2021: Unmarked Rust-language accesses are never atomic operations plus some wordsmithing. https://paulmck.livejournal.com/66175.html?view=comments#comments lkmm rust linux public

https://paulmck.livejournal.com/65800.html Mon, 01 Nov 2021 19:40:10 GMT Stupid RCU Tricks: Waiting for Grace Periods From NMI Handlers paulmck https://paulmck.livejournal.com/65800.html Suppose that you had a state machine implemented by NMI handlers, and that some of the transitions in this state machine need to wait for an RCU grace period to elapse. How could these state transitions be implemented? Before we start, let's dispense with a couple of silly options. First, we clearly are not going to invoke <tt>synchronize_rcu()</tt> from within an NMI handler. This function can sleep, and we cannot sleep even within non-threaded interrupt handlers, let alone within NMI handlers. Second, we are not going to invoke <tt>call_rcu()</tt> from within an NMI handler. This function disables interrupts to exclude concurrent invocations on a single CPU, and disabling interrupts does nothing to keep NMI handlers from running. Worse yet, when running on <tt>rcu_nocbs</tt> CPUs (which offload callback invocations to <tt>rcuo</tt> kthreads), <tt>call_rcu()</tt> acquires a lock. Therefore, invoking <tt>call_rcu()</tt> from within an NMI handler would at best result in deadlock, and might also result in corruption of RCU's lists of callbacks. So what can we do? One approach would be to use a self-spawning RCU callback, that is, a callback that invokes <tt>call_rcu()</tt> on itself. (Yes, this is perfectly legal, just as it is with timer handlers. See the function <tt>rcu_torture_fwd_prog_cb()</tt> in <tt>kernel/rcu/rcutorture.c</tt> of v5.14 of the Linux kernel for an example.) This callback could also increment a counter that could be checked by the NMI handler. When the NMI handler wished to defer a state transition until the end of a future RCU grace period, it could transition to an additional “waiting for RCU” state while recording the value of that counter. Later, when the counter had advanced sufficiently, a subsequent NMI handler could complete the transition to the desired state. Of course, it is not sufficient to wait for the counter to advance just once. The reason is that the initial NMI might have occurred just before the RCU callback executed, and the next NMI might happen just afterwards. Yes, the counter has advanced, but almost no time has elapsed, much less a full RCU grace period. It is instead necessary to wait for the counter to advance by two, and also to have the needed memory barriers in place. But there is a better way that makes use of RCU's existing counters and memory barriers. RCU provides these four functions for this purpose, two of which are usable from NMI handlers: <ol> <li> <tt>get_state_synchronize_rcu()</tt>, which was first used in v4.10 in 2015, returns a “cookie” that can be checked later. SRCU provides a similar <tt>get_state_synchronize_srcu()</tt> API. <li> <tt>start_poll_synchronize_rcu()</tt> returns a cookie as well, but also ensures that the needed RCU grace period gets scheduled. Unfortunately, this last requires locks to be acquired, which precludes its use in an NMI handler. SRCU provides a similar <tt>start_poll_synchronize_srcu()</tt> API, which was first used in v5.14 in 2021. <li> <tt>poll_state_synchronize_rcu()</tt> takes a cookie from the first two functions and returns true if the corresponding RCU grace period has elapsed. SRCU provides a similar <tt>poll_state_synchronize_srcu()</tt> API, which was first used in v5.14 in 2021. <li> <tt>cond_synchronize_rcu()</tt>, which was first used in v4.10 in 2015, also takes a cookie from the first two functions, but waits (if needed) until the corresponding RCU grace period has elapsed. Unfortunately, the possibility of waiting precludes this function's use in an NMI handler. </ol> So the first NMI handler can invoke <tt>get_state_synchronize_rcu()</tt> to obtain a cookie, then transition to the additional state. Later NMI handlers residing in this additional state could pass that cookie to <tt>poll_state_synchronize_rcu()</tt>, completing the transition if this function returns <tt>true</tt>. On busy systems, RCU grace periods are being initiated by many other things, so that there is almost always a grace period in progress, but if this must work on quiet systems, the aforementioned self-spawning callback could be used to force the issue when needed. Of course, RCU has made internal use of grace-period polling for a very long time, starting prior to the beginning of the Linux-kernel git repository in 2005. In short, NMI handlers can now work with both RCU and SRCU grace periods without the need to invent counters or to worry about memory-ordering issues! https://paulmck.livejournal.com/65800.html?view=comments#comments rcu linux stupid rcu tricks public

https://paulmck.livejournal.com/65721.html Fri, 15 Oct 2021 18:22:33 GMT Verification Challenges paulmck https://paulmck.livejournal.com/65721.html You would like to do some formal verification of C code? Or you would like a challenge for your formal-verification tool? Either way, here you go!       <ol> <li> <a href="https://paulmck.livejournal.com/37782.html" target="_blank">Stupid RCU Tricks: rcutorture Catches an RCU Bug</a>, also known as Verification Challenge 1. <li> <a href="https://paulmck.livejournal.com/38016.html" target="_blank">Verification Challenge 2: RCU NO_HZ_FULL_SYSIDLE</a>. <li> <a href="https://paulmck.livejournal.com/38997.html" target="_blank">Verification Challenge 3: cbmc</a>. <li> <a href="https://paulmck.livejournal.com/39343.html" target="_blank">Verification Challenge 4: Tiny RCU</a>. <li> <a href="https://paulmck.livejournal.com/39793.html" target="_blank">Verification Challenge 5: Uses of RCU</a>. <li> <a href="https://paulmck.livejournal.com/46993.html" target="_blank">Verification Challenge 6: Linux-Kernel Tree RCU</a>. <li> <a href="https://paulmck.livejournal.com/50441.html" target="_blank">Verification Challenge 7: Heavy Modifications to Linux-Kernel Tree RCU</a>. </ol> https://paulmck.livejournal.com/65721.html?view=comments#comments verification parallel validation public

https://paulmck.livejournal.com/65341.html Wed, 13 Oct 2021 21:08:50 GMT TL;DR: Memory-Model Recommendations for Rusting the Linux Kernel paulmck https://paulmck.livejournal.com/65341.html These recommendations assume that the initial Linux-kernel targets for Rust developers are device drivers that do not have unusual performance and scalability requirements, meaning that wrappering of small C-language functions is tolerable. (Please note that most device drivers fit into this category.) It also assumes that the main goal is to reduce memory-safety bugs, although other bugs might be addressed as well. Or, Murphy being Murphy, created as well. But that is a risk in all software development, not just Rust in the Linux kernel. Those interested in getting Rust into Linux-kernel device drivers sooner rather than later should look at the short-term recommendations, while those interested in extending Rust's (and, for that matter, C's) concurrency capabilities might be more interested in the long-term recommendations. <h2>Short-Term Recommendations</h2>The goal here is to allow the Rust language considerable memory-model flexibility while also providing Rust considerable freedom in what it might be used for within the Linux kernel. The recommendations are as follows: <ol> <li> Provide wrappers around the existing Linux kernel's locking, MMIO, I/O-barrier, and I/O-access primitives. If I was doing this work, I would add wrappers incrementally as they were actually needed, but I freely admit that there are benefits to providing a full set of wrappers from the get-go. <li> Either wrapper <tt>READ_ONCE()</tt> or be careful to take Alpha's, Itanium's, and ARMv8's requirements into account as needed. The situation with <tt>WRITE_ONCE()</tt> appears more straightforward. The same considerations apply to the <tt>atomic_read()</tt> and <tt>atomic_set()</tt> family of primitives. <li> Atomic read-modify-write primitives should be made available to Rust programs via wrappers around C code. But who knows? Maybe it is once again time to try out <a href="https://lwn.net/Articles/691128/" target="_blank" rel="nofollow">intrinsics</a>. Again, if I was doing the work, I would add wrappers/implementations incrementally. <li> Although there has been significant discussion surrounding how sequence locking and RCU might be handled by Rust code, more work appears to be needed. For one thing, it is not clear that people proposing solutions are aware of the wide range of Linux-kernel use cases for these primitives. In the meantime, I recommend keeping direct use of sequence locking and RCU in C code, and providing Rust wrappers for the resulting higher-level APIs. In this case, it might be wise to make good use of compiler directives in order to limit Rust's ability to apply code-motion optimizations. However, if you need sequence-locking or RCU Rust-language wrappers, there are a number that have been proposed. (Except that you should first take another look or three at wrappering higher-level APIs. Yes, APIs are hard, but there are huge benefits to proper APIs!) <li> Control dependencies should be confined to C code. If needed, higher-level APIs whose C-language implementations require control dependencies can be wrappered for Rust use. But my best guess is that it will be some time before Rust code needs control dependencies. </ol> Taking this approach should avoid memory-model-mismatch issues between Rust and C code in the Linux kernel. And discussions indicate that much (and maybe all) of the wrappering work called out in the first three items above has already been done. Of course, situations that do not fit this set of recommendations can be addressed on a case-by-case basis. I would of course be happy to help. Specifically, in my role as lead Linux-kernel RCU maintainer: <ol> <li> My first reaction to submission of Rust wrappers for the RCU API will be to ask hard questions about the possibility of higher-level APIs. <li> If higher-level APIs are infeasible, I will look carefully at which RCU use cases are supported by the proposed wrappering. I am working to improve documentation of the range of RCU use cases in order to help submitters to do this as well. (This work will likely be helpful elsewhere as well.) <li> Again, if higher-level APIs are infeasible, I will look carefully at how the Rust wrappers are helping to find bugs. This will clearly require me to continue learning Rust, as this requires me to have a detailed view of Rust's ownership mechanisms and how things like reference-counting use cases work around limitations in these mechanisms. </ol> This procedure should help buy the time required for me to learn more about the Rust language and for the Rust community to learn more about RCU and its many use cases. <h2>Long-Term Recomendations</h2>This section takes a more utopian view. What would a perfect Rust sequence-locking implementation/wrapper look like? Here are some off-the-cuff desiderata, none of which are met by the current C-code Linux-kernel implementation: <ol> <li> Use of quantities computed from sequence-lock-protected variables in a failed reader should result in a warning. But please note that reliably associating variables with sequence locks may not be easy. One approach suggested in response to this series is to supply a closure containing the read-side critical section, thus restricting such leakage to (unsafe?) side effects. <li> Improper access to variables not protected by the sequence lock should result in a warning. But please note that it is quite difficult to define "improper" in this context, let alone detect it in real code. <li> Data races in failed sequence-lock readers should not cause failures. But please note that this is extremely difficult in general if the data races involve non-<tt>volatile</tt> C-language accesses in the reader. For example, the compiler would be within its rights to refetch the value after the old value had been checked. (This is why the <a href="https://paulmck.livejournal.com/63957.html" target="_blank">sequence-locking</a> post suggests marked accesses to sequence-lock-protected variables.) <li> Data races involving sequence-locking updaters are detected, for example, via KCSAN. </ol> As noted elsewhere, use of a wrapper around the existing C-language implementation allows C and Rust to use the same sequence lock. This might or might not prove to be important. Similarly, what would a perfect Rust RCU wrapper look like? Again, here are some off-the-cuff desiderata, similarly unmet by existing C-code Linux-kernel implementations: <ol> <li> Make <tt>call_rcu()</tt> and friends cause the specified object to make an end-of-grace-period transition in other threads' ownership from readers to unowned. Again, not an immediate transition at the time of <tt>call_rcu()</tt> invocation, but rather a deferred transition that takes place at the end of some future grace period. <li> Some Rust notion of type safety would be useful in slab caches flagged as <tt>SLAB_TYPESAFE_BY_RCU</tt>. <li> Some Rust notion of existence guarantee would be useful for RCU readers. <li> Detect pointers to RCU-protected objects being improperly leaked from RCU read-side critical sections, where proper leakage is possible via locks and reference counters. Perhaps an emphasis on closures is at least part of the answer. <li> Detect mishandling of dependency-carrying pointers returned by <tt>rcu_dereference()</tt> and friends. Or perhaps introduce some marking such pointers to that the compiler will avoid breaking the ordering. See the <a href="https://paulmck.livejournal.com/63316.html" target="_blank">address/data dependency</a> post for a list of C++ working papers moving towards this goal. </ol> There has recently been some work attempting to make the C compiler understand control dependencies. Perhaps Rust could make something useful happen in this area, perhaps by providing markings that allow the compiler to associate the reads with the corresponding control-dependent writes. Perhaps Rust can better understand more of the <a href="https://paulmck.livejournal.com/64392.html" target="_blank">ownership schemes</a> that are used within the Linux kernel. <h2>Rationale</h2>Please note that middle-term approaches are likely to be useful, for example, those that simply provide wrappers around the C-language RCU and sequence-locking implementations. However, such approaches also carry risks, especially during that time when the intersections of the set of people deeply understanding Rust with the set of people deeply understanding RCU and sequence locking is the empty set. For example, one risk that became apparent during the effort to add RCU to the C++ standard is that people new to RCU and sequence locking will latch onto the first use case that they encounter and focus solely on that use case. This has also been a recurring issue for concurrency experts who encounter sequence locking and RCU for the first time. This of course means that I need to do a better job of documenting RCU, its use cases, and the relationships between them. This is not to say that per-use-case work is pointless. In fact, such work can be extremely valuable, if nothing else, in helping to build understanding of the overall problem. It is also quite possible that current RCU and sequence-locking use cases will resemble those of the future in the same way that the venerable "while" loop resembles the wide panoply of interator-like constructs found not only in modern languages, including those written in C in the Linux kernel, in other words, perhaps Rust will focus on specific fearless-concurrency-friendly RCU/sequence-locking use cases. Except that the Linux kernel still contains "while" loops, as does a great deal of other software, which suggests that the low-level RCU and sequence-locking APIs will always be required. There will also be questions as to whether a given new-age use case is there to help developers using RCU and sequence locking on the one hand or whether its main purpose is instead to work around a shortcoming in Rust on the other. "This should be good clean fun!" ;-) Experience indicates that it will take significant time to sort out all of these issues. We should therefore make this time available by proceeding initially as described in the short-term recommendations. Criteria for judging proposals include safety, performance, scalability, build time, development cost, maintenance effort, and so on, not necessarily in that order. <h2>History</h2>October 18, 2021: Add "Rationale" section. October 20, 2021: Add a few expansions and clarifications. October 21, 2021: RCU-specific recommendations. https://paulmck.livejournal.com/65341.html?view=comments#comments lkmm rust linux public

https://paulmck.livejournal.com/65056.html Wed, 06 Oct 2021 21:48:30 GMT Rusting the Linux Kernel: Summary and Conclusions paulmck https://paulmck.livejournal.com/65056.html We have taken a quick trip through history, through a number of the differences between the Linux kernel and the C/C++ memory models, sequence locks, RCU, ownership, zombie pointers, and KCSAN. I give a big "thank you" to everyone who has contributed to this discussion, both publicly and privately. It has been an excellent learning experience for me, and I hope that it has also been helpful to all of you. To date, <a href="https://paulmck.livejournal.com/63957.html" target="_blank">Can Rust Code Own Sequence Locks?</a> has proven the most popular by far. Porting Linux-kernel code using sequence locking to Rust turns out to be trickier than one might expect, in part due to the inherently data-racy nature of this synchronization primitive. So what are those advocating use of Rust within the Linux kernel to do? The first thing is to keep in mind that the Linux kernel comprises tens of millions of lines of code. One disadvantage of this situation is that the Linux kernel is not going to be ported to much of anything quickly or easily. One corresponding advantage is that it allows Rust-for-Linux developers to carefully pick their battles, focusing first on those portions of the Linux kernel where conversion to Rust might do the most good. Given that order-of-magnitudes performance wins are unlikely, the likely focus is likely to be on proof of concept and on reduced bug rates. Given Rust's heavy focus on bugs stemming from undefined behavior, and given the Linux kernel's use of the <tt>-fno-strict-aliasing</tt> and <tt>-fno-strict-overflow</tt> compiler command-line options, bug-reduction choices will need to be made quite carefully. In addition, order-of-magnitude bug-rate differences across the source base provides a high noise floor that makes it more difficult to measure small bug-reduction rates. But perhaps enabling the movement of code into mainline and out of staging can be another useful goal. Or perhaps there is an especially buggy driver out there somewhere that could make good use of some Rust code. Secondly, careful placement of interfaces between Rust and C code is necessary, especially if there is truth to the rumors that Rust does not inline C functions without the assistance of LTO. In addition, devilish details of Linux-kernel synchronization primitives and dependency handling may further constrain interface boundaries. Finally, keep in mind that the Linux kernel is not written in standard C, but rather in a dialect that relies on gcc extensions, code-style standards, and external tools. Mastering these extensions, standards, and tools will of course require substantial time and effort, but on the other hand this situation provides precedent for Rust developers to also rely on extensions, style standards, and tools. But what about the question that motivated this blog in the first place? What memory model should Linux-kernel Rust code use? For Rust outside of the Linux kernel, the current state of compiler backends and the path of least resistance likely leads to something resembling the C/C++ memory model. Use of Rust in the Linux kernel is therefore not a substitute for continued participation in the C/C++ standards committees, much though some might wish otherwise. Rust non-<tt>unsafe</tt> code does not depend much on the underlying memory model, at least assuming that <tt>unsafe</tt> Rust code and C code avoids undermining the data-race-free assumption of non-<tt>unsafe</tt> code. The need to preserve this assumption was in fact inspired the blog post discussing <a href="https://paulmck.livejournal.com/64970.html" target="_blank">KCSAN</a>. In contrast, Rust <tt>unsafe</tt> code must pay close attention to the underlying memory model, and in the case of the Linux kernel, the only reasonable choice is of course the Linux-kernel memory model. That said, any useful memory-ordering tool will also need to pay attention to safe Rust code in order to correctly evaluate outcomes. However, there is substantial flexibility, depending on exactly where the interfaces are placed: <ol> <li> As noted above, forbidding use of Rust <tt>unsafe</tt> code within the Linux kernel would render this memory-model question moot. Data-race freedom for the win! <li> Allowing Rust code to access shared variables only via atomic operations with acquire (for loads), release (for stores) or stronger ordering would allow Rust <tt>unsafe</tt> code to use that subset of the Linux-kernel memory model that mirrors the C/C++ memory model. <li> Restricting use of sequence locks, RCU, control dependencies, lockless atomics, and non-standard locking (e.g., stores to shared variables within read-side reader-writer-locking critical sections) to C code would allow Rust <tt>unsafe</tt> code to use that subset of the Linux-kernel memory model that closely (but sadly, not exactly) mirrors the C/C++ memory model. <li> Case-by-case relaxation of the restrictions called out in the preceding pair of items pulls in the corresponding portions of the Linux-kernel memory model. For example, adding sequence locking, but only in cases where readers access only a limited co-located set of objects might permit some of the simpler Rust sequence-locking implementations to be used (see for example the comments to the <a href="https://paulmck.livejournal.com/63957.html" target="_blank">sequence-locking</a> post). <li> Insisting that Rust be able to do everything that Linux-kernel C code currently does pulls in the entirety of the Linux-kernel memory model, including those parts that have motivated many of my years of C/C++ standards-committee fun and excitement. </ol> With the first three options, a Rust compiler adhering to the C/C++ memory model will work just fine for Linux-kernel C code. In contrast, the last two options would require code-style restrictions (preferably automated), just as they are in current Linux-kernel C code. See the <a href="https://paulmck.livejournal.com/65341.html" target="_blank">recommendations</a> post for more detail. In short, choose wisely and be very careful what you wish for! ;-) <h4>History</h4>October 7, 2021: Added atomic-operation-only option to memory-model spectrum. October 8, 2021: Fix typo noted by Miguel Ojeda October 12, 2021: Self-review. October 13, 2021: Add reference to recommendations post. https://paulmck.livejournal.com/65056.html?view=comments#comments lkmm rust linux public

https://paulmck.livejournal.com/64970.html Wed, 06 Oct 2021 19:11:32 GMT Can the Kernel Concurrency Sanitizer Own Rust Code? paulmck https://paulmck.livejournal.com/64970.html Given the data-race-freedom guarantees of Rust's non-<tt>unsafe</tt> code, one might reasonably argue that there is no point in the Kernel Concurrency Sanitizer (KCSAN) analyzing such code. However, the Linux kernel is going to need <tt>unsafe</tt> Rust code, and although there has been significant progress in formal verification of such code, one of the leading researchers in this area says “<a href="https://cacm.acm.org/magazines/2021/4/251364-safe-systems-programming-in-rust/fulltext" target="_blank" rel="nofollow">we hope to eventually develop formal methods that can be put directly in the hands of programmers</a>”. We would be wise to take careful note of the word “eventually”. Furthermore, even given unanticipated universal acclamation of Rust within the Linux kernel community combined with equally unanticipated advances in C-to-Rust translation capabilities, a significant fraction of the existing tens of millions of lines of Linux-kernel C code will persist for some time to come. Both the <tt>unsafe</tt> Rust code and the C code can interfere with Rust non-<tt>unsafe</tt> code, and furthermore there are special cases where <a href="https://paulmck.livejournal.com/63957.html?thread=475605#t475605" target="_blank">safe code can violate unsafe code's assumptions</a>. Therefore, run-time analysis of Rust code (safe code included) is likely to be able to find issues that compile-time analysis cannot. As a result, Rust seems unlikely to render KCSAN obsolete any time soon. This suggests that Rust code should include KCSAN instrumentation, both via the compiler and via atomic and volatile primitives, as is currently the case for the C compilers and primitives currently in use. The public KCSAN interface is as follows: <ol> <li> <tt>__kcsan_check_access()</tt>: This function makes a relevant memory access known to KCSAN. It takes a pointer to the object being accessed, the size of the object, and an access type (for example, zero for read, <tt>KCSAN_ACCESS_WRITE</tt> for write, and <tt>KCSAN_ACCESS_ATOMIC</tt> for atomic read-modify-write). <li> The <tt>__tsan_{read,write}{1,2,4,8}()</tt> family of functions serve the same purpose as <tt>__kcsan_check_access()</tt>, but the size and access type are implicit in the function name instead of being passed as arguments, which can provide better performance. This family of functions is used by KCSAN-enabled compilers. <li> <tt>ASSERT_EXCLUSIVE_ACCESS()</tt> is invoked from C code and causes KCSAN to complain if there is a concurrent access to the specified object. A companion <tt>ASSERT_EXCLUSIVE_ACCESS_SCOPED()</tt> expands the scope of the concurrent-access complaint to the full extent of the compound statement invoking this macro. <li> <tt>ASSERT_EXCLUSIVE_WRITER()</tt> is invoked from C code and causes KCSAN to complain if there is a concurrent write to the specified object. A companion <tt>ASSERT_EXCLUSIVE_WRITER_SCOPED()</tt> expands the scope of the concurrent-writer complaint to the full extent of the compound statement invoking this macro. <li> <tt>ASSERT_EXCLUSIVE_BITS()</tt> is similar to <tt>ASSERT_EXCLUSIVE_WRITER()</tt>, but focuses KCSAN's attention on changes to bits set in the mask passed as this macro's second argument. </ol> These last three categories of interface members are designed to be directly invoked in order to check concurrency designs. See for example the direct call to <tt>ASSERT_EXCLUSIVE_WRITER()</tt> from the <tt>rcu_cpu_starting()</tt> function in Linux-kernel RCU. It would of course be quite useful for these interface members to be available from within Rust code. KCSAN also provides a <tt>data_race()</tt> primitive that suppresses data-race diagnostics, but allows the compiler free rein on optimization. This is used in the Linux kernel for things like infrequent diagnostics that are not part of the overall concurrency design. KCSAN has helped me fix a number of embarrassing bugs in Linux-kernel RCU that might otherwise have inconvenience end users, so they just might be helpful to people introducing Rust code into the Linux kernel. It is therefore quite encouraging that Marco Elver (KCSAN lead developer and maintainer) points out off-list that the <tt>rustc</tt> compiler already has <a href="https://rustc-dev-guide.rust-lang.org/sanitizers.html" target="_blank" rel="nofollow">sanitizer support</a>, which should suffice not only for KCSAN, but for the equally helpful Kernel Address Sanitizer (KASAN) as well. He also notes that some care is required to pass in the relevant <tt>-mllvm</tt> options to <tt>rustc</tt> and to ensure that it is not attempting to link against <tt>compiler-rt</tt> because doing so is not appropriate when building the Linux kernel. Marco also noted that KCSAN relies on the front-end to properly handle volatile accesses, and Miguel Ojeda kindly <a href="https://godbolt.org/z/hsnozhvc4" target="_blank" rel="nofollow">confirmed that it does</a>. So there is hope! For more information on KCSAN and its reason for being: <ol> <li> "Concurrency bugs should fear the big bad data-race detector" (<a href="https://lwn.net/Articles/816850/" target="_blank" rel="nofollow">Part 1</a> and <a href="https://lwn.net/Articles/816854/" target="_blank" rel="nofollow">Part 2</a>). <li> <a href="https://lwn.net/Articles/793253/" target="_blank" rel="nofollow">Who's afraid of a big bad optimizing compiler?</a>. <li> <a href="https://lwn.net/Articles/799218/" target="_blank" rel="nofollow">Calibrating your fear of big bad optimizing compilers</a>. <li> Sections 4.3.4 ("Accessing Shared Variables"), 15.2 ("Tricks and Traps"), and 15.3 ("Compile-Time Consternation") of <a href="https://kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook-e2.pdf" target="_blank" rel="nofollow">perfbook</a>. </ol> <h4>History</h4>October 7, 2021: Add current state of KCSAN/Rust integration. Clarify Rust module per Gary Guo email. October 12, 2021: Self-review. October 28, 2021: Add <tt>data_race()</tt> primitive. https://paulmck.livejournal.com/64970.html?view=comments#comments lkmm rust linux public

https://paulmck.livejournal.com/64730.html Tue, 05 Oct 2021 18:40:36 GMT Will Your Rust Code Survive the Attack of the Zombie Pointers? paulmck https://paulmck.livejournal.com/64730.html Some of the previous posts in this series have been said to be quite difficult, so I figured I owed you all an easy one. And the zombie-pointer problem really does have a trivial solution, at least in the context of the Linux kernel. In other environments, all bets are off. But first, what on earth is a zombie pointer? To answer that question, let's start with a last-in-first-out (LIFO) stack. We don't know when this algorithm was invented or who invented it, but a very similar algorithm was described in R. J. Treiber's classic 1986 technical report (IBM Almaden Research Center RJ 5118), and it was referred to as prior art in (expired) US Patent 3,886,525, which was filed in 1973 (hat trick to Maged Michael). Fun though it might be to analyze this code's original IBM assembly-language implementation, let's instead look at a C11 implementation: <blockquote><pre> 1 struct node_t* _Atomic top; 2 3 void list_push(value_t v) 4 { 5 struct node_t *newnode = (struct node_t *) malloc(sizeof(*newnode)); 6 7 set_value(newnode, v); 8 newnode->next = atomic_load(&top); 9 do { 10 // newnode->next may have become invalid 11 } while (!atomic_compare_exchange_weak(&top, &newnode->next, newnode)); 12 } 13 14 void list_pop_all() 15 { 16 struct node_t *p = atomic_exchange(&top, NULL); 17 18 while (p) { 19 struct node_t *next = p->next; 20 21 foo(p); 22 free(p); 23 p = next; 24 } 25 } </pre></blockquote> Those used to the Linux-kernel <tt>cmpxchg()</tt> atomic operation should keep in mind that when the C11 <tt>atomic_compare_exchange_weak()</tt> fails, it writes the current value referenced by its first argument to the pointer passed as its second argument. In other words, instead of passing <tt>atomic_compare_exchange_weak()</tt> the new value, you instead pass it a pointer to the new value. Each time <tt>atomic_compare_exchange_weak()</tt> fails, it updates the pointed-to new value. This sometimes allows this function to be used in a tight loop, as in line 11 above. With these <tt>atomic_compare_exchange_weak()</tt> semantics in mind, let's step through execution of <tt>list_push(C)</tt> on a stack initially containing objects A and B: <ol> <li> Line 5 allocates memory for the new object C. <li> Line 7 initializes this newly allocated memory. <li> Line 8 sets the new object's <tt>->next</tt> pointer to the current top-of-stack pointer. <li> Line 11 invokes <tt>atomic_compare_exchange_weak()</tt>, which atomically compares the value of <tt>newnode->next</tt> to the value of <tt>top</tt>, and if they are equal, stores the pointer <tt>newnode</tt> into <tt>top</tt> and returns <tt>true</tt>. (As noted before, if the comparison is instead not-equal, <tt>atomic_compare_exchange_weak()</tt> instead stores the value of <tt>top</tt> into <tt>newnode->next</tt> and returns <tt>false</tt>, repeating until such time as the pointers compare equal.) <li> One way or another, the stack ends up containing objects C, A, and B, ordered from the top of stack down. </ol> Oddly enough, this code would still work even if line 8 were omitted, however, that would result in a high probability that the first call to <tt>atomic_compare_exchange_weak()</tt> would fail, which would not be so good for common-case performance. It would also result in compile-time complaints about uninitialized variables, so line 8 is doubly unlikely to be omitted in real life. While we are at it, let's step through <tt>list_pop_all()</tt>: <ol> <li> Line 16 atomically stores <tt>NULL</tt> into <tt>top</tt> and returns its previous value. After this lines executes, the stack is empty and its previous contents are referenced by local variable <tt>p</tt>. <li> Lines 18-24 execute for each object on list <tt>p</tt>:<ol><li> Line 19 prefetches a pointer to the next object. <li> Line 21 passes the current object to function <tt>foo()</tt>. <li> Line 22 frees the current object. <li> Line 23 advances to the next object. </ol></ol> Thus far, we have seen no zombie pointers. Let's therefore consider the following sequence of events, with the stack initially containing objects A and B: <ol> <li> Thread 1 starts pushing object C, but is interrupted, preempted, otherwise impeded just after executing line 8. <li> Thread 2 pops the entire list and frees all of the objects. Object C's <tt>->next</tt> pointer is now invalid because it points to now-freed memory formerly known as object A. <li> Thread 2 pushes an object, and happens to allocate memory for it at the same address as the old object A, so let's call it A'. <li> Object C's <tt>->next</tt> pointer is still invalid as far as an omniscient compiler is concerned, but from an assembly language viewpoint happens to reference the type-compatible object A'. This pointer is now a "zombie pointer" that has come back from the dead, or that has at least come to have a more entertaining form of invalidity. <li> Thread 1 resumes and executes the <tt>atomic_compare_exchange_weak()</tt> on line 11. Now this primitive, like its Linux-kernel counterpart <tt>cmpxchg()</tt>, operates not on the compiler's idea of what the pointer is, but rather on the actual bits making up that pointer. Therefore, that <tt>atomic_compare_exchange_weak()</tt> succeeds despite the fact that the object C's <tt>->next</tt> pointer is invalid. <li> Thread 1 has thus completed its push of object C, and the resulting stack looks great from an assembly-language viewpoint. But the pointer from object C to A' is a zombie pointer, and compilers are therefore within their rights to do arbitrarily strange things with it. </ol> As noted earlier, this has a trivial solution within the Linux kernel. Please do not <a href="https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/Answers/Rust/linuxzombiesolution.html" target="_blank" rel="nofollow">peek</a> too early. ;-) But what can be done outside of the Linux kernel? The traditional solution has been to hide the allocator from the compiler. After all, if the compiler cannot see the calls to <tt>malloc()</tt> and <tt>free()</tt>, it cannot figure out that a given pointer is invalid (or, in C-standard parlance, indeterminate). Creating your own allocator with its own function names suffices, and back in the day you often needed to create your own allocator if performance and scalability was important to you. However, this would also deprive Rust of information that it needs to check pointer operations. So maybe Rust needs to know about allocation and deallocation, but needs to be very careful not to pass this information on to the optimizer. Assuming that this even makes sense in the context of the implementation of Rust. Perhaps code working with invalid and zombie pointers needs to be carefully written in C. And perhaps there is some way to write it safely in <tt>unsafe</tt> Rust. The final approach is to wait until backend support for safe handling of invalid/zombie pointers arrives, and then add Rust syntax as appropriate. Here is some documentation of current efforts towards adding such support to C++: <ol> <li> <a href="https://www.youtube.com/watch?v=7bZQeOGhK84" target="_blank" rel="nofollow">CPPCON 2020 presentation</a> <li> <a href="https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p1726r5.pdf" target="_blank" rel="nofollow">Pointer lifetime-end zap (informational/historical)</a> <li> <a href="https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p2414r1.pdf" target="_blank" rel="nofollow">Pointer lifetime-end zap proposed solutions</a> </ol> So there is good progress, but it might still be some time before this is supported by either the standard or by compilers. <h4>History</h4>October 6, 2021: Expand description per feedback from Jonathan Corbet October 12, 2021: Self-review. https://paulmck.livejournal.com/64730.html?view=comments#comments lkmm rust linux public

https://paulmck.livejournal.com/64392.html Mon, 04 Oct 2021 23:37:09 GMT How Much of the Kernel Can Rust Own? paulmck https://paulmck.livejournal.com/64392.html Rust concurrency makes heavy use of ownership and borrowing. The purpose of this post is not to give an exposition of Rust's capabilities and limitations in this area, but rather to give a series of examples of ownership in the Linux kernel. The first example involves Linux-kernel per-CPU variables. In some cases, such variables are protected by per-CPU locks, for example, a number of fields in the per-CPU <tt>rcu_data</tt> structure are used by the kernel threads that manage grace periods for offloaded callbacks, and these fields are protected by the <tt>->nocb_gp_lock</tt> field in the same instance of that same structure. In other cases, access to a given per-CPU variable is permitted only by the corresponding CPU, and even then only if that CPU has disabled preemption. For example, the per-CPU <tt>rcu_data</tt> structure's <tt>->ticks_this_gp</tt> field may be updated only from the corresponding CPU, and only when preemption is disabled. In the particular case, preemption is disabled as a side-effect of having disabled interrupts. The second example builds on the first. In kernels built with <tt>CONFIG_RCU_NOCB_CPU=n</tt>, the per-CPU <tt>rcu_data</tt> structure's <tt>->cblist</tt> field may be updated from the corresponding CPU, and only when preemption is disabled. However, it is also allowed from some other CPU when the corresponding CPU has been taken offline, but only when that other CPU that is orchestrating the offlining of the corresponding CPU. (What about kernels built with <tt>CONFIG_RCU_NOCB_CPU=y</tt>? They must also acquire a <tt>->nocb_lock</tt> that is also contained within the per-CPU <tt>rcu_data</tt> structure.) The third example moves to the non-per-CPU <tt>rcu_node</tt> structures, which are arranged into a combining tree that processes events from an arbitrarily large number of CPUs while maintaining bounded lock contention. Each <tt>rcu_node></tt> structure has a <tt>->qsmask</tt> bitmask that tracks which of its children need to report a quiescent state for the current grace period, and a <tt>->gp_tasks</tt> pointer that, when non-<tt>NULL</tt>, references a list of tasks that blocked while in an RCU read-side critical section that blocks the current grace period. The children of each leaf <tt>rcu_node</tt> structure are the <tt>rcu_data</tt> structures feeding into that <tt>rcu_node</tt> structure. Finally, there is a singleton <tt>rcu_state</tt> structure that contains a <tt>->gp_seq</tt> field that numbers the current grace period and also indicates whether or not it is in progress. We now have enough information to state the ownership rule for the <tt>rcu_state</tt> structure's <tt>->gp_seq</tt> field. This field may be updated only if all of the following hold: <ol> <li> The current CPU holds the root <tt>rcu_node</tt> structure's <tt>->lock</tt>. <li> All <tt>rcu_node</tt> structures' <tt>->qsmask</tt> fields are zero. <li> All <tt>rcu_node</tt> structures' <tt>->gp_tasks</tt> pointers are <tt>NULL</tt>. </ol> This might seem excessively ornate, but it is the natural consequence of RCU's semantics and design: <ol> <li> The next grace period is not permitted to start until after the previous one has ended. This is a design choice that allows RCU to function well in the face of update-side overload from large numbers of CPUs. <li> A grace period is not allowed to end until all CPUs have been observed in a quiescent state, that is, until all <tt>rcu_node</tt> structures' <tt>->qsmask</tt> fields have become zero. <li> A grace period is not allowed to end until all tasks that were preempted while executing within a critical section blocking that grace period have resumed and exited their critical sections, that is, until all <tt>rcu_node</tt> structures' <tt>->gp_tasks</tt> pointers have become <tt>NULL</tt>. <li> If multiple CPUs would like to start a grace period at the same time, there has to be something that works out which CPU is actually going to do the starting, and the last part of that something is the root <tt>rcu_node</tt> structure's <tt>->lock</tt>. </ol> Trust me, there are far more complex ownership models in the Linux kernel! The fourth example involves per-CPU data that is protected by a reader-writer lock. A CPU is permitted to both read and update its own data only if it read-holds the lock. A CPU is permitted to read other CPUs' data only if it write-holds the lock. That is right, CPUs are allowed to write when read-holding the lock and and are allowed to read while write-holding the lock. Perhaps reader-writer locks should instead have been called exclusive-shared locks, but it is some decades too late now! The fifth and final example involves multiple locks, for example, when some readers must traverse the data structure from an interrupt handler and others must sleep while traversing that same structure. One way to implement this is to have one interrupt-disabled spinlock (for readers in interrupt handlers) and a mutex (for readers that must sleep during the traversal). Updaters must then hold both locks. So how on earth does the current C-language Linux kernel code keep all this straight??? One indispensable tool is the assertion, which in the Linux kernel includes <tt>WARN()</tt>, <tt>BUG()</tt>, and friends. For example, RCU uses these to verify the values of the aforementioned <tt>->qsmask</tt> and <tt>->gp_tasks</tt> fields during between-grace-periods traversals of the <tt>rcu_node</tt> combining tree. Another indispensable tool is lockdep, which, in addition to checking for deadlocks, allows code to verify that conditions are right. A few of the many available lockdep assertions are shown below: <ol> <li> <tt>lockdep_assert_held()</tt>: Complains if the specified lock is not held by the current CPU/task. <li> <tt>lockdep_assert_held_write()</tt>: Complains if the specified reader-writer lock is not write-held by the current CPU/task. <li> <tt>lockdep_assert_held_read():</tt>: Complains if the specified reader-writer lock is not read-held by the current CPU/task. <li> <tt>lockdep_assert_none_held_once()</tt>: Complains if the current CPU/task holds any locks. <li> <tt>lockdep_assert_irqs_disabled()</tt>: Complains if the current CPU has interrupts enabled. <li> <tt>rcu_read_lock_held()</tt>: Complains if the current CPU/task is not within an RCU read-side critical section. </ol> Of course, none of these assertions are going to do anything for you unless you make use of them. On the other hand, some of the more complex ownership criteria are going to be quite difficult for compilers or other tooling to intuit without significant help from the developer. A more entertaining ownership scenario is described in Section 5.4.6 ("Applying Exact Limit Counters") of <a href="https://kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook-e2.pdf" target="_blank" rel="nofollow">perfbook</a>. Chapter 8 ("Data Ownership") describes several other ownership scenarios. <h4>History</h4>October 12, 2021: Self-review. https://paulmck.livejournal.com/64392.html?view=comments#comments lkmm rust linux public

https://paulmck.livejournal.com/64209.html Mon, 04 Oct 2021 19:41:18 GMT Can Rust Code Own RCU? paulmck https://paulmck.livejournal.com/64209.html <a href="https://www2.rdrop.com/users/paulmck/RCU/RCU.2018.11.21c.PSU-full.pdf" target="_blank" rel="nofollow">Read-copy update (RCU)</a> vaguely resembles a reader-writer lock [1], but one in which readers do not exclude writers. This change in semantic permits RCU readers to be exceedingly fast and scalable. In fact, in the most aggressive case, <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt> generate no code, and <tt>rcu_dereference()</tt> emits but a single load instruction. This most aggressive case is achieved in production via Linux-kernel <tt>CONFIG_PREEMPT_NONE=y</tt> builds. Additional information on RCU is presented at the end of this post. <h2>Can Rust Ownership be Adapted to RCU?</h2>The fact that RCU readers do not exclude writers opens the door to data races between RCU readers and updaters. This in turn suggests that least some RCU use cases are likely to pose challenges to Rust's ownership semantics, however, discussions with Wedson Almeida Filho at <a href="https://www.linuxplumbersconf.org/" target="_blank" rel="nofollow">Linux Plumbers Conference</a> suggested that these semantics might be flexed a little bit, perhaps in a manner similar to the way that Linux-kernel RCU <a href="https://lwn.net/Articles/371986/" target="_blank" rel="nofollow">flexes</a> <a href="https://lwn.net/Articles/185666/" target="_blank" rel="nofollow">lockdep's</a> semantics. In addition, the data races between the read-side <tt>rcu_dereference()</tt> primitive and the update-side <tt>rcu_assign_pointer()</tt> primitive might reasonably be resolved by having the Rust implementation of these two primitives use <tt>unsafe</tt> mode. Or, perhaps better yet, the Rust implementation of these two primitives might simply wrapper the Linux-kernel primitives in order to get the benefit of existing tools such as the aforementioned lockdep and the <a href="https://lwn.net/Articles/87538/" target="_blank" rel="nofollow">sparse</a> static-analysis checker. One downside of this approach is potential performance degradation due to the extra function call implied by the wrappering. Longer term, LTO might eliminate this function call, but if the initial uses of Rust are confined to performance-insensitive device drivers, the extra overhead should not be a problem. <h2>Linux-Kernel Tooling and RCU</h2>Here are a few examples of these Linux-kernel tools: <ol> <li> A pointer can be marked <tt>__rcu</tt>, which will cause the <tt>sparse</tt> static-analysis checker to complain if that pointer is accessed directly, instead of via <tt>rcu_dereference()</tt> and <tt>rcu_assign_pointer()</tt>. This checking can help developers avoid accidental destruction of the <a href="https://paulmck.livejournal.com/63316.html" target="_blank">address and data dependencies</a> on which many RCU use cases depend. <li> In most RCU use cases, the <tt>rcu_dereference()</tt> primitive that accesses RCU-protected pointers must itself be protected by <tt>rcu_read_lock()</tt>. The Linux-kernel lockdep facility has been therefore adapted to complain about unprotected uses of <tt>rcu_dereference()</tt>. <li> Passing the same pointer twice in quick succession to a pair of <tt>call_rcu()</tt> invocations is just as bad as doing the same with a pair of <tt>kfree()</tt> invocations: Both situations are a double free. The Linux kernel's debug-objects facility checks for this sort of abuse. </ol> This is by no means a complete list. Although I am by no means claiming that these sorts of tools provide the same degree of safety as does Rust safe mode, it is important to understand that Rust <tt>unsafe</tt> mode is not to be compared to standard C, but rather to the dialect of C used in the Linux kernel, augmented by the associated tools, processes, and coding guidelines. <h2>RCU Use Cases</h2>There are in fact Rust implementations of variants of RCU, including: <ol> <li> <a href="https://github.com/crossbeam-rs/crossbeam/commit/d9b1e3429450a64b490f68c08bd191417e68f00c" target="_blank" rel="nofollow">crossbeam-rs</a> <li> <a href="https://docs.rs/arc-swap/1.4.0/arc_swap/struct.ArcSwapAny.html#method.rcu" target="_blank" rel="nofollow">ArcSwapAny</a> </ol>I currently have no opinion on whether or not these could be used within the Linux kernel, and any such opinion is irrelevant. You see, Rust's use of RCU within the Linux kernel would need to access RCU-protected data structures provided by existing C code. In this case, it will not help for Rust code to define its own RCU. It must instead interoperate with the existing Linux-kernel RCU implementations. But even interoperating with Linux-kernel RCU is not trivial. To help with this task, this section presents a few RCU use cases within the Linux kernel. The first use case is the textbook example where once an RCU-protected object is exposed to readers, its data fields remain constant. This approach allows Rust's normal read-only mode of ownership to operate as intended. The pointers will of course change as objects are inserted or deleted, but these are handled by <tt>rcu_dereference()</tt> and <tt>rcu_assign_pointer()</tt> and friends. These special primitives might be implemented using <tt>unsafe</tt> Rust code or C code, as the case might be. In-place updates of RCU-protected objects are sometimes handled using the <tt>list_replace_rcu()</tt> primitive. In this use case, a new version of the object is allocated, the old version is copied to the new version, any required updates are carried out on the new version, and then <tt>list_replace_rcu()</tt> is used to make the new version available to RCU readers. Readers then see either the old version or the new version, but either way they see an object with constant data fields, again allowing Rust ownership to work as intended. However, one complication arises for objects removed from an RCU-protected data structure. Such objects are not owned by the updater until after a grace period elapses, and they can in fact be accessed by readers in the meantime. It is not clear to me how Rust would handle this. For purposes of comparison, within the Linux kernel, the <a href="https://lwn.net/Articles/802128/" target="_blank" rel="nofollow">Kernel Concurrency Sanitizer (KCSAN)</a> handles this situation using dynamic runtime checking. With this checking enabled, when an updater takes full ownership of a structure before readers are done with it, KCSAN reports a data race. These issues should be most immediately apparent to anyone attempting to create a Rust wrapper for the <tt>call_rcu()</tt> function. Because RCU readers do not exclude RCU updaters, it is possible for an RCU reader to upgrade itself to an updater while traversing an RCU-protected data structure. This is usually handled by a lock embedded in the object itself. From what I understand, the easiest way to apply Rust ownership to these situations is to make the RCU-protected object contain a structure that in turn contains the data protected by the lock. People wishing to apply Rust to these sorts of situations are invited to review the Linux-kernel source code to check for possible complications, for example, cases where readers locklessly read fields that are updated under the protection of the lock, and cases where multiple locks provide one type of protection or another to overlapping subsets of fields in the RCU-protected object. Again because RCU readers do not exclude RCU updaters, there can also be lockless in-place updates (either from readers or updaters) to RCU-protected objects, for example, to set flags, update statistics, or to provide type-safe memory for any number of lockless algorithms (for example, using <tt>SLAB_TYPESAFE_BY_RCU</tt>). Rust could presumably use appropriate atomic or volatile operations in this case. One important special case of RCU readers locklessly reading updated-in-place data is the case where readers cannot be allowed to operate on an object that has been removed from the RCU-protected data structure. These situations normally involve a per-object flag and lock. Updaters acquire the object's lock, remove the object from the structure, set the object's flag, and release the lock. Readers check the flag, and if the flag is set, pretend that they did not find the corresponding object. This class of use cases illustrate how segregating read-side and update-side fields of an RCU-protected object can be non-trivial. RCU is sometimes used to protect combinations of data structures, sometimes involving nested RCU readers. In some cases, a given RCU read-side critical section might span a great many functions across several translation units. It is not unusual for RCU use cases to include a search function that is invoked both by readers and updaters. This function will therefore sometimes be within an RCU read-side critical section and other times be protected by the update-side lock (or by some other update-side synchronization mechanism). <a href="https://paulmck.livejournal.com/63957.html" target="_blank">Sequence locking</a> is sometimes used in conjunction with RCU, so that RCU protects traversal of the data structure and sequence locking detects updates profound enough to cause problems for the RCU traversals. The poster boy for this use case is in the Linux kernel's directory-entry cache. In this case, certain sequences of rename operations could fool readers into traversing pathnames that never actually existed. Using the sequence lock named <tt>rename_lock</tt> to protect such traversals allows readers to reject such bogus pathnames. More information on the directory-entry cache is available in Neil Brown's excellent Linux Weekly News series (<a href="https://lwn.net/Articles/649115/" target="_blank" rel="nofollow">part 1</a>, <a href="https://lwn.net/Articles/649729/" target="_blank" rel="nofollow">part 2</a>, <a href="https://lwn.net/Articles/650786/" target="_blank" rel="nofollow">part 3</a>). One key take-away of this use case is that sequence locking is sometimes used to protect arbitrarily large data structures. Sequence locking can be used to easily provide readers with a consistent view of a data structure, for a very wide range of definitions of "consistent". However, all of these techniques allow updaters to block readers. There are a number of other approaches to read-side consistency both within and across data structures, including the <a href="https://www.rdrop.com/~paulmck/RCU/Updates.2017.09.11c.SVLUG.pdf" target="_blank" rel="nofollow">Issaquah Challenge</a>. One final use case is phased state change, where RCU readers check a state variable and take different actions depending on the current state. Updaters can change the state variable, but must wait for the completion of all readers that might have seen the old value before taking further action. The rcu-sync functionality (<tt>rcu_sync_init()</tt> and friends) implements a variant of RCU-protected phased state change. This use case is notable in that there are no linked data structures and nothing is ever freed. This is by no means an exhaustive list of RCU use cases. RCU has been in the Linux kernel for almost 20 years, and during that time kernel developers have come up with any number of new ways of applying RCU to concurrency problems. <h2>Rust RCU Options</h2>The most straightforward approach is to provide Rust wrappers for the relevant C-language RCU API members. As noted earlier, wrappering the low-overhead read-side API members will introduce function-call overhead in non-LTO kernels, but this should not be a problem for performance-insensitive device drivers. However, fitting RCU into Rust's ownership model might not be as pretty as one might hope. A more utopian approach would be to extend Rust's ownership model, one way or another, to understand RCU. One approach would be for Rust to gain "type safe" and "guaranteed existence" ownership modes, which would allow Rust to better diagnose abuses of the RCU API. For example, this might help with pointers to RCU-protected objects being erroneously leaked from RCU read-side critical sections. So how might a pointer to an RCU-protected object be non-erroneously leaked from an RCU read-side critical section, you ask? One way is to acquire a reference via a reference count within that object before exiting that critical section. Another way is to acquire a lock within that object before exiting that critical section. So what does the Linux kernel currently do to find this type of pointer leak? One approach is to enable KCSAN and build the kernel with <tt>CONFIG_RCU_STRICT_GRACE_PERIOD=y</tt>, and to run this on a small system. Nevertheless, this might be one area where Rust could improve Linux-kernel diagnostics, albeit not without effort. <h2>For More Information</h2>Additional information on RCU may be found here: <ol> <li> <a href="https://www.kernel.org/doc/Documentation/RCU/" target="_blank" rel="nofollow">Linux-kernel documentation</a>, with special emphasis on <a href="https://www.kernel.org/doc/Documentation/RCU/Design/Requirements/Requirements.html" target="_blank" rel="nofollow">Linux-kernel RCU's requirements</a>. <li> The following sections of <a href="https://kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook-e2.pdf" target="_blank" rel="nofollow">perfbook</a> discuss other aspects of RCU:<ol> <li> Section 9.5 ("Read-Copy Update"). <li> Section 10.3 ("Read-Mostly Data Structures"). <li> Section 13.5 ("RCU Rescues"). <li> Section 15.4.2 ("RCU [memory-ordering properties ]"). </ol><li> <a href="https://lwn.net/Articles/777036/" target="_blank" rel="nofollow">The RCU API, 2019 Edition</a>. <li> <a href="https://liburcu.org" target="_blank" rel="nofollow">Userspace RCU</a>. <li> <a href="https://github.com/facebook/folly/blob/master/folly/synchronization/Rcu.h" target="_blank" rel="nofollow">Folly-library RCU</a>. <li> Section 9.6.3.3 of <a href="https://kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook-e2.pdf" target="_blank" rel="nofollow">perfbook</a> lists several other RCU implementations. <li> <a href="https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p1122r4.pdf" target="_blank" rel="nofollow">Proposed Wording for Concurrent Data Structures: Read-Copy-Update (RCU)</a> (C++ Working Paper). </ol> RCU's semantics ordered by increasing formality: <ol> <li> The "Fundamental Requirements" section of the aforementioned <a href="https://www.kernel.org/doc/Documentation/RCU/Design/Requirements/Requirements.html" target="_blank" rel="nofollow">Linux-kernel RCU requirements</a>. Extremely informal, but you have to start somewhere! <li> <a href="https://www2.rdrop.com/~paulmck/techreports/rcu-semantics.2005.01.30a.pdf" target="_blank" rel="nofollow">RCU Semantics: A First Attempt</a>. Also quite informal, but with a bit more math involved. <li> <a href="https://software.imdea.org/~gotsman/papers/recycling-esop13-ext.pdf" target="_blank" rel="nofollow">Verifying Highly Concurrent Algorithms with Grace (extended version)</a>. Formal semantics expressed in separation logic. <li> <a href="https://doi.acm.org/10.1145/3173162.3177156" target="_blank" rel="nofollow">Frightening Small Children and Disconcerting Grown-ups: Concurrency in the Linux Kernel</a>. Executable formal semantics expressed in Cat language. This paper also includes a proof that the traditional "wait for all pre-existing readers" semantic is equivalent to these executable formal semantics within the context of the Linux-kernel memory model. <li> <a href="https://github.com/torvalds/linux/tree/master/tools/memory-model" target="_blank" rel="nofollow">Linux-kernel memory model (LKMM)</a>, see especially the <tt>Sync-rcu</tt> and <tt>Sync-srcu</tt> relations in file <tt>linux-kernel.cat</tt>. </ol> Furthermore, significant portions of Linux-kernel RCU have been subjected to mechanical proofs of correctness: <ol> <li> Lihao Liang applied the <a href="https://www.cprover.org/cbmc/" target="_blank" rel="nofollow">C Bounded Model Checker (CBMC)</a> to Tree RCU (<a href="https://arxiv.org/abs/1610.03052" target="_blank" rel="nofollow">paper</a>). <li> Michalis Kokologiannakis applied <a href="https://github.com/nidhugg/nidhugg" target="_blank" rel="nofollow">Nidhugg</a> to Tree RCU (<a href="https://michaliskok.github.io/papers/spin2017-rcu.pdf" target="_blank" rel="nofollow">paper</a>). <li> Lance Roy applied CBMC to Classic SRCU, represented by Linux-kernel commit <a href="https://lore.kernel.org/all/1484520155-21017-3-git-send-email-paulmck@linux.vnet.ibm.com/" target="_blank" rel="nofollow">418b2977b343 ("rcutorture: Add CBMC-based formal verification for SRCU")</a>. Sadly, the scripting has not aged well. </ol> For more information, see <a href="https://paulmck.livejournal.com/46993.html" target="_blank">Verification Challenge 6</a>. For those interested in the userspace RCU library, there have been both <a href="https://www.computer.org/csdl/trans/td/2012/02/ttd2012020375-abs.html" target="_blank" rel="nofollow">manual</a> and <a href="https://dl.acm.org/doi/10.1145/2506164.2506174" target="_blank" rel="nofollow">mechanical</a> proofs of correctness. <a href="https://www.mpi-sws.org/~dreyer/papers/rcu/paper.pdf" target="_blank" rel="nofollow">A manual proof of correctness of userspace RCU in the context of a linked list has also been carried out</a>. <h4>Endnotes</h4><table><tr valign="top"><td>[1]  </td><td>At its core, RCU is nothing more nor less than a way of waiting for already-started things to finish. Within the Linux kernel, something to be waited on is delimited by <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>. For their part, <tt>synchronize_rcu()</tt> and <tt>call_rcu()</tt> do the waiting, synchronously and asynchronously, respectively. It turns out that you can do quite a lot with this simple capability, including approximating some reader-writer-lock use cases, emulating a number of reference-counting use cases, providing type-safe memory, and providing existence guarantees, this last sometimes being pressed into service as a poor-man's garbage collector. Section 9.3.3 ("RCU Usage") of <a href="https://kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook-e2.pdf" target="_blank" rel="nofollow">perfbook</a> provides more detail on these and other RCU use cases. </td></tr></table> <h4>History</h4>October 12, 2021: Self-review. Note that some of the comments are specific to earlier versions of this blog post. October 13, 2021: Add RCU-protected phased state change use case. October 18, 2021: Add read/update common code and Tasseroti citation. https://paulmck.livejournal.com/64209.html?view=comments#comments lkmm rust linux public

Original Source | Taken Source