SBOMs for Python packages project

sethmlarson · 2024-11-05T18:02:08.455Z

I’m announcing a new cross-functional project for SBOMs and Python packages. This project is specifically looking to solve these issues:

Enable Python users that require SBOM documents (likely due to regulations like CRA or SSDF) to self-serve using existing SBOM generation tools.
Solve the “phantom dependency” problem, where non-Python software is bundled in Python packages but not recorded in any metadata. This makes the job of software composition analysis (SCA) tools difficult or impossible.
Make the adoption work by relevant projects such as build backends, auditwheel-esque tools, as minimal as possible. Empower users who are interested in having better SBOM data for the Python projects they are using to be able to contribute engineering time towards that goal.

Expected future work will be:

Surveying relevant build backends, auditwheel-like tools, vendoring tools, and build environments.
Authoring a PEP for how to include an SBOM document in a Python package that is self-referential about the Python package itself. Will include a new core metadata field Sbom-File and a method to manually specify SBOM documents to include from pyproject.toml.
Authoring an informational PEP for how to transform Python package metadata (including the above PEP) into an SBOM document.

If you are interested in this work or are willing to provide opinions at the various stages please follow along with the GitHub repository and I’ve also been allocated a dedicated channel in the PyPA Discord (if you’re not in PyPA Discord, ask me for an invite @sethmlarson). Happy to answer any questions or concerns in this thread too.

h-vetinari · 2024-11-06T00:40:40.845Z

Seth Michael Larson:

Solve the “phantom dependency” problem

Is this going to build help push PEP 725 or a are you planning a completely different approach (e.g. separate metadata or somehow inspecting wheel contents)?

I see PURLs appearing in the readme already, but no mention about that PEP. IMO it would be much more beneficial to tackle this problem at the source (packaging metadata that’s used for building; generate SBOM from that), rather than create a separate set of sbom-related metadata that needs to be kept in sync.

sethmlarson · 2024-11-08T15:23:13.884Z

Thanks for sending me this, my current read of PEP 725 (correct me where I’m wrong @pradyunsg and @rgommers) is that it’s an attempt to standardize the “requirements” for external dependencies, not necessarily describing the set of dependencies that a Python package actually contains.

For example, a source distribution describing a dependency on libssl might get built into a wheel, the Requires-External: pkg:generic/libssl field is removed, but since that dependency is then vendored the SBOM PEP would allow writing an SBOM document inside the wheel describing the exact libssl package, version, distro (pkg:deb/ubuntu/libssl-dev@3.0.2-0ubuntu1.18) and is able to include other information like hashes, license, and more.

If this is correct, I think that both PEPs are necessary parts of the story and don’t conflict with each other. I can definitely reference how the upcoming SBOM PEP would interact with PEP 725.

sethmlarson · 2024-11-08T17:28:40.623Z

So I’ve started authoring a PEP draft using the README for the GitHub repo. I am looking for folks who are interested in being a PEP delegate for this project. Let me know if you’re interested, thanks again for everyone’s comments and feedback.

pf_moore · 2024-11-08T17:36:05.666Z

I don’t know if the intention is for this to be a core PEP or a packaging PEP but in case it’s the latter, I’d like to explicitly not be considered for PEP delegate for this. I’m normally the default delegate for packaging interoperability PEPs, but I don’t feel like I have sufficient understanding of (or interest in) this topic to pick this one up.

sethmlarson · 2024-11-08T17:39:13.770Z

Paul Moore:

I don’t know if the intention is for this to be a core PEP or a packaging PEP but in case it’s the latter, I’d like to explicitly not be considered for PEP delegate for this.

I believe this should be a packaging PEP because it will add a new core metadata field. I hope you’ll still be available to review the proposal, your experience with interop will be very useful I suspect.

pradyunsg · 2024-11-08T19:11:50.636Z

Seth Michael Larson:

Thanks for sending me this, my current read of PEP 725 (correct me where I’m wrong @pradyunsg and @rgommers) is that it’s an attempt to standardize the “requirements” for external dependencies, not necessarily describing the set of dependencies that a Python package actually contains.

That’s right, although I’d say they do benefit from each other in the same way that having explicit dependency metadata in METADATA files is useful.

pf_moore · 2024-11-08T20:12:37.954Z

Seth Michael Larson:

I hope you’ll still be available to review the proposal

I’ll certainly read this thread, and comment if I have anything useful to say (the problem is usually trying to stop me posting, rather than the other way around ). But my attitude towards SBOMs tends towards a grumpy “why should we do work in our free time to make life easier for big companies” so I’d rather keep quiet instead of disrupting the discussion.

pitrou · 2024-11-09T14:49:19.042Z

Paul Moore:

But my attitude towards SBOMs tends towards a grumpy “why should we do work in our free time to make life easier for big companies” so I’d rather keep quiet instead of disrupting the discussion.

I’m not sure what your current professional status is @pf_moore , but I think it might actually be possible to get companies to fund the required PEP and PR review and lifecycle process?

pf_moore · 2024-11-09T15:25:13.842Z

Antoine Pitrou:

I’m not sure what your current professional status is @pf_moore , but I think it might actually be possible to get companies to fund the required PEP and PR review and lifecycle process?

Personally, it’s not a matter of funding^[1]. I’m not actually sure if we want companies funding the PEP process, as that significantly increases the risk (or at least the perception) of a possible conflict of interest. In all honesty, though, I don’t think there’s a need as far as the PEP process is conerned. I’m sure there are plenty of other people qualified to be PEP delegate.

What would be useful is, if there’s any work required to implement the PEP, then for companies to provide good-quality PRs and an ongoing maintenance commitment to the relevant code. Note that it’s important to get the maintenance commitment here - we’ve had instances in pip where we’ve had features contributed, but without follow-up maintenance support, and that’s ultimately created more work for the pip team than not having the feature would have done. Maintenance commitment doesn’t even need a presence on the project maintainer team, simply having someone do timely bug diagnosis and fix development as a normal 3rd party user is a great help.

I’m retired, and Python is purely a hobby for me these days ↩︎

pradyunsg · 2024-11-10T06:28:33.863Z

Paul Moore:

follow-up maintenance

This is the piece that I’m concerned about here as well.

I do expect that we’ll find insitutional users of Python being interested in funding/implementing this stuff “upstream” of them, especially since it looks like western regulatory bodies have already started investigating/encoraging/requiring SBOMs in various industries.

I don’t think we’re set up particularly well right now, in Python’s packaging tooling space, to handle the sort of (regulatory pressure induced!) cross-project work necessary here; especially since I don’t expect this to be a “one burst and done” project, but rather an ongoing thing. The underlying SBOM standards are gonna evolve and Python packaging standards after the introduction of proper bill-of-materials metadata will need to cater to any constraints forced on us by things we do to cater for this stuff. And, the code that’s written will need to be maintained.

That said, I’d like to believe that no one involved is looking to just shove the long-term maintainance burden of this stuff onto open source projects, without thinking about how those projects will deal with that – not least because most open source projects have very little incentive to actually invest in this stuff themselves and will just say “no thanks” otherwise.

Plus, I know that Seth is aware of these aspects, and also trust that he’ll work to find a reasonable solution for the various tradeoffs here while engaging with the relevant parties appropriately.

sethmlarson · 2024-11-12T22:36:59.741Z

Keeping the “maintain X forever” box checked is always the toughest one and it’s a top focus for me on this project.

I am intentionally designing the “include SBOM” mechanism to be as simple as possible, essentially having the documents catch a mostly-free ride in archives. I don’t expect a heavy lift to implement anything in installers, for example.

SBOM standards evolving likely won’t have any effect on existing code (aside from package indexes, which might need to adopt new “major” SBOM standard versions). The design of the mechanism defers completely to SBOM standards about contents, we only provide a few basic rules about the primary component to make the semantics work out. SBOM scanning tools are incentivized to support whatever Python package tools emit (lest they start dropping information on the floor), so long-term backwards compatibility is incentivized across the toolchain.

Pradyun Gedam:

I don’t think we’re set up particularly well right now, in Python’s packaging tooling space, to handle the sort of (regulatory pressure induced!) cross-project work necessary here; especially since I don’t expect this to be a “one burst and done” project, but rather an ongoing thing.

This is why I think spending my paid time on this project will be useful.

I am hoping that by providing a mechanism to have SBOM information included in Python packages at all (and as a solution to a named problem “phantom dependency” where Python is the unfortunate star) we’ll see increased interest and contributions from users who need SBOMs.

As it stands today, if a company wanted to improve projects’ SBOMs that they depend on they would need to start where I am now with packaging PEPs and a plan to contribute to a bunch of Python tools and then advocate for projects they care about to accept their contributions. My goal with the PEP is to at least unblock the first few things in the way of those types of contributions.

The areas which will need continuous human-touch are primarily “upstream” SBOM documents, likely from dependencies which are in their source tree. I have some plans to replicate some of the work I did for CPython in having a tool that provides automation and shoulder-taps to update SBOMs when checked-in dependencies change. Many projects, even ones with binary dependencies, won’t need to do this if they’re using auditwheel/cibuildwheel or an ecosystem with an easier software ID story like Rust or JS.

With some luck I’ll hopefully being around for a while to help track down ecosystem compatibility issues. I am also planning on creating patches for many of the mentioned projects like auditwheel, maturin, vendoring, etc ahead of a PEP going for review so that “implementations” are ready ahead of time. For example, I already have a local patch for auditwheel which generates and includes an SBOM into a repaired wheel. This SBOM document alone means that scanning tools can detect vulnerable versions of libwebp in Pillow, for example.

Paul Moore:

Maintenance commitment doesn’t even need a presence on the project maintainer team, simply having someone do timely bug diagnosis and fix development as a normal 3rd party user is a great help.

This makes sense to me, in a way I am providing this sort of support to CPython with their SBOMs. Maybe I can work this into the framing of “how to contribute SBOM data to your upstream projects”. I do plan on talking about this whole project mostly as “opening a new data distribution path, it’s up to you, dear user, to contribute to keep the data up to your standards”.

brettcannon · 2024-11-12T22:59:09.183Z

Paul Moore:

I’m normally the default delegate for packaging interoperability PEPs, but I don’t feel like I have sufficient understanding of (or interest in) this topic to pick this one up.

If you’re asking for a volunteer then I can at least put my hand up since this is sort of like licenses where I suspect MS would be supportive of me helping to improve the situation.

pf_moore · 2024-11-12T22:59:30.989Z

One thing that would be worth considering is how this will work with projects that (for whatever reason) do not want to include/maintain SBOM data. In some ways this feels very like typing information - not everyone wants to add typing to their projects, so there are facilities (typeshed, stubs) for typing data to be maintained externally by interested 3rd parties. It seems to me that a similar mechanism for SBOM data might be useful.

steve.dower · 2024-11-12T23:19:10.124Z

Paul Moore:

One thing that would be worth considering is how this will work with projects that (for whatever reason) do not want to include/maintain SBOM data

It’s possible (and legitimate) for the recipient to construct the SBOM from the package after it’s been installed (strictly speaking, “after it’s been integrated into their product,” which is the same as installation for Python packages).

Ultimately, the SBOM is going to matter for the purchaser, and the seller is responsible to provide it.^[1] The seller is also responsible for supporting all the software they sell, and the vast majority of licenses used for Python packages don’t allow the seller to transfer liability back to the developer, so it really doesn’t matter if there’s an “upstream SBOM” or not.

However, when there is an upstream SBOM, and especially if we ever figure out good ways to sign and verify them, they are probably the most consistent way we’re going to get to enable signed/verifiable wheels (traced to the publisher, not just signed repository metadata a.k.a. TUF). For some publishers, this will matter (I already put SBOMs into my $work wheels).

But there shouldn’t be any reason to require/force OSS-licensed packages to include an SBOM. It might help tools that try to construct referential graphs of software, but when they’re missing there’s really no value in getting the information from anywhere but the actual files in the package (unlike, say ClearlyDefined, which collects license info in a consistent way - that plus hashes of actual files becomes your SBOM).

In case it’s not clear at this point, “people who publish to PyPI” are neither the purchaser nor the seller (99.9% of the time). The seller is the software company who “pip installs” something and sells the software that bundles it. The primary need for SBOMs is to inform someone the details of what they’ve just bought, so if there’s no software sale, there’s likely no SBOM. ↩︎

pitrou · 2024-11-14T08:39:38.437Z

Seth Michael Larson:

and as a solution to a named problem “phantom dependency” where Python is the unfortunate star

Do you have a well-defined plan for how to tackle this?

Steve Dower:

Ultimately, the SBOM is going to matter for the purchaser, and the seller is responsible to provide it.

But the “seller” will have a hard time generating a correct SBOM if the upstream project doesn’t provide one, right? The PyArrow wheels for example bundle of bunch of third-party C++ libraries that the entity downloading the wheel doesn’t know about.

sethmlarson · 2024-11-14T14:30:00.692Z

Brett Cannon:

If you’re asking for a volunteer then I can at least put my hand up since this is sort of like licenses where I suspect MS would be supportive of me helping to improve the situation.

I think this is a great match! The mechanics of the PEP is quite similar to PEP 639 for licenses, and if your work is supportive then even better.

Steve Dower:

It’s possible (and legitimate) for the recipient to construct the SBOM from the package after it’s been installed (strictly speaking, “after it’s been integrated into their product,” which is the same as installation for Python packages).

This is the current state of affairs for most software, built software artifacts don’t encode enough information for those scanners so they only capture top-level dependencies instead of bundled ones. This project would provide a mechanism to send information to SBOM scanners, either from the source stage or build stage of a project.

Antoine Pitrou:

sethmlarson:

and as a solution to a named problem “phantom dependency” where Python is the unfortunate star

Do you have a well-defined plan for how to tackle this?

I’ve documented the overall plan in this GitHub repository README. In short: create a PEP for a simple mechanism for including SBOMs in Python packages, look into implementing automated support for creating SBOMs from dependencies (build backends and wheel processing tools), and finally create a simple tool (like what has been working for CPython) for projects which can’t automate their SBOMs but either want to contribute these themselves or have eager users.

I don’t think there’s an “easy switch” for all projects to suddenly have high-quality SBOMs, but with the above I think we can certainly make a dent in the problem through automated means alone. For example, I tracked hundreds of projects’ dependencies that would get covered by adding support for this prospective PEP to auditwheel. I’m looking to quantify the total number of projects which would be covered by each solution in my surveys on that same GitHub repo.

sethmlarson · 2024-11-14T14:33:04.658Z

Thanks @alex_Gaynor for documenting cryptography’s desire to adopt such an SBOM packaging standard:

In short, if there is a Python packaging standard for expression SBOMs, we’ll adopt it.

cryptography is one of my top candidates that I’ve been using in surveys and will be using once I start working on an SBOM-capable fork of Maturin, so this is lovely to see that this work would be adopted by the project.

pitrou · 2024-11-14T14:44:28.478Z

Seth Michael Larson:

For example, I tracked hundreds of projects’ dependencies that would get covered by adding support for this prospective PEP to auditwheel.

I’m curious, do you expect wheels to package .a files? Python extensions with statically-linked third-party dependencies would typically not ship .a files separately.

sethmlarson · 2024-11-14T14:59:32.609Z

Antoine Pitrou:

sethmlarson:

For example, I tracked hundreds of projects’ dependencies that would get covered by adding support for this prospective PEP to auditwheel.

I’m curious, do you expect wheels to package .a files? Python extensions with statically-linked third-party dependencies would typically not ship .a files separately.

Maybe I am misunderstanding your question, please correct me if I am. I wasn’t anticipating asking any tool to change its behavior, instead working around existing behavior and having tools (where possible) note down when they use or pull in a third-party dependency into the final artifact. Is there something about .a files that I’m missing?