CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
What's the problem this feature will solve?
As part of our ongoing collaboration to find exposed secrets in PyPI packages, we are working on a scanning pipeline that automatically scans newly released packages. In order to report our findings, we will need an endpoint we can call, with an agreed-upon schema.
Describe the solution you'd like
Schema
Ideally, the endpoint’s payload would be on a per artifact basis, allowing us to include metadata about the artifact alongside the list of secrets that were found. Here is a possible schema for the payload.
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "Artifact scanning report",
"description": "The detail of all the findings for a given artifact",
"type": "object",
"required": [
"release",
"scan_info",
"scan_results"
],
"properties": {
"release": {
"type": "object",
"required": [
"title",
"package_name",
"version"
],
"properties": {
"title": {
"type": "string",
"examples": [
"ggshield 1.0.2"
]
},
"package_name": {
"type": "string",
"examples": [
"ggshield"
]
},
"version": {
"type": "string",
"examples": [
"1.0.2"
]
}
}
},
"scan_info": {
"type": "object",
"required": [
"scanner_version",
"scanned_at"
],
"properties": {
"scanner_version": {
"type": "string",
"examples": [
"2.99.0"
]
},
"scanned_at": {
"type": "date-time",
"examples": [
"2023-11-16T17:10:25Z"
]
}
}
},
"scan_results": {
"type": "array",
"items": {
"type": "object",
"required": [
"artifact",
"secrets"
],
"properties": {
"artifact": {
"type": "object",
"required": [
"name",
"sha256_digest"
],
"properties": {
"name": {
"type": "string",
"examples": [
"ggshield-1.0.2.zip"
]
},
"sha256_digest": {
"type": "string",
"examples": [
"13550350a8681c84c861aac2e5b440161c2b33a3e4f302ac680ca5b686de48de"
]
}
}
},
"secrets": {
"type": "array",
"items": {
"type": "object",
"required": [
"detector_name",
"detector_display_name",
"company_name",
"filepath",
"matches",
"validity_status"
],
"properties": {
"detector_name": {
"type": "string",
"examples": [
"google_aiza"
]
},
"detector_display_name": {
"type": "string",
"examples": [
"Google API Key"
]
},
"company_name": {
"type": "string",
"examples": [
"Google"
]
},
"documentation_url": {
"type": "uri",
"examples": [
"https://docs.gg.com/google_aiza"
]
},
"filepath": {
"type": "string",
"examples": [
"/ggshield/connect/google.py"
]
},
"matches": {
"type": "array",
"items": {
"type": "object",
"required": [
"match_name",
"index_start",
"index_end"
],
"properties": {
"match_name": {
"type": "string",
"examples": [
"apikey"
]
},
"index_start": {
"type": "integer",
"examples": [
12
]
},
"index_end": {
"type": "integer",
"examples": [
32
]
}
}
}
},
"validity_status": {
"type": "string",
"enum": [
"NO_CHECKER",
"FAILED_TO_CHECK",
"VALID",
"INVALID"
],
"examples": [
"VALID"
]
}
}
}
}
}
}
}
},
"examples": [
{
"release": {
"title": "ggshield 1.0.2",
"package_name": "ggshield",
"version": "1.0.2"
},
"scan_info": {
"scanner_version": "2.99.0",
"scanned_at": "2023-11-16T17:10:25Z"
},
"scan_results": [
{
"artifact": {
"name": "ggshield-1.0.2.zip",
"sha256_digest": "13550350a8681c84c861aac2e5b440161c2b33a3e4f302ac680ca5b686de48de"
},
"secrets": [
{
"detector_name": "google_aiza",
"detector_display_name": "Google API Key",
"company_name": "Google",
"documentation_url": "https://docs.gg.com/google_aiza",
"filepath": "/ggshield/connect/google.py",
"matches": [
{
"match_name": "apikey",
"index_start": 12,
"index_end": 32
}
],
"validity_status": "VALID"
}
]
}
]
}
]
}
Response
We do not expect the endpoint to return any data, we just need to be able to distinguish between a successful call and one that fails: standard status codes should be more than enough.
API versioning
We have no strong requirement on this point, and will be fine with whichever solution you choose for the versioning of the schema.
Call volume and rate limiting
Since we are planning to call the endpoint once per artifact in which we find secrets, the worst case would be that we find secrets in every single artifact. In that case, our volume of calls would be directly proportional to the number of releases. We consequently don’t expect our volume of calls to be such as to restricted by rate limiting.
Authentication
This endpoint should not be publicly available. A possible approach would be to use both authentication via a secret (ideally just an API key) and an IP allowlist, to guarantee that only known entities have access to the endpoint.
Remediation
In the case of prolonged downtime of the endpoint, we won’t be able to upload our findings. They will be persisted on our end, and can be re-uploaded at a later point. We do not plan to have a way to automate this: this will be done “manually”, on an ad-hoc fashion.
We would also probably need to have an automated way of revoking / renewing our own API key, to be able to remediate any leak on our end immediately.