CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 64
Description
(or: The Big Mystery of Spooky EPUB Relative URLs 👻🎃)
TL;DR: EPUB 3.3 now normatively references the URL Standard. But URL parsing is ambiguous in some cases, because base URLs are not clearly defined.
Current situation
In an EPUB, files reference each other via relative URL strings (see Relative URLs, in Open Container Format).
In the URL standard, to parse a relative URL string into URL records, the URL parser needs a base URL.
The base URL used to parse a URL string is defined by host languages (like in CSS, or HTML). Typically, it is the URL of the document containing the URL string.
EPUB defines what base URL to use for URL parsing in two cases:
- relative URL strings found in documents located in the
META-INF
directory - relative URL strings in the Package Documents
Parsing a URL in documents located in the META-INF
directory
For documents in the META-INF
directory, URL strings must be parsed using the root directory as the base URL (see Relative URLs, in Open Container Format).
The problem is that Root Directory is not defined as a URL, but quite abstractly as "the base of the OCF Abstract Container". The spec also says the root directory is "virtual in nature". In fact, RS may or may not generate a physical directory for the root directory (see OCF ZIP Container RS processing).
Parsing a URL in the Package Document
For Package Documents, URL strings must be parsed uses the URL of the Package Document as the base URL (see Parsing Relative URLs, in Package Documents RS processing).
Here again, the URL of the Package Document is not well-defined. But the spec says (in the same section) that for zipped EPUBs, the URL of the package document is obtained "from the URL of the EPUB Container together with a fragment identifier that specifies the path to Package Document (relative to the Root Directory)".
Problems
The URL of the container’s root directory is undefined
The current specification leaves many questions unanswered:
- What is the URL of the root directory? Is it the URL of the ZIP file? or extracted directory? or constructed based on the URL of the ZIP file? how? or it's up to the RS to define it?
- The RS may generate a physical directory for the container's Root Direcotry if it unzips the EPUB. What if the RS doesn't unzip the root but only a subdirectory? What if the EPUB is not unzipped as a whole? (but streamed on demand).
The current way to obtain the URL of the Package Document is flawed
Parsing a relative URL in the Package Document always results in a URL of a resource outside the container.
Examples:
For instance, for an EPUB mobydick.epub
located at https://example.org/acme-publishing/mobydick.epub
, the URL of the Package Document would be something like https://example.org/acme-publishing/mobydick.epub#path=/EPUB/package.opf
. So this is how a few relative URL string examples are parsed:
# | URL string | Base EPUB | Resulting URL |
---|---|---|---|
1 | nav.xhtml |
https://example.org/acme/mobydick.epub#path=/EPUB/package.opf |
https://example.org/acme/nav.xhtml |
2 | nav.xhtml |
https://example.org/acme/tomsawyer.epub#package-doc=/EPUB/package.opf |
https://example.org/acme/nav.xhtml |
3 | ../video/cat.mp4 |
https://example.org/acme/mobydick.epub#package-doc=/EPUB/package.opf |
https://example.org/video/cat.mp4 |
4 | /secret |
https://example.org/acme/mobydick.epub#package-doc=/EPUB/package.opf |
https://example.org/secret |
5 | ../../../secret |
https://example.org/acme/mobydick.epub#package-doc=/EPUB/package.opf |
https://example.org/secret |
- example 1 shows that the parsed URL of a navigation document identifies a (possibly existing) resource outside the EPUB.
- example 1 and 2 show that the URLs of two documents from two different EPUBs are parsed into the same URL.
- example 3 shows that a legit relative URL of an in-container video resource is parsed as the URL that:
- may conflict with the URL of another legit remote resource (remote resources are allowed for video content).
- leaks outside the container, and points to a space possibly owned by another publisher
- example 4 and 5 show that it is very easy to forge URL strings that are parsed to arbitrary files on a server or file system. This is true not only for path-absolute URL strings like 4, but also of for path-relative URL strings like 5.
To summarize:
- the current way Package Document URLs are defined is flawed (potential conflicts between 2 legit URL strings)
- the current way Package Document URLs is possibly a security or privacy vulnerability
Possible Solutions
The ideal solution would ensure parsed URLs would be:
- unambiguous: the results of parsing two URL strings should not be two identical URLs for one processor and two different URLs for another processor.
- Why? because otherwise it is impossible to tell if an EPUB is conforming (it may be for a processor and not for another)
- contained: the result of parsing a relative URL string should not be the URL of a resource outside of the container. At least, a URL string representing a legit in-container resource should not be parsed to a URL of a remote resource.
- Why? To avoid conflicts between publication resources and remote resources. To avoid possible vulnerabilities.
- unique: the result of parsing two relative URL strings from two different EPUBs should not be two identical URLs.
- Why? To avoid conflicts within a RS implementation (to be confirmed)
- origin-safe: the URLs parsed from two relative URL strings from two different EPUB instances should not be same-origin. If possible, the URLs parsed from two relative URL strings in the same EPUB should be same-origin.
- Why? resources within the same publication share the same trusted authority, resources within different publicaitons (or copies of the same publication) do not.
Note: the ideal solution might not exist, or might not be practical to use, to implement, or to specify. But the goals listed above may help us evaluate a solution.
Possible solutions will be listed below as individual comments, for easier referencing in the discussion.
Comments and ideas welcome! 😊
I may have missed important things…