| CARVIEW |
I gave a talk at the Haskell Implementors’ Workshop about Hackage, which you can find at https://vimeo.com/15464003. It’s 35 minutes in total.
The presentation part is a straightforward overview. Open discussion starts at about 16:30. You can get the slides here [PDF].
I hope it gives you a better idea of where Hackage is going. During the weekend, I had some great discussions about Hackage, and comparing functional languages, and finance (of all things). Even better, there was actual solid planning. On Sunday, Duncan Coutts and I materialized a plan for switching over to the new Hackage. It’s up online at the Hackage trac wiki, and the current revision looks something like this:
- *Live mirroring (user-immutable, all accounts are historical)
- Get archive.tar.gz of all ~10,000 packages on Hackage
- Investigate unmirrorable packages (e.g. binembed-example, network-info, old-time)
- Get cabal-installs pointing at it
- Implement backup for newer features (not all essential):
- Download statistics
- Candidates
- Preferred versions + deprecation
- Get data migration (schema updates) working more smoothly
- *Live server beta testing (user-mutable, all accounts are active)
- Disable registration; main Hackage accounts imported in
- Still mirroring the main Hackage
- Changes made here will be wiped out when server is fully deployed
- Configure server with Apache to support the tracs, support https on Hackage
- When ready to deploy: turn off upload on current Hackage
- Construct export tarball with these features:
- core (packages, user db, admin list)
- upload (trustees, maintainers)
- tags (based on categories, initially)
- distro (from current files: arch + debian, eventually exherbo + ubuntu)
- download (from logs, give expected format to Galois log holders)
- versions (deprecated packages, preferred-versions)
- Wipe server state and restore from tarball
- *Switch!
Throughout all this there will be testing for backups and performance. The starred items are the significant ones that’ll be announced. They look like “use it with cabal-install!”, “use it as you please unofficially!”, and “use it as you please officially!”. If you’d like to learn more about some of the ideas behind hackage-server, the architecture document is a good starting point, as well as past blog posts and the features themselves.
]]>User groups
There are three important user groups: admins, package trustees, and package maintainers. Some server updates require membership in these groups; membership can be edited with a simple interface.
- Admins perform administrative tasks. They can create accounts, change anyone’s password, delete an account, make server backups, and modify the members of the other user groups. They can also modify the package index in ways not allowed by normal uploads.
- There is one package maintainer group per package. When a package is uploaded and no versions of it existed previously, the maintainer group is created with the uploader as the sole member. Maintainers can add other maintainers. Members of this group can then upload new versions of the package, edit its preferred versions and deprecated status, upload documentation, manage build reports, and other maintenance tasks. Of course, they don’t have to.
- Package trustees are package maintainers for all packages. They can add and remove maintainers for any package, and perform any action per package that maintainers can.
It’s not set in stone, or even etched on papyrus, who the admins and trustees are actually going to be. Initially, package maintainers will be anyone who’s uploaded a version of a given package.
Other features provide their own user groups as well. One thing about their implementation is that they are entirely decentralized: there’s no section in the code which lists all of the user groups. There is a user → group mapping, but it’s updated only in response to the groups themselves being modified. Other groups, editable by admins, include:
- Distro maintainers: can indicate which packages are available under which Linux distributions in their binary repositories. This information is available on package pages for those who prefer distro packages, as well as in list form.
- Mirrorers: these are accounts for scripts which copy packages from one Hackage to another. Presently this is implemented in a batch-difference mode from hackage-scripts to hackage-server and is run periodically.
Uploading
Uploading follows these steps:
- Uploader POSTs to /packages/ with package=[package tarball]
- Make sure the user is logged in and get their user info
- Put the package file in a temporary directory for incoming files
- Get the package’s cabal file, parse it, and check it’s valid. Get the package name and version.
- Fail if the package version is already in the main database
- If maintainers exist for the package, make sure the user is in the maintainer group
- Run pre-upload hooks; these can indicate errors and cause the upload to fail
- Move the package to blob file storage and add it to the main index
- Run all of the post-upload hooks, updating secondary indices to keep them in sync
- Redirect to the new package page (or display an error)
Account registration
I wish I could give you the process for account registration, but the truth is that it’s still undecided. The present system involves requesting an account via email. This could still work with the new hackage-server, technically. There are a few reasons why this kind of process could be refined: there can be several admins; account creation no longer requires access to the server’s filesystem (.htpasswd); and account maintenance of all of Hackage is a lot for one person.
Possible approaches include:
- Admins create accounts, possibly requested from some kind of web interface.
- Anyone can self-register partial accounts which can do everything but upload, but can e.g. edit tags, write comments, or vote. These can be transformed from partial to total accounts by admins (perhaps also using a ticket system).
- Let anyone self-register for an account and start uploading (it’s worked for rubygems.org).
So…
The newer hackage-server comes with some nifty defaults, including more detailed ways to maintain packages. There are some guiding principles to consider when making policies: for example, packages shouldn’t be any harder to upload than they are now (which is not very). Another principle is quality assurance (see “A radical Hackage social experiment“). The above system has been developed with these in mind. Last but not least, there’s the community’s experience with current Hackage policies. How can they be improved?
]]>https://sparky.haskell.org:8080/
This is imported from Hackage package data a day or so ago—no user account data. The features currently enabled on the server are package pages, uploading packages, uploading candidates, distribution information, user groups, documentation, build reports, preferred versions, package deprecation, reverse dependencies, download statistics, tags, name search, and a handful of others.
The most important feature, though, and the reason this was a complete rewrite instead of just extending the old server, is that the internal design is modular and meant to be extended easily. If there’s a feature you don’t like (say, doing download statistics), it should take very little time to gut it from the application and not compile it in at all. The NameSearch module, as an example, adds two search indices, a simple search page (at /packages/find), and an OpenSearch plugin with suggestions. Installing it entails adding a line to Features.hs and writing an HTML view for it.
Performance
As far as performance goes: the process of routing a URI, querying data from several sources, and rendering the resultant page takes anywhere from 15ms (for an unadorned package page) to 3 seconds (for long lists of packages with descriptions and tags) on the sparky server. This is the amount of time it takes to fully generate the document as a ByteString, which is then given to the Happstack web framework. Here are some example times. I expect that switching from xhtml to BlazeHTML, based on the benchmarks so far, would definitely reduce the rendering time; I’m looking into other places to cut corners, though I’m no expert here.
Routing itself takes around 1ms, based on the dynamic approach I described in this post. On my laptop, which has faster cores but far fewer of them, crafting a response takes anywhere from 2ms to half a second, and routing takes around 0.2ms, for the same server configuration and package collection.
Unfortunately, sparky itself seems a bit laggy: yesterday it took 30 seconds (!) to request and retrieve a 350KB HTML document which is fully cached in memory, even though it took a fraction of a millisecond to get a ByteString for it. I’m looking into this.
Try it out!
So, take a look around and tell me what you think! If you want to try out your own copy, these should work as a bash shell scripts, if you have ghc+cabal-install+alex+happy on your system: import current Hackage data or start a completely new server. (These install the server and use its command line interface.) Importing the current Hackage dataset requires somewhere in the neighborhood of 750MB of memory (I’m looking to reduce this) and 600MB to run the server (sparky has 32GB of memory). A brand new server requires just 2MB of memory.
To do
The primary goal this summer was to create a server architecture that could handle whatever we as a community need, and implement as much of it in Haskell as possible. I’m only one person, so there’s still a lot left to do, short-term and long-term, to get a better Hackage. I’ve outlined some of these tasks below.
What needs to be done before deploying to hackage.haskell.org?
- Documentation. It’s one of the most important things Hackage provides. hackage-server lets maintainers upload documentation tarballs, but ticket 517 should be resolved so documentation can be more easily generated with Cabal.
- Importing download statistics from the last few years. Granted, this is a minor one, but it’s a big help to have these without a gap in recording.
- Stress-testing, in terms of making sure the server performs well and maintains the consistency of internal indices. Make sparky a bit more responsive. Ensure compatibility with cabal-install, including old versions. Double-check security in order to minimize the risk of attacks (replay, DDOS, etc.).
- Deciding policy for things like account creation and uploading. I’ll put up a blog post soon about the policy that hackage-server currently has for these sorts of things, including an overview of the user group system.
- Implementing backup for some of the newer features and creating an interface for admins to download backup tarballs.
- Make sure the URI scheme is convenient for everyone.
- Make robots.txt and set noindex on pages as appropriate.
- Arrange for distribution maintainers (for Debian and Arch, presently) to send us updates about which packages they have available. Haskell packages in distribution repositories tend to be simpler to install and more stable, so connecting to them is important.
- We need site admins and package trustees!
In the short-term future? (these should be implemented, sooner better than later)
- Build reports: get a system working for cabal-install clients to send build reports, anonymous or non-anonymous, as a replacement/enhancement of the build bot’s functionality. At present Hackage can accept basic build reports, but this should be gotten right before it’s enabled, particularly for anonymous reports.
- Web interface redesign. Since Hackage has more information to serve, it needs a better way to visually organize it. Anyone with web design chops is welcome. Other things to do here: expose JSON representations for Ajax functionality; rewrite HTML generating-code to use Blaze.
- Serve the internals of packages and set up a sitemap.xml so they can go on Google Code Search.
- Allow modifications to the cabal file without bumping the package version number. Admins can do this, but under some circumstances package maintainers might want to as well.
- See if user group information can be stored better internally.
- Get an STMP client running on the server to send automated email notifications.
- More server-side logging of actions (with user and timestamp): this makes it easier to find out what’s going on and provide historical data.
In the long term future? (looking into the crystal ball)
- Social features. This includes reviews, voting, contributing content: the little things that let you know your fellow Haskellers are humans and not code-generating automatons (besides mailing lists, IRC, reddit, meetups, conferences, blog posts…). The more effectively we can connect maintainers and users, the better. Most of these social features would be simple to implement technically. It’s more difficult to decide which features would actually benefit us as a community and get better-quality packages.
- Allow the creation of arbitrary groups of packages. Currently, there’s a Haskell Platform feature, which puts a little star next to every package that’s in the platform. Why not lay the groundwork for other package groups?
- Insert your idea here
There’s a document in progress about the server internals, and how you can extend Hackage with new features. For the next week, I’ll be tidying up the code, bug-hunting, writing documentation, and seeing what I can do with transition preparations. Come join #hackage on freenode, if you like, since we’ll be discussing some of these things in the coming weeks.
]]>The original schedule
When I applied to do the Hackage for Summer of Code, I included a tentative schedule. I have not strictly followed it so far, though I didn’t quite expect to. Here’s why.
1. 2 weeks. Become familiar with codebase and add documentation to declarations as I understand them. Find functionality not in the old server and not covered by the coming weeks and fully port it. Do the same for items in the hackage-server TODO list.
I didn’t anticipate all of the restructuring that needed to be done, thinking I could mostly append rather than modify. Well, I have substantially altered the already-great codebase into a modular form I’m pretty happy with, but it nonetheless takes a long time to do so when you’re starting with a 10,000-line codebase developed over 2 years (now it’s around 12,000 lines). The old server is mostly fully ported, although it wasn’t done within the space of these two weeks.
2. 1.5 weeks. Get build reports to display and gather useful information: already partially implemented. Use this feature as an opportunity to become even more comfortable refactoring and enhancing the hackage-server source.
Non-anonymous build reports are essentially complete. Anonymous ones are a bundle of privacy pitfalls, so we’ll have them as separate feature, using a variant on the data structure currently used to house per-package reports. The idea is to publish them to everyone but do so in a way that mostly eliminates identification or cross-referencing. More on this below.
3. 1.5 weeks. Get user accounts and settings working, writing a system for web forms, both the dynamic JavaScript kind and static kind. Use this system to get package configurating [sic] settings editable by both package maintainers and Hackage administrators.
I’ve written precious little HTML and no JavaScript, instead using curl to prod the server and setting up an Arch VM to ensure compliance with the current (soon to be old?) cabal-install. User accounts, digest authentication, and user groups — essentially access control lists — are all here. Most of this information is served in text/plain at the moment. Given that the new server will probably require a redesign by more design-minded Haskellers, I’d rather keep everything minimalistic for the time being. As I mentioned last post, I think the server architecture has a good separation of model and view.
4. 1 week. If a viable solutions for changelogs comes up by this point, I’ll implement it here. This might be as simple as a ./changelog file with a simple prescribed format.
That’s this week! At least 100 packages on Hackage already have changelogs. Of those, about two dozen are named changelog.md (they use markdown fix/feature structure, which git uses). The rest have whichever format the author chose, and these formats are all over the place. Some use darcs changes, which is too fine-grained for Hackage. All this is a too non-uniform for an automatic uniform interface. One approach that I can probably code up in a day or two is to have a changelog editable on Hackage. It could be inputted on upload and possibly edited afterwards by maintainers. Otherwise, I’ll leave this one until “a viable solution for changelogs comes up”.
What’s been done
All of the features I listed in the last blog post have been implemented, although not all of it’s exposed through HTML. Brief descriptions of them are there. The most interesting one, also proving to be the most challenging, is the candidate packages feature, which an enhanced version of the check-pkg CGI script. Here’s what you can do with it.
- /packages/candidates/: see the list of candidate packages. POST here to upload a candidate package; candidates for existing packages can only be added by maintainers.
- /package/{package}/candidate: see a preview package page for a candidate tarball, with any warnings or errors that would prevent putting it in the main index
- /package/{package}/candidate/publish: POST here to put the package in the main index. It has to be a later version than the latest current existing under the name, and only maintainers can do this. If no package exists under the name, these restrictions don’t apply.
- /package/{package}/candidate/{cabal}: get the cabal file for this package
- /package/{package}/candidate/{tarball}: get the tarball
In the immediate future
I’d like to get the newer server ready for running on sparky by the end of the week. It doesn’t yet look very different from the current Hackage in terms of what web browsers can access.
Currently there are four ways to start up the server. The first is to initialize it on a blank slate and go from there with hackage-server --initialise. Second, you can start it normally with an existing dataset stored by happstack-state, just hackage-server. Otherwise, you can import from an existing source. You can import mostly everything from the old Hackage server, as I described in my first post. Alternatively, you can initialize it from a single backup tarball produced by the server.
I’d like to revamp the interface to make it easier to deploy. Instead of importing directly from old sources, there’s going to be an auxiliary mode to convert legacy data into a newer backup tarball. Then, the new tarball can be imported directly. I haven’t had any backup tarballs on hand to test the newer import/export system, though it compiles. This is next on the todo list.
Some features that I’d like to get done soon are uploading documentation and implementing deprecation. Deprecated packages might still be needed as dependencies, so they’re kept around and will probably go in the index tarball, but they won’t be highly visible on any of the HTML pages. Currently documentation will be implemented by uploading tarballs. This is compatible with the current solution, which is to have a dedicated build client. It would be easier to have users upload their own docs, and not have to deal with the build client not being able to do so. This would be simple if .haddock files provided everything neessary for generating HTML docs and linking them with hscolour pages. I’m not sure if this is the case. Holding onto .haddock files also makes documentation statistics a lot easier. For now, documentation tarball upload is the route I’m taking.
Another nice feature would be serving directly from package tarballs, preferrably without having to store them in memory or unpack them on the server filesystem. Like the documentation feature, it would use a data structure defined in the hackage-server source: a TarIndexMap. Given a file path, it can efficiently give you the byte offset of the tar entry where that file is stored, and from that retrieve the file directly. There are some downsides here. First, package tarballs are not .tar but .tar.gz, so this might more-than-double the amount of storage required, which unpacking would do anyway. Second, the TarIndexMap of every single package tarball would be kept in memory, although this uses an efficient trie structure, so it’s not so bad.
There are also some internal server design challenges, which I’ll describe in the next two paragraphs; skip them if you like. One of them is making URI generation less clunky. Every resource provides enough information to generate its canonical URI given an association list of string pairs. However, this requires passing around the resource itself, which also contains the server function and other things. I’m considering making a global map that, given the string name of a resource, gives a URI-generating function, which means either passing this mapping to every single server function or setting up a ReaderT monad around Happstack’s ServerPartT. The other issue is that a URI is not guaranteed; it’s wrapped in a Maybe, since this system doesn’t provide the type safety guarantees of libraries like web-routes: it’s ‘stringly typed’.
In addition, user groups are currently totally decentralized, but perhaps they could use some more coordination. The MediaWiki system of having a global mapping for which groups are allowed to execute which permissions is pretty good, though in a typical PHP manner, it uses strings to do this. It might be better for each type of group to list what permissions it can do, rather than having this check in code itself, but again this might require passing this mapping to every single server function.
Memory and performance
I’ve done some rudimentary statistics-gathering, but much more will need to be done soon.
For instance, importing from the old Hackage server causes the memory used by the server to reach around 700MB and stay there (any memory allocated by GHC always stays there), and this is only for the current tarball versions. However, this is only needed for initialization, as I mentioned I plan on making a separate mode for legacy import.
By contrast, starting up the server with the current set of package versions occupies 390 MB of memory, although only 148 MB is used by the RTS at any given time. When initializing the server in this mode, 40% of the CPU time is used on garbage collection, but things seem reasonably stable afterwards. The directory storage with the current tarball versions occupies 130 MB disk space, and the happstack-state database is just 17 MB. This database is pretty small comparatively, likely because it doesn’t include the parsed PackageDescription data structure, which contains lots of fields and lots of strings.
In general, I think I need some modifications to ensure that GHC isn’t too heap-hungry, I suspect. Heap profiling has proven suspect thus far, since apparently the sever has a special affinitify for ghc-prim:GHC.Types.:, and if I’m reading it right I find it somewhat hard to believe that over 90% of the sever’s memory is used on cons cells. On the other hand, maybe there are that many Strings and [Dependency]s. I think later on I’ll be asking the advice of some more senior Haskell hackers to keep memory usage down, even if one of the selling points of Happstack is that all data’s in memory. (Not entirely true here: the blob storage is used for package tarballs.)
In the eventual future
Build reports are a must-do, and at present authenticated clients can submit build reports and build logs. Anonymous reports are tricky though (but still immensely useful), and I know many of you guys wouldn’t submit reports without them. Statistics need to be done as well; how to take a large amount of these:
package: cabal-install-0.6.2 os: linux arch: i386 compiler: ghc-6.10.4 client: cabal-install-0.6.0 flags: -bytestring-in-base -old-base dependencies: Cabal-1.6.0.3 HTTP-4000.0.8 [...] install-outcome: DependencyFailed zlib-0.5.2.0 docs-outcome: NotTried tests-outcome: NotTried
and tell you something useful about them. Perhaps it could tell you that the above report is not recent.
Also, a solution for systematic client-side and server-side caching of HTML hasn’t come up yet, if this is in the cards at all. Making an ETag-generating function is not a simple matter, particularly when multiple representations of the same resource are served in multiple formats at multiple URIs (sadly, I can’t rely solely on the Accept header, because browser implementers seemingly read RFCs highlighted with black markers).
Finally, there’s no clear procedure for migrating data, and I’m still not fully familiar with Happstack state’s data versioning system. Apparently both data types need to exist at the same time, and then the old one can be discarded. I could probably write a startup mode for this.
The most eventual of future elements is more shiny features. This future will extend beyond this summer, so while some individual features might deserve Summer of Code projects in their own right, I’ll try to knock out as many of the others as possible. Let the other 4/7ths begin!
]]>Feature graphs
Each feature has a listing of the URIs it provides, the user groups it needs to authenticate, and the data it needs to store with methods to back up and restore that data. A feature might also define caches for its pages, IO hooks to execute on certain events (like uploading a package), and pretty much anything else: features are arbitary datatypes that implement a HackageFeature typeclass. If feature A depends on feature B, then feature A can extend B’s URIs with new formats and HTTP methods, use B’s data and user groups, and register for any of B’s hooks.
The barebones features are:
- core: the central functionality set for something to reasonably be called a Hackage server. This serves tarballs, cabal files, and basic listings. The data it maintains are the user database and a map from
PackageNameto[PkgInfo](see previous post). It is possible to create a core-only server with an archive.tar, but it’s effectively immutable after initialization. - mirror: this allows tarballs to be uploaded directly by special clients, and it is intended for use by secondary Hackages (if any) which need to stay up to date without having to support a userbase. This doesn’t use its own data, instead manipulating the core’s.
Now, take a look at the packages, upload, check, users, distros, and build features. Some of then depend on each other. They all depend on core. html depends on all of them. One way to look at the organization is that they provide the model and controller for data and html provides a view. They are interfaces which provide their own data in a way which html/json/xml/yaml/whichever other features can render in their particular format with a minimal amount of effort.
For example, the packages feature doesn’t define any of its own URIs, but has a function, PackageId -> IO (Maybe PackageRender), which the HTML package page calls. The PackageRender type is essentially the One True Resource Representation of a package, and it looks like this:
data PackageRender = PackageRender {
-- using the most recently uploaded package as of now
rendPkgId :: PackageIdentifier,
-- Vec-0.9.8
rendAllVersions :: [Version],
-- [0.9.0, 0.9.1, 0.9.2, 0.9.3, 0.9.4,
0.9.5, 0.9.6, 0.9.7, 0.9.8]
rendDepends :: [[Dependency]],
-- [[array, base (≤5), ghc-prim, QuickCheck (2.*)]]
rendExecNames :: [String],
-- [] (no executables)
rendLicenseName :: String,
-- BSD 3
rendMaintainer :: Maybe String,
-- Just "Scott"
rendCategory :: [String],
-- ["Data", "Math"]
rendRepoHeads :: [(RepoType, String, SourceRepo)],
-- [] (no repository)
rendModules :: Maybe ModuleForest,
-- Just a tree containing Data.Vec.*
rendHasTarball :: Bool,
-- True
rendUploadInfo :: (UTCTime, String),
-- (Jun 17 2010, "Scott")
rendOther :: PackageDescription
-- the package description
}
From this, the html feature can make a package page that looks like the current one, where 95% of its work is HTML formatting via Text.XHtml.Strict. A json feature could use the same information to make a data-rich nest of curly braces.
Now, a paragraph or two about a failed approach. I also considered having each feature provide its own HTML. This is perhaps the simplest approach on the face of it. However, it gets tricky to, say, define a package page and then later append to it for newer features. I considered HTML hooks where a feature could provide an interface to anyone who wants to inject Html blocks into its pages. For example, a build reports feature would have to register for a hook so that the main package page can link to the reports page.
This has several disadvantages, the most prominent of which is that it makes it cumbersome to switch to a different HTML-generating library or add new formats. I just accepted that HTML was an exceptionally unmodular thing. Instead, what the HTML feature now does is depend on both the build reports feature and the packages feature, and this also allows free-form HTML instead of copy+paste amalgamations, which I’ve heard can be rather ugly. The metric to go by here is “out of all modifications one could imagine making to the server, how can I make them implementable modifying the minimum number of modules?” (I haven’t considered using partial derivatives to optimize the minimum… yet.)
Here is a brief description of the middle features:
- packages: just package pages
- upload: authenticated users can upload new packages, with some checking in place: can’t overwrite packages, can only upload a new version if a maintainer, and so on. Adds a maintainer/author group for each package. By contrast, the mirror feature overwrites packages without question.
- check: checking packages before indexing them and providing candidate packages (see previous post)
- users: user pages, password-changing, currently using core and not storing any data of its own
- distros: linking Hackage with Arch, Debian, and any other distribution with package repositories with Haskell binaries. These distributions can PUT and DELETE to Hackage to indicate the addition and removal of these packages.
- build: submission of build reports, both anonymous and with full compilation logs
And finally, an ad hoc but nonetheless important feature:
- legacy: a pile of 301 redirects so that old URIs can mostly work (in particular, links to /cgi-bin/hackage-scripts/package/foobar posted on mailing lists 4 years ago will still work)
Features each have their own particular init functions. For instance, the function to initialize the HTML module is currently:
initHtmlFeature :: CoreFeature -> PackagesFeature -> IO HtmlFeature
URI trees
I would have written this up yesterday but I’ve spent the last 24 hours implementing a new and improved routing system. All of the magic happens in impl, the ServerPart Response which is given to Happstack’s simpleHTTP.
impl :: Server -> ServerPart Response
impl server =
renderServerTree (serverConfig server) []
-- ServerTree ServerResponse
. fmap serveResource
-- ServerTree Resource
. foldl' (\acc res -> addServerNode (resourceLocation res) res acc)
serverTreeEmpty
-- [Resource]
$ concatMap resources (serverFeatures server)
This seems pretty terse for what’s effectively the server’s main method, but complexity lurks just beneath the surface. It all starts with lists of resources, each server feature providing its own list, which are concatenated into a [Resource]. A Resource contains a URI, and how to respond when that URI is visited for certain combinations of HTTP methods and content-types. Although I’ve never coded a line of Ruby in my life, I stole some of Rails’ routing syntax for this task (also stolen by the Pylons web framework, apparently). Here’s how it works:
- A resource at “/users/login” will be run only when /users/login is visited, assuming it’s a GET request.
- A resource at “/package/:package” will be run when /package/HDBC is visited, but also when /package/nonexistent-1.0 is entered. It’s passed
[("package", "HDBC")]in the former case, and there are combinators to turn assoc lists into data values (type DynamicPath = [(String, String)]and a combinatorwithPackagePath :: DynamicPath -> (PackageId -> PkgInfo -> ServerPart Response) -> ServerPart Response). It’s up to the resource to return a 404 if it can’t abide by the URI. - A resource at “/package/:package/doc/…” will be run when /package/uvector/doc/ or any subdirectory is visited, and it’s likewise passed an appropriate assoc list.
- I can specify “/package/:package/:cabal.cabal”, and when /package/parsec-3.1.0/parsec.cabal is visited, the resource is given
[("package", "parsec-3.1.0"), ("cabal", "parsec")](the extension is stripped off). - And the most complicated one: “/package/:package.:format”. This works for /package/QuickCheck (
[("package", "QuickCheck"), ("format", "")]), or /package/llvm-0.8.0.2.json ([("package", "llvm-0.8.0.2"), ("format", "json")]). An empty format means to go for the default, in this case HTML.
Server trees provide a way for efficiently serving an entire tree of URIs. Starting from an empty server tree, resources are incrementally added, and when two share the same URI format they are combined. For example, the simplified Hackage URI tree is:
The relevant types are:
data ServerTree a = ServerTree {
nodeResponse :: Maybe a,
nodeForest :: Map BranchComponent (ServerTree a)
}
data BranchComponent = StaticBranch String -- /foo
| DynamicBranch String -- /:bar
| TrailingBranch -- /...
addServerNode :: Monoid a => [BranchComponent] -> a
-> ServerTree a -> ServerTree a
Finally, I have a 100-line function for converting resources into something Happstack can read (to be broken up shortly, I hope). It’s called serveResource, and it’s how I convert a ServerTree Resource into a ServerTree ServerResponse via ServerTree‘s Functor instance. Then the tree is converted to its final flat form, using Happstack’s path-munching combinators to traverse each node’s forest.
serveResource :: Resource -> ServerResponse
renderServerTree :: Config -> DynamicPath
-> ServerTree ServerResponse
-> ServerPart Response
If this effort is a success, I won’t have to deal with the ServerTree type in any great detail for the rest of the summer. I’ve pushed all of the above code to the hackage-server repository.
Thanks for perusing my run-through of some of the internal server design and my exploration of the problem domain. I’ve also been reading the other GSOC blogs, including Marco’s progress on Immix. I had idly considered applying for that, but given my near-total unfamiliarity with the GHC RTS, I think it would’ve been more than a bit difficult for me. I can see he’s doing a great job, too. Still, there are some things I appreciate about using Haskell and not C in my project. Not only does the type system prevent a host of runtime errors, it also forces me to consider all possible sorts of values which can inhabit a given type and write a case for each one. This is something that’s come in handy a lot in the past few days. Well, now on to actually implementing features in detail. I’ll keep you all posted.
]]>The goal is to make a REST API which can be read and manipulated by automated clients, and of course perused by web browsers just like the current Hackage. One of the purposes of REST (Representational State Transfer) is simplifying state between the client and server by manipulating representation of resources using plain old HTTP.
Those Haskellers who are familiar with REST might point out that documenting an API and setting up URI conventions (like I did on the Hackage wiki) are partly antithetical to the goals of REST, which eschew servers that only understand highly specific remote procedural calls and clients which construct URIs based on hard-coded conventions (coupling). Roy Fielding, the inventor of REST, stresses that REST APIs must be hypertext-driven. Don’t worry: my intention is to make all of the URIs fully discoverable from the server root, whether browsing the HTML representation or, say, a JSON version. The URI page is an aid in design, not documentation. Since there tends to be a one-to-one mapping between each URI/method pair I’ve listed and each feature I’d like to implement, it tells me what I have left to do.
Fine-tuning data structures
To this end, I’ve made and committed some changes to hackage-server. Some of time was spent adjusting a few important types, and the rest dealing with the subsequent code breakages. It is still better and safer than doing similar things in a dynamically typed programming language, where I’d end up either sweeping the entire code base or analyzing call graphs manually to determine what broke. Here’s an example of a type I altered, the PkgInfo type, which holds information about a specific package version:
data PkgInfo = PkgInfo {
-- | The name and version represented here
pkgInfoId :: !PackageIdentifier,
-- | Parsed information from the cabal file.
pkgDesc :: !GenericPackageDescription,
-- | The current .cabal file text.
pkgData :: !ByteString,
-- | The actual package .tar.gz file, where BlobId
-- is a filename in the state/blobs/ folder.
-- The head of the list is the current package tarball.
pkgTarball :: ![(BlobId, UploadInfo)],
-- | Previous .cabal file texts, and when they were uploaded.
pkgDataOld :: ![(ByteString, UploadInfo)],
-- | When the package was created with the .cabal file.
pkgUploadData :: !UploadInfo
} deriving (Typeable, Show)
type UploadInfo = (UTCTime, UserId)
The global Hackage state defines a mapping from PackageName to [PkgInfo]. Subtle differences in which types of values are allowed to inhabit PkgInfo have important consequences for package uploading policy. There are a few notable results of this definition.
- A package can exist without a tarball. This is more significant for importing data to create secondary Hackages than the normal upload process. The more incrementally importing can happen, the simpler it will be. Alternatively, this would allow for a metadata-only Hackage mirror.
- Cabal files can be updated, with a complete history, without having to change the version number. This would allow maintainers to expand version brackets or compiler flags, so long as the changes don’t break anything (constricting version brackets is more dangerous).
- Tarballs can be updated, also with a complete history, without having to change the version number. This probably won’t be enabled on the main Hackage, but exceptions can be granted by admins. If an ultra-unstable Hackage mirror came about, as opposed to the somewhat-unstable model we have now, this might be allowed.
Modular RESTful features
The HackageFeature data structure is intended to encapsulate the behavior of a feature and its state. Features include the core feature set—the minimal functionality that a server must have to be considered a Hackage server, which is serving package tarballs and cabal files—supplemented by user accounts, package pages, reverse dependencies, Linux distro integration, and so on.
The most important field of a feature is the locations :: [(BranchPath, ServerResponse)]. The BranchPath is the generic form of a URI, a list of BranchComponents. Taking inspiration from Ruby on Rails routing, you can construct one with the syntax "/package/:package/reports/:id", where visiting https://hackage.haskell.org/HDBC/reports/4/ will pass [("package", "HDBC"), ("id", "4")] to the code serving build reports. You can define arbitrary ServerPart Responses at a path, or you can use a Resource abstraction which lets you specify different HTTP methods (GET, POST, PUT, and DELETE). This system is still in development.
HTTP goodies
Because each resource defines its method set upfront, it’s possible to make an HTTP OPTIONS method for each one. This is an example of something you get “for free” by structuring resources in certain ways. As I’ve discovered, there can be an unfortunate trade-off: requiring too much structure makes it unpleasant to extend Hackage with new functionality (having to deal with all of the guts of the server). Too little structure means that those implementing new features can accidentally break the site’s design principles and generally cause havoc. A reasonable middle ground is the convention over configuration approach: I’d have plenty of configurable structure internally, and combinators which build on that structure by filling in recommended conventions. This applies particularly to getting the most out of HTTP.
The idea of content negotiation in HTTP is simple, although there’s no clear path ahead for implementing it yet. For Hackage, content negotiation consists of responding to preferences in the client’s Accept header, which contains MIME types with various priorities. (Other sorts of negotiation include those for languages and encoding.) A web browser like Firefox might send
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8, and a hypothetical newer cabal-install client would send application/json for miscellaneous data it needs to read.
Authentication functionality is essentially done, although I had originally planned to work on it later this month. The types might still need some tweaking, of course. There is a system of access control lists called UserLists, each of which is a Data.IntSet of UserIds. With this, we can have an extensible hierarchy of user groups, such as site administrators, per-package maintainers, and trustees (allowed to manipulate all packages without possessing other admin functionality). The type signature for the main authentication function is:
requireHackageAuth :: MonadIO m => Users -> Maybe UserList
-> Maybe AuthType -> ServerPartT m UserId
Users is the type containing the site’s entire user database. AuthType, either BasicAuth or DigestAuth, can be passed in to force the type of authentication: either basic or digest. Since all passwords are currently hashed in crypt form. This method either returns the UserId, assuming authentication succeeded, or forces a 401 Unauthorized or 403 Forbidden. With this, we can easily extend it to handle specific tasks:
requirePackageAuth :: (MonadIO m, Package pkg) => pkg
-> ServerPartT m UserId
requirePackageAuth pkg = do
userDb <- query $ GetUserDb
pkgm <- query $ GetPackageMaintainers
(packageName pkg)
trust <- query $ GetHackageTrustees
let groupSum = Groups.unions [trust, fromMaybe Groups.empty pkgm]
requireHackageAuth userDb (Just groupSum) Nothing
Import/export
To paraphrase Don’s comment on my previous post, we absolutely can’t afford to lose any data. Although the state/db/ directory contains all of happstack-state’s MACID data, and can be periodically backed-up off-site, binary data is by nature easy to mess up and hard to recover. A bit of redundancy in storage is a reasonable safeguard, and there’s little more redundant than English, at least compared to bit-packing.
Antoine had implemented an extensive Hackage-to-CSV export/import system, where instead of e.g. having a single bit represent whether an account is enabled or disabled, we use the words “enabled” and “disabled”, and put the resulting export tarball in a safe place. Instead of having one centralized system, each HackageFeature should take care of its own data, and so I’d like to work on decentralizing the system in the days ahead. The type signatures, suggested by Duncan, are:
data HackageFeature = {
...
dumpBackup :: IO [BackupEntry],
restoreBackup :: [BackupEntry] -> IO (),
...
}
type BackupEntry = ([FilePath], ByteString)
Bringing features together
There are notable tasks remaining for the basic infrastructure, such as implementing this import/export system. Another major one is creating a hook system with the usual dual nature of one part that responds to actions (like uploading pages) and another which edits documents on the fly (like adding sections to a package page). If you have experience with website plugin systems, what are your thoughts on getting this done this in a strongly, safety typed manner?
Having taken a brief tour of the internal server proto-design and the types of functionality that can be implemented with it, I’d like to show how we can leverage these to implement some useful features, some this summer if we as a community approve of them:
- Build reports, to see if a given package builds on your OS (might save time for unfortunately oft-neglected Windows-users), on your architecture, with your set of dependencies. I would strongly encourage all of you to at least submit anonymized build reports once the feature goes live (check out the client-side implementation), if not the full build log, although I promise we won’t stoop to a “Do you want to submit a build report?” query every single time a build fails: maybe only just the first time :) Submitting or not is more of a configuration option. Build reports will probably be anonymized for the public interface, but available in full to package maintainers through the
requireHackageAuthauthentication mechanism. - Reverse dependencies. This is a
HackageFeaturethat doesn’t need to define any of its own persistent data, just its own compact index of depedencies that subscribes to a package uploading hook. You can peruse Roel’s revdeps Hackage, and if you feel like setting up hackage-scripts with Apache, you can apply his patch to run your own. - Hackage mirrors. It should be simple to write a mirroring client that polls hackage.haskell.org’s recent changes, retrieves the new tarballs, and HTTP PUTs them to a mirror with proper authentication.
- Candidate package uploads: improved package checking. This would allow you to create a temporary package resource, perhaps available at
/package/:package/candidate/, to augment the current checking system. Currently, checking gives you a preview of the package page with any Cabal-generated warnings. Here, you could set up a package on the Hackage website that’s not included on the package list or index tarball. It would employ its own mapping fromPackageNametoPkgInfo. You can make a candidate package go live at any time, even allowing others to install your candidate package before then. This is a slightly different idea from the ultra-unstable zoo of packages I mentioned withPkgInfo, but has similar quality assurance goals.
Thanks for reading what I’ve been up to. Critique is welcomed.
]]>$ darcs get https://darcs.haskell.org/hackage-scripts/
Its primary goal is to serve both cabal files (package metadata) for the cabal-install tool to parse and package tarballs for it to compile, and the server uses a glorified directory tree to accomplish this. It also has a minimalistic web interface for finding packages, viewing their metadata, and perusing Haddock-generated documentation. hacakge-scripts uses a combination of static files and Network.CGI executables, which are invoked by the web server, read information about the request using the CGI specification, and then print the HTML response to standard output. Not the least of these scripts is the one that uploads new packages, using either cabal upload or the web interface.
hackage-scripts is portable in that it should run on any standard Apache installation. Unfortunately, it usually doesn’t run out of the box. The directory tree and static files have to be set up manually, and the Makefile and source code need to be hardcoded with pathnames indicating where the set up is. Even if you can’t get it running on your own, it is happily chugging away on hackage.haskell.org, which your cabal configs (~/.cabal/config) undoubtedly point to.
The candidate replacement is known simply as hackage-server, and you can get it here in its pre-summer-of-code state:
$ darcs get https://code.haskell.org/hackage-server/
It uses the Happstack web framework to deconstruct URIs by their path hierarchy, rather than letting Apache root through a large directory tree of mostly static files. It also uses the happstack-state package, at present keeping approximately 186 MiB of package data for 8376 package versions in memory to serve requests, falling back to the disk for larger files such as the package tarballs.
This summer’s project is particular in that it involves work on a code base which most Haskellers won’t install themselves, but provides a service most of us will end up dealing with frequently. This makes it important to get right from an architecture standpoint. Nonetheless, I hope to make it painless to set up a secondary Hackage repository as a drop-in replacement for the main one, potentially allowing you to pull from a variety of sources of varying stabilities. Setting up a server on https://localhost:8080/ over an empty repository is as easy as changing to the repository’s top-level Darcs directory for the repository and running
$ cabal install $ hackage-server --initialise
(albeit not as easy if the dependencies end up failing: I had to change the Happstack dependency brackets in hackage-server.cabal from ==0.4.* to ==0.5.* because I use an older base) Setting up a haskell.hackage.org clone with the current tarballs is a bit more complex, but within the realm of science to solve!
$ cabal install $ wget -P /tmp https://hackage.haskell.org/cgi-bin/hackage-scripts/archive.tar $ wget -P /tmp https://hackage.haskell.org/packages/archive/00-index.tar.gz $ wget -P /tmp https://hackage.haskell.org/packages/archive/log $ echo 'admin:wywGGkc7Qc/6I' > /tmp/htpasswd $ echo 'admin' > /tmp/adminlist $ hackage-server --import-index=/tmp/00-index.tar.gz \ --import-log=/tmp/log --import-accounts=/tmp/htpasswd \ --import-archive=/tmp/archive.tar \ --import-admins=/tmp/adminlist
Be warned: archive.tar is 128MB at the moment! As for wywGGkc7Qc/6I, it is one of 4096 crypt-salted hashings of the password admin. On Wednesday I implemented digest authentication, which would instead hash admin:hackage:admin in MD5 and use a nonce challenge/response for reasonably secure authentication (the current scheme sends your password in near-plaintext with every request). I found a minor Chromium bug in the process, too!
Tersely put, the design goals are for hackage-server to become a more consistent, extensible, modular and (most importantly) runnable Hackage server. This means duplicating the existing functionality, a task mostly done by Antoine Latter and Duncan Coutts in the span of the last two years, and organizing the modules into a URI hierarchy that obeys REST and ROA principles. I’ve outlined all of the resources Hackage currently provides (partially listed on the trac wiki), and I’m working on a mapping to a new and improved set of URIs.
For the more commonly accessed Hackage URIs (those that have been linked from other websites or hardcoded in cabal), backwards-compatibility is a priority, and mostly already implemented as a series of 301 redirects. Such a legacy redirect system might be considered a “feature”, a plug-in functionality which can be enabled and disabled. Part of making the new Hackage modular and hackable is defining a consistent interface for features. Much like lambdabot‘s Module typeclass, each feature can be defined discretely, and the behavior of the web server becomes the msum of each feature’s ServerPart Response.
The above is the state of affairs on Day 1 (well, Day 4, but I’m still getting started with these new-fangled blags!). The title of my proposal is “Infrastructure for a more social Hacakge 2.0“, not “A more social Hackage 2.0”. I expect that the exact array of social services that Hackage will provide will need a hefty bout of fine-tuning and analysis (see also some insightful thoughts on this), so my job is to provide the technical base to make the shiny new features easy to plug in and modify, as well as implementing as many as possible in a mad rush of coding late July and early August.
If you have any kind of wish list for Hackage features, it is imperative that you let me know—eventually. Duncan and others have encouraged me to concentrate on setting up the infrastructure before building features, so at some point I’ll try to facilitate a community discussion about what you all want to see in our favorite package repository. If you need me, you can find me as Gracenotes on the #haskell and #hackage channels on irc.freenode.net. And best of luck to my fellow gsoc-ers, whose blogs I’ve linked in the sidebar.
]]>
