journal-scrapers

Journal scraper definitions for the ContentMine framework.

Summary

This repo is a collection of scraperJSON definitions targeting academic journals. They can be used to extract and download data from URLs of journal articles, such as:

Title, author list, date
Figures and their captions
Fulltext PDF, HTML, XML, RDF
Supplementary materials
Reference lists

Scraper collection status

All the scrapers in the collection are automatically tested daily as well as every time any scraper is changed. The tests work by having the expected results for a set of URLs stored, and randomly selecting one of those URLs to re-scrape. If the results match those expected the test passes. If the badge is green and says build|passing, all the scrapers are OK. If the badge is red and says build|failing, one or more of the scrapers has stopped working. You can click on the badge to see the test report, to see which scrapers are failing and how.

How well the scrapers are covered by the tests is also checked. Coverage should be 100% - this means every element of every scraper is checked at least once in the testing. If coverage is below 100%, you can see exactly which parts of which scrapers are not covered by clicking the coverage badge below.

ScraperJSON definitions

Scrapers are defined in JSON, using a schema called scraperJSON which is currently evolving. The current schema is described at the scraperJSON repo.

Contributing scrapers

If your favourite publisher or journal is not covered by a scraper in our collection, we'd love you to submit a new scraper.

We ask that all contributions follow some simple rules that help us maintain a high-quality collection.

The scraper covers all the data elements used in the ContentMine.
You must submit a set of 5-10 test URLs.
It comes with a regression test (which can be auto-generated).
You agree to release the scraper definition and tests under the Creative Commons Zero license.

Usage

Currently these definitions can be used with the quickscrape tool.

License

All scrapers are released under the Creative Commons 0 (CC0) license.

Name		Name	Last commit message	Last commit date
Latest commit History 361 Commits
scrapers		scrapers
scripts		scripts
test		test
.coveralls.yml		.coveralls.yml
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

journal-scrapers

Table of Contents

Summary

Scraper collection status

ScraperJSON definitions

Contributing scrapers

Usage

License

About

Uh oh!

Releases

Packages

Contributors 14

Uh oh!

Languages

ContentMine/journal-scrapers

Folders and files

Latest commit

History

Repository files navigation

journal-scrapers

Table of Contents

Summary

Scraper collection status

ScraperJSON definitions

Contributing scrapers

Usage

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 14

Uh oh!

Languages

Packages