HTTP/2 301
date: Sun, 18 Jan 2026 05:31:00 GMT
content-length: 0
location: https://doi.org/10.1101/162099
server: cloudflare
vary: Origin
expires: Mon, 19 Jan 2026 05:31:00 GMT
permissions-policy: interest-cohort=(),browsing-topics=()
cf-cache-status: DYNAMIC
nel: {"report_to":"cf-nel","success_fraction":0.0,"max_age":604800}
strict-transport-security: max-age=31536000; includeSubDomains; preload
report-to: {"group":"cf-nel","max_age":604800,"endpoints":[{"url":"https://a.nel.cloudflare.com/report/v4?s=nDebtnCndjlnCME3WqX0i5h1VBKU6QfCyK8xjQeWCYoWA3iGj7CXFgFbpPzHyHyFzSp3Z6gWtWdeni2Nc2Q0ChZxRfNipw%3D%3D"}]}
cf-ray: 9bfbb4205c5de8e0-BLR
alt-svc: h3=":443"; ma=86400
HTTP/2 302
date: Sun, 18 Jan 2026 05:31:00 GMT
content-type: text/html;charset=utf-8
location: https://biorxiv.org/lookup/doi/10.1101/162099
server: cloudflare
vary: Origin
vary: Accept
expires: Sun, 18 Jan 2026 06:10:59 GMT
permissions-policy: interest-cohort=(),browsing-topics=()
cf-cache-status: DYNAMIC
nel: {"report_to":"cf-nel","success_fraction":0.0,"max_age":604800}
strict-transport-security: max-age=31536000; includeSubDomains; preload
report-to: {"group":"cf-nel","max_age":604800,"endpoints":[{"url":"https://a.nel.cloudflare.com/report/v4?s=TZVu5j%2BJKwI4lxGT5Q50j27rMnrkLjQQqoyhS1WMzEbQac64Y%2BGzVnaEGM3GcCE2Ks482hkNSHjO3Ehxpftka6ugzLu7xg%3D%3D"}]}
cf-ray: 9bfbb420acaae8e0-BLR
alt-svc: h3=":443"; ma=86400
HTTP/1.1 302 Found
Date: Sun, 18 Jan 2026 05:31:01 GMT
Content-Type: text/html; charset=iso-8859-1
Transfer-Encoding: chunked
Connection: keep-alive
server: cloudflare
location: https://www.biorxiv.org/lookup/doi/10.1101/162099
cf-cache-status: DYNAMIC
Nel: {"report_to":"cf-nel","success_fraction":0.0,"max_age":604800}
Report-To: {"group":"cf-nel","max_age":604800,"endpoints":[{"url":"https://a.nel.cloudflare.com/report/v4?s=Jct3bleXegeAooDZxMvlf9M2hiLs2jAHwUC9xkZpllj%2BJHJl3tcHnEuh6a5%2BVhb5RIi%2F3yN2NKJHnWUioP0k9kIO0n10o4qwRaZI"}]}
CF-RAY: 9bfbb4213bec4463-BOM
alt-svc: h3=":443"; ma=86400
HTTP/2 301
date: Sun, 18 Jan 2026 05:31:01 GMT
content-type: text/html; charset=UTF-8
location: https://www.biorxiv.org/content/10.1101/162099v1
cf-ray: 9bfbb4248c0bc1b3-BLR
x-content-type-options: nosniff
x-content-type-options: nosniff
x-drupal-cache: MISS
expires: Sun, 18 Jan 2026 06:01:01 GMT
cache-control: public, max-age=1800
x-varnish-ttl:
pragma: no-cache
vary: Accept-Encoding
x-highwire-sitecode: biorxiv
x-highwire-smart-code: biorxiv_production
x-varnish: 1891352864
x-varnish-cache:
via: 1.1 varnish
cf-cache-status: MISS
set-cookie: __cf_bm=OLAfU3JDUWCAGxdGpEcS0d8eD.fMBoHA4b04hUctmFo-1768714261-1.0.1.1-iX3nGsgj9cBHukihg3hoE481s6KWunkiNkHldxb4pvqtCaGhkF9MqWbCGIHuDolOsOVCC2YRan2ny2hrpgclXdItZd_iLdqC4hMP0IuQZgg; path=/; expires=Sun, 18-Jan-26 06:01:01 GMT; domain=.www.biorxiv.org; HttpOnly; Secure; SameSite=None
server: cloudflare
HTTP/2 200
date: Sun, 18 Jan 2026 05:31:03 GMT
content-type: text/html; charset=utf-8
content-encoding: gzip
x-content-type-options: nosniff
x-content-type-options: nosniff
x-drupal-cache: MISS
expires: Sun, 19 Nov 1978 05:00:00 GMT
cache-control: no-cache, must-revalidate
set-cookie: SSESS1dd6867f1a1b90340f573dcdef3076bc=qiuCxA7N8adkASCaYI2Ma3XaLXhg79yxDESTzn_YCN4; expires=Tue, 10-Feb-2026 09:04:22 GMT; path=/; domain=.biorxiv.org; secure; HttpOnly
content-language: en
x-frame-options: SAMEORIGIN
x-generator: Drupal 7 (https://drupal.org)
link:
; rel="canonical",; rel="shortlink"
vary: Accept-Encoding
x-highwire-sitecode: biorxiv
x-highwire-smart-code: biorxiv_production
x-varnish: 693481551
age: 0
via: 1.1 varnish
x-varnish-ttl:
x-varnish-cache:
cf-cache-status: DYNAMIC
server: cloudflare
cf-ray: 9bfbb4287dcfc1b3-BLR
Text mining of 15 million full-text scientific articles | bioRxiv
New Results
Text mining of 15 million full-text scientific articles
doi: https://doi.org/10.1101/162099

Abstract
Across academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823–2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein–protein, disease–gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only.
Copyright
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
It is made available under a CC-BY 4.0 International license.