| CARVIEW |
Select Language
HTTP/2 301
date: Sat, 17 Jan 2026 19:33:30 GMT
content-type: text/html
content-length: 162
location: https://allenai.github.io/scispacy
expires: Sat, 17 Jan 2026 19:33:29 GMT
cache-control: no-cache
strict-transport-security: max-age=15724800
HTTP/2 301
server: GitHub.com
content-type: text/html
location: https://allenai.github.io/scispacy/
x-github-request-id: 3BFF:3D7756:15DB7C:19771B:696BDDFC
accept-ranges: bytes
age: 1550
date: Sat, 17 Jan 2026 19:33:30 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210037-BOM
x-cache: HIT
x-cache-hits: 0
x-timer: S1768678411.717450,VS0,VE1
vary: Accept-Encoding
x-fastly-request-id: f525ed3bb27a3ae0063303ebc3950c1cd60b908b
content-length: 162
HTTP/2 200
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Thu, 04 Dec 2025 07:45:36 GMT
access-control-allow-origin: *
etag: W/"69313c20-3a99"
expires: Sat, 17 Jan 2026 19:17:40 GMT
cache-control: max-age=600
content-encoding: gzip
x-proxy-cache: MISS
x-github-request-id: 2840:A2227:167CB3:1A2AD8:696BDDFB
accept-ranges: bytes
age: 0
date: Sat, 17 Jan 2026 19:33:30 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210037-BOM
x-cache: HIT
x-cache-hits: 0
x-timer: S1768678411.731705,VS0,VE217
vary: Accept-Encoding
x-fastly-request-id: 88afe87e777513754a1f8e69a9fc7ec027c95ca9
content-length: 3759
scispacy | SpaCy models for biomedical text processing
scispacy
SpaCy models for biomedical text processing
scispaCy is a Python package containing spaCy models for processing biomedical, scientific or clinical text.
Interactive Demo
Just looking to test out the models on your data? Check out our demo.
Installing
pip install scispacy
pip install <Model URL>
Models
| Model | Description | Install URL |
|---|---|---|
| en_core_sci_sm | A full spaCy pipeline for biomedical data. | Download |
| en_core_sci_md | A full spaCy pipeline for biomedical data with a larger vocabulary and 50k word vectors. | Download |
| en_core_sci_scibert | A full spaCy pipeline for biomedical data with a ~785k vocabulary and allenai/scibert-base as the transformer model. |
Download |
| en_core_sci_lg | A full spaCy pipeline for biomedical data with a larger vocabulary and 600k word vectors. | Download |
| en_ner_craft_md | A spaCy NER model trained on the CRAFT corpus. | Download |
| en_ner_jnlpba_md | A spaCy NER model trained on the JNLPBA corpus. | Download |
| en_ner_bc5cdr_md | A spaCy NER model trained on the BC5CDR corpus. | Download |
| en_ner_bionlp13cg_md | A spaCy NER model trained on the BIONLP13CG corpus. | Download |
Performance
Our models achieve performance within 3% of published state of the art dependency parsers and within 0.4% accuracy of state of the art biomedical POS taggers.
| model | UAS | LAS | POS | Mentions (F1) | Web UAS |
|---|---|---|---|---|---|
| en_core_sci_sm | 89.18 | 87.15 | 98.18 | 67.89 | 87.36 |
| en_core_sci_md | 90.08 | 88.16 | 98.46 | 68.86 | 88.04 |
| en_core_sci_lg | 89.97 | 88.18 | 98.51 | 68.98 | 87.89 |
| en_core_sci_scibert | 92.12 | 90.58 | 98.18 | 67.70 | 92.58 |
| model | F1 | Entity Types |
|---|---|---|
| en_ner_craft_md | 78.01 | GGP, SO, TAXON, CHEBI, GO, CL |
| en_ner_jnlpba_md | 72.06 | DNA, CELL_TYPE, CELL_LINE, RNA, PROTEIN |
| en_ner_bc5cdr_md | 84.28 | DISEASE, CHEMICAL |
| en_ner_bionlp13cg_md | 77.84 | AMINO_ACID, ANATOMICAL_SYSTEM, CANCER, CELL, CELLULAR_COMPONENT, DEVELOPING_ANATOMICAL_STRUCTURE, GENE_OR_GENE_PRODUCT, IMMATERIAL_ANATOMICAL_ENTITY, MULTI-TISSUE_STRUCTURE, ORGAN, ORGANISM, ORGANISM_SUBDIVISION, ORGANISM_SUBSTANCE, PATHOLOGICAL_FORMATION, SIMPLE_CHEMICAL, TISSUE |
Example Usage
import scispacy
import spacy
nlp = spacy.load("en_core_sci_sm")
text = """
Myeloid derived suppressor cells (MDSC) are immature
myeloid cells with immunosuppressive activity.
They accumulate in tumor-bearing mice and humans
with different types of cancer, including hepatocellular
carcinoma (HCC).
"""
doc = nlp(text)
print(list(doc.sents))
>>> ["Myeloid derived suppressor cells (MDSC) are immature myeloid cells with immunosuppressive activity.",
"They accumulate in tumor-bearing mice and humans with different types of cancer, including hepatocellular carcinoma (HCC)."]
# Examine the entities extracted by the mention detector.
# Note that they don't have types like in SpaCy, and they
# are more general (e.g including verbs) - these are any
# spans which might be an entity in UMLS, a large
# biomedical database.
print(doc.ents)
>>> (Myeloid derived suppressor cells,
MDSC,
immature,
myeloid cells,
immunosuppressive activity,
accumulate,
tumor-bearing mice,
humans,
cancer,
hepatocellular carcinoma,
HCC)
# We can also visualise dependency parses
# (This renders automatically inside a jupyter notebook!):
from spacy import displacy
displacy.render(next(doc.sents), style='dep', jupyter=True)
# See below for the generated SVG.
# Zoom your browser in a bit!
Data Sources
scispaCy models are trained on data from a variety of sources. In particular, we use:
- The GENIA 1.0 Treebank, converted to basic Universal Dependencies using the Stanford Dependency Converter. We have made this dataset available along with the original raw data.
- word2vec word vectors trained on the Pubmed Central Open Access Subset.
- The MedMentions Entity Linking dataset, used for training a mention detector.
- Ontonotes 5.0 to make the parser and tagger more robust to non-biomedical text. Unfortunately this is not publicly available.