| CARVIEW |
Select Language
HTTP/2 301
server: GitHub.com
content-type: text/html
location: https://allenai.github.io/scispacy/
x-github-request-id: 3BFF:3D7756:15DB7C:19771B:696BDDFC
accept-ranges: bytes
age: 0
date: Sat, 17 Jan 2026 19:07:40 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210052-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1768676860.234770,VS0,VE236
vary: Accept-Encoding
x-fastly-request-id: 84e2a2fac71fd29128550277bdb28671fc835648
content-length: 162
HTTP/2 200
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Thu, 04 Dec 2025 07:45:36 GMT
access-control-allow-origin: *
etag: W/"69313c20-3a99"
expires: Sat, 17 Jan 2026 19:17:40 GMT
cache-control: max-age=600
content-encoding: gzip
x-proxy-cache: MISS
x-github-request-id: 2840:A2227:167CB3:1A2AD8:696BDDFB
accept-ranges: bytes
age: 0
date: Sat, 17 Jan 2026 19:07:40 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210052-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1768676860.484860,VS0,VE217
vary: Accept-Encoding
x-fastly-request-id: 1d533aadff1a6990a8765dcd458bc5e8b6d0b687
content-length: 3759
scispacy | SpaCy models for biomedical text processing
scispacy
SpaCy models for biomedical text processing
scispaCy is a Python package containing spaCy models for processing biomedical, scientific or clinical text.
Interactive Demo
Just looking to test out the models on your data? Check out our demo.
Installing
pip install scispacy
pip install <Model URL>
Models
| Model | Description | Install URL |
|---|---|---|
| en_core_sci_sm | A full spaCy pipeline for biomedical data. | Download |
| en_core_sci_md | A full spaCy pipeline for biomedical data with a larger vocabulary and 50k word vectors. | Download |
| en_core_sci_scibert | A full spaCy pipeline for biomedical data with a ~785k vocabulary and allenai/scibert-base as the transformer model. |
Download |
| en_core_sci_lg | A full spaCy pipeline for biomedical data with a larger vocabulary and 600k word vectors. | Download |
| en_ner_craft_md | A spaCy NER model trained on the CRAFT corpus. | Download |
| en_ner_jnlpba_md | A spaCy NER model trained on the JNLPBA corpus. | Download |
| en_ner_bc5cdr_md | A spaCy NER model trained on the BC5CDR corpus. | Download |
| en_ner_bionlp13cg_md | A spaCy NER model trained on the BIONLP13CG corpus. | Download |
Performance
Our models achieve performance within 3% of published state of the art dependency parsers and within 0.4% accuracy of state of the art biomedical POS taggers.
| model | UAS | LAS | POS | Mentions (F1) | Web UAS |
|---|---|---|---|---|---|
| en_core_sci_sm | 89.18 | 87.15 | 98.18 | 67.89 | 87.36 |
| en_core_sci_md | 90.08 | 88.16 | 98.46 | 68.86 | 88.04 |
| en_core_sci_lg | 89.97 | 88.18 | 98.51 | 68.98 | 87.89 |
| en_core_sci_scibert | 92.12 | 90.58 | 98.18 | 67.70 | 92.58 |
| model | F1 | Entity Types |
|---|---|---|
| en_ner_craft_md | 78.01 | GGP, SO, TAXON, CHEBI, GO, CL |
| en_ner_jnlpba_md | 72.06 | DNA, CELL_TYPE, CELL_LINE, RNA, PROTEIN |
| en_ner_bc5cdr_md | 84.28 | DISEASE, CHEMICAL |
| en_ner_bionlp13cg_md | 77.84 | AMINO_ACID, ANATOMICAL_SYSTEM, CANCER, CELL, CELLULAR_COMPONENT, DEVELOPING_ANATOMICAL_STRUCTURE, GENE_OR_GENE_PRODUCT, IMMATERIAL_ANATOMICAL_ENTITY, MULTI-TISSUE_STRUCTURE, ORGAN, ORGANISM, ORGANISM_SUBDIVISION, ORGANISM_SUBSTANCE, PATHOLOGICAL_FORMATION, SIMPLE_CHEMICAL, TISSUE |
Example Usage
import scispacy
import spacy
nlp = spacy.load("en_core_sci_sm")
text = """
Myeloid derived suppressor cells (MDSC) are immature
myeloid cells with immunosuppressive activity.
They accumulate in tumor-bearing mice and humans
with different types of cancer, including hepatocellular
carcinoma (HCC).
"""
doc = nlp(text)
print(list(doc.sents))
>>> ["Myeloid derived suppressor cells (MDSC) are immature myeloid cells with immunosuppressive activity.",
"They accumulate in tumor-bearing mice and humans with different types of cancer, including hepatocellular carcinoma (HCC)."]
# Examine the entities extracted by the mention detector.
# Note that they don't have types like in SpaCy, and they
# are more general (e.g including verbs) - these are any
# spans which might be an entity in UMLS, a large
# biomedical database.
print(doc.ents)
>>> (Myeloid derived suppressor cells,
MDSC,
immature,
myeloid cells,
immunosuppressive activity,
accumulate,
tumor-bearing mice,
humans,
cancer,
hepatocellular carcinoma,
HCC)
# We can also visualise dependency parses
# (This renders automatically inside a jupyter notebook!):
from spacy import displacy
displacy.render(next(doc.sents), style='dep', jupyter=True)
# See below for the generated SVG.
# Zoom your browser in a bit!
Data Sources
scispaCy models are trained on data from a variety of sources. In particular, we use:
- The GENIA 1.0 Treebank, converted to basic Universal Dependencies using the Stanford Dependency Converter. We have made this dataset available along with the original raw data.
- word2vec word vectors trained on the Pubmed Central Open Access Subset.
- The MedMentions Entity Linking dataset, used for training a mention detector.
- Ontonotes 5.0 to make the parser and tagger more robust to non-biomedical text. Unfortunately this is not publicly available.