| CARVIEW |
Danish Gigaword
A billion-word corpus of Danish text, freely distributed with attribution.
Introduction
It’s hard to develop good tools for processing Danish with computers when no large and wide-coverage dataset of Danish text is readily available. To address this, the Danish Gigaword Project (DAGW) maintains a corpus for Danish with over a billion words. DAGW is a project of the IT University of Copenhagen, contributed to by over a dozen other universities and businesses in Denmark; you can read the official ITU press release here. This is the homepage for the project. The general goals are to create a dataset that is:
- representative;
- accessible;
- a suitable common starting point for Danish NLP models.
The corpus is managed and communicated in English so that the world beyond Denmark can also use the resource.
Download
Danish Gigaword is available via Hugging Face:
huggingface.co/datasets/danish-foundation-models/danish-gigaword
Documentation
Read the paper about The Danish Gigaword Corpus.
License & Reference
If you use the data, you MUST acknowledge it. The license is CC-BY 4.0, Creative Commons with Attribution.
Sample attributions:
In a press release:
Modellen er præ-trænet på et datasæt fra The Danish Gigaword Project (https://gigaword.dk), der er udviklet af forskere fra IT-Universitetet i København
The model is pre-trained using the Danish Gigaword Corpus (https://gigaword.dk), developed at the IT University of Copenhagen
In academic writing:
Derczynski, L., Ciosici, M. R., et al. (2021). The Danish Gigaword Corpus. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa 2021).
@inproceedings{dagw, title = {{The Danish Gigaword Corpus}}, author = {Leon Derczynski and Manuel R. Ciosici and Rebekah Baglini and Morten H. Christiansen and Jacob Aarup Dalsgaard and Riccardo Fusaroli and Peter Juel Henrichsen and Rasmus Hvingelby and Andreas Kirkedal and Alex Speed Kjeldsen and Claus Ladefoged and Finn Årup Nielsen and Jens Madsen and Malte Lau Petersen and Jonathan Hvithamar Rystrøm and Daniel Varab}, year = 2021, booktitle = {Proceedings of the 23rd Nordic Conference on Computational Linguistics}, publisher = {NEALT} }
In a software product, tool, or service:
Denne service er lavede med data fra The Danish Gigaword Corpus
That’s all we ask in return for our work; no money, no signed agreement, no royalties - just acknowledgment. We hope you think that’s fair.
If you cannot acknowledge the project like this, you are not licensed to use the data.
Models using Danish Gigaword
- Ælæctra - A Step Towards More Efficient Danish Natural Language Processing. huggingface.co/Maltehb/aelaectra-danish-electra-small-cased
We’re interested in how DAGW is used; please contact us if you train a model over it.
Tools using Danish Gigaword
- A&ttack and Ha&te by Analyse & Tal
- Implementation in Sketch Engine
We’re interested in how DAGW is used; please contact us if you build a tool from it.
Press Coverage
- Heste-nettet kan blive grundlag for kunstig intelligens på dansk - Danmarks Radio
- Hestenet, tørstige prompts og chatbot, der kan høre og se - Prompt
- Danish AI Trained on Data From a Web Forum About Horses - Bloomberg
- ChatGPT blev trænet af danske hestetøser
- I Danmark har vi vores egne grundmodeller til dansk sprog. Det udvikles dog udelukkende af ihærdige frivillige, som gør et fantastisk arbejde. - Børsen
- Featured in the Foreign Ministry’s “Invest in Denmark”
- A Danish billion-word corpus appears - Import AI
- Danish Gigaword Project - et historisk stort dansk tekstkorpus - Sprogteknologi.dk / Digitaliseringsstyrelsen
- ITU led project will make automated translation more reliable - ITU
- Superalgoritme kortlægger det danske had og afslører yndlingsofrene på Facebook - Politiken
- Sprogmodellen Ælæctra vil forbedre dansk sprogteknologi på en klimavenlig måde - KMD
- This Powerful AI Technique Led to Clashes at Google and Fierce Debate in Tech. Here’s Why. - Morning Brew
Contact
The project is managed by Leon Derczynski (ld@itu.dk, PI) and Manuel R. Ciosici (manuelc@isi.edu, Co-I).
Credits
Background image of Henne Kirkeby by Sven Huls