Corpora

Corpus Linguistics and Morphology | Corpora

Corpora

Overview

Synchronic
- German
- English
- French
- Spanish
- Italian
- Catalan
- Swedish
- Czech
- Finnish
- Russian
- Turkish
- Multilingual
Diachronic
- German
- English
- Portuguese
- French
- Italian
- Dutch
- Spanish
Further Resources

Synchronic

German

DWDS Core Corpus
http://www.dwds.de/resource/kerncorpus/
Corpus of the Berlin-Brandenburgischen Akademie der Wissenschaften, upon which the Digitale Wörterbuch der deutschen Sprache des 20. Jahrhunderts (DWDS) was created.
Deutscher Wortschatz Project
http://wortschatz.uni-leipzig.de/
Deutscher Wortschatz Online. Contains 35 milion sentences with 500 million words.
Hamburg Dependency Treebank
http://hdl.handle.net/11022/0000-0000-7FC7-2
The Hamburg Dependency Treebank is to our knowledge the largest dependency treebank available (at the date of its publication). It consists of genuine dependency annotations, i.e. they have not been transformed from phrase structures. The sentences were all sourced from the German news site heise.de, from articles published between 1996 and 2001. The mapping from sentences to articles and authors is retained, allowing, e.g. analysis of individual style. The creation of the treebank through manual annotation was largely interleaved with the creation of a standard for morphologically and syntactically annotating sentences as well as a constraint-based parser.
IDS-Corpora
http://www.ids-mannheim.de/kt/corpora.html
Corpora of the Institut für Deutsche Sprache. World-biggest collection of German-language textcorpora used for empirical linguistic research. Online search possible with COSMAS .
LIMAS-Korpus
http://www.korpora.org/Limas/
Representative corpus of written contemporary German language of the 1970s: 500 texts or fragments, various text genres with a total of 1 million word forms. Can be entirely searched online.
Korpus Südtirol
http://www.korpus-suedtirol.it/index_EN
An initiative aiming at the collection, filing and corpus linguistic processing of South Tyrolean German texts.

English

British National Corpus (BNC)
http://www.natcorp.ox.ac.uk
The British National Corpus contains 100 million words of written and spoken language from various fields and aims to represent contemporary British English. Also available on CD.
American National Corpus (ANC)
http://americannationalcorpus.org/
The ANC corpus aims to be American equivalent of the BNC corpus.
Loyola Computer-Mediated Communication Corpus
http://cmccorpus.cs.loyola.edu/
900 text samples of computer-mediated communication from Loyola College in Baltimore, Maryland (USA)
Michigan Corpus of Academic Spoken English: MiCASE
http://quod.lib.umich.edu/m/micase/

Freely available, online search function, flat annotation. Comprises 152 Transcriptions ( 1,848,364 Words)
International Corpus of English (ICE)
http://ice-corpora.net/ice/

Corpuses of regional varieties of English. Each corpus consists of one million words of spoken and wirtten English produced after 1989. Common corpus design and scheme for grammatical annotation. Many of the corpuses are free for non-commercial academic research.

French

Corpus de Référence du Français parlé
http://sites.univ-provence.fr/delic/corpus/index.html
440,000 words, 134 recordings, over 36 hours of spoken language
Un corpus d’entretiens spontanés
http://www.llas.ac.uk/resources/mb/80
95 conversations/speakers

Spanish

Arthus
http://www.bds.usc.es/corpus.html
Various text sorts. Contemporary. All scanned.

Italian

CORpus di Italiano Scritto (CORIS)
http://corpora.dslo.unibo.it/coris_eng.html
100 million words.
Banca dati dell'italiano parlato (BADIP)
http://languageserver.uni-graz.at/badip/badip/home.php
Various corpora of spoken Italian
Corpus OVI dell'Italiano antico (corpus TLIO)
http://www.vocabolario.org/
21.817.929 words in 1978 texts

Catalan

Corpus del català contemporani
http://www.ub.edu/cccub/
Corpus of contemporary colloquial Catalan.

Swedish

The Bank of Swedish
http://spraakbanken.gu.se/
A linguistic reference databank at the University of Gothenburg.

Czech

Cesky Národní Korpus (CNK)
http://ucnk.ff.cuni.cz
Czech national corpus. Query can be made online or via the GUI "Bonito".

Finnisch

The Advanced Finnish Learners’ Corpus
http://www.hum.utu.fi/oppiaineet/suomi/en/research/Siitonen_Ivaska.html
Longitudinal essay corpus with texts written by students learning Finnish in MA courses.

Russian

Narusco
http://narusco.ru/
National Corpus of Written Russian

Turkish

Turkish National Corpus
http://www.tnc.org.tr/

Multilingual Corpora

OPUS - Open Source Parallel Corpus
http://opus.lingfil.uu.se/
OPUS comprises 30 million words in 60 languages. The corpus also comprises an Open Office docummentatition (OO), PHP manuals (PHP), and KDE manuals (KDedoc) with KDE system news.
Multext Project
http://www.lpl.univ-aix.fr/projects/multext/
Multilingual Text Tools And Korpora
Multext-East
http://nl.ijs.si/ME/
MULTEXT-East is a corpus of 6 language: Bulgarian, Czech, Estonian, Hungarian, Romanian, and Slovenian. English is the "hub language" of the project.
Bohemica.com
http://www.bohemica.com/index.php
Translation corpus annoted in Czech and English containing 100.000 words (24 written documents consisting each of 1000-4000 words). The corpus contains both fiction and non-fiction and is available for download.
RuN-Euro Corpus
http://www.nevmenandr.net/run/index.php#
Parallel corpus originally consisting of Norwegian and Russian texts and other European languages. The texts are aligned at the sentence level and have been tagged for grammatical information at the word level.

Diachronic

German

Bibliotheca Augustana
www.fh-augsburg.de/~harsch/augustana.html
litteraturae et artis collectio
Kali Korpus
www.kali.uni-hannover.de
The German Kali corpus (Kali: Korpusarbeit Linguistik, corpus work linguistics) is a partially annotated diachronic corpus, designed for research and teaching. The project started at the end of 2003 for the German course at the University of Hannover under the supvervision of Prof. Gabriele Diewald.
Text corpus of Thomas Gloning
http://www.uni-giessen.de/gloning/etexte.htm
freely available
Middle High German Corpus (Bochum)
http://www.ruhr-uni-bochum.de/wegera/archiv_1.htm
Middle High German Terms and Notions Data Bank (Mittelhochdeutsche Begriffsdatenbank, MHDBDB)
http://mhdbdb.sbg.ac.at
contains 4,7 million Words
CEEC (Codices Electronici Ecclesiae Coloniensis)
http://www.ceec.uni-koeln.de
Digitalised codes of the archiepiscopal diocesis and dome library in Cologne(DDB)
TITUS
http://titus.uni-frankfurt.de/indexd.htm
Indo-German thesaurus of text and language materials
mediavum
http://www.mediaevum.de
links to historical texts

English

Penn-Helsinki Parsed Corpus of Middle English
http://www.ling.upenn.edu/mideng
Corpus comprising prose examples and is annotated syntactically. Structures can be queried. CD-ROM.
Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English
http://www-users.york.ac.uk/~sp20/corpus.html
Prose examples and syntactically annotated. Structures can be queried. CD-ROM.
Lampeter Corpus of Early Modern English
http://khnt.hit.uib.no/icame/manuals/LAMPETER/LAMPHOME.HTM

Collection of texts from various fields, published between 1640 and 1740.
Corpus of Early English Correspondence (CEEC)
http://www.helsinki.fi/varieng/domains/CEEC.html
2,7 million words. Text published between 1417 and 1681.
The English language of the north-west in the late Modern English period: A Corpus of late 18c Prose
http://www.llc.manchester.ac.uk/subjects/lel/staff/david-denison/corpus-late-18th-century-prose/
Ca. 300.000 Wörter. Letters between 1761 and 1789.
Corpus of Early Modern Playtexts in English: KEMPE
http://corp.hum.sdu.dk
Can be queried online; freely available. Part-of-speech (POS) and syntactically annotated corpus of 8.9 million words.

Portuguese

O Corpus do Portugues
http://www.corpusdoportugues.org/
Corpus of 45 million words, 50,000 texts published between the 14th and 20th century. Lemmas and POS are annotated. A powerful web interface allows searching for information according to texts, registers, dialects, time periods. Also possible are statistical calculations based upon the search results.
Tycho Brahe Parsed Corpus of Historical Portuguese
http://www.tycho.iel.unicamp.br/~tycho/corpus/index.html
Syntactically annotated. Downloadable.

French

Frantext
http://zeus.inalf.fr/frantext.htm
http://setis.library.usyd.edu.au/frantext (description)

Italian

Corpus OVI dell'Italiano antico (corpus TLIO)
http://www.vocabolario.org/
21,817,929 Words in 1978 Texts

Dutch

Taalbank
http://gtb.inl.nl/

Spanish

Corpus del espanol (RAE)
http://www.corpusdelespanol.org/
date range: 1200-2000.

Further Resources

Technical Report "Eine vergleichende Analyse von historischen und diachronen digitalen Korpora"
http://www.deutschdiachrondigital.de/publikationen/TRHistorischeKorpora.pdf.
Authors: Emil Kroymann, Sebastian Thiebes, Anke Lüdeling, Ulf Leser
Internet Grammar
http://www.tu-chemnitz.de/phil/english/InternetGrammar/shared/
German-English translation corpus. Texts from the last 15 years of politics, tourism, as well as academic texts. 1 million words per language.
A Glossarial DataBase of Middle English
http://www.hti.umich.edu/english/gloss
Johnson's Dictionary
http://www.hti.umich.edu/english/johnson
Access available via password.
Dictionnaire du Moyen francais
http://atilf.atilf.fr/dmf.htm
Middle English
http://ets.umdl.umich.edu/m/mec/
Elektronic version of the Middle English Dictionary
The Perseus Digital Library
http://www.perseus.tufts.edu/
Celt Corpus of Electronic Texts
http://www.ucc.ie/celt/
Online ressource fir Irish history, literature and politics
Medievaland Early Modern Data Bank (MEMDB)
http://www.scc.rutgers.edu/memdb/
The Thesaurus Linguae Graecae (TLG)
http://www.tlg.uci.edu/
The Early Modern English Dictionaries Database (EMEDD)
http://www.chass.utoronto.ca/~ian/emedd.html
The Patrologia Latina Database (PLD)
http://etext.virginia.edu/pld.html
Comprises the most influential works of Roman and Medieval theology, philosophy, history, and literature. Commercial.
A Dictionary of the Welsh Language
http://www.aber.ac.uk/~gpcwww/
Thesaurus Lingua Aethiopicae
http://www.uni-mainz.de/Organisationen/TLA/index.html
Latin and Greek texts
http://www.ulg.ac.be/cipl/bdlasla/
Wörterbuchnetz
http://www.woerterbuchnetz.de/
Network of dictionaries
Electronic Text Corpus of Sumerian Literature (ETCSL)
http://etcsl.orinst.ox.ac.uk/
Transcription of clay tablets with over 350 literary works from Mesopotamia (nowadays Iraq) in Sumerian, late 3rd and early 2nd century BCE

Faculty of Language, Literature and Humanities - Corpus Linguistics and Morphology

Overview

Synchronic

German

English

French

Spanish

Italian

Catalan

Swedish

Czech

Finnisch

Russian

Turkish

Multilingual Corpora

Diachronic

German

English

Portuguese

French

Italian

Dutch

Spanish

Further Resources