Faculty of Language, Literature and Humanities - Corpus Linguistics and Morphology

Corpora

Overview

Synchronic

German

  • DWDS Core Corpus
    http://www.dwds.de/resource/kerncorpus/

    Corpus of the Berlin-Brandenburgischen Akademie der Wissenschaften, upon which the Digitale Wörterbuch der deutschen Sprache des 20. Jahrhunderts (DWDS) was created.

  • Deutscher Wortschatz Project
    http://wortschatz.uni-leipzig.de/

    Deutscher Wortschatz Online. Contains 35 milion sentences with 500 million words.

  • Hamburg Dependency Treebank
    http://hdl.handle.net/11022/0000-0000-7FC7-2

    The Hamburg Dependency Treebank is to our knowledge the largest dependency treebank available (at the date of its publication). It consists of genuine dependency annotations, i.e. they have not been transformed from phrase structures. The sentences were all sourced from the German news site heise.de, from articles published between 1996 and 2001. The mapping from sentences to articles and authors is retained, allowing, e.g. analysis of individual style. The creation of the treebank through manual annotation was largely interleaved with the creation of a standard for morphologically and syntactically annotating sentences as well as a constraint-based parser.

  • IDS-Corpora
    http://www.ids-mannheim.de/kt/corpora.html

    Corpora of the Institut für Deutsche Sprache. World-biggest collection of German-language textcorpora used for empirical linguistic research. Online search possible with COSMAS .

  • LIMAS-Korpus
    http://www.korpora.org/Limas/

    Representative corpus of written contemporary German language of the 1970s: 500 texts or fragments, various text genres with a total of 1 million word forms. Can be entirely searched online.

  • Korpus Südtirol
    http://www.korpus-suedtirol.it/index_EN

    An initiative aiming at the collection, filing and corpus linguistic processing of South Tyrolean German texts.

English

  • British National Corpus (BNC)
    http://www.natcorp.ox.ac.uk

    The British National Corpus contains 100 million words of written and spoken language from various fields and aims to represent contemporary British English. Also available on CD.

  • American National Corpus (ANC)
    http://americannationalcorpus.org/

    The ANC corpus aims to be American equivalent of the BNC corpus.

  • Loyola Computer-Mediated Communication Corpus
    http://cmccorpus.cs.loyola.edu/

    900 text samples of computer-mediated communication from Loyola College in Baltimore, Maryland (USA)

  • Michigan Corpus of Academic Spoken English: MiCASE
    http://quod.lib.umich.edu/m/micase/

    Freely available, online search function, flat annotation. Comprises 152 Transcriptions ( 1,848,364 Words)

  • International Corpus of English (ICE)
    http://ice-corpora.net/ice/

    Corpuses of regional varieties of English. Each corpus consists of one million words of spoken and wirtten English produced after 1989. Common corpus design and scheme for grammatical annotation. Many of the corpuses are free for non-commercial academic research.

French

Spanish

Italian

Catalan

Swedish

Czech

  • Cesky Národní Korpus (CNK)
    http://ucnk.ff.cuni.cz

    Czech national corpus. Query can be made online or via the GUI "Bonito".

Finnisch

Russian

Turkish

  • Turkish National Corpus
    http://www.tnc.org.tr/

Multilingual Corpora

  • OPUS - Open Source Parallel Corpus
    http://opus.lingfil.uu.se/

    OPUS comprises 30 million words in 60 languages. The corpus also comprises an Open Office docummentatition (OO), PHP manuals (PHP), and KDE manuals (KDedoc) with KDE system news.

  • Multext Project
    http://www.lpl.univ-aix.fr/projects/multext/

    Multilingual Text Tools And Korpora

  • Multext-East
    http://nl.ijs.si/ME/

    MULTEXT-East is a corpus of 6 language: Bulgarian, Czech, Estonian, Hungarian, Romanian, and Slovenian. English is the "hub language" of the project.

  • Bohemica.com
    http://www.bohemica.com/index.php

    Translation corpus annoted in Czech and English containing 100.000 words (24 written documents consisting each of 1000-4000 words). The corpus contains both fiction and non-fiction and is available for download.

  • RuN-Euro Corpus 
    http://www.nevmenandr.net/run/index.php#
    Parallel corpus originally consisting of Norwegian and Russian texts and other European languages. The texts are aligned at the sentence level and have been tagged for grammatical information at the word level. 

Diachronic

German

English

Portuguese

  • O Corpus do Portugues
    http://www.corpusdoportugues.org/

    Corpus of 45 million words, 50,000 texts published between the 14th and 20th century. Lemmas and POS are annotated. A powerful web interface allows searching for information according to texts, registers, dialects, time periods. Also possible are statistical calculations based upon the search results.

  • Tycho Brahe Parsed Corpus of Historical Portuguese
    http://www.tycho.iel.unicamp.br/~tycho/corpus/index.html

    Syntactically annotated. Downloadable.

French

Italian

Dutch

Spanish

Further Resources