Corpora

Overview

Synchronic

German

English

  • British National Corpus (BNC)
    http://www.natcorp.ox.ac.uk

    The British National Corpus contains 100 million words of written and spoken language from various fields and aims to represent contemporary British English. Also available on CD.

  • American National Corpus (ANC)
    http://americannationalcorpus.org/

    The ANC corpus aims to be American equivalent of the BNC corpus.

  • Loyola Computer-Mediated Communication Corpus
    http://cmccorpus.cs.loyola.edu/

    900 text samples of computer-mediated communication from Loyola College in Baltimore, Maryland (USA)

  • Michigan Corpus of Academic Spoken English: MiCASE
    http://quod.lib.umich.edu/m/micase/

    Freely available, online search function, flat annotation. Comprises 152 Transcriptions ( 1,848,364 Words)

  • International Corpus of English (ICE)
    http://ice-corpora.net/ice/

    Corpuses of regional varieties of English. Each corpus consists of one million words of spoken and wirtten English produced after 1989. Common corpus design and scheme for grammatical annotation. Many of the corpuses are free for non-commercial academic research.

French

Spanish

Italian

Catalan

Swedish

Czech

  • Cesky Národní Korpus (CNK)
    http://ucnk.ff.cuni.cz

    Czech national corpus. Query can be made online or via the GUI "Bonito".

Finnisch

Russian

Turkish

Multilingual Corpora

  • OPUS - Open Source Parallel Corpus
    http://urd.let.rug.nl/tiedeman/OPUS/

    OPUS comprises 30 million words in 60 languages. The corpus also comprises an Open Office docummentatition (OO), PHP manuals (PHP), and KDE manuals (KDedoc) with KDE system news.

  • Multext Project
    http://www.lpl.univ-aix.fr/projects/multext/

    Multilingual Text Tools And Korpora

  • Multext-East
    http://nl.ijs.si/ME/

    MULTEXT-East is a corpus of 6 language: Bulgarian, Czech, Estonian, Hungarian, Romanian, and Slovenian. English is the "hub language" of the project.

  • Bohemica.com
    http://www.bohemica.com/index.php

    Translation corpus annoted in Czech and English containing 100.000 words (24 written documents consisting each of 1000-4000 words). The corpus contains both fiction and non-fiction and is available for download.

Diachronic

German

English

Portuguese

  • O Corpus do Portugues
    http://www.corpusdoportugues.org/

    Corpus of 45 million words, 50,000 texts published between the 14th and 20th century. Lemmas and POS are annotated. A powerful web interface allows searching for information according to texts, registers, dialects, time periods. Also possible are statistical calculations based upon the search results.

  • Tycho Brahe Parsed Corpus of Historical Portuguese
    http://www.tycho.iel.unicamp.br/~tycho/corpus/index.html

    Syntactically annotated. Downloadable.

French

Italian

Dutch

Spanish

Further Resources

Document Actions
last modified 11-09-23 by Burkhard Dietterle (Stud. Hilfskraft)
Personal tools