Sprach- und literaturwissenschaftliche Fakultät - Korpuslinguistik und Morphologie

Amir Zeldes - Datasets

This page provides raw datasets used in my publications and teaching. Instructions on how to cite the use of the data are given for each dataset.

Falko Noun Compounding Data

This dataset includes all nouns (POS tag "NN") in the Falko essay corpora FalkoEssayL2v2.2 (advanced German learners) and FalkoEssayL1v2.2 (comparable native speaker data). The data gives for each noun a classification into 'compound' or 'simplex', as well as lemma, head and modifier (for compounds) and the first native language of the writer.

The corpus itself is described in Reznicek et al. (2010) and the extraction and analysis of the nouns is described in Zeldes (2013). Please cite both references when making use of data



  • Reznicek, Marc, Maik Walter, Karin Schmid, Anke Lüdeling, Hagen Hirschmann, Cedric Krummes & Thorsten Andreas (2010). Das Falko-Handbuch. Korpusaufbau und Annotationen. Version 1.0.1. Technical report, Humboldt-Universität zu Berlin.
  • Zeldes, Amir (2013), "Komposition als Konstruktionsnetzwerk im fortgeschrittenen L2-Deutsch". Zeitschrift für germanistische Linguistik 41(2), 240-276.

PCC11 Information Status and Topicality Data

This dataset includes information structural annotations based on the guidelines in Dipper et al. (2007) for all discourse referents from pcc11, a sample of the Potsdam Commentary Corpus (Stede 2004). The data was extracted using ANNIS (Zeldes et al. 2009). 

The data contains the string representation for each referent (replacing underscores for spaces), information status using the values "giv" (given), "new", "acc" (accessible) and "idiom" (for non-referential idiomatic phrases), and topicality with the values "ab" (aboutness topic), "fs" (framesetter) and "nt" (non-topic). A further column gives a more fine-grained information status tagset according to Dipper et al. (2007), adding subtypes for given active and inactive, accessible inferable, general, situational and aggregate.



  • Dipper, Stefanie, Michael Götze and Stavros Skopetead (eds.) (2007), "Information Structure in Cross-Linguistic Corpora: Annotation Guidelines for Phonology, Morphology, Syntax, Semantics, and Information Structure". Interdisciplinary Studies on Information Structure 7.
  • Stede, Manfred (2004), The Potsdam Commentary Corpus. In: Bonnie Webber & Donna K. Byron (eds.), Proceeding of the ACL-04 Workshop on Discourse Annotation. Barcelona, Spain, 96–102. 
  • Zeldes, Amir, Ritz, Julia, Lüdeling, Anke & Chiarcos, Christian (2009), "ANNIS: A Search Tool for Multi-Layer Annotated Corpora". In: Proceedings of Corpus Linguistics 2009, July 20-23, Liverpool, UK.

Sahidic Coptic Corpora

Several richly annotated corpora of Sahidic Coptic created in collaboration with Prof. Caroline T. Schroeder (University of the Pacific) are now available under CC-BY license. See details here: