Corpus Linguistics and Morphology


Current projects


ANNIS is an open source, versatile web browser-based search and visualization architecture for complex multilevel linguistic corpora with diverse types of annotation. ANNIS, which stands for Annotation of Information Structure, has been designed to provide access to the data of the SFB 632 "Information Structure: The Linguistic Means for Structuring Utterances, Sentences and Texts". Since information structure interacts with linguistic phenomena on many levels, ANNIS2 addresses the SFB's need to concurrently annotate, query and visualize data from such varied areas as syntax, semantics, morphology, prosody, referentiality, lexis and more. For projects working with spoken language, support for audio/video annotations is also required.



The LangBank (Digital Infrastructure to Support the Study of Latin and Historical German) project is dedicated to the creation of a resource of annotated texts in Classical Latin and Histroical German. Access to a wide range of fully annotated texts is an important asset for research in humanities as well as for the acquisition of languages: While it is imperative for teachers and students to find texts adapted to both, the intended illustrational purpose and the learner's proficiency level, scholars are in need of accessing several texts with respect to specific language properties, such as grammatical constructions, vocabulary, spelling differences, etc.



The management and archiving of digital research data is an overlapping field for linguistics, library and information science (LIS) and computer science. These disciplines are cooperating in the LAUDATIO project. The name LAUDATIO is an abbreviation for Long term Ac­cess and Us­age of Deep­ly An­no­tat­ed In­for­ma­tion. The project is funded by the German Research Foundation from 2011-2014. The departments of Corpus Linguistics as well as Historical Linguistics, and the Computer and Media Service (CMS) at Humboldt-Universität zu Berlin and The National Institute for Research in Computer Science and Control (INRIA France) are project partners cooperating with the Berlin School of Library and Information Science (BSLIS).

LAUDATIO aims to build an open access research data repository for historical linguistic data with respect to the above mentioned requirements of historical corpus linguistics. For the access and (re-)use of historical linguistic data the LAUDATIO repository uses a flexible and appropriate documentation schema with a subset of TEI customized by TEI ODD. The extensive metadata schema contains information about the preparation and checking methods applied to the data, tools, formats and annotation guidelines used in the project, as well as bibliographic metadata, and information on the research context (e.g. the research project). To provide complex and comprehensive search in the linguistic annotation data, the linguistic search and visualization tool ANNIS will be integrated in the LAUDATIO repository infrastructure.


Mind Research Repository (MRR)

The Mind Research Repository (MRR) provides access to publications along with data and scripts for analyses and figures reported in them. It is a further development of a project started as the Potsdam Mind Research Repository (PMR2) in August 2010.
A combination of paper plus data plus scripts is referred to as a paper package. The main goals of the Mind Research Repository are the following:

  • Document data and analyses used in publications in a public forum.
  • Invite readers (a) to reproduce analyses/figures, (b) to try out and possibly publish alternative analyses, or (c) to adopt scripts for their own data.
  • Enable readers to provide authors with feedback about their scripts, both about necessary corrections of errors and more elegant alternative code.
  • Serve as a site for experimental results that were not published because they did not turn out as expected, assuming that there were no technical or other obvious reasons for the failure of the experiment. Making such data available in the context of research that did yield the desired results may inspire others to take a new look. Perhaps this way the problem associated with the well-known bias for publications with positive results can be (slightly) reduced.



With SaltNPepper we provide two powerful frameworks for dealing with linguistic annotated data. SaltNPepper is an Open Source project developed at the Humboldt University of Berlin. In linguistic research a variety of formats exists, but no common way of dealing with them. Therefore we developed a metamodel called Salt which abstracts over linguistic data. Salt is based on a general graph structure and treats linguistic data as sets of nodes and edges. Therefore it is highly usable in very different contexts of linguistic analysis Pepper is a pluggable framework which offers the possibility to plug-in new modules (using OSGi). The architecture of Pepper is flexible and makes it possible to benefit from already existing modules.


<tiger2/> is an standard conformant XML format serializing the ISO SynAF model (ISO 24615:2010) for expressing syntactic annotation for a wide variety of theoretical formalisms and corpus architectures. It is closely related to and develops the ideas found in TigerXML (http://www.ims.uni-stutt­gart.de/pro­jek­te/TI­GER/). The format is conceived as theory neutral, as it is suited to both shallow and deep parsing in any number of theories and supports both pure constituency and dependency trees, as well as combinations of the two. For more information (schemas, API, etc.) see:





The Berlin Map Task Corpus (BeMaTaC) is a freely available corpus of spoken German. It consists of an L1 subcorpus recorded with native speakers of German and an identically designed L2 subcorpus with speakers of German as a foreign language. BeMaTaC uses a map-task design, where one speaker (the instructor) instructs another speaker (the instructee) to reproduce a route on a map with landmarks. The dialogues are recorded with two separately placed microphones and a video showing the drawing hand of the instructee. Transcriptions are consistently tokenized, time-aligned and annotated on a wide and easily extendable range of different layers. Extensive and anonymized metadata are provided with every dialogue.



Deutsche Diachrone Baumbank

The DDB (Deut­sche Dia­chro­ne Baum­bank) is a small (ca. 8000 to­kens) deep­ly syn­tac­ti­cal­ly an­no­tat­ed cor­pus con­sist­ing of three sub­cor­po­ra of dif­fer­ent lan­guage pe­ri­ods of Ger­man (Old High Ger­man, Mid­dle High Ger­man, Ear­ly New High Ger­man). The set up of the cor­pus main­ly fol­lows the TI­GER-cor­pus, one of the larg­est free­ly ac­ces­si­ble tree­banks of Ger­man. DDB was de­vel­oped with­in the proj­ect, sup­port­ed by the Sen­ate of Ber­lin, „In­ter­dis­ci­pli­nar­y re­search net­work lin­guis­tics – bi­o­in­for­mat­ics for the com­pu­ta­tion of kin­ship and de­scent”.

Home­page: http://korp­ling.ger­man.hu-ber­lin.de/ddb-do­ku/in­dex.htm
Cor­pus: http://korp­ling.ger­man.hu-ber­lin.de/ddd/search.html

Fairy tales corpus (Märchenkorpus)

The fairy tales corpus contains 201 "Kinder- und Hausmärchen", and the 10 children legends (Kinderlegenden), which are printed in the second volume of the Brothers Grimm final edition. The corpus was designed, compiled and edited for the seminar "Drama pedagogy of fairy tales: Linguistics, Pedagogy and Theatre." The seminar, led by Maik Walter, took place in the summer term 2013 at the German Department of the University of Tübingen (see Maik Walter (in press): Es VERBte (ein)mal. Linguistisches Forschungstheater im Grimm-Jahr 2013. Zeitschrift für Theaterpädagogik 63. 29.Jahrgang. Themenheft: Forschung, Fachdiskurse & Labore).



Falko is a freely available error-annotated learner corpus of German as a foreign language.



KanDeL (Kansas Developmental Learner corpus) is a freely available longitudinal learner corpus of beginning to intermediate learners of German as a foreign language, constructed at the University of Kansas by Nina Vyatkina



CLARIN-D curation project: Linguistic annotation of nonstandard varieties — guidelines and „best practices“ (F-AG 7 | KP 2)

The RIDG­ES proj­ect (Reg­is­ter in Di­a­chron­ic Ger­man Sci­ence) is an in­ves­ti­ga­tion in­to the de­vel­op­ment of the Ger­man sci­en­tif­ic lan­guage in the ear­ly mod­ern and mod­ern pe­ri­ods, rang­ing from the mid 16th to the late 19th cen­tu­ry.



INDUS network

Individualisiertes Sprachenlernen (als Gegenstück zu standardisierten Massenkursen) ist durch neuste Entwicklungen der Sprachtechnologie in greifbare Nähe gerückt. Somit lassen sich nicht nur die weit verbreiteten sondern auch „kleine“ Sprachen abdecken. Es zeigt sich jedoch, dass die Einbettung der Technologien in reale Lernsituationen viele neue Fragen aufwirft, die nur durch eine viele Disziplinen überspannende Forschungsanstrengung beantwortet werden können.

Das INDUS-Netzwerk bringt dazu Akteure aus den Disziplinen Sprachtechnologie, Linguistik, Bildungsforschung, Lernpsychologie, Pädagogische Psychologie, Spracherwerbsforschung und Didaktik des Sprachenlernens zusammen, die sich im Kontext ihrer spezifischen Expertise bereits mit dem Lernen von Sprachen auseinandergesetzt haben. Gemeinsam werden konkrete Forschungsfragen bearbeitet, die sich vor allem auf die Aspekte der Individualisierung beziehen, z.B. zur Modellierung des Lerners, zur Anpassung des Lehrmaterials an verschiedene Lernausgangslagen wie Muttersprache und Vorwissen und zur Generierung von hilfreichen Rückmeldungen.


Netzwerk Kobalt-DAF


Annotation und Analyse argumentativer Lernertexte

Konvergierende Zugänge zu einem schriftlichen Korpus des Deutschen als Fremdsprache



Finished projects and networks



The network, which is funded by the German Research Foundation (DFG), combines skills from German Linguistics, Computer Linguistics, Computer Science and Psychology in order to achieve two goals: First, based on a set of concrete research questions, to compile suggestions for standards and the processing of linguistic data from German internet-based communication and, second, to develop methods and tools for their empirical computer-assisted analysis. The findings will be documented in publications, and the suggestions for standards and procedures will successively be provided online.



Using methods from computational linguistics, this project will identify indicators of the quality of students’ texts in the German language. Special emphasis will be placed on the evolution of those quality indicators across competence levels, i.e. the development of observable parameter values over time as the students’ language skills improve. The study will be based on essays, test results, students’ attitudes and personal information from the city of Hamburg’s longitudinal KESS study, as well as material from other surveys. The core of this dataset is comprised of approximately 9000 essays which were rated along several dimensions.


This project seeks to systematically identify linguistic structures of German that pose a specific difficulty for the acquisition of German as a foreign language (GFL). Conventionally, this is done by observing learner errors (see Borin & Prütz 2004 or Westergren-Axelsson & Hahn 2001). However, if learners avoid difficult elements, this method fails. We claim that the relative underrepresentation of structures in learner data implies that these structures are difficult to acquire. Therefore, we propose a systematic study of underrepresented structures.