Direkt zum InhaltDirekt zur SucheDirekt zur Navigation
▼ Zielgruppen ▼

Humboldt-Universität zu Berlin - Faculty of Language, Literature and Humanities - Corpus Linguistics and Morphology

BeMaTaC

A deeply annotated multimodal map-task corpus of spoken learner and native German

Diese Webseite ist auch auf Deutsch verfügbar.


About


The Berlin Map Task Corpus (BeMaTaC) is a freely available corpus of spoken German. It consists of an L1 subcorpus recorded with native speakers of German and an identically designed L2 subcorpus with advanced speakers of German as a foreign language. BeMaTaC uses a map-task design, where one speaker (the instructor) instructs another speaker (the instructee) to reproduce a route on a map with landmarks. The speakers cannot see each other and are thus unable to communicate non-verbally. The dialogues are recorded with two separately placed microphones and a video showing the drawing hand of the instructee. Transcriptions are consistently tokenized, time-aligned and annotated on a wide and easily extendable range of different layers. Extensive and anonymized metadata are provided with every dialogue.

The current 2.1 / 2013-02.1 release contains the L1 subcorpus with 12 dialogues (66 minutes total, 8900 normalized tokens) as well as the L2 subcorpus with 5 dialogues (77 minutes total, 9228 normalized tokens). 9 more dialogues (101 minutes total) are currently being transcribed and annotated. The next release of our corpus is projected for late 2015.

sample map instructor sample map instructee
instructor instructee

Access


BeMaTaC can be accessed using ANNIS, an open-source browser-based search and visualization tool for deeply annotated corpora.


Annotation


The current 2.1 / 2013-02.1 release contains the following layers:

  • Loosely orthographic transcription including fillers, truncations, colloquial contractions and idiosyncratic pronunciations
  • Normalized orthographic transcription
  • Automatically generated lemmatization
  • Automatically generated part-of-speech tags using the STTS (Stuttgart-Tübingen-TagSet)
  • Syntactically motivated utterance spans
  • Backchanneling (in the L1 subcorpus only the instructee's backchanneling)
  • Disfluencies: fillers (filled pauses), prolongations, mispronunciations, explicit editing terms and repetitions
  • Repairs: reparandum, interregnum, reparans
  • Repair subcategorizations: repetitions, substitutions, insertions
  • Extralinguistic events
  • Breaks (unfilled pauses)
  • Token length

The following data is available as part of the NoSta-D corpus:

  • Syntactic dependencies
  • Named entitiy recognition and disambiguation
  • Coreferences

We are currently working on the following annotations:

  • Automatic annotation of breaks, fillers and repetitions
  • Improved part-of-speech tagging by taking utterance spans into account
  • Semi-automatic normalization
  • Manually corrected part-of-speech tags (L1 subcorpus)

Long-term annotation plans:

  • Hyperlemma annotation for idiosyncratic lexical items
  • Manually corrected lemmatization
  • Manually corrected part-of-speech tags (L2 subcorpus)
  • Phonetic/phonological transcripton/annotation
  • Syntactic features
  • Information structure

Documentation


The following documents apply to the most current release, previous versions may contain data incompatible with these guidelines.


Download


Creative Commons Licence BeMaTaC is licensed under a Creative Commons Attribution 3.0 Unported License.

If you are using our corpus for research or if you are planning on extending BeMaTaC with further annotations, please tell us about it.


L1 subcorpus: 2.1 / 2013-02.1 release


L2 subcorpus: 2.1 / 2013-02.1 release


Other releases

  • Syntactic dependencies, named entities and coreferences are available as part of the NoSta-D corpus.
  • Previous releases are available for download in the release history section of this website.

Team & Contact



Publications


How to cite BeMaTaC

  • Please always cite this website and in the following form: http://u.hu-berlin.de/bematac

  • If mandated by your citation requirements, you may cite Simon Sauer as the primary editor.

  • Currently, there is no published paper that has a general description of BeMaTaC. In addition to the website, however, you may cite the following posters:

  • When citing specific data from within the corpus, please refer to the subcorpus (L1 or L2), the corpus version (e.g. 2013-02.1), the specific document (e.g. 2011-12-14-A), and the token range as given in the tok layer.


2015

  • Malte Belz, Simon Sauer, Anke Lüdeling, Christine Mooshammer. 2015. Repair Behaviour of Advanced German Learners in the Berlin Map Task Corpus. IFCASL Workshop on Phonetic Learner Corpora, satellite workshop of ICPhS2015, Glasgow, 12.08.2015.

  • Anke Lüdeling, Malte Belz, Hagen Hirschmann, Martin Klotz, Carolin Odebrecht, Laura Perlitz, Simon Sauer, Vivian Voigt. 2015. BeMaTaC, Falko, RIDGES. Linguistische Mehrebenenkorpora für Nichtstandard-Varietäten des Deutschen. Digital-Humanities-Tag 2015, Philosophische Fakultät II, Humboldt-Universität zu Berlin. [poster]

  • Simon Sauer. 2015. BeMaTaC: Ein tief annotiertes multimodales Map-Task-Korpus gesprochener Lerner- und Muttersprache. Gesprochene Fremdsprache Deutsch — Forschung und Vermittlung, Universidade de Lisboa, 26.—28.02.2015. [abstract]

2014

  • Malte Belz. 2014. Managing referential mismatches in German map task dialogues. RefNet Workshop, Edinburgh, 31.08.2014. [abstract]

  • Oxana Rasskazova, Simon Sauer, Christine Mooshammer. 2014. Berlin Dialog Corpus (BeDiaCo) – ein multimodales Korpus für Konvergenz- und Dialogforschung. Workshop Sprachdatenbanken – von der Aufnahme zur Publikation, CLARIN-D. [poster]

  • Simon Sauer & Oxana Rasskazova. 2014. BeMaTaC – eine digitale multimodale Ressource für Sprach- und Dialogforschung. Workshop Grenzen überschreiten – Digitale Geisteswissenschaft heute und morgen, Digital Humanities Berlin 2014. [poster]

  • Malte Belz. 2014. Repair disfluencies in German native and non-native speech. Linguistic Evidence 2014. [poster]

2013

  • Myriam Klapi. 2013. Disfluency Patterns: A Contrastive Corpus Study. Master's thesis. Humboldt-Universität zu Berlin, December 2013.

  • Malte Belz. 2013. Disfluencies und Reparaturen bei Muttersprachlern und Lernern – eine kontrastive Analyse. Master's thesis. Humboldt-Universität zu Berlin, November 2013. [online]

  • Oxana Rasskazova & Simon Sauer. 2013. BeMaTaC: ein multimodales Map-Task-Dialogkorpus. Pre-conference workshop Gesprochene Sprache und Sprachverarbeitung, GSCL 2013. [abstract]

  • Anke Lüdeling. 2013. Corpora of Spoken Language. Invited talk. From Hand to Mouth: A Dialogue between Spoken and Sign Language Research 2013. [slides]

  • Malte Belz & Myriam Klapi. 2013. Pauses following Fillers in L1 and L2 German Map Task Dialogues. Proceedings of Disfluency in Spontaneous Speech. DiSS 2013, 9–12. [online]

  • Clara Becker. 2013. Doing Backchanneling – Verhalten von Frauen und Männern beim Backchanneling im aufgabenorientierten Dialog. Bachelor's thesis. Humboldt-Universität zu Berlin, July 2013. [online]

  • Simon Sauer & Anke Lüdeling. 2013. BeMaTaC: A Flexible Multilayer Spoken Dialogue Corpus for Contrastive SLA Analyses. ICAME 34, 46–47. [abstract]

  • Linda Giesel, Myriam Klapi, Daisy Krüger, Isabelle Nunberger, Oxana Rasskazova, Simon Sauer. 2013. Gesprochene Muttersprache vs. Lernersprache – Aufbau und Auswertung eines Korpus. Forschendes Lernen an der Humboldt-Universität zu Berlin, 81–86. [online]

  • Linda Giesel, Myriam Klapi, Daisy Krüger, Isabelle Nunberger, Oxana Rasskazova, Simon Sauer. 2013. Berlin Map Task Corpus – A deeply annotated multimodal map-task corpus of spoken learner and native German. DGfS-CL 2013. [poster]

Teaching


A key aim of BeMaTaC is promoting the usage of corpora and teaching the necessary expertise. This is accomplished not only by using BeMaTaC data in linguistics courses but also by actively extending the corpus in class.


Winter term 2014/2015


Winter term 2013/2014


Summer term 2013


Winter term 2012/2013


Winter term 2011/2012


Tools & References


  • Original map-task design by HCRC
    Anne H. Anderson, Miles Bader, Ellen Gurman Bard, Elizabeth Boyle, Gwyneth Doherty, Simon Garrod, Stephen Isard, Jacqueline Kowtko, Jan McAllister, Jim Miller, Catherine Sotillo, Henry Thompson & Regina Weinert. 1991. The HCRC Map Task Corpus. Language and Speech 34, 351–366.

  • Original corpus design based on HAMATAC
    Thomas Schmidt, Hanna Hedeland, Timm Lehmberg & Kai Wörner. 2010. HAMATAC – The Hamburg MapTask Corpus. [online]

  • Maps courtesy of IDS Mannheim
    Caren Brinckmann, Stefan Kleiner, Ralf Knöbl & Nina Berend. 2008. German Today: an areally extensive corpus of spoken Standard German. Proceedings 6th International Conference on Language Resources and Evaluation. LREC 2008. [online]

  • Automatic segmentation and alignment: MAUS
    Florian Schiel, Christoph Draxler & Jonathan Harrington. 2011. Phonemic Segmentation and Labelling using the MAUS Technique. Workshop New Tools and Methods for Very-Large-Scale Phonetics Research. University of Pennsylvania, 2011, January, 28–31. [online]

  • Manual alignment and normalization: Praat
    Paul Boersma. 2010. Praat, a system for doing phonetics by computer. Glot International 5 (9/10), 341–345.

  • Annotation and metadata: EXMARaLDA
    Thomas Schmidt & Kai Wörner. 2009. EXMARaLDA – Creating, analysing and sharing spoken language corpora for pragmatic research. Pragmatics (19:4), 565–582.

  • Lemmatization and part-of-speech tagging: TreeTagger
    Helmut Schmid. 1994. Probabilistic Part-of-Speech Tagging Using Decision Trees. Proceedings of International Conference on New Methods in Language Processing. [online]

  • Part-of-speech tagset: STTS
    Anne Schiller, Simone Teufel, Christine Stöckert & Christine Thielen. 1999. Guidelines für das Tagging deutscher Textcorpora mit STTS (Kleines und großes Tagset). [online]

  • Converter framework: SaltNPepper
    Florian Zipser & Laurent Romary. 2010. A model oriented approach to the mapping of annotation formats using standards. Proceedings of the Workshop on Language Resource and Language Technology Standards, LREC 2010. [online]

  • Search and visualization interface: ANNIS
    Amir Zeldes, Julia Ritz, Anke Lüdeling & Christian Chiarcos. 2009. ANNIS: A Search Tool for Multi-Layer Annotated Corpora. Proceedings of Corpus Linguistics 2009, July, 20–23. [online]


Last update: 06 October 2015