KiDKo: Main corpus and complementary corpus

German in Multilingual Contexts | KiDKo: Main corpus and complementary corpus

KiDKo: Main corpus and complementary corpus

Giving Credit

The text corpus is licensed under CC-BY 4.0 with the following reference:

Heike Wiese, Ines Rehbein, Sören Schalowski, Ulrike Freywald [&] Katharina Mayr (2010ff): KiDKo - Ein Korpus spontaner Unterhaltungen unter Jugendlichen im multiethnischen und monoethnischen urbanen Raum.

KiDKo - Ein Korpus spontaner Unterhaltungen unter Jugendlichen im multiethnischen und monoethnischen urbanen Raum by Heike Wiese, Ines Rehbein, Sören Schalowski, Ulrike Freywald & Katharina Mayr is licensed under a Creative Commons Attribution 4.0 International License.

Data collection

Spontaneous speech data of young people, from self-recordings: informal conversations between friends, mostly in German.

Speakers

9th grade students who were between 14 and 17 years old at the time of recording; initial contact via two schools: one in Berlin-Kreuzberg and one in Berlin-Hellersdorf with, respectively, 84.4% and 4.8% of students having a "non-German background language" (i.e., on a questionnaire issued by the Berlin school administration parents indicated that the main language spoken at home is not German) (see also Wiese et al. 2012).

You can find detailled information about the anchor speakers here.

Here is a table giving the figures for individual speakers' shares of the corpus.

Size

	Number of tokens	Number of anchor speakers
Main corpus	~ 228,000	17 (10 male, 7 female)
Complementary corpus	~ 105,000	6 (5 male, 1 female)

Corpus features

(cf. Rehbein, Schalowski & Wiese 2014)

The corpus consists of audio recordings with aligned, anonymised transcriptions. The corpus contains part-of-speech (POS) information (Rehbein & Schalowski 2014) and provides an additional orthographic normalisation layer as well as the translation of Turkish code switching. Another annotation level provides information on syntactic chunks and topological fields.

The transcription of the data was carried out in EXMARaLDA (Extensible Markup Language for Discourse Annotation) (Schmidt & Wörner 2005). Transcription conventions are based on a modified form of 'GAT basic' (Selting et al. 1998) (i.e. mostly orthographical transcription while marking certain prosodic features, such as upper case for stress, specific characters for pauses and lengthening, and parenthesis for non-verbal material).

Each transcript contains meta-information on socio-demographic features and the linguistic background of the speakers (for all anchor speakers: sex, residential area, family language).

Corpus access

The corpus is available online via the Hamburger Zentrum für Sprachkorpora (HZSK).

For legal reasons, we are not allowed to make the audio files accessible online. Instead, we have set up a local workstation in the Humboldt-Universität zu Berlin where you can access the audio data. If you are interested to do so, please contact us and arrange an appointment (heike.wiese at hu-berlin.de).

Alternatively, you can access the data via the repository of the Hamburg University: https://www.fdr.uni-hamburg.de/record/8247

Additionally, the corpus is available for reading as PDF files. Due to the amount of data, the subcorpus KiDKo/Mu is split into five files:

Information on KiDKo and ANNIS

Here you can find a general overview and introduction, with some examples on KiDKo searches.
Here you can find information on the transcription and normalisation in KiDKo.
STTS Guidelines (Stuttgart-Tübingen Tagset)
Overview of the STTS POS tagset
Extended POS-Tagset used for the annotation of parts-of-speech in KiDKo
Quickstart - working with ANNIS and KiDKo
ANNIS User Guide

References

Rehbein, I., Schalowski, S., and Wiese, H. (2014).The KiezDeutsch Korpus (KiDKo) Release 1.0.
In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC),
May 24-31, 2014. Reykjavik, Iceland.
Rehbein, I., and Schalowski, S. (2013). STTS goes Kiez ‐ Experiments on Annotating and Tagging Urban Youth Language. Journal for Language Technology and Computational Linguistics 28: 199-227 (Themenheft "Das STTS-Tagset für Wortartentagging - Stand und Perspektiven").
Selting, Margret; Auer, Peter; Barden, Birgit, Bergmann, Jörg; Couper-Kuhlen, Elizabeth; Günthner, Susanne; Meier, Christoph; Quasthoff, Uta; Schlobinski, Peter; Uhmann, Susanne (1998). Gesprächsanalytisches Transkriptionssystem (GAT). Linguistische Berichte 173: 91-122.
Wiese, Heike; Freywald, Ulrike; Schalowski, Sören, & Mayr, Katharina (2012). Das KiezDeutsch- Korpus. Spontansprachliche Daten Jugendlicher aus urbanen Wohngebieten. Deutsche Sprache 40:97-123.
Zeldes, A., Ritz, J., Lüdeling, A., and Chiarcos, C. (2009). Annis: A search tool for multi-layer annotated corpora. In Proceedings of Corpus Linguistics, July 20-23, 2009. Liverpool, UK.

Faculty of Language, Literature and Humanities - German in Multilingual Contexts