Faculty of Language, Literature and Humanities - Corpus Linguistics and Morphology


CTLL Banner


Date and Venue

06 - 07 January 2011
Humboldt-University Berlin


Anke Lüdeling (Humboldt University)
Gregory Crane (Tufts University)

Workshop Description

The rise of linguistic corpora and of analytical methods from corpus linguistics has begun to open up new pathways for language learning. This workshop will examine applications for students of both modern languages (such as English, German, Chinese and Arabic) and of historical languages (such as Greek and Latin) for which no native speakers survive.

Students can, for example, define corpora that represent those areas of the language on which they choose to focus – these could be documents from news media or canonical texts. Here we are investigating whereby students can assess their ability to acquire and then apply knowledge from these corpora as they work with new linguistic sources. We also are investigating ways for faculty to assess the competence of students exploring disparate genres within a given language, fostering scalable assessment of students who pursue different pathways.

We also look at the applications of annotations. How can learners benefit from developing and/or executing system annotation of linguistic corpora? How much can students learn about linguistics and/or about a particular language? A project such as the Greek and Latin Treebanks contain more than 1 million corrected syntactic annotations on individual words. How well can we detect patterns of difficulty among individual students and then use those patterns to personalize instruction?

Finally, we look at the opportunities that corpora provide for students to make tangible contributions and then conduct their own research. Thus, students may begin by annotating data that either constitutes a stand-alone (but repurposable) corpus or augment existing annotation. Such annotations are, if well-executed, themselves tangible contributions to scholarship. Students can then conduct their own research based on these annotations and have an opportunity to conduct meaningful research and generate new knowledge that can be automatically linked to passages on which it sheds light.

Workshop Program

(Click here to download this as a PDF file)

January 06, 2011

10:00 – 10:20

Welcome and Introduction: Anke Lüdeling & Greg Crane

10:20 – 11:10

Detmar Meurers

Enhancing Authentic Texts for Language Learners


11:10 – 11:30             


Coffee Break

11:30 – 12:20 

Stefanie Wulff

Using corpora in SLA research: potential and limitations


12:20 – 13:10 

Harry Diakoff

The use of corpora and annotated corpora in

computer assisted language learning

13:10 – 14:30



Julia Richling & Anke Lüdeling

Researching and teaching register differences

15:20 – 15:50



Stefanie Dipper

Corpus-based ways to introduce syntax




January 07, 2011



David Smith

Efficient Inference for Declarative Approaches to Language




Heike Zinsmeister

Explointing the 'Annotation Cycle' for Teaching Linguistics


John Lee

Introducing an online language learning environment and its corpus of tertiary student writing




David Bamman

Tracking Linguistic Variation in Historical Corpora




Kim Gerdes

Collaborative Dependency Annotation in the Classroom

15:40 – 16:30



Topics and Abstracts

  • David Bamman, Boston
    Tracking Linguistic Variation in Historical Corpora

    For cultural heritage languages with no living native speakers, textual corpora are not only a resource for helping learn a language; they the only means by which we know them. These corpora are especially valuable for languages that show variation over a deep historical lifetime or wide geographical space (such as Greek and Latin), since they provide the raw material for detecting that variance empirically. I will describe in this talk three different strands of research for quantifying those differences in Greek and Latin: 1.) developing dependency treebanks (with the help of students) to help measure varying syntactic phenomena; 2.) leveraging parallel texts in Latin, Greek and English to automatically construct “dynamic lexica” to report on sense variation in subcorpora (such as those defined by genre or author); and 3.) mining a deep historical Latin corpus to track rising and falling trends in usage over the span of two thousand years. These different strands of work all naturally dovetail in the domain of language learning, since they provide the foundation for cultivating a view of Latin (and Greek) less as monolithic languages defined by a canonical grammar and more as the sum of individual usage that varies widely across genre, time and space.
  • Harry Diakoff, Schenectady, NY
    The use of corpora and annotated corpora in computer assisted language learning

    While interfaces to enable language learners' direct querying of corpora continue to evolve, little consensus has emerged on their optimal pedagogical use, and many of the best tools currently available require a sophistication that many beginning and even intermediate students lack. Computer applications for second language learning can make systematic use of corpora in ways that provide many of the benefits of corpus-based language learning without requiring the student to master the mechanics of corpus querying, and can offer a stable platform for ICALL research, encouraging the collection of comparable data and collaborative development of such applications. The Alpheios Project provides a case history of an attempt to integrate corpus resources into a customizable and adaptive e-tutor for language learning and formative assessment.(To be announced)

  • Stefanie Dipper, Bochum
    Corpus-based ways to introduce syntax

    Introductory courses in formal syntax are usually based on made-up example sentences and introspective grammaticality judgments. Such approaches face well-known problems. For instance, made-up examples often do not represent the full range of syntactic variation exhibited by natural languages, and intuitions about grammaticality can be misled by the (missing) context.

    In this talk, I would like to present the use of corpora in an introductory syntax course. I will show how corpus searches and corpus frequencies can be exploited to lay the foundations of parts of speech and constituency. In particular, I will illustrate how to use corpus evidence to come up with distribution- and form-based criteria. Corpus evidence comes from the British National Corpus, which is accessed via the interface BNCweb.

  • Kim Gerdes, Paris, LPP, Sorbonne Nouvelle
    Collaborative Dependency Annotation in the Classroom

    Machine learning is beautiful if you have data to learn on. But how to do the first step of annotating enough data to get started? Whether or not you have a parser that does partially what you want, in any case you have to manually create or correct and adapt the analysis. You can pay students, train them, and make them annotate. But then again, you can also use the learning process itself. I will describe how to create well annotated corpora from a very small gold standard: In an introductory syntax class, the students learn
    categories and syntactic functions on real world examples of French and are asked to practice on the internet platform Vakyartha (http://arborator.ilpga.fr/vakyartha/). The distribution of unannotated examples and gold-standard sentences makes it possible at the same time to evaluate the student and to compute the best analyses of yet unannotated sentences using a rover that gives different weight to the students’ annotation.

  • John S.Y. Lee, Hongkong
    Introducing an online language learning environment and its corpus of tertiary student writing

    Research has suggested that students learn better under an e-learning environment and e-learning has been widely implemented in the teaching of English for academic purposes. In this talk, we present a web-based e-learning environment which provides subject teachers and language tutors with a platform of collaboration to improve students’ English writing ability by providing human feedback on the language of those assignments assigned by their subject teacher.

    The system provides a large number of texts produced by student writers. We introduce a learner corpus developed from the written work collected from the system and reports some observations from preliminary studies based on the corpus.

  • Detmar Meurers, Tübingen
    Enhancing Authentic Texts for Language Learners

    Second language acquisition research since the 80s has established that awareness of language categories and forms is important for an adult learner to successfully acquire a foreign language (Lightbown and Spada, 1999). Addressing that need, Sharwood Smith (1993) argued for the use of consciousness raising strategies drawing the learner's attention to specific language properties. He coined the term input enhancement to refer to strategies highlighting the salience of language categories and forms.

    In this talk, we discuss the use of natural language processing (NLP) to provide automatic input enhancement of web pages. The pages can be freely selected by the learners based on their interests and using a regular web browser. Based on a Firefox add-on, the browser can automatically enhance language patterns which are known to be difficult for learners of English, such as determiners and prepositions, phrasal verbs, the distinction between gerunds and to-infinitives, and wh-question formation. The current prototype focuses on learners of English, but the underlying architecture can be used for other languages and we make it freely available.

    One can view such automatic visual input enhancement as an enrichment of Data-Driven Learning (DDL). Where DDL has been characterized by Tim Johns as an "attempt to cut out the middleman [the teacher] as far as possible and to give the learner direct access to the data", in our automatic input enhancement approach the learner stays in control, but the NLP uses 'teacher knowledge' about relevant and difficult language properties to make those more prominent and noticeable for the learner, and to support interaction with the language material.


    Patsy M. Lightbown and Nina Spada. 1999. How languages are learned. Oxford University Press, Oxford.

    Michael Sharwood Smith. 1993. Input enhancement in instructed SLA: Theoretical bases. Studies in Second Language Acquisition, 15:165–179.

  • Julia Richling & Anke Lüdeling, Berlin
    Researching and teaching register differences

    Our talk deals with a method of teaching and researching language variation. Speakers have the choice to express ‘the same thing’ (the variable) in one of many different ways (the variants). It has often been shown that the choice is not random but triggered by social factors (Labov 1966, 2008 etc.), local factors (Anderwald & Szmrecsanyi 2009), text type etc. Language can vary on all linguistic levels; and typically several variables co-vary in a given variety. Douglas Biber and colleagues have shown in many papers (Biber 1995, 2006 etc.) that multi-dimensional analyses are a good method to understand register, where register is defined as functionally motivated variation.

    We use Biber’s methods in teaching register variation. Corpus compilation, definition of variables and variants, annotation and evaluation are performed by the students in a seminar; all students are working on the same experiment. Our research question is: Can we detect register variation even among very similar texts such as newspaper texts from different sections?

    Using Functional Analysis as well as Principal Components Analysis we see that the different newspaper sections show interesting differences.


    Anderwald, Lieselotte & Szmrecsanyi, Benedict (2009) Corpus Linguistics and Dialectology. In: Lüdeling, Anke & Kytö, Merja (eds) Corpus Linguistics. An International Handbook. Mouton de Gryuter, Berlin, 1126-1140.

    Biber, Douglas (1995), Dimensions of Register Variation: A Cross-linguistic Comparison. Cambridge University Press, Cambridge.

    Biber, Douglas (2006), University Language: A Corpus-based Study of Spoken and Written Registers. John Benjamins, Amsterdam.

    Labov, William (1966) The Social Stratification of English in New York City. The Center for Applied Linguistics, Washington (2nd edition 2006, Cambridge University Press, Cambridge).

  • David Smith, Amherst
    Efficient Inference for Declarative Approaches to Language

    Much recent work in natural language processing treats linguistic analysis as an inference problem over graphs. This development opens up useful connections between machine learning, graph theory, and linguistics. In particular, we will see how linguists can declaratively specify linguistic inference problems, in terms of hard and soft constraints on grammatical structures. The first part of the talk formulates syntactic parsing as a graphical model with the novel ingredient of global constraints. Global constraints are propagated by combinatorial optimization algorithms, which greatly improve on collections of local constraints. The second part extends these models for efficient learning of transformations between non-isomorphic structures. These noisy (quasi-synchronous) mappings have applications to adapting parsers across domains, projecting parsers to new languages, learning features of the syntax-semantics interface, and reranking passages for information retrieval.

  • Stefanie Wulff, Denton (University of North Texas)
    Using corpora in SLA research: potential and limitations

    This paper seeks to illustrate the potential of applying corpus-linguistic data and methodology to questions of second language acquisition (L2 acquisition), and to initiate discussion about limitations of corpus linguistics in this field of research.

    Firstly, I will discuss three corpus-based case studies of English L2 acquisition:
    • tense-aspect (Wulff et al. 2009): revisiting the Aspect Hypothesis, this study used corpus data to examine how various features of the input affect tense-aspect morphology. Matched against acquisition data for different TA patterns by adult learners of English, the results suggest that frequency, distinctiveness, and prototypicality jointly drive acquisition.
    • argument-structure constructions (Gries & Wulff 2005): this study combined corpus-linguistic and experimental evidence in favor of the hypothesis that learners of English save argument structure constructions alongside words in their mental lexicon.
    • the genitive alternation (Wulff & Gries 2010): this study presents the first multifactorial account of the genitive alternation in learner English, taking into account factors such as rhythmic alternation, syntactic weight, and activation status.

    These case studies hopefully illustrate both advantages and limitations of corpus data: on the one hand, corpus data are maximally compatible with usage-based frameworks of L2 acquisition in that they provide access to comparatively large amounts of rich, dense, and varied learner data; they facilitate the investigation of complex and multifactorially determined phenomena; and they enable the researcher to uncover trends and patterns in learner production that could otherwise escape notice. On the other hand, these case studies are not without limitations: for instance, no direct evidence can be furnished from corpus data that would speak to how all these different target structures are processed online; also, I will point to more practical limitations of this research in various places that mostly have to do with the limited availability of sufficiently large and ideally annotated corpora. Taking all methodological implications into consideration, I would like to argue in this paper that the observational data available through corpus data are a very valuable research tool that ideally should be employed in combination with experimental research in order to address questions regarding online processing, and to evaluate scenarios for which corpus data make no or conflicting predictions.


    Gries, Stefan Th. and Stefanie Wulff. 2005. Do foreign language learners also have constructions? Evidence from priming, sorting, and corpora. Annual Review of Cognitive Linguistics 3:182-200.
    Wulff, Stefanie, Nick C. Ellis, Ute Römer, Kathleen Bardovi-Harlig and Chelsea LeBlanc. 2009. The acquisition of tense-aspect: Converging evidence from corpora, cognition, and learner constructions. Modern Language Journal 93.3:354-369.
    Wulff, Stefanie and Stefan Th. Gries. 2010. Second language acquisition alternations: the genitive alternation in German ESL. Invited presentation, 19 July 2010, English Language Institute, University of Michigan.

  • Heike Zinsmeister, Konstanz
    Exploiting the 'Annotation Cycle' for Teaching Linguistics

    An annotation process consists of three main parts: (i) the creation of annotation guidelines, (ii) the annotation process itself, and (iii) the evaluation of annotation quality by measuring inter-annotator agreement and creating confusion matrices. It is a cycle because guidelines are to be refined both after encountering data and after the evaluation – and then the annotation process starts all over again. 
    In this talk I will outline how to make use of annotation in teaching linguistics: Creating their own guidelines requires the students to study grammars and linguistic articles in a way that will enable them to put the linguistic definitions into use. The annotation process sharpens their understanding of the phenomenon by having them deal with authentic data that will question the applicability of the analyses. Furthermore, the evaluation will point them to difficulties and divergent interpretations. 

    I will present one showcase of how to employ the annotation cycle in teaching linguistics: different uses of German 'es' ('it'). This touches upon different areas of syntax and lexical semantics including the use of 'es' as referential pronoun, quasi argument and placeholder for extraposed clauses among others. By sketching the implementation in class, I will discuss two different settings: a paper and pencil setting versus using annotation and evaluation software.

Venue Description

The Institute for German Language and Linguistics is located at Dorotheenstr. 24:

View Larger Map

The building can be entered either from Hegelplatz on the East (main entrance) or through the yard entrance from Universitätsstraße on the West (closer if you're coming from the Freidrichstraße train station).

Entrance from Hegelplatz Entrance from the yard

All talks will be held in one session (no parallel sessions) in room 3.246.
The room is equipped with a projector, whiteboards/markers, power points for laptops and video connections over a VGA cable. For special connectors (e.g. Macs) or foreign power plugs please be sure to bring an appropriate adapter (Germany uses 220V AC). Wireless LAN will be available for all participants. Both rooms and the building itself are accessible for disabled participants. For accommodations in the area see the next section: "Getting around Berlin". If you require additional information about the facilities or available equipment please contact Anke Lüdeling at: Anke.Luedeling@rz.hu-berlin.de

Getting around Berlin

Getting to Berlin

Berlin has two major airports: Tegel and Schönefeld. Both allow access into town, either by bus from Tegel (the bus lines TXL goes into the center of town, and other buses are also available for other destinations), or by S-Bahn (part of the local subway/rail system in Berlin) from Schönefeld.

It is also possible to arrive by train, in which case the central station (Hauptbahnhof) is only one station away from Friedrichstraße, which is the station closest to the university (see the map under Venue).

Transportation within Berlin

Berlin has an excellent public transport system consisting of under- and overground trains (U-Bahn and S-Bahn), trams and buses. A normal single ride costs € 2.10 for zones AB (not including Schönefeld, which is in zone C). A day ticket for zones AB costs € 6.10, while a 7-day ticket costs € 26.20. These tickets are valid in the U-Bahn, S-Bahn, tram and bus.

You can find more detailed, up to date information on travel in Berlin in English here: http://www.visitberlin.de/english/berlin-infos/e_bi_stadtinfos_nahverkehr.php, including the useful map of all lines. There is also an online trip planner with train and bus times here: http://www.vbb-fahrinfo.de/hafas/query.exe/en.


The university is located in the center of town (in the Mitte area), near Unter den Linden boulevard. There are very many hotels in the area, but here are a few suggestions we've had good experiences with:

  • The 3-star Dietrich-Bonhoeffer-Hotel (http://www.dietrich-bonhoeffer-hotel.de/) is probably the most convenient option situated near the university
  • Another 3-star hotel 2 metro stations away from the university is the Hotel Zarenhof (http://www.hotel-zarenhof.de/).
  • The somewhat more expensive 4-star ParkInn hotel at Alexanderplatz (http://parkinnalexanderplatz.berlinhotels.it/) is centrally located and also two metro stations away from the university.
  • The university guest house is very close to the conference venue and prices are very reasonable, however it has only few vacancies and serves no breakfast (a kitchenette is available for your own food though)
  • If you are planning on staying longer in Berlin or generally prefer some space and getting your own breakfast, you might want to look at getting a small "holiday flat" (Ferienwohnung). A very reasonably priced option (ranging between € 50-100 a night) located 1 metro station away from the university is offered by Piepenburg (http://www.piepenburg-verwaltungen.de/seiten/a1000.php). The page is in German, so calling may be the best way to clarify details.