Faculty of Language, Literature and Humanities - Corpus Linguistics and Morphology

Dr. phil. Anna Shadrova

Postdoc, PI (corpus linguistics, quantative analysis of small and deeply annotated corpora, graph-based modeling of linguistic data, linguistic path dependence, usage-based perspectives of the lexicon, the lexicon in multilingual and register-diverse settings, intra- and inter-speaker variability, structure of the mental lexicon, text as process, SLA/L2 acquisition, multilingualism, heritage languages, lexicosyntax, variation)

Research Interests

Together with several other researchers from our department I am currently working on a number of projects that have to do with a usage-based perspective of the lexicon in task-based, deeply annotated and small to mid-size corpora. 

We are especially interested in interfaces with morphology (productivity in word formation, lexicalization, German complex verbs and nouns in use), lexicosyntax (native-like selection/coselectional constraint), discourse structure and lexical semantics in the context of multilingualism (L1, L2, heritage languages) and diversity of task and register.

Tracing the lexicon in use is notoriously difficult due to high its inter- and intra-speaker variability and its extreme combinatorial potential. We therefore use approaches that allow for quantification, while also keeping intact and available the whole linguistic depth of the data, such as graph-based modeling and time-series modeling of text as process. Two prominent and underresearched concepts that we have recently been focusing on in our projects are path dependence (how is production shaped by what has been said or written before) and the structure of the mental lexicon in L1 and L2 (what can the path taken through the system reveal about its structure). 

Our goal is to model language data as realistically as possible from a linguistic perspective before we proceed to quantification and comparison. This is to acknowledge that linguistic concepts are frequently fuzzy-edged, multi-layered and highly intertwined. On top of that, linguistic data is frequently highly ambiguous, task-specific and never truly random or neutral with respect to communicative intent or situation. We therefore place high value on epistemological clarity and sampling methods for the internal validation of our results, as well as best practices around open data/open science, reproducibility and replicability.



I am grateful and humbled to be the recipient of this year's Outstanding Doctoral Thesis Award by the German Society for Computational Linguistics (GSCL Promotionspreis) for my dissertation Measuring coselectional constraint in learner corpora - a graph-based approach. Thanks again to the scientific committee of the GSCL for considering my contribution as well as giving me the honor to present my work at KONVENS 2022 in Potsdam.

Current Projects


RUEG - Research Unit Emerging Grammars: Project Corpus-linguistic methods

In RUEG, we are interested emerging patterns and variation in the multilingual setup of heritage speakers of Turkish, Russian, and Greek in Germany and German in the US, as well as their monolingual counterparts in Russia, Greece, Turkey, and the US. RUEG is concerned with various aspects of grammar and the lexicon.

The research in RUEG is based on multifaceted corpus data representing formal and informal settings in written and spoken mode, and the majority and heritage language of the bilingual speakers. Additionally, we are looking at potential differences by gender and age of participants, as well as variable degrees of heritage language vitality in the majority community of each speaker group.

This complexity requires adequate technical representation as well as methodological research and guidance, which is what we provide through Pc. Our research focus lies on the development of quantitative methods for small to medium-sized corpora, such as the employment of graph metrics and network analysis in core-linguistic research, Bayesian vs. complex frequentist statistics (mixed-effect modeling in particular), the application of machine learning techniques for the advancement of knowledge and information retrieval through introspection; and their optimization for smaller data.


CRC Register (associated)

The CRC Register: Language Users’ Knowledge of Situational-Functional Variation investigates aspects of the register knowledge of the speakers of a language. Competent speakers can adapt their linguistic behavior on every level in response to the current situation: They know, for example, that the German word sauer ‘ticked off’ is appropriate in different situations than the word verärgert ‘angry’, that one uses less complex sentences when speaking with children than in an academic function, and that sometimes it matters whether one says around 8 o’clock or 7:49 am, and sometimes it doesn’t. We are thus concerned with intraindividual variation.

Project C04 (which I am associated with) investigates the acquisition of register competence in an L2 based on various spoken and written learner corpora. If register knowledge is largely acquired implicitly through langueg experience, we can assume that even advanced learners of a second language might possess lower register competence compared to L1. It follows that they would choose different expressions from a smaller set of alternatives and that they would overall show smaller differences between communicative situations. C04 views this through the lens of the phenomenon of modification.

Epistemology and research paradigms in corpus linguistics: Reading and discussion group

From an epistemological perspective, linguistics is a challenging field. Language is a multi-layered, highly situationally variable, extremely context-dependent and highly ambiguous phenomenon. As corpus linguists, we frequently need to classify every single instance of our language data, leading us to face this hyperspace of overlaying classification in particularly challenging ways. In our reading and discussion group, we aim to untangle some of those ways and relate them to epistemological debates and research paradigms that favor -- or enforce -- one interpration over another. We have a mailing list for our montly to quarterly meetings. Feel free to message me if you would like to sign up.


Contact Information

Dorotheenstraße 24
room 3.333
10117 Berlin - Mitte
Tel.: 030 2093 9774
anna [dot] shadrova [ät] hu-berlin [dot] de
mailing address:
c/o Institut für deutsche Sprache und Linguistik
Sprach- und Literaturwissenschaftliche Fakultät
Humboldt-Universität zu Berlin
Unter den Linden 6
D-10099 Berlin


Research output

Download CV

Peer-reviewed journal papers

Shadrova, A. (2021): Topic models do not model topics: epistemological remarks and steps towards best practices. Journal of Data Mining and Digital Humanities 2021, https://doi.org/10.46298/jdmdh.7595, Source : oai:HAL:hal-03261599v3

Shadrova, A., Linscheid, P., Lukassek, J., Lüdeling, A., & Schneider, S. (2021). A Challenge for Contrastive L1/L2 Corpus Studies: Large Inter- and Intra-Individual Variation Across Morphological, but Not Global Syntactic Categories in Task-Based Corpus Data of a Homogeneous L1 German Group. Frontiers in Psychology, 12, 5267. doi:10.3389/fpsyg.2021.716485

Ighreiz, A., C. Möllers, L. Rolfes, A. Shadrova & A. Tischbirek (2020): Karlsruher Kanones: Selbst- und Fremdkanonisierung der Rechtsprechung des Bundesverfassungsgerichts. Archiv des öffentlichen Rechts Jahrgang 145 (2020) / Heft 4, S. 537-613 (77), https://doi.org/10.1628/aoer-2020-0026

Lüdeling, A.; Hirschmann, H. & Shadrova, A. (2017) Linguistic Models, Acquisition Theories, and Learner Corpora: Morphological Productivity in SLA Research Exemplified by Complex Verbs in German. Language Learning Special Issue on Language learning research at the intersection of experimental, corpus-based and computational methods: Evidence and interpretation 67 (S1),  96-129.


Invited chapters & peer-reviewed proceedings

Shadrova, A. (2022): It may be in the structure, not the combinations: Graph metrics as an alternative to statistical measures in corpus-linguistic research. In: Tara Andrews, Franziska Diehr, Thomas Efer, Andreas Kuczera and Joris van Zundert (eds.): Graph Technologies in the Humanities - Proceedings 2020, published at http://ceur-ws.org/Vol-3110, p. 245-278.

Lüdeling, A, Hirschmann, H., Shadrova, A. & Wan, S. (2021): Tiefe Analyse von Lernerkorpora. In H. Lobin, A. Witt & A. Wöllstein (Ed.), Deutsch in Europa (pp. 235-284). Berlin, Boston: De Gruyter. https://doi.org/10.1515/9783110731514-013

Thomas, E. M., Cantone, K. F., Davies, S., & Shadrova, A. (2014). Cross-linguistic influence and patterns of acquisition: The emergence of gender and word order in German-Welsh bilinguals. In: E. M. Thomas and I. Mennen (Eds.): Advances in the Study of Bilingualism, p. 41-62. Clevedon: Multilingual Matters.


Shadrova, A. (2020): Measuring coselectional constraint in learner corpora: A graph-based approach. Univ.-Dissertation: Humboldt-Universität zu Berlin, http://dx.doi.org/10.18452/21606.

Shadrova, A. (2013): Mehr Chunks! – Entwicklungsperspektiven für die Konstruktionsgrammatik unter Einbeziehung von Phraseologie, Psycholinguistik und L2-Erwerbsforschung. Masterarbeit, Humboldt-Universität zu Berlin, http://dx.doi.org/10.18452/14224


[Corpus and annotation guidelines] Shadrova, A. (2021). Kobalt: Extension Corpus and Annotation Guidelines for Verb Classification and Dependency Adjustments (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5730224

[Data from analysis] Shadrova, A., Linscheid, P., Lüdeling, A., Lukassek, J., & Schneider, S.. (2021). Additional Data to "A Challenge for Contrastive L1/L2 Corpus Studies" [Data set]. Zenodo. https://doi.org/10.5281/zenodo.4752308

[Corpus, scripts and data from analysis] Shadrova, A. (2020): Extended Kobalt-DaF corpus, scripts for pre-processing and analysis, extracted lexicosyntactic graphs (JSON), and R-plots from PhD thesis and beyond: https://doi.org/10.5281/zenodo.3584091

[Corpus] Möllers, C., A. Shadrova & L. Wendel (2021): BVerfGE-Korpus 1.0. Mit freundlicher Unterstützung des Mohr-Siebeck-Verlags. https://doi.org/10.5281/zenodo.4551408

[Data from analysis] Ighreiz, A., C. Möllers, L. Rolfes, A. Shadrova & A. Tischbirek (2021): Karlsruher Kanones? Netzwerke, Tabellen und Analyseplots. https://doi.org/10.5281/zenodo.4464810


Klotz, M., A. Lüdeling & A. Shadrova (2021): Contrastive Corpus Methodology for Language Modeling and Analysis. DGfS-Kurz AG, 43rd Annual Conference of the German Linguistic Society (DGfS):
Modell und Evidenz / Model and Evidence, University of Freiburg, Germany, February 23-26, 2021

Krause, T. & A. Shadrova (2016) Korpus III: Einführung in die Annis-API mit Python. Linguistischer Methodenworkshop 2016, Institut für deutsche Sprache und Linguistik. Humboldt-Universität zu Berlin, 23.02.2016.

Shadrova, A. & T. Krause (2016) Korpus II: Frequenzanalyse, Dependenzen, Metadatensuche mit Annis. Linguistischer Methodenworkshop 2016, Institut für deutsche Sprache und Linguistik. Humboldt-Universität zu Berlin, 23.02.2016.


[Conference talk] A. Shadrova (2022): Lexical similarity in L1 and L2 German as evidence for the structure and dynamics of the lexicon (work in progress). Learner Corpus Research, Padua. 23.09.2022.

[Keynote] A. Lüdeling, J. Lukassek, A. Shadrova (2022): Variability in Grammatical Categories and Structures: The Case of Word Formation. Grammar and Corpora, Gent. 02.07.2022.

[Talk] Shadrova, A. (2022): Problemkind der Korpuslinguistik: Das Lexikon in Strukur, Gebrauch und Analyse. Korpuslinguistisches Kolloquium, HU Berlin, 25.05.2022. Folien.

[Conference talk] J. Lukassek, A. Lüdeling, A. Shadrova, S. Wan (2022): Complex nouns as markers of academic register in L1-and L2-authored essays. Workshop "Word Formation and Discourse Structure", Leipzig, 06.05.2022.

[Conference talk] Lüdeling, A., Lukassek, J., & Shadrova, Anna. Variation and productivity in German L1 and L2 nominal word-formation. 44th Conference of the German Linguistic Society (DGFS; AG8), 25.02.2022. [online]

[Conference talk] Shadrova, A., M. Klotz & A. Lüdeling (2021): Linguistic Modeling and Analysis. Opening talk for DGfS Kurz-AG Contrastive Corpus Methodology for Language Modeling and Analysis.

[Public defense] Shadrova, A. (2020): Interlanguage-Effekte in L1 und L2: Eine graphbasierte lexikosyntaktische Betrachtung anhand geschriebener Korpusdaten aus Falko und RUEG, HU Berlin, 10.07.2020.

[Talk] Shadrova, A. (2020): No free lunch: Ob und wie Topic Modeling und andere probabilistische Informationsexktraktionsverfahren zum Erkenntnisgewinn genutzt werden können. Korpuslinguistisches Kolloquium, HU Berlin, 08.07.2020.

[Conference talk] Shadrova, A. (2020): Graph metrics as an alternative to statistical
measures in linguistic research. Graph Technologies in the Digital Humanities 2020, Wien, 21.02.2020.

[Talk] Shadrova, A. (2020): Korpuslinguistische Modellierung juristischer
Fragen in einem Korpus von BVerfG-Entscheidungen. Korpuslinguistisches Kolloquium, HU Berlin, 22.01.2020.

[Talk] Shadrova, A. (2019): Individuelle Varianz und Textlängeneffekte:
Wie geht Sampling in Lernerkorpora? Korpuslinguistisches Kolloquium, HU Berlin, 05.06.2019.

[Talk] Lüdeling, A. & A. Shadrova (2020): Forschungsfragen, Modelle, Auswertung. Möglichkeiten und Grenzen der korpusgestützten Textanalyse. Workshop "Methoden quantitativer Textanalyse", Berlin, 21.11.2019.

[Talk] Tischbirek, A. & A. Shadrova (2020): Karlsruher Kanones? Selbst- und Fremdkanonisierung der Rechtsprechung des BVerfG. Workshop "Methoden quantitativer Textanalyse", Berlin, 21.11.2019.

[Conference talk] Shadrova, A. (2019): U-shaped learning of verb argument
coselection in learners of German. Learner Corpus Research 2019, Warschau, 13.09.2019.

[Talk] Shadrova, A. (2018): Lernerkorpora: Mehrebenenannotation und Zielhypothesen als Such- und Analysewerkzeug. Workshop "Von Lernerdaten zu Lernerkorpora", Schloss Rauischholzhausen, 12.07.2018.

[Talk] Shadrova, A. (2017): Korpuslinguistische Kollokationsanalyse als Trendscout-Analyse zum Förderprogramm „Industrielle Gemeinschaftsforschung – IGF“. Vortrag beim IGF-Arbeitstreffen am BWMI, 04.10.17.[Vortrag] Shadrova, Anna (2017): Lexikalische Assoziatiosmaße und Idiomatizität: Eine Problemskizze anhand von Lernerdaten aus dem Kobalt-Korpus. Korpuslinguistisches Kolloquium, HU Berlin, 24.05.2017.

[Conference talk] Shadrova, A. (2015): Learners know their German: Statistical similarities of surface features in German L1 and L2 essays. International Symposium on Bilingualism 10, 24.05.2015.

[Talk] Shadrova, A. & A. Lüdeling (2015): Individuelle Differenzen in Lernerdaten. INDUS-Netzwerktreffen, Universität Duisburg-Essen.[Talk] Shadrova, Anna (2014): "Kobalt-E: Erste Ergebnisse". Netwerk Kobalt-DaF. Arbeitstreffen in Tübingen, 04.11.14.




winter 17/18

Models of Grammatical Description
Erasmus students and students from similar programs

(Seminar: Modelle grammatischer Beschreibung)

Methods in Linguistics
Erasmus students and students from similar programs

(Übung: Methoden der Linguistik)

Learner German and Hood German
B.A. German Studies/Germanic Lingustics

(Seminar: Lernerdeutsch und Kiezdeutsch)

summer 17

Grammar of German
B.A. German Studies/Germanic Lingustics/Historical Linguistics

(Übung Deutsche Grammatik)

summer 16

Intro to Natural Language Processing with Python
B.A. German Studies/Germanic Linguistics/Historical Linguistics; M.A. Linguistics

winter 15/16

Intro to Linguistics
B.A. German Studies/Germanic Linguistics/Historical Linguistics

winter 14/15

Grammatical and Textual Regularities of Internet Language
B.A. German Studies/Germanic Linguistics

(Grammatische und textbezogene Regularitäten der Internetsprache, Modul "Text und Diskurs I")

summer 14

Grammar of German
B.A. German Studies/Germanic Lingustics/Historical Linguistics

(Übung Deutsche Grammatik)

winter 13/14

Grammar of German
B.A. German Studies/Germanic Lingustics/Historical Linguistics

(Übung Deutsche Grammatik)


Previous projects

Contrastive corpus methodology and language modeling and analysis

Workshop at the 43rd annual meeting of the German Linguistic Society in Freiburg, 24.-26. Februar 2021. With Martin Klotz und Anke Lüdeling. Details and presentations: https://www.linguistik.hu-berlin.de/de/institut/professuren/korpuslinguistik/events/kurz-ag-msc

Leibniz project on linguistic developments in German Federal Constitutional Court decisions

From 2018 to 2021 I was part of Prof. Dr. Christoph Möller's Leibniz project at the Humboldt University Faculty of Law analyzing linguistic developments in German Federal Constitutional Court decisions based on a longitudinal corpus reaching back to the beginning of the GFCC in 1951.

Current work includes the modeling of complex corpus data in a graph-based corpus architecture (text as graph); the development of an epistemologically well-rooted employment of topic modeling in text-based research; an analysis of thematic distributions by types of proceeding in the jurisdiction of the Court; canonization and citation practice of the Court.

Other relevant topics include NLP, data modeling, quantitative linguistics, stilometry, Pattern Recognition, Network Analysis, Information Retrieval in formalized language, formalization as a linguistic property, linguistic formalization of formalized language on syntactic and semantic levels.


Wendel, L., Shadrova, A., & Tischbirek, A. (2022). From Modeled Topics to Areas of Law: A Comparative Analysis of Types of Proceedings in the German Federal Constitutional Court. German Law Journal, 23(4), 493-531. doi:10.1017/glj.2022.39


Shadrova, A. (2021): Topic models do not model topics: epistemological remarks and steps towards best practices. Journal of Data Mining and Digital Humanities 2021, https://doi.org/10.46298/jdmdh.7595, Source : oai:HAL:hal-03261599v3

Ighreiz, A., C. Möllers, L. Rolfes, A. Shadrova & A. Tischbirek (2020): Karlsruher Kanones: Selbst- und Fremdkanonisierung der Rechtsprechung des Bundesverfassungsgerichts. Archiv des öffentlichen Rechts Jahrgang 145 (2020) / Heft 4, S. 537-613 (77), https://doi.org/10.1628/aoer-2020-0026



In my dissertation "Measuring coselectional constraint in learner corpora: A graph-based approach" (http://edoc.hu-berlin.de/18452/22356) I investigated the structural development of coselectional constraint (~collocation, idiomaticity, the idiom principle) in the use of verb-argument structures in learners at different stages of acquisition. The study is based on essays written by L1-Chinese and L1-Belarusian/Russian learners of German collected by Netzwerk Kobalt-DaF.

The research question was whether it is possible to measure the nativelikeness of coselectional constraint in small to medium-sized corpora and whether there is a process of restructuring with an increase in coselectional constraint with increasing target language ability; and whether there is an intermittent decrease of coselectional constraint at intermediate stages, i.e. a u-shaped learning development.

I analyzed the data in a graph-based approach making use of Louvain modularity (Blondel et al. 2008). An increase in modularity is observable in both learner groups, but a u-shaped development was only found in Belarusian learners. This is discussed from typological, cultural, and cognitive perspectives.

The thesis further discusses the lack of theoretical embedding of coselection in usage-based linguistics, the low explanatory power of the much-presumed "phraseological continuum", the inadequacy of statistical measures of lexical association for the evaluation of coselectional constraint in corpora from a linguistic and a more mathematical perspective, and makes suggestions to the incorporation of graph-based methods in lexical and lexicosyntactic research.

Supervised by Prof. Dr. Anke Lüdeling and Prof. Dr. Amir Zeldes (Georgetown University, Washington, D.C.), defended summa cum laude (10.7.20). Graciously funded through a BMBF scholarship granted by the Hans Böckler Foundation (2014-2018) and a Research Track scholarship granted by the Humboldt Graduate School (2013).

Keywords: Corpus linguistics, second language acquisition, formalization of usage-based linguistics, methodology of quantitative linguistics in small and medium-sized corpora, graph-based corpus methods, validation


Blondel, Vincent D; Guillaume, Jean-Loup; Lambiotte, Renaud; Lefebvre, Etienne (9 October 2008). "Fast unfolding of communities in large networks". Journal of Statistical Mechanics: Theory and Experiment. 2008 (10): P10008. arXiv:0803.0476.

DALeKo - Dokumentation und Analyse von Lernersprache

In this project run by the Arbeitskreis Fremdsprachendidaktik (working group on foreign language teaching) of English, Romance and Slavic Studies at the Humboldt University, a corpus of student-written essays in four school-taught languages (English, French, Russian and Spanish) was compiled. At this point, Russian texts elicited in school and university contexts are available with pos and lemma annotations through the  Annis³ search engine and interface. Due to legal restrictions, the data is only available after registration. Please contact Prof. Dr. Anka Bergmann for further information.

INDUS Research Group on Individualized Language Learning and Approaches from Language Technology


Learn more here.