Pc Corpuslinguistic Methods

Anke Lüdeling, Berlin | Anna Shadrova, Berlin

Project Pc is both an infrastructure and a research project within RUEG2. It is the successor to project Pd in RUEG1. On the side of infrastructure and support, it will continuously provide integration of new and/or corrected annotations, data curation and sustainability, as well as technical support and research engineering, i.e. the improvement of automatic and semi-automatic anno-tation, and more generally the development of tools and pipelines for information retrieval/text mining and quantitative analysis. It will also provide support and consultation in the choice and application of quantitative research methods for projects P8-P11 in RUEG2.

On the research side, it aims to advance the field of corpus linguistics in two ways:

(1) through an evaluation of advanced machine learning techniques and the feasibility and usefulness of their application for the automatic and semi-automatic annotation and information retrieval in non-standard corpora of limited size;

(2) through a focus on the development, validation, evaluation, and epistemological embedding of methods for the RUEG corpus specifically, as well as small and mid-sized corpora in general.

While machine learning has been a well-researched area of computer science and computational linguistics, its application to core-linguistic research questions is still a young field and requires more experimental and exploratory work. The latter is also true of the systematisation of quanti-tative methods. This part of the research agenda of Pc is basic research with the potentials and uncertainties that this kind of venture tends to hold.


Cooperation Partners

Aurelie Herbelot (U Trento), Georg Rehm (DFKI Berlin)

PhD Student

Martin Klotz


Student Assistants

Gaja Hartz, Esra Uyanık