Documentation Version 1.0

Corpus Linguistics and Morphology | Documentation Version 1.0

Documentation Version 1.0

Corpus Pipeline

Corpora are collected in several stages:

Obtain facsimile, usually from Google Books
Correct OCR or transcribe text, marking up structure with TEI
Tokenize, part-of-speech tag and lemmatize with TreeTagger
Add corpus specific manual annotations using MS Excel
Export the merged corpus to persistent formats and the ANNIS search and visualization tool

Corpus Design

For purposes of comparability, we try to select texts from one scientific discipline which is ideally represented in a similar fashion throughout the early modern era. For the first RIDGES corpus we have selected the domain of herbology (Kräuterkunde). The timespan of interest has been divided into 30 year periods, with a currently minimal sample of one text per period. Texts vary somewhat in length since older text is more difficult to annotate. Each document is typically between 4000 and 10000 tokens long.

Annotation Layers

The RIDGES corpora follow a multi-layer design. Annotation layers can be roughly divided into four kinds:

Token Annotations
TEI metadata
Structural TEI annotations
Corpus-specific annotations

Token Annotations

These annotations always apply to exactly one token. Part-of-speech annotation and lemmatization were carried out with TreeTagger and corrected manually.

PAULA/relANNIS	TEI XML	Description
tok	(plain text)	The diplomatic transcription of the word form as found on the manuscript. Line-breaks are marked as in the text, usually as '='.
norm	N/A	A normalized word form based on Modern German orthography. For words not found in Modern German, a modern orthography is assumed (e.g. beſchicht is normalized as beschieht, analog to geschieht).
lemma	N/A	The normalized uninflected lexicon entry for each word form, using modern orthography (again, obsolete words are also modernized, e.g. beſchicht has the lemma beschehen, analog to geschehen)
pos	N/A	Part-of-speech annotation using the STTS tagset for German.
clean	N/A	Some texts may also have a partially normalized layer with consistent orthography from the relevant period, but not modernized to Modern German orthography. For example, a form with a line break like wor=den will be cleaned to worden but not normalized to modern geworden where this would now be the appropriate form.
hyperlemma	N/A	In some cases where the use of modernized orthography is impossible or misleading, a modern semantic equivalent is given as a hyperlemma (e.g. Heümonat is hyperlemmatized as Juli or ráß as beißend.

TEI Metadata

These annotations follow the TEI P5 guidelines.

PAULA/relANNIS	TEI XML	Description
meta::author	author	Name of the author (if known).
meta::bibl	bibl	Full bibliographical entry for the source including the page numbers annotated in the corpus.
meta::date	date	Date of publication, usually just the year (e.g. "1722").
meta::publisher	publisher	Publisher of the document (if known).
meta::pubPlace	pubPlace	Publication place of the document.
meta::title	title	Title of the work the document was extracted from.

TEI Structural Annotations

These annotations follow the TEI P5 guidelines.

PAULA/relANNIS	TEI XML	Description
del	del	Area deleted in original text
unclear	unclear	Unreadable or otherwise unclear text
atLeast	unclear@atLeast	Minimum prseumed length of unclear text in characters
atMost	unclear@atMost	Maximum prseumed length of unclear text in characters
div1 - div5	div	A subsection of the document. Nesting depth is made explicit by the number after div in the PAULA/relANNIS version
div1_n - div5_n	div@n	A numbered subsection (the n annotation has the section number as a value, though this may also be a letter such as A or a subsection such as 1.1)
div1_type - div5_type	div@type	The type of section or subsection. Section can correspond to the entire "book", a "chapter" or smaller sections, including systematic types specific to the genre such as "place" (where a certain herb grows), "form" (descriptions of a herb's form) etc.
figure	figure	A graphic embedded in the original document.
figure_rend	figure@rend	Description of the rendering of the figure.
foreign	foreign	A foreign language area.
foreign_rend	foreign@rend	Description of the rendering of the foreign language area (e.g. fonts like Antiqua, italics)
lang	foreign@xml:lang	The language a foreign area is written in (ISO three letter codes according to ISO 3166-1 alpha-3).
head	head	A heading.
head_n	head@n	The number of a heading.
head_rend	head@rend	Description of the rendering of the heading.
head_type	head@type	Type of heading used, e.g. "margin" for a marginal heading.
hi	hi	Highlighted area.
hi_rend	hi@rend	Description of the rendering of the highlighted area.
lb	lb	Linebreak.
list	list	A list of items.
list_type	list@type	The type of list used.
item	item	Item in a list.
name	name	A proper name (annotated only in some documents).
name_type	name@type	The type of proper name (e.g. "person", "herb").
note	note	A note in the original document (e.g. footnotes).
p	p	A paragraph.
p_n	p@n	The number of a numbered paragraph (this may also be a letter such as A).
p_rend	p@rend	Description of the rendering of the paragraph.
pb	pb	Pagebreak.
pb_ana	pb@ana	Analysis of the pagebreak (e.g. in case of apparently incorrect page numbers).
pb_n	pb@n	The number of the page (if marked explicitly).
pb_rend	pb@rend	Description of the rendering of the page (repeated parts of book or chapter titles, redundant confidence texts).
quote	quote	A quotation (in some documents only).
reason	unclear@reason	Reason for annotation of the current area (usually describes form of unclear areas).
ref	ref	Reference to a footnote.
ref_target	ref@target	ID of the footnote being referred to.
ref_type	ref@type	Type of reference (e.g. a TEI "noteAnchor").
w	w	A word annotated with additional attributes.
xml_id	fZ (Z is a number)	ID given to a footnote.

Corpus Specific Annotations

These annotations were developed by our students to annotate spans of tokens with properties of special interest.

PAULA/relANNIS	TEI XML	Description
definition	N/A	A Definition.
term	N/A	A technical term.
property	N/A	Describes a reference to properties of a herb such as effect, smell etc.
reader_ref	N/A	References made by authors to the reader. Values indicate the grammatical type of the reference, e.g. "pron2sg" for second person singular pronoun.
author_ref	N/A	References made by authors to themselves. Values indicate the grammatical type of the reference, e.g. "pron1pl" for first person plural pronoun.
uncertain	N/A	Annotator uncertain of lemma and/or normalization since no equivalent could be established.

Faculty of Language, Literature and Humanities - Corpus Linguistics and Morphology