Faculty of Language, Literature and Humanities - Corpus Linguistics and Morphology

Documentation Version 1.0

Corpus Pipeline

Corpora are collected in several stages:

  1. Obtain facsimile, usually from Google Books
  2. Correct OCR or transcribe text, marking up structure with TEI
  3. Tokenize, part-of-speech tag and lemmatize with TreeTagger
  4. Add corpus specific manual annotations using MS Excel
  5. Export the merged corpus to persistent formats and the ANNIS search and visualization tool

 

 

Corpus Design

For purposes of comparability, we try to select texts from one scientific discipline which is ideally represented in a similar fashion throughout the early modern era. For the first RIDGES corpus we have selected the domain of herbology (Kräuterkunde). The timespan of interest has been divided into 30 year periods, with a currently minimal sample of one text per period. Texts vary somewhat in length since older text is more difficult to annotate. Each document is typically between 4000 and 10000 tokens long.

Annotation Layers

The RIDGES corpora follow a multi-layer design. Annotation layers can be roughly divided into four kinds:

  1. Token Annotations
  2. TEI metadata
  3. Structural TEI annotations
  4. Corpus-specific annotations

 

Token Annotations

These annotations always apply to exactly one token. Part-of-speech annotation and lemmatization were carried out with TreeTagger and corrected manually.

PAULA/relANNIS TEI XML Description
tok (plain text) The diplomatic transcription of the word form as found on the manuscript. Line-breaks are marked as in the text, usually as '='.
norm N/A A normalized word form based on Modern German orthography. For words not found in Modern German, a modern orthography is assumed (e.g. beſchicht is normalized as beschieht, analog to geschieht).
lemma N/A The normalized uninflected lexicon entry for each word form, using modern orthography (again, obsolete words are also modernized, e.g. beſchicht has the lemma beschehen, analog to geschehen)
pos N/A Part-of-speech annotation using the STTS tagset for German.
clean N/A Some texts may also have a partially normalized layer with consistent orthography from the relevant period, but not modernized to Modern German orthography. For example, a form with a line break like wor=den will be cleaned to worden but not normalized to modern geworden where this would now be the appropriate form.
hyperlemma N/A In some cases where the use of modernized orthography is impossible or misleading, a modern semantic equivalent is given as a hyperlemma (e.g. Heümonat is hyperlemmatized as Juli or ráß as beißend.

TEI Metadata

These annotations follow the TEI P5 guidelines.

PAULA/relANNIS TEI XML Description
meta::author author Name of the author (if known).
meta::bibl bibl Full bibliographical entry for the source including the page numbers annotated in the corpus.
meta::date date Date of publication, usually just the year (e.g. "1722").
meta::publisher publisher Publisher of the document (if known).
meta::pubPlace pubPlace Publication place of the document.
meta::title title Title of the work the document was extracted from.

TEI Structural Annotations

These annotations follow the TEI P5 guidelines.

PAULA/relANNIS TEI XML Description
del del Area deleted in original text
unclear unclear Unreadable or otherwise unclear text
atLeast unclear@atLeast Minimum prseumed length of unclear text in characters
atMost unclear@atMost Maximum prseumed length of unclear text in characters
div1 - div5 div A subsection of the document. Nesting depth is made explicit by the number after div in the PAULA/relANNIS version
div1_n - div5_n div@n A numbered subsection (the n annotation has the section number as a value, though this may also be a letter such as A or a subsection such as 1.1)
div1_type - div5_type div@type The type of section or subsection. Section can correspond to the entire "book", a "chapter" or smaller sections, including systematic types specific to the genre such as "place" (where a certain herb grows), "form" (descriptions of a herb's form) etc.
figure figure A graphic embedded in the original document.
figure_rend figure@rend Description of the rendering of the figure.
foreign foreign A foreign language area.
foreign_rend foreign@rend Description of the rendering of the foreign language area (e.g. fonts like Antiqua, italics)
lang foreign@xml:lang The language a foreign area is written in (ISO three letter codes according to ISO 3166-1 alpha-3).
head head A heading.
head_n head@n The number of a heading.
head_rend head@rend Description of the rendering of the heading.
head_type head@type Type of heading used, e.g. "margin" for a marginal heading.
hi hi Highlighted area.
hi_rend hi@rend Description of the rendering of the highlighted area.
lb lb Linebreak.
list list A list of items.
list_type list@type The type of list used.
item item Item in a list.
name name A proper name (annotated only in some documents).
name_type name@type The type of proper name (e.g. "person", "herb").
note note A note in the original document (e.g. footnotes).
p p A paragraph.
p_n p@n The number of a numbered paragraph (this may also be a letter such as A).
p_rend p@rend Description of the rendering of the paragraph.
pb pb Pagebreak.
pb_ana pb@ana Analysis of the pagebreak (e.g. in case of apparently incorrect page numbers).
pb_n pb@n The number of the page (if marked explicitly).
pb_rend pb@rend Description of the rendering of the page (repeated parts of book or chapter titles, redundant confidence texts).
quote quote A quotation (in some documents only).
reason unclear@reason Reason for annotation of the current area (usually describes form of unclear areas).
ref ref Reference to a footnote.
ref_target ref@target ID of the footnote being referred to.
ref_type ref@type Type of reference (e.g. a TEI "noteAnchor").
w w A word annotated with additional attributes.
xml_id fZ (Z is a number) ID given to a footnote.

Corpus Specific Annotations

These annotations were developed by our students to annotate spans of tokens with properties of special interest.

PAULA/relANNIS TEI XML Description
definition N/A A Definition.
term N/A A technical term.
property N/A Describes a reference to properties of a herb such as effect, smell etc.
reader_ref N/A References made by authors to the reader. Values indicate the grammatical type of the reference, e.g. "pron2sg" for second person singular pronoun.
author_ref N/A References made by authors to themselves. Values indicate the grammatical type of the reference, e.g. "pron1pl" for first person plural pronoun.
uncertain N/A Annotator uncertain of lemma and/or normalization since no equivalent could be established.