Faculty of Language, Literature and Humanities - Corpus Linguistics and Morphology

Documentation Version 2.0

Corpus pipeline

  1. Constitution: Ridges v1 without "Flora francisca redidiva"
    You can find a complete list of all documents of this version in the download section.
  2. Correction of transcription, <clean>-layer and normalisation
  3. Part-of-speech tag and lemmatize with TreeTagger
  4. Manual correction of structural and content annotations
  5. Export the merged corpus to persistent formats and the ANNIS search and visualization tool

 

Corpus design

For purposes of comparability, we try to select texts from one scientific discipline which is ideally represented in a similar fashion throughout the early modern era. For the first RIDGES corpus we have selected the domain of herbology (Kräuterkunde). The timespan of interest has been divided into 30 year periods, with a currently minimal sample of one text per period. Texts vary somewhat in length since older text is more difficult to annotate. Each document is typically between 4000 and 10000 tokens long.

Annotation layers

The RIDGES corpora follow a multi-layer design. Annotation layers can be roughly divided into five kinds:

  1. Transcription/normalisation
  2. Linguistic annotations
  3. Structural annotations
  4. Content annotation
  5. Metadata

 

Transcription/normalisation

These annotations always apply to exactly one token. Part-of-speech annotation and lemmatization were carried out with TreeTagger and corrected manually.

Annotation layer and value(s) Description
dipl
annotation value(s):
  • Text
The diplomatic transcription of the word form as found on the manuscript.
clean
annotation value(s):
  • Text
Normalizations regarding graphical structures and special characters (e.g. "ſ" to "s"), but not modernized to Modern German orthography. For example, a form with a line break like wor=den will be cleaned to worden but not normalized to modern geworden where this would now be the appropriate form.
norm
annotation value(s):
  • Text
A normalized word form based on Modern German orthography. Modern flexion is not normalized.

 

Linguistic annotations

Annotation layer and value(s) Description
pos
annotation value(s):
  • STTS
Part-of-speech annotation using the STTS tagset for German.
lemma
annotation value(s):
  • Text (Type)
The normalized uninflected lexicon entry for each word form, using modern orthography (again, obsolete words are also modernized, e.g. beſchicht has the lemma beschehen, analog to geschehen)
hyperlemma
annotation value(s):
  • Text
In some cases where the use of modernized orthography is impossible or misleading, a modern semantic equivalent is given as a hyperlemma (e.g. Heümonat is hyperlemmatized as Juli or ráß as beißend.
foreign
annotation value(s):
  • foreign
Non-german text.
foreign_trans
annotation value(s):
  • trans_to_german
  • trans_from_german
  • trans_from_german_extended
  • trans_to_german_extended
Translation from and to German.
lang
annotation value(s):
Description of the target language and of the source language of a translation.

 

Structural annotations

Annotation layer and value(s) Description
lb
annotation value(s):
  • lb
Linebreak.
brace
annotation value(s):
  • brLeft
  • brRight
Left or right parentheses marking text over multiple lines.
brace_dir
annotation value(s):
  • Text
Direction of parentheses
p
annotation value(s):
  • p
A paragraph.
p_n
annotation value(s):
  • Number or letter
The number of a numbered paragraph (this may also be a letter such as A).
p_rend
annotation value(s):
  • initial capital
  • big bold type
Description of the rendering of the paragraph.
pb
annotation value(s):
  • pb
Pagebreak.
pb_n
annotation value(s):
  • Number or Letter
The number of the page (if marked explicitly).
pb_rend
annotation value(s):
  • in header: Von Haſelwurtz. Cap. III.
  • in header: Vorred
  • in header: Von Chamillen. Cap. VIII.
  • in header: Vorrede.
  • in header Vorred, signature ´A io`at bottom of page
  • in header: Von Staubwurtz. Cap. II
  • in header: Von Eibisch. Cap. V.
  • in header: Vorred, signature 'A ' at bottom of page
  • in header Vorred, signature'A iiij' at bottom of page
  • in header: Von Wermůt. Cap. I.
  • in header: Vorred, signature 'A iij' at bottom of page
  • in header: Von Drachenwurtz. Cap. IIII.
  • in header: Vorred, signature 'A ij' at bottom of page
  • Ohl zu machen.
  • Zum beſten zu Diſtilliren.
  • Waſſer auß Kräutern vnd dergleichen
  • Auffs beſt zu Diſtilliren.
  • Auß Kräutern vnd dergleichen
  • signature 'A ' at bottom of page
  • Auffs beſt zu Diſtilliren.
  • Waſſer auß Kräutern vnd dergleichen
  • Am beſten zu Diſtilliren.
Description of the rendering of the page (repeated parts of book or chapter titles, redundant confidence texts).
pb_ana
annotation value(s):
  • page number should be 7
Analysis of the pagebreak (e.g. in case of apparently incorrect page numbers).
div1 - div5
annotation value(s):
  • div
A subsection of the document. Nesting depth is made explicit by the number after div in the PAULA/relANNIS version
div1_type - div5_type
annotation value(s):
  • appendix
  • book
  • chapter
  • description
  • form
  • herb
  • names
  • name
  • nature
  • parts_preparation_and_usus
  • places
  • place
  • preface
  • postscript
  • power
  • reproduction
  • season
  • section
  • species
  • title
  • time
  • utensils
The type of section or subsection. Section can correspond to the entire "book", a "chapter" or smaller sections, including systematic types specific to the genre such as "place" (where a certain herb grows), "form" (descriptions of a herb's form) etc.
div1_n - div5_n
annotation value(s):
  • Number
A numbered subsection (the n annotation has the section number as a value, though this may also be a letter such as A or a subsection such as 1.1)
unclear
annotation value(s):
  • unclear
Unreadable or otherwise unclear text
atLeast
annotation value(s):
  • Number
Minimum presumed length of unclear text in characters
atMost
annotation value(s):
  • Number
Maximum presumed length of unclear text in characters
interpretation
annotation value(s):
  • Text
Suggestions for unreadable or unclear text
figure
annotation value(s):
  • figure
  • table
A graphic embedded in the original document.
figure_rend
annotation value(s):
  • Drawing of two jars
  • Drawing of three jars Drawing of two glasses
  • Drawing of three glasses
  • Drawing of two alembics
  • Drawing of an instrument
  • Drawing of an EIBISCH.
  • Drawing of a STAUBWURTZ.
  • Drawing of a KAMILLE.
  • Drawing of a HÜHNERDARM.
Description of the rendering of the figure.
hi
annotation value(s):
  • hi
Highlighted area.
hi_rend
annotation value(s):
  • antiqua
  • italics
  • fracture
  • bold
  • underlined
  • red
  • inicap
  • letter-spacing:1em
Description of the rendering of the highlighted area.
head
annotation value(s):
  • head
A heading.
head_n
annotation value(s):
  • Number
The number of a heading.
head_rend
annotation value(s):
  • red and black
  • red
  • brown
Description of the rendering of the heading.
note
annotation value(s):
  • note
  • margin
A note in the original document (e.g. footnotes, margins).
ref
annotation value(s):
  • ref
Reference to a footnote.
ref_target
annotation value(s):
  • #fZ (Z is a number)
ID of the footnote being referred to.
ref_type
annotation value(s):
  • noteAnchor
Type of reference (e.g. a TEI "noteAnchor").
quote
annotation value(s):
  • quote
A quotation (in some documents only).
list
annotation value(s):
  • list
A list of items.
list_type
annotation value(s):
  • simple
The type of list used.
item
annotation value(s):
  • item
Item in a list.
xml_id
annotation value(s):
  • fZ (Z is a number)
ID given to a footnote.

 

Content annotations

Annotation layer and value(s) Description
definition
annotation value(s):
  • fig
  • expl
A definition of a figure.
term
annotation value(s):
  • t
  • h
  • d
A technical term, naming of a herb or plant, naming of a disease
property
annotation value(s):
  • appearance
  • effect
  • smell
  • preparation
  • taste
  • cultivation
Describes a reference to properties of a herb such as effect, smell etc.
reader_ref
annotation value(s):
  • pron1pl
  • pron2sg
  • pron3sg
  • pron2pl
  • address
References made by authors to the reader. Values indicate the grammatical type of the reference, e.g. "pron2sg" for second person singular pronoun.
author_ref
annotation value(s):
  • pron1pl
  • pron1sg
  • pron2sg
  • pron3sg
  • author
References made by authors to themselves. Values indicate the grammatical type of the reference, e.g. "pron1pl" for first person plural pronoun.
name
annotation value(s):
  • name
A proper name (annotated only in some documents).
name_type
annotation value(s):
  • herb
  • scholar
  • plant
  • person
  • flower
  • tree
  • gardener
  • publisher
The type of proper name (e.g. "person", "herb").

 

Metadata

These annotations follow the TEI P5 guidelines.

Annotation layer and value(s) Description
meta::author
annotation value(s):
  • author
Name of the author (if known).
meta::bibl
annotation value(s):
  • bibl
Full bibliographical entry for the source including the page numbers annotated in the corpus.
meta::date
annotation value(s):
  • date
Date of publication, usually just the year (e.g. "1722").
meta::publisher
annotation value(s):
  • publisher
Publisher of the document (if known).
meta::pubPlace
annotation value(s):
  • pubPlace
Publication place of the document.
meta::title
annotation value(s):
  • title
Title of the work the document was extracted from.