Faculty of Language, Literature and Humanities - Corpus Linguistics and Morphology

Documentation Version 6.0

Corpus pipeline

  1. 14 additional texts were added:
    GartDerGesundheit-VR_1487_vonCuba
    ArtzneyBuchleinDerKreutter-VFR_1532_Tallat
    ContrafaytKreuterbuch-VR_1532_Brunfels
    ContrafaytKreuterbuch-CCXXXVII-CCXLVIII_1532_Brunfels
    NewKreuetterBuch-VR_1539_Bock
    WieSichMeniglich-VR_1557_vonBodenstein 
    AlchymistischePractic-VR_1603_Libavius
    BlackwellischesKraeuterbuch_1750_Blackwell
    Apothekerlexikon_1793_Hahnemann
    GetreueDarstellungUndBeschreibung_1809_Hayne
    EigenschaftenAllerHeilpflanzen-Wermut_1828_Anonymous
    GrossesIllustriertesKraeuterbuch_1860_Mueller
    GemeinnuetzigesKraeuterbuch_1874_Siegmund
    NatürlichePflanzenfamilien_1887_Engler
     
    You can find a complete list of all documents of this version in the download section.
  2. Transcription and tokenization of the 14 new texts with TreeTagger.
  3. Normalization of the 14 new texts (<norm> layer). Correction of the <norm> layer of all other texts.
  4. Deletion of the following annotation layers: <p>, <p_rend>, <p_n>, <brace>, <brace_dir>, <pb_rend>, <div1-div5>, <div1_type-div5_type>, <div1_n-div5_n>, <xml:id>, <list>, <list_type>, <pos_klein>, <head>, <head_n>, <dialekt>, <diachronie>, <citation>, <term> (the value "h" from <term> is now included as "pl" in layer <plant> and the value "d"  as "di" in layer <disease>).
  5. All annotation layers and metadata with German names changed to English names.
  6. Part-of-speech tagging and lemmatization with TreeTagger-Batch and TreeTagger for all documents. Please note: Quotation marks can cause errors and need to be masked. Furthermore empty lines will be deleted by the TreeTagger. Fill those lines with a random tag (e.g. <9>) and use the option -sgml while tagging. Lines that include tags will not be tagged and can be deleted afterwards. After merging TreeTagger-output with the MS Excel file, the MS Excel macro SearchAndMerge (Readme) reconstruct the segmentation.
  7. Extensive manual creation and correction of structural and content annotations in MS Excel.
  8. Automatic creation of <clean> for all documents (Python-Script and Readme).
  9. Conversion from MS Excel to ANNIS format and PAULA format via Pepper.

 

ridges-version6-diagramm.png

 

Corpus design

 

In order to study the development of the scientific language throughout the period of interest, we require a subject domain that is sufficiently well represented in all subperiods. That is why we have selected the domain of herbology (Kräuterkunde). Texts vary somewhat in length since older text is more difficult to annotate. Rplot_v6.png

 

Annotation layers

The RIDGES corpus is designed as a multi-layer architecutre. Annotation layers can be roughly divided into five kinds:

  1. Transcription/normalisation
  2. Linguistic annotations
  3. Structural annotations
  4. Content annotation
  5. Metadata

 

Transcription/normalisation

Annotation layer and value(s) Description
dipl
independent segmentation

annotation value(s):
  • Text
The diplomatic transcription of the word form as found on the manuscript. A Unicode-table with special character is used.
clean
independent segmentation

annotation value(s):
  • Text
Automatic normalization by a Python-Script regarding graphical structures and special characters only (e.g. "ſ" to "s"). For example, a form with a line break like wor=den will be cleaned to worden but not normalized to modern geworden where this would now be the appropriate form. For words including line breaks notice, that if the second word begins with a capital letter, this letter will be normalized to a small letter in the clean layer (e.g. "Gelb- Sucht" to "Gelbsucht"). If all letters of the second word are capital letters, they will remain the same (e.g. "MON- TANUM" to "MONTANUM"). Dipl units containing vowels with macrons are replaced by each potential form of that token, separated by '|' (for example: 'auſzwēdig' to: 'auszwemdig|auszwendig'). For a full overview of the replacements for the clean-tier see the Readme.
norm
independent segmentation

annotation value(s):
  • Text
In this layer the segmentation, graphemics, inflection forms and lexemes are normalized. Graphemics: orthographic normalization according to Duden (e.g. kreutter -> Kräuter); phonology: please notice the sound changes of the Early New High German period, like diphthongization, monophthongization, syncope, apocoke, etc. (e.g. lehret -> lehrt); morphology: in die Nasen -> in die Nase; lexicology: extinct lexical material is normalized according to modern orthography and described in the layer "erlaeuterung" as the case may be (e.g. Vergeſz -> Vergess); word formation: extinct morphemes are normalized - if possible - according to modern orthography (e.g. halben -> halber or stachelecht -> stachelig). Currently there are only some documents in which case was normalized.

 

Linguistic annotations

Annotation layer and value(s) Description
pos
segmentation based on 'norm'

annotation value(s):
  • STTS
Autmatic part-of-speech annotation using the STTS tagset for German.
lemma
segmentation based on 'norm'

annotation value(s):
  • String
Automatic lemmatization by TreeTagger.
comment
segmentation based on 'dipl'

annotation value(s):
  • String
This is an unsystematic layer. In some cases where the use of modernized orthography is impossible or misleading, a modern semantic equivalent (e.g. Heümonat -> Juli) or an explanation can be given. This layer was originally called "hyperlemma" and was renamed as "erlaeuterung" in ridge-v5.
foreign
segmentation based on 'dipl'

annotation value(s):
  • foreign
Non-german text.
foreign_trans
segmentation based on 'dipl'

annotation value(s):
  • trans_to_german
  • trans_from_german
  • trans_from_german_extended
  • trans_to_german_extended
Translation from and to German.
lang

segmentation based on 'dipl'

annotation value(s):

The language of a foreign area is written in (ISO three letter codes according to ISO 639-2).
comp
segmentation based on 'dipl'

annotation value(s):
  • k
Nominal compound (with nominal head).
comp_orth
segmentation based on 'dipl'

annotation value(s):
  • zs
  • gtr
  • bs
  • lb1
  • lb2
Annotation of the specific spelling of the annotated compounds (komp): zs: written together, gtr: written separately, bs: hyphenated (one line), lb1: separated by line break (hyphenless), lb2: separated by line break (hyphenated).
prot
segmentation based on 'dipl'

annotation value(s):
  • prot1
  • prot2
  • prot3
Assigns prototypes to each compound of the komp-layer : prot1: reliably identifiable as compound, prot2: quite likely a compound und prot3: case of doubt (not assigned in (komp)).

attr_gen
segmentation based on 'dipl'

annotation value(s):

  • gprä
  • gpost
Annotation of nominal phrases with genitive attribute post or prenominal. gprä = prenominal genitive, gpost = postnominal genitive.

morph_ellipsis
segmentation based on 'dipl'

annotation value(s):

  • strD
Coordination of compounds and parts of compounds(truncated morphemes and compounds such as: gelb⸗ und Waſſerſucht).

persname
segmentation based on 'dipl'

annotation value(s):

  • String
Every name of a person to which the author of a particular document refers is annotated. For every instance the name of the person is given in the nominative form.

title
segmentation based on 'dipl'

annotation value(s):

  • String
Every title of books to which the author of a particular document refers is annotated. For every instance the title of the book is given in the nominative form.

form_disease
segmentation based on 'dipl'

annotation value(s):

  • deriv
  • derivat
  • kompNN
  • kompNNgetrennt
  • lat
  • phrase
  • Phrase
  • phraseDasIst
  • phraseGen
  • phraseGEN
  • phraseGenannt
  • phraseHS
  • phraseRS
  • phraseSubj
  • phraseV1
  • phraseVP
  • simplex
  • wort
     
NA

problem
segmentation based on 'dipl'

annotation value(s):

  • String
NA

herbname_norm
segmentation based on 'dipl'

annotation value(s):

  • String
In this layer a systematic herbal name is given. Sometimes it is ambigous - in this case you can find additional information in the "erlaeuterung" or in the "bemerkung_lexik" layer.

herbprep
segmentation based on 'dipl'

annotation value(s):

  • String
This layer was made for the identification of preparations of herbs. Only those instances are included which are NPs or modifiers with a herb as head. The name is given in the nominative singular form and normalized according to modern orthography. Whitespaces are replaced by underscores. Compounds are always written together, regardless of their compound spelling in the facsimile. Everything is written in lower case letters (e.g. safft des weremuts -> saft_des_wermuts.

form_prep
segmentation based on 'dipl'

annotation value(s):

  • kompNN
  • kompNNgetrennt
  • phraseVon
  • phraseGen
In this layer preparations with herbs are described syntactically or morphologically. kompNN = NN compounds which are written together or hyphenated; kompNNgetrennt = nouns following each other which could be a compound (written seperatlely); phraseVon = preparations with herbs containing a von-PP (e.g. safft von weremut); phraseGen = preparation with herbs containing a genitive attribute (e.g. safft des weremuts.

noun_nom
segmentation based on 'dipl'

annotation value(s):

  • String
In this layer all nouns which are included in the text are given, namely in the first occurring spelling in the nominative singular form. If the first occurring form of "Saft" is safft, all further incidences of "Saft" are given as safft. Everything is written in lower case letters. The purpose of this layer is to investigate the variation of noun spelling within one text.

form_noun
segmentation based on 'dipl'

annotation value(s):

  • simplex
  • kompNN
  • kompNNgetrennt
  • kompNEN
  • kompNENgetrennt
  • kompNNNgetrennt
  • kompAN
  • kompVN
  • derivat
  • nom
  • gri
  • lat
  • lex
In this layer all nouns are were morphologically annotated. kompNN = NN compound, written together or hyphenated; kompNNgetrennt = all sequences of two nouns which could be compounds, but are written separately; kompNEN  = NE-N compound, written together or hyphenated; kompNENgetrennt = all sequences of NE and N which could be compounds, but are written separately; kompNNNgetrennt = all sequences of three nouns which could be compounds, but are written separately; kompAN = AN compounds; kompVN = VN compounds; ; derivat = derivates; nom=  implicite nominalisation (conversion, ablaut, syntactic, nominalisation); gri/lat/ara = clear Greek/Latin/Arabic nouns, already in the German language integrated foreign material is treated like native words; lex = lexicalized herb names which were originally morphologically complex, e.g. Beifuß, Wermut, Stabwurz, and tausend guldin for "Tausendguldenkraut".

comment_lex
segmentation based on 'dipl'

annotation value(s):

  • String
This is an unsystematic layer for comments and questions about lexis.

clause_type
segmentation based on 'dipl'

annotation value(s):

  • rs
  • padv
  • rsx
  • rsdem
  • padvpart
  • dem
  • part
Annotation of clause types. No hierarchical annotation. For nested sentences only the highest clause is annotated. In the layer "bemerkungen_syntax" you can find notes about the nestings. rs = clear relative clauses, both "w-relative clauses" and "d-relative clauses"; padv = clauses which are introduced by a pronominal adverb; rsx = relative clauses without main clause (this often occurs in headlines); rsdem = ambiguous cases: relative clause or demonstrative clause; padvpart = clauses with pronominal adverb and participle; dem = demonstrative clauses (all clauses with a demonstrative pronoun as subejct); part = participles that behave similarly like relative clauses.  

position_rel
segmentation based on 'dipl'

annotation value(s):

  • vor
  • nach
  • int
Position of the relative clause within the main clause. vor = preposed; nach = postposed; int = embedded.

position_referent
segmentation based on 'dipl'

annotation value(s):

  • adja-v
  • adja-n
  • dist
  • na
Position of the relative clause relative to the reference category. adja-v = adjacently preposed; adja-n = adjacently postposed; dist = distant; na = not applicable.

form_referent
segmentation based on 'dipl'

annotation value(s):

  • np
  • d-pron
  • p-pron
  • null
Form of the reference category of the relative clause. np = non-pronominal NP; d-pron = der, die, das, dieser, etc.; p-pron = personal pronoun; null = for free relative or asyndetic relative clauses with a covert correlate in the main clause).

position_verb_rel
segmentation based on 'dipl'

annotation value(s):

  • v2
  • ve
  • venf
Verb position within the relative clause. v2 = verb second; ve = verb end; venf = verb end with occupied postfield.

form_relpron
segmentation based on 'dipl'

annotation value(s):

  • d-pron
  • w-pron
  • w-phras
Form of the category which introduces the relative clause.d-pron = all d-pronouns; w-pron = wer, welch-; w-phras = e.g. welch frau

mod_referent
segmentation based on 'dipl'

annotation value(s):

  • relsatz
  • d-pron
  • m-padv
  • m-part
  • np
relsatz = Annotated on pronouns, NPs or clauses, if modified by a relative clause. Not applicable for free relative clauses. The whole reference category is annotated as span.d-pron/m-padv, m-part, np = NA.

position_verb
segmentation based on 'norm'

annotation value(s):

  • V2
  • Vletzt
  • V?
  • V1
Verbposition.V2: Verb second position at a given subordinated clause with subordinated conjunction; analyzed as a token feature at occurrences of pos=KOUS. Vletzt: Verb final position at a given subordinated clause with subordinated conjunction; analyzed as a token feature at occurrences of pos=KOUS. V?: Unclear verb position at a given subordinated clause with subordinated conjunction; analyzed as a token feature at occurrences of pos=KOUS.V1: Verb first position at a given subordinated clause with subordinated conjunction; analyzed as a token feature at occurrences of pos=KOUS.

subclause_type
segmentation based on 'norm'

annotation value(s):

  • Adverbial
  • Attribut
  • Komplement
Type of subordinating clause. Adverbial: Adverbial function of a given subordinated clause with subordinated conjunction; analyzed as a token feature at occurrences of pos=KOUS. Attribut: Attributive function of a given subordinated clause with subordinated conjunction; analyzed as a token feature at occurrences of pos=KOUS. Komplement: Complement function of a given subordinated clause with subordinated conjunction; analyzed as a token feature at occurrences of pos=KOUS.

KOUS_sem
segmentation based on 'norm'

annotation value(s):

  • additiv
  • final
  • k.a.
  • kausal
  • konditional
  • konsekutiv
  • konzessiv
  • modal
  • temporal
  • 0
KOUS_Semantik. additiv: Additive semantics of subordinated conjunction; analyzed at occurrences of pos=KOUS. final Final semantics of subordinated conjunction; analyzed at occurrences of pos=KOUS. k.a.: Non analyzable semantics of subordinated conjunction, due to complement status of subordinated clause; analyzed at occurrences of pos=KOUS.kausal Causal semantics of subordinated conjunction; analyzed at occurrences of pos=KOUS. konditional: Conditional semantics of subordinated conjunction; analyzed at occurrences of pos=KOUS. konsekutiv: Consecutive semantics of subordinated conjunction; analyzed at occurrences of pos=KOUS. konzessiv: Concessive semantics of subordinated conjunction; analyzed at occurrences of pos=KOUS. modal: Modal semantics of subordinated conjunction; analyzed at occurrences of pos=KOUS. temporal: Temporal semantics of subordinated conjunction; analyzed at occurrences of pos=KOUS; 0: NA.

ppk_e1 - ppk_e3
segmentation based on 'dipl'

annotation value(s):

  • ppk
  • ppk_e2
  • ppk_e3
  • zwf
  • ppk_rek
  • BSP
  • BSP+
  • ​BSPBuchtitel

Prepositional constructions (prepositional attributive constructions or attributive adverbial phrases) are annotated.

ppk: normal prepositional construction

ppk_e2: normal ppk inside a ppk in the layer <ppk_e1>

ppk_e3: normal ppk inside a ppk in the layer <ppk_e2>

zwf: case of doubt

ppk_rek: recursive (nested) ppk

attr_X: attributes, that refer to an element inside a ppk without linking to that inside a syntactic sequence. X is a placeholder for the respective referent.

BSP/BSP+: special examples (personal tag for annotator)

BSPBuchtitel: special examples regarding book titles (personal tag for annotator)

sentence_end
segmentation based on 'dipl'

annotation value(s):

  • S
Sentence endings are annotated. You can find the detailed annotation guidelines here.

 

Structural annotations

Annotation layer and value(s) Description

lb
segmentation based on 'dipl'

annotation value(s):

  • lb
Linebreak.

pb
segmentation based on 'dipl'

annotation value(s):

  • pb
Pagebreak.

pb_n
segmentation based on 'dipl'

annotation value(s):

  • Integer or Letter
The number of the page (if marked explicitly).

pb_ana
segmentation based on 'dipl'

annotation value(s):

  • Integer
Revision of the pagebreak (e.g. in case of apparently incorrect page numbers).

unclear
segmentation based on 'dipl'

annotation value(s):

  • unclear
Unreadable or otherwise unclear text.

atLeast
segmentation based on 'dipl'

annotation value(s):

  • Integer
Minimum presumed length of unclear text in characters.

atMost
segmentation based on 'dipl'

annotation value(s):

  • Integer
Maximum presumed length of unclear text in characters.

interpretation
segmentation based on 'dipl'

annotation value(s):

  • String
Suggestions for unreadable or unclear text.

figure
segmentation based on 'dipl'

annotation value(s):

  • figure
  • table
A graphic or table embedded in the original document.

figure_rend
segmentation based on 'dipl'

annotation value(s):

  • drawingOfTwoJars
  • drawingOfThreeJars
  • drawingOfTwoGlasses
  • drawingOfThreeGlasses
  • drawingOfTwoAlembics
  • drawingOfAnInstrument
  • drawingOfAnEibisch
  • drawingOfAStaubwurtz
  • drawingOfAKamille
  • drawingOfAHühnerdarm
  • drawingOfAHelmet
  • drawingOfAFilter
  • drawingOfAWaldenburgischerKolben
  • drawingOfAHaselwurtz
  • drawingOfADrachenwurtz
  • drawingOfAGauchheyl
  • drawingOfADill
  • drawingOfAHauswurz
Description of the rendering of a figure.

hi
segmentation based on 'dipl'

annotation value(s):

  • hi
Highlighted area.

script
segmentation based on 'dipl'

annotation value(s):

  • blackletter
  • roman
  • mixed
Annotation of change of font.

hi_rend
segmentation based on 'dipl'

annotation value(s):

  • antiqua
  • bold
  • end
  • iniCap
  • italics
  • letter-spacing:1em
  • red
Description of the rendering of the highlighted area.

head
segmentation based on 'dipl'

annotation value(s):

  • head
A heading.

note
segmentation based on 'dipl'

annotation value(s):

  • note
  • margin
  • end
A note in the original document (e.g. footnotes, margins).

ref
segmentation based on 'dipl'

annotation value(s):

  • ref
Reference to a footnote.

ref_target
segmentation based on 'dipl'

annotation value(s):

  • #fINT
ID of the footnote being referred to.

ref_type
segmentation based on 'dipl'

annotation value(s):

  • noteAnchor
Type of reference (e.g. a TEI "noteAnchor").

quote
segmentation based on 'dipl'

annotation value(s):

  • yes
  • no
dipl-tokens that are part of a quote are annotated with "yes". The default value is "no".

item
segmentation based on 'dipl'

annotation value(s):

  • item
Item in a list.

 

Content annotations

These annotations were developed by our students to annotate spans of tokens with properties of special interest.

Annotation layer and value(s) Description

definition
segmentation based on 'norm'

annotation value(s):

  • fig
  • expl

author_ref
segmentation based on 'norm'

annotation value(s):

  • author
  • pron1sg
  • pron1pl
  • pron2pl
  • pron3sg
References made by authors to themselves. Values indicate the grammatical type of the reference, e.g. "pron1pl" for first person plural pronoun.

reader_ref
segmentation based on 'norm'

annotation value(s):

  • pron1pl
  • pron2pl
  • pron2sg
  • pron3sg
  • reader
  • author
References made by authors to the reader. Values indicate the grammatical type of the reference, e.g. "pron2sg" for second person singular pronoun.

plant
segmentation based on 'norm'

annotation value(s):

  • pl
Naming of a plant

property
segmentation based on 'norm'

annotation value(s):

  • appearance
  • cultivation
  • effect
  • preparation
  • smell
  • taste
Description of properties like appearance, smell, etc.

name
segmentation based on 'norm'

annotation value(s):

  • name
A proper name (annotated only in some documents).

name_type
segmentation based on 'norm'

annotation value(s):

  • flower
  • gardener
  • herb
  • person
  • plant
  • publisher
  • scholar
  • tree
The type of proper name (e.g. "person", "herb").

reference

annotation value(s):

  • String
This unsystematic layer is for referencing interpretations of all kind.

 

Metadata

These annotations are loosely based on the TEI P5 guidelines. Furthermore you can find the complete corpus meta data in TEI p5 here: HANDLE ID. All meta data are annotated for each document.

Annotation layer and value(s) Description

author

annotation value(s):

  • String
  • NA
Name of the author (if known).

bibl

annotation value(s):

  • String
Full bibliographical entry for the source including the page numbers annotated in the corpus.

date

annotation value(s):

  • Integer
Date of publication, usually just the year (e.g. "1722").

publisher

annotation value(s):

  • String
  • NA
Publisher of the document (if known).

place

annotation value(s):

  • String
  • NA
Publication place of the document.

title

annotation value(s):

  • String
Title of the work the document was extracted from.

translator

annotation value(s):

  • String
  • NA
Translator of the text, if existing.

trans_from

annotation value(s):

  • it
  • lat
  • NA
Language from which the text was translated.

editor

annotation value(s):

  • String
  • NA
Editor of the text, if known..

version

annotation value(s):

  • String
Version of the corpus.

edition_first

annotation value(s):

  • yes
  • no
Erstauflage: first edition of the text; Nichterstauflage: not the first edition of the text.

issue

annotation value(s):

  • Integer
  • NA
Volume of the text, if known.

maintopic

annotation value(s):

  • science
  • non-science
science: the text is about scientific topics; non-science: the text is about everyday topics.

topic

annotation value(s):

  • Al
  • As
  • B
  • G
  • K
  • L
  • M
  • R
One or more topics per text are given. Additive value in alphabetical order of the abbreviations. Al: alchemy, As: astronomy, B: botany, G: gardening, K: kitchen, L: linguistics, M: medicine, R: religion. Example values: "B", "BM" oder "BKM".

register

annotation value(s):

  • herbology
Register of the text: Herbology.

lingualism

annotation value(s):

  • monoling
  • multiling
mehrsprachig: the text is multilingual, which means that there are whole paragraphs written in another language than German (single translations of specialist terms do not count); einsprachig: the text is monolingual.

orig_date

annotation value(s):

  • Integer
  • NA
If a text is categorized as "Nichterstauflage" in "auflage", the original date of publication is given here (if known).

orig_place

annotation value(s):

  • String
  • NA
If a text is categorized as "Nichterstauflage" in "auflage", the original place of publication is given here (if known).

repository

annotation value(s):

  • URL
URL to the repository where you can find the facsimile of the text.

lang_type

annotation value(s):

  • fnhd
  • nhd
The language type is given. fnhd: Early New High German, nhd: New High German

lang_area

annotation value(s):

  • md
  • obd
  • NA
The language area is given. md: Middle German, obd: High German. If a text is a later and more standardised one, the value "NA" is given.

text_type

annotation value(s):

  • prose
  • lyric
  • mixed
Declaration of the general text composition. Prosa: the text is prosaic, Poesie: the text is poetic; gemischt: the text is partly poetic and partly prosaic.

lyric_type

annotation value(s):

  • end_rhyme
  • meter
  • rhyme_meter
  • NA
If in "textgestaltung" the values "Poesie" or "gemischt" are given, you can find here the specific poetic elements that are used. Endreim: end rhyme; Metrik: metrics

preface

annotation value(s):

  • yes
  • no
ja: a preface is transcribed in a specific document; nein: no preface is transcribed in the document.

wormwood

annotation value(s):

  • yes
  • no
ja: there is a paragraph about the topic "Wermut" in this document; nein: there is no paragraph about the topic "Wermut" in this document.

herb_sorting

annotation value(s):

  • yes
  • no
ja: the document is a herbal monography collection, which means that different herbs are described in an ordered selection; nein: the document is no herbal monography collection

korpusdokumentation

annotation value(s):

  • URL
URL to the corpus documentation.