Sprach- und literaturwissenschaftliche Fakultät - Korpuslinguistik und Morphologie

Dokumentation Version 1.0

Dokumentation der ersten Version von Ridges Herbology.

Korpus-Pipeline

Korpora werden in mehreren Schritten erhoben:

  1. Herunterladen eines Faksimiles, meistens von Google Books.
  2. Korrektur der OCR-Vorlage oder manuelle Transkription und Erfassung der Struktur in TEI-XML.
  3. Tokenisierung, Wortartentagging und Lemmatisierung mit TreeTagger.
  4. Weitere manuelle Annotationen mit MS Excel.
  5. Zusammenführung der Annotationen und Export des Korpus in persistente Formate und das Such- und Visualisierungstool ANNIS.

 

 

Korpus-Design

Um Vergleichbarkeit zu gewährleisten, wählen wir Texte aus einer wissenschaftlichen Disziplin, die idealerweise auf ähnliche Weise im gesamten Untersuchungszeitraum vertreten ist. Für das erste RIDGES-Korpus haben wir den Bereich der Kräuterkunde gewählt. Der Untersuchungszeitraum wurde in 30-jährige Abschnitte unterteilt, mit derzeit einer Stichprobe pro Abschnitt. Da die Verarbeitung älterer Texte aufwendiger ist, variiert die Länge der Texte. Jedes Dokument umfasst ca. 4.000 bis 10.000 Wortformen.

Annotationsebenen

Die Annotationsebenen in den Korpora werden in einer Mehrebenenarchitektur gespeichert und lassen sich in vier Gruppen untergliedern.

  1. Token-Annotationen
  2. TEI-Metadaten
  3. Strukturelle TEI-Annotationen
  4. Korpus-spezifische Annotationen

Beschreibungen der einzelnen Ebenen finden sich unten in englischer Sprache.

Token-Annotationen

Diese Anntoationen entsprechen immer genau einem token. Part-of-speech-Annotationen (Wortarten) und Lemmatisierung wurden mit TreeTagger durchgeführt und von Hand korrigiert.

PAULA/relANNIS TEI XML Description
tok (plain text) The diplomatic transcription of the word form as found on the manuscript. Line-breaks are marked as in the text, usually as '='.
norm N/A A normalized word form based on Modern German orthography. For words not found in Modern German, a modern orthography is assumed (e.g. beſchicht is normalized as beschieht, analog to geschieht).
lemma N/A The normalized uninflected lexicon entry for each word form, using modern orthography (again, obsolete words are also modernized, e.g. beſchicht has the lemma beschehen, analog to geschehen)
pos N/A Part-of-speech annotation using the STTS tagset for German.
clean N/A Some texts may also have a partially normalized layer with consistent orthography from the relevant period, but not modernized to Modern German orthography. For example, a form with a line break like wor=den will be cleaned to worden but not normalized to modern geworden where this would now be the appropriate form.
hyperlemma N/A In some cases where the use of modernized orthography is impossible or misleading, a modern semantic equivalent is given as a hyperlemma (e.g. Heümonat is hyperlemmatized as Juli or ráß as beißend.

TEI-Metadaten

Diese Annotationen folgen den TEI-P5-Richtlinien.

PAULA/relANNIS TEI XML Description
meta::author author Name of the author (if known).
meta::bibl bibl Full bibliographical entry for the source including the page numbers annotated in the corpus.
meta::date date Date of publication, usually just the year (e.g. "1722").
meta::publisher publisher Publisher of the document (if known).
meta::pubPlace pubPlace Publication place of the document.
meta::title title Title of the work the document was extracted from.

Strukturelle TEI-Annotationen

Diese Annotationen folgen den TEI-P5-Richtlinien.

PAULA/relANNIS TEI XML Description
del del Area deleted in original text
unclear unclear Unreadable or otherwise unclear text
atLeast unclear@atLeast Minimum prseumed length of unclear text in characters
atMost unclear@atMost Maximum prseumed length of unclear text in characters
div1 - div5 div A subsection of the document. Nesting depth is made explicit by the number after div in the PAULA/relANNIS version
div1_n - div5_n div@n A numbered subsection (the n annotation has the section number as a value, though this may also be a letter such as A or a subsection such as 1.1)
div1_type - div5_type div@type The type of section or subsection. Section can correspond to the entire "book", a "chapter" or smaller sections, including systematic types specific to the genre such as "place" (where a certain herb grows), "form" (descriptions of a herb's form) etc.
figure figure A graphic embedded in the original document.
figure_rend figure@rend Description of the rendering of the figure.
foreign foreign A foreign language area.
foreign_rend foreign@rend Description of the rendering of the foreign language area (e.g. fonts like Antiqua, italics)
lang foreign@xml:lang The language a foreign area is written in (ISO three letter codes according to ISO 3166-1 alpha-3).
head head A heading.
head_n head@n The number of a heading.
head_rend head@rend Description of the rendering of the heading.
head_type head@type Type of heading used, e.g. "margin" for a marginal heading.
hi hi Highlighted area.
hi_rend hi@rend Description of the rendering of the highlighted area.
lb lb Linebreak.
list list A list of items.
list_type list@type The type of list used.
item item Item in a list.
name name A proper name (annotated only in some documents).
name_type name@type The type of proper name (e.g. "person", "herb").
note note A note in the original document (e.g. footnotes).
p p A paragraph.
p_n p@n The number of a numbered paragraph (this may also be a letter such as A).
p_rend p@rend Description of the rendering of the paragraph.
pb pb Pagebreak.
pb_ana pb@ana Analysis of the pagebreak (e.g. in case of apparently incorrect page numbers).
pb_n pb@n The number of the page (if marked explicitly).
pb_rend pb@rend Description of the rendering of the page (repeated parts of book or chapter titles, redundant confidence texts).
quote quote A quotation (in some documents only).
reason unclear@reason Reason for annotation of the current area (usually describes form of unclear areas).
ref ref Reference to a footnote.
ref_target ref@target ID of the footnote being referred to.
ref_type ref@type Type of reference (e.g. a TEI "noteAnchor").
w w A word annotated with additional attributes.
xml_id fZ (Z is a number) ID given to a footnote.

Korpus-spezifische Annotationen

Diese Annotationen wurden von unseren Studenten entwickelt, um Spannen von Token mit besonderen Eigenschaften auszuzeichnen.

PAULA/relANNIS TEI XML Description
definition N/A A Definition.
term N/A A technical term.
property N/A Describes a reference to properties of a herb such as effect, smell etc.
reader_ref N/A References made by authors to the reader. Values indicate the grammatical type of the reference, e.g. "pron2sg" for second person singular pronoun.
author_ref N/A References made by authors to themselves. Values indicate the grammatical type of the reference, e.g. "pron1pl" for first person plural pronoun.
uncertain N/A Annotator uncertain of lemma and/or normalization since no equivalent could be established.