Dokumentation Version 1.0
Korpus-Pipeline
Korpora werden in mehreren Schritten erhoben:
- Herunterladen eines Faksimiles, meistens von Google Books.
- Korrektur der OCR-Vorlage oder manuelle Transkription und Erfassung der Struktur in TEI-XML.
- Tokenisierung, Wortartentagging und Lemmatisierung mit TreeTagger.
- Weitere manuelle Annotationen mit MS Excel.
- Zusammenführung der Annotationen und Export des Korpus in persistente Formate und das Such- und Visualisierungstool ANNIS.
Korpus-Design
Um Vergleichbarkeit zu gewährleisten, wählen wir Texte aus einer wissenschaftlichen Disziplin, die idealerweise auf ähnliche Weise im gesamten Untersuchungszeitraum vertreten ist. Für das erste RIDGES-Korpus haben wir den Bereich der Kräuterkunde gewählt. Der Untersuchungszeitraum wurde in 30-jährige Abschnitte unterteilt, mit derzeit einer Stichprobe pro Abschnitt. Da die Verarbeitung älterer Texte aufwendiger ist, variiert die Länge der Texte. Jedes Dokument umfasst ca. 4.000 bis 10.000 Wortformen.
Annotationsebenen
Die Annotationsebenen in den Korpora werden in einer Mehrebenenarchitektur gespeichert und lassen sich in vier Gruppen untergliedern.
Beschreibungen der einzelnen Ebenen finden sich unten in englischer Sprache.
Token-Annotationen
Diese Anntoationen entsprechen immer genau einem token. Part-of-speech-Annotationen (Wortarten) und Lemmatisierung wurden mit TreeTagger durchgeführt und von Hand korrigiert.
PAULA/relANNIS | TEI XML | Description |
---|---|---|
tok | (plain text) | The diplomatic transcription of the word form as found on the manuscript. Line-breaks are marked as in the text, usually as '='. |
norm | N/A | A normalized word form based on Modern German orthography. For words not found in Modern German, a modern orthography is assumed (e.g. beſchicht is normalized as beschieht, analog to geschieht). |
lemma | N/A | The normalized uninflected lexicon entry for each word form, using modern orthography (again, obsolete words are also modernized, e.g. beſchicht has the lemma beschehen, analog to geschehen) |
pos | N/A | Part-of-speech annotation using the STTS tagset for German. |
clean | N/A | Some texts may also have a partially normalized layer with consistent orthography from the relevant period, but not modernized to Modern German orthography. For example, a form with a line break like wor=den will be cleaned to worden but not normalized to modern geworden where this would now be the appropriate form. |
hyperlemma | N/A | In some cases where the use of modernized orthography is impossible or misleading, a modern semantic equivalent is given as a hyperlemma (e.g. Heümonat is hyperlemmatized as Juli or ráß as beißend. |
TEI-Metadaten
Diese Annotationen folgen den TEI-P5-Richtlinien.
PAULA/relANNIS | TEI XML | Description |
---|---|---|
meta::author | author | Name of the author (if known). |
meta::bibl | bibl | Full bibliographical entry for the source including the page numbers annotated in the corpus. |
meta::date | date | Date of publication, usually just the year (e.g. "1722"). |
meta::publisher | publisher | Publisher of the document (if known). |
meta::pubPlace | pubPlace | Publication place of the document. |
meta::title | title | Title of the work the document was extracted from. |
Strukturelle TEI-Annotationen
Diese Annotationen folgen den TEI-P5-Richtlinien.
PAULA/relANNIS | TEI XML | Description |
---|---|---|
del | del | Area deleted in original text |
unclear | unclear | Unreadable or otherwise unclear text |
atLeast | unclear@atLeast | Minimum prseumed length of unclear text in characters |
atMost | unclear@atMost | Maximum prseumed length of unclear text in characters |
div1 - div5 | div | A subsection of the document. Nesting depth is made explicit by the number after div in the PAULA/relANNIS version |
div1_n - div5_n | div@n | A numbered subsection (the n annotation has the section number as a value, though this may also be a letter such as A or a subsection such as 1.1) |
div1_type - div5_type | div@type | The type of section or subsection. Section can correspond to the entire "book", a "chapter" or smaller sections, including systematic types specific to the genre such as "place" (where a certain herb grows), "form" (descriptions of a herb's form) etc. |
figure | figure | A graphic embedded in the original document. |
figure_rend | figure@rend | Description of the rendering of the figure. |
foreign | foreign | A foreign language area. |
foreign_rend | foreign@rend | Description of the rendering of the foreign language area (e.g. fonts like Antiqua, italics) |
lang | foreign@xml:lang | The language a foreign area is written in (ISO three letter codes according to ISO 3166-1 alpha-3). |
head | head | A heading. |
head_n | head@n | The number of a heading. |
head_rend | head@rend | Description of the rendering of the heading. |
head_type | head@type | Type of heading used, e.g. "margin" for a marginal heading. |
hi | hi | Highlighted area. |
hi_rend | hi@rend | Description of the rendering of the highlighted area. |
lb | lb | Linebreak. |
list | list | A list of items. |
list_type | list@type | The type of list used. |
item | item | Item in a list. |
name | name | A proper name (annotated only in some documents). |
name_type | name@type | The type of proper name (e.g. "person", "herb"). |
note | note | A note in the original document (e.g. footnotes). |
p | p | A paragraph. |
p_n | p@n | The number of a numbered paragraph (this may also be a letter such as A). |
p_rend | p@rend | Description of the rendering of the paragraph. |
pb | pb | Pagebreak. |
pb_ana | pb@ana | Analysis of the pagebreak (e.g. in case of apparently incorrect page numbers). |
pb_n | pb@n | The number of the page (if marked explicitly). |
pb_rend | pb@rend | Description of the rendering of the page (repeated parts of book or chapter titles, redundant confidence texts). |
quote | quote | A quotation (in some documents only). |
reason | unclear@reason | Reason for annotation of the current area (usually describes form of unclear areas). |
ref | ref | Reference to a footnote. |
ref_target | ref@target | ID of the footnote being referred to. |
ref_type | ref@type | Type of reference (e.g. a TEI "noteAnchor"). |
w | w | A word annotated with additional attributes. |
xml_id | fZ (Z is a number) | ID given to a footnote. |
Korpus-spezifische Annotationen
Diese Annotationen wurden von unseren Studenten entwickelt, um Spannen von Token mit besonderen Eigenschaften auszuzeichnen.
PAULA/relANNIS | TEI XML | Description |
---|---|---|
definition | N/A | A Definition. |
term | N/A | A technical term. |
property | N/A | Describes a reference to properties of a herb such as effect, smell etc. |
reader_ref | N/A | References made by authors to the reader. Values indicate the grammatical type of the reference, e.g. "pron2sg" for second person singular pronoun. |
author_ref | N/A | References made by authors to themselves. Values indicate the grammatical type of the reference, e.g. "pron1pl" for first person plural pronoun. |
uncertain | N/A | Annotator uncertain of lemma and/or normalization since no equivalent could be established. |