clean.py
Code to normalize special character used in old german texts. Takes in corpus-files
in txt format and writes out those files with replacements of the following
non-standard characters:
'⸗' to '-',
'ſ' to 's',
'å' to 'a',
'ů' to 'u',
'o̊' to 'o',
'æ' to 'ae',
'Æ' to 'AE',
'œ' to 'oe',
'Œ' to 'OE',
'aͤ' to 'ä',
'oͤ' to 'ö',
'uͤ' to 'ü',
'vͤ' to 'ü',
'Aͤ' to 'Ä',
'Oͤ' to 'Ö',
'Uͤ' to 'Ü',
'Vͤ' to 'Ü',
'ñ' to 'nn',
'n̄n' to 'nn',
'n̄' to 'nn',
'ñ' to 'nn',
'ñn' to 'nn',
'm̄m' to 'mm',
'm̄' to 'mm',
'm̃m' to 'mm',
'm̃' to 'mm',
'.*_.*' to 'unknown' ('_' represents non-readable Character),
'€' to 'der', ('€' represents 'digit' (der))
'$' to 'us', ('$' represents a fracture character translatable as 'us') //since V2
'í' to 'i', // since V2
'˖' to ':', // since V2
'ʒ' to 'z', // since V2
'ȝ' to 'z', // since V2
'v̂' to 'ü', // since V2
'ű' to 'ü', // since V2
'ͤa'to'ä', // since V2
'äͤ'to'ä' // since V2
The new documents in ridges-V4 put new demands on the clean-tier regarding to vowels with macrons. The normalization of those characters became unpredictable, even with consideration of the context. That's why we decided to replace token containing vowels with macrons with each potential form of that token, separated by '|' (for example: 'auſzwēdig' to: 'auszwemdig|auszwendig')
Furthermore the script contracts token, that were seperated by line breaks and marked with '-' or '⸗'.
----------------------------------------------------------------------------------------------------
README Contents:
Usage
Required Software
Input/ Output Format
----------------------------------------------------------------------------------------------------
Usage:
$ ./cleanV2.py /path/to/input_file.txt /path/to/output_file_stem.txt
----------------------------------------------------------------------------------------------------
Required Software:
+ python >= 2.3 < 3.0
----------------------------------------------------------------------------------------------------
Input/Output Format:
The input corpus needs to be in txt format. Each line contains a line break span-annotation and one "dipl"
token (see the guidelines for version 3.0 of Ridges Herbology) separated by a single tab character. An
example:
lb dipl # tier term
lb Dies
ist
ein
Form-
lb bsp
.
Quotation marks require the escape character (\) or the predefined xml-masking ("), otherwise format issues may appear.
One output file is created for each input file. The output file contains the annotation tiers of the input file
with the replacements outlined above.
----------------------------------------------------------------------------------------------------
Author: Vivian Voigt