Sprach- und literaturwissenschaftliche Fakultät - Korpuslinguistik und Morphologie

clean-readme.txt

clean.py

Code to replace special character used in old german texts. Takes in corpus-files
in txt format and writes out those files with replacements of the following
non-standard characters:

'ã' to 'an',
'ā' to 'an',
'ā' to 'an',
'ñ' to 'nn',
'n̄' to 'nn',
'Ũ' to 'Um',
'õ' to 'on',
'Õ' to 'On',
'ẽ' to 'en',
'ē' to 'en',
'ē' to 'en',
'Ẽ' to 'En',
'ĩ' to 'in',
'Ĩ' to 'In',
'ſ' to 's',
'ů' to 'u',
'€' to 'der', ('€' represents 'digit' (der))
'ů' to 'u',
'⸗' to '-',
'æ' to 'ae',
'Æ' to 'AE',
'œ' to 'oe',
'Œ' to 'OE',
'å' to 'a',
'aͤ' to 'ä',
'oͤ' to 'ö',
'uͤ' to 'ü',
'vͤ' to 'ü',
'Aͤ' to 'Ä',
'Oͤ' to 'Ö',
'ñ' to 'nn',
'Uͤ' to 'Ü',
'Vͤ' to 'Ü',
'ñ' to 'nn',
'ũg' to 'ung',
'ũ' to 'um',
'n̄n' to 'nn',
'n̄' to 'nn',
'm̄m' to 'mm',
'm̄' to 'mm',
'_' to 'unknown' ('_' represents non-readable Character)

Furthermore the script contracts token, that were seperated by line breaks and marked with '-' or '⸗'.

----------------------------------------------------------------------------------------------------

README Contents:

Usage
Required Software
Input/ Output Format

----------------------------------------------------------------------------------------------------

Usage

$ ./replace.py /path/to/input_file.txt /path/to/output_file_stem.txt

----------------------------------------------------------------------------------------------------

Required Software

+ python >= 2.3 < 3.0

----------------------------------------------------------------------------------------------------

Input/Output Format

The input corpus needs to be in txt format. Each line contains a line break span-annotation and one "dipl"
token (see the guidelines for version 3.0 of Ridges Herbology) separated by a single tab character. An
example:

lb dipl ... # tier term
lb Dies ...
ist ...
ein ...
Form- ...
lb bsp ...
.
Quotation marks require the escape character (\), otherwise format issues may appear.

One output file is created for each input file. The output file contains the annotation tiers of the input file
with the replacements outlined above.

----------------------------------------------------------------------------------------------------

Authors: Vivian Voigt, Benjamin Weißenfels