Direkt zum InhaltDirekt zur SucheDirekt zur Navigation
▼ Zielgruppen ▼

Sprach- und literaturwissenschaftliche Fakultät - Korpuslinguistik und Morphologie

clean-README.txt

cleanREADME.txt
clean-skript.py
version 3.0

Code to normalize special character used in old german texts. Takes in corpus-files
in txt format and writes out those files with replacements of the following
non-standard characters:

'⸗' to '-',
'ſ' to 's',
'å' to 'a',
'ů' to 'u',
'o̊' to 'o',
'æ' to 'ae',
'Æ' to 'AE',
'œ' to 'oe',
'Œ' to 'OE',
'aͤ' to 'ä',
'oͤ' to 'ö',
'uͤ' to 'ü',
'vͤ' to 'ü',
'Aͤ' to 'Ä',
'Oͤ' to 'Ö',
'Uͤ' to 'Ü',
'Vͤ' to 'Ü',
'ñ' to 'nn',
'ñ' to 'nn',
'ñn' to 'nn',
'm̃m' to 'mm',
'm̃' to 'mm',
'.*_.*' to 'unknown' ('_' represents non-readable Character),
'˖' to ':', // since V2
'v̂' to 'ü', // since V2
'ͤa'to'ä', // since V2
'äͤ'to'ä', // since V2
'oͤ' to 'ö', // since V2.1.3
'uͤ' to 'ü', // since V2.1.3
'vͤ' to 'ü', // since V2.1.3
'Aͤ' to 'Ä', // since V2.1.3
'Oͤ' to 'Ö', // since V2.1.3
'Uͤ' to 'Ü', // since V2.1.3
'Vͤ' to 'Ü', // since V2.1.3
'˖' to ' to ', // since V2.1.3
'ʒ' to 'z', // since V2.1.3
'ȝ' to 'z', // since V2.1.3
'v̂' to 'ü', // since V2.1.3
'o̊' to 'o', // since V2.1.3
'oͦ' to 'o', // since V2.1.3
'ꝰ' to 'us', // since V2.1.3
'ꝝ' to 'rum', // since V2.1.3
'd̉' to 'der', // since V2.1.3
'v̉' to 'ü', // since V2.1.3
'℞' to 'recipe', // since V2.1.3
'℔' to 'libra', // since V2.1.3
'℥' to 'uncia', // since V2.1.3
'℈' to 'scrupel', // since V2.1.3
'ÿ' to 'y', // since V2.1.3
'ꝰ'to 'us' // since V2.2
'ꝝ' to 'rum'// since V2.2



The documents in ridges-V4/5/6 put new demands on the clean-tier regarding to vowels with tildes. The normalization of those characters became unpredictable, even with consideration of the context. That's why we decided to replace token containing vowels with macrons with each potential form of that token, separated by '|' (for example: 'auſzwēdig' to: 'auszwemdig|auszwendig')

If there are tokens including more than two tildes, you need to translate those token manually. Therefore, the concerning line and token will be printed into the terminal.

Furthermore the script contracts token, that were seperated by line breaks and marked with '-' or '⸗'.


----------------------------------------------------------------------------------------------------

README Contents:

Usage
Required Software
Input/ Output Format

----------------------------------------------------------------------------------------------------


Usage:

$ ./clean-skript_V3.py /path/to/input_file.txt /path/to/output_file_stem.txt


----------------------------------------------------------------------------------------------------


Required Software:

+ python >= 2.3


----------------------------------------------------------------------------------------------------


Input/Output Format:


The input corpus needs to be in txt format. Each line contains a line break span-annotation and one "dipl"
token (see the guidelines for version 3.0 of Ridges Herbology) separated by a single tab character. An
example:

lb dipl # tier term
lb Dies
ist
ein
Form-
lb bsp
.

Please note: Quotation marks require the escape character (\) or the predefined xml-masking ("), otherwise format issues may appear.

One output file is created for each input file. The output file contains the annotation tiers of the input file
with the replacements outlined above.



----------------------------------------------------------------------------------------------------



Main author: Vivian Voigt
Further developments: Laura Perlitz