Sprach- und literaturwissenschaftliche Fakultät - Korpuslinguistik und Morphologie

cleanV2README.txt

 clean.py Code to normalize special character used in old german texts. Takes in corpus-files in txt format and writes out those files with replacements of the following non-standard characters: '⸗' to '-', 'ſ' to 's', 'å' to 'a', 'ů' to 'u', 'o̊' to 'o', 'æ' to 'ae', 'Æ' to 'AE', 'œ' to 'oe', 'Œ' to 'OE', 'aͤ' to 'ä', 'oͤ' to 'ö', 'uͤ' to 'ü', 'vͤ' to 'ü', 'Aͤ' to 'Ä', 'Oͤ' to 'Ö', 'Uͤ' to 'Ü', 'Vͤ' to 'Ü', 'ñ' to 'nn', 'n̄n' to 'nn', 'n̄' to 'nn', 'ñ' to 'nn', 'ñn' to 'nn', 'm̄m' to 'mm', 'm̄' to 'mm', 'm̃m' to 'mm', 'm̃' to 'mm', '.*_.*' to 'unknown' ('_' represents non-readable Character), '€' to 'der', ('€' represents 'digit' (der)) '$' to 'us', ('$' represents a fracture character translatable as 'us') //since V2 'í' to 'i', // since V2 '˖' to ':', // since V2 'ʒ' to 'z', // since V2 'ȝ' to 'z', // since V2 'v̂' to 'ü', // since V2 'ű' to 'ü', // since V2 'ͤa'to'ä', // since V2 'äͤ'to'ä' // since V2 The new documents in ridges-V4 put new demands on the clean-tier regarding to vowels with macrons. The normalization of those characters became unpredictable, even with consideration of the context. That's why we decided to replace token containing vowels with macrons with each potential form of that token, separated by '|' (for example: 'auſzwēdig' to: 'auszwemdig|auszwendig'). Please note: If a token contains more than two macrons, a manual edit is needed. The line concerned (line number and token) will be printed to the terminal. Furthermore the script contracts token, that were separated by line breaks and marked with '-' or '⸗'. ---------------------------------------------------------------------------------------------------- README Contents: Usage Required Software Input/ Output Format ---------------------------------------------------------------------------------------------------- Usage: $ ./cleanV2.py /path/to/input_file.txt /path/to/output_file_stem.txt ---------------------------------------------------------------------------------------------------- Required Software: + python >= 2.3 < 3.0 ---------------------------------------------------------------------------------------------------- Input/Output Format: The input corpus needs to be in txt format. Each line contains a line break span-annotation and one "dipl" token (see the guidelines for version 3.0 of Ridges Herbology) separated by a single tab character. An example: lb dipl # tier term lb Dies ist ein Form- lb bsp . Quotation marks require the escape character (\) or the predefined xml-masking (&quot;), otherwise format issues may appear. One output file is created for each input file. The output file contains the annotation tiers of the input file with the replacements outlined above. ---------------------------------------------------------------------------------------------------- Author: Vivian Voigt