Sprach- und literaturwissenschaftliche Fakultät - Korpuslinguistik und Morphologie

cleanV2README.txt

clean.py

	Code to normalize special character used in old german texts. Takes in corpus-files
	in txt format and writes out those files with replacements of the following 
	non-standard characters:

		'⸗' to '-',
		'ſ' to 's',
		'å' to 'a',		
		'ů' to 'u',
		'o̊' to 'o',
		'æ' to 'ae',
		'Æ' to 'AE',
		'œ' to 'oe',
		'Œ' to 'OE',
		'aͤ' to 'ä',
		'oͤ' to 'ö',
		'uͤ' to 'ü',
		'vͤ' to 'ü',
		'Aͤ' to 'Ä',
		'Oͤ' to 'Ö',
		'Uͤ' to 'Ü',
		'Vͤ' to 'Ü',
		'ñ' to 'nn',
		'n̄n' to 'nn',
		'n̄' to 'nn',
		'ñ' to 'nn',
		'ñn' to 'nn',
		'm̄m' to 'mm',
		'm̄' to 'mm',
		'm̃m' to 'mm',
		'm̃' to 'mm',
		'.*_.*' to 'unknown' ('_' represents non-readable Character),
		'€' to 'der', ('€' represents 'digit' (der))
		'$' to 'us', ('$' represents a fracture character translatable as 'us')	//since V2
		'í' to 'i',		// since V2
		'˖' to ':',		// since V2
		'ʒ' to 'z',		// since V2
		'ȝ' to 'z',		// since V2
		'v̂' to 'ü',	// since V2
		'ű' to 'ü',		// since V2
		'ͤa'to'ä',		// since V2
		'äͤ'to'ä'		// since V2

		
	The new documents in ridges-V4 put new demands on the clean-tier regarding to vowels with macrons. The normalization of those characters became unpredictable, even with consideration of the context. That's why we decided to replace token containing vowels with macrons with each potential form of that token, separated by '|' (for example: 'auſzwēdig' to: 'auszwemdig|auszwendig')
		
	Furthermore the script contracts token, that were seperated by line breaks and marked with '-' or '⸗'.

	
----------------------------------------------------------------------------------------------------

README Contents:

	Usage
	Required Software
	Input/ Output Format

----------------------------------------------------------------------------------------------------


Usage:

$ ./cleanV2.py 	/path/to/input_file.txt	/path/to/output_file_stem.txt


----------------------------------------------------------------------------------------------------


Required Software:

+ python >= 2.3	< 3.0	


----------------------------------------------------------------------------------------------------


Input/Output Format:


The input corpus needs to be in txt format.  Each line contains a line break span-annotation and one "dipl"
token (see the guidelines for version 3.0 of Ridges Herbology) separated by a single tab character.  An 
example:

		lb	dipl			# tier term
		lb	Dies
			ist	
			ein	
			Form-
		lb	bsp	
			.

Quotation marks require the escape character (\) or the predefined xml-masking ("), otherwise format issues may appear.
			
One output file is created for each input file. The output file contains the annotation tiers of the input file
with the replacements outlined above.



----------------------------------------------------------------------------------------------------



Author: Vivian Voigt