Software

Corpus Linguistics and Morphology | Software

Software

Online tools

VIEW (Variation in Englisch Words and Phrases)
http://view.byu.edu/
Publicly accessible web interface by Mark Davies for searching the BNC. Various search options.
PIE (Phrases in English)
http://pie.usna.edu/
Publicly accessible web interface by William H. Fletcher for searching the BNC. Allows the search of word, part-of-speech, or character n-grams as well as phrase frames.
The Sketch Engine
http://www.sketchengine.co.uk/
The Sketch Engine by Adam Kilgarriff and Pavel Rychly is a corpus search engine incorporating word sketches, grammatical relations, and a distributional thesaurus. A word sketch is a one-page, automatic, corpus-derived summary of a word's grammatical and collocational behaviour. Free demo account after registration.

APIs and frameworks

Annotation Graph Toolkit (AGTK)
http://agtk.sourceforge.net/
Free software library in C++ (Java port available) for the processing of annotation graphs. Annotation Graphs are a formal framework for representing linguistic annotations of time series data. Annotation graphs abstract away from file formats, coding schemes and user interfaces, providing a logical layer for annotation systems.
Atlas (Architecture and Tools for Linguistic Analysis Systems)
http://www.nist.gov/speech/atlas/
Software library in Java for the processing of annotation graphs. Altas provides a data model, a storage format, and an API.
LT XML
http://www.ltg.ed.ac.uk/software/xml/
Free software library in C for the processing of XML documents, including searching and extracting, down-translation (e.g. report generation, formatting), tokenising and sorting.
NITE XML Toolkit (NXT)
http://www.ltg.ed.ac.uk/NITE/
Software library in Java for developing tailored end user corpus tools, especially for highly structured and/or cross-annotated multimodal corpora. NXT provides a data model, a storage format, and API support for handling data, querying it, and building graphical user interfaces.

Corpus creation tools

CLaRK
http://www.bultreebank.org/clark/
An XML-based system for corpora development
GATE - General Architecture for Text Engineering
http://gate.ac.uk/
GATE is a modular system for the linguistic processing of texts. It comprises an architecture, library framework and graphical development environment. Plugins can be used to build an application for a particular annotation task. GATE is freely available under GNU Library General Public License (LGPL 2.0) and can be downloaded after a registration. It is implemented in Java, and thus available for all major platforms.
SPre - configurable pre-processor
http://www.spinfo.phil-fak.uni-koeln.de/spinfo-forschung-spre.html

SPre is a program for segmenting and annotating texts of arbitrary formats. The algorithms for the segmentation are relatively freely configurable via an XML file. Other annotators can be integrated. SPre is published as a plugin for GATE. SPre is implemented in Java, and thus available on all major platforms.
jTokeniser
http://www.andy-roberts.net/software/jTokeniser/
Program and API for tokenising natural language text strings. Various tokenisers are provided for the segmentation of sentences into words and texts into sentences. Written in Java, hence available on all major platforms. Free Software (LGPL).

Annotation tools

Alembic Workbench Project
http://www.mitre.org/tech/alembic-workbench/
Tool for manual and automatic annotation of text corpora. Automatic annotation is achieved by a mixed approach: heuristics for information extraction can be manually composed or automatically inducted. Available free of charge.
PALinkA: A Discourse Annotation Tool
http://clg.wlv.ac.uk/projects/PALinkA/
An annotation program which allows a wide range of annotations. At present it has been used to annotate texts for anaphora resolution, centering, summarisation and marking certain features in texts.
TASX (Time Aligned Signal data eXchange) currently down
http://medien.informatik.fh-fulda.de/tasxforce
TASX provides an XML based annotation format, an annotation tool and a web based query system for multimodal corpora.
Annotate
http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/annotate.html
Annotate is a tool for efficient semi-automatic annotation of corpus data. It facilitates the generation of context-free structures and additionally allows crossing edges.
EXMARaLDA
http://www.exmaralda.org/
EXMARaLDA (EXtensible MARkup Language for Discourse Annotation) provides an XML-based format and a variety of tools for discourse transcription and annotation. It's written in Java, and thus available for all major computer platforms.
Transcriber
http://www.etca.fr/CTA/gip/Projets/Transcriber/
Transcriber is a tool for assisting the manual annotation of speech signals. It provides a user-friendly graphical user interface for segmenting long duration speech recordings, transcribing them, and labeling speech turns, topic changes and acoustic conditions. It is more specifically designed for the annotation of broadcast news recordings, for creating corpora used in the development of automatic broadcast news transcription systems, but its features might be found useful in other areas of speech research.
Anvil
http://www.dfki.de/~kipp/anvil/
Anvil is a free video annotation tool.
MMAX
http://mmax.eml-research.de
A tool for multi-modal annotation in XML

Tagger

TreeTagger
http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
Trainable tagger based on decision trees. Part-of-speech parameter files are available for English, German, French, and Italian.
HMM-based tagger MBT
http://ilk.kub.nl/
Memory Based Tagger. Download possible. Dutch, English, Spanish, Swedish, and German
AMALGAM Tagger
http://www.comp.leeds.ac.uk/amalgam/amalgam/amalghome.htm
AMALGAM Tagger is based on Brill's tagger and tags English text with the part-of-speech tagging schemes of the Brown Corpus (Brown), International Corpus of English (ICE), Lundon-Lund Corpus (LLC), Lancaster-Oslo/Bergen Corpus (LOB), UNIX parts (Parts), Polytechnic of Wales Corpus (POW), Spoken English Corpus (SEC), and University of Pennsylvania Corpus (UPenn). AMALGAM Tagger can only be used via email.
Monty Tagger
http://web.media.mit.edu/~hugo/montytagger/
Monty tagger is part of the MontyLingua tools. Quote from the web-site: "Part-of-speech tagging based on Brill94, enriched with common sense."
claws
http://www.comp.lancs.ac.uk/ucrel/claws/
CoCab
http://chasen.aist-nara.ac.jp/~kaoru-ya/cocab/
biomedical vokabulary
connexor-tagger
http://www.connexor.com/
with a small Tagset
EngCG-tagger
http://www.ling.helsinki.fi/~avoutila/cg/
QTag
http://phrasys.net/uob/om/software

Trainable, probabilistic part-of-speech tagger. Parameter-files for English are available.
LT POS-Tagger
http://www.ltg.ed.ac.uk/software/pos/index.html
Uses the Penn Treebank Tagset, accept plain text and SGML to.
ISSCO TaggerTool
http://www.issco.unige.ch/staff/robert/tatoo/tatoo.html
Brill Tagger
http://www.cs.jhu.edu/~brill/
Transformation-based tagger.
mtag Multext-Tagger
http://www.issco.unige.ch/projects/MULTEXT.html
re-implementation of the Xerox tagger in C
TnT
http://www.coli.uni-sb.de/~thorsten/tnt/
statistic tagger from Thorsten Brants; for Windows. German and English
AUTASYS
http://www.phon.ucl.ac.uk/home/alex/project/tagging/tagging.htm
A Fully Automatic English Wordclass Analysis System
GALENA
http://www.dc.fi.udc.es/lfcia/Proyectos/Galena/
Tagger and parser for Spanish. Plattform independent. 7000 Lemmas.

Corpus analysis tools

IMS Open Corpus Workbench (CWB)
http://cwb.sourceforge.net/
The IMS Open Corpus Workbench (former IMS Corpus Workbench) is a set of tools for full text retrieval of text corpora. The Corpus Query Processor (CQP) is a powerful corpus search tool supporting regular expressions, match conditions on all annotation levels and collocation analysis. Research and evaluation licences are available free of charge.
WordSmith Tools
http://www.lexically.net/wordsmith/
Commercial set of tools to explore the behaviour of words in texts. It provides a tool for generating lists of all words or word-clusters in a text, a concordancer to see a word in its context, and a tool for identifying key words in a text. Demo mode available (restricted functional range).
AntConc
http://www.antlab.sci.waseda.ac.jp/software.html
freeware concordance software; compiles amongst others KWIC (key words in contexts), word clusters, n-grams, word frequencies
TextSTAT - Simple Text Analysis Tool
http://neon.niederlandistik.fu-berlin.de/en/textstat/
open source concordance software; compiles amongst others KWIC (key words in contexts), word clusters, n-grams, word frequencies, retrograde/reverse sorting
QLDB - Querying Linguistic Databases
http://www.ldc.upenn.edu/Projects/QLDB/
Project about data models and query languages for linguistic databases.
An On-Line Repository of Association Measures
http://www.collocations.de/AM/
Statistical association measures, applied to cooccurrence frequency data collected in a contingency table, are the most widely used tool for the analysis of word combinations and the extraction of collocations from text corpora.
The UCS Toolkit (version 0.3)
http://www.collocations.de/
The UCS toolkit is a collection of libraries and scripts for the statistical analysis of cooccurrence data.

Other

STTS (Stuttgart-Tübingen-TagSet)
http://www.ims.uni-stuttgart.de/projekte/corplex/TagSets/stts-table.html
Part-of-speech tag set for German. Links to other tagsets.

Faculty of Language, Literature and Humanities - Corpus Linguistics and Morphology

Online tools

APIs and frameworks

Corpus creation tools

Annotation tools

Tagger

Corpus analysis tools

Other