Faculty of Language, Literature and Humanities - Corpus Linguistics and Morphology

Software

Online tools

  • VIEW (Variation in Englisch Words and Phrases)
    http://view.byu.edu/

    Publicly accessible web interface by Mark Davies for searching the BNC. Various search options.

  • PIE (Phrases in English)
    http://pie.usna.edu/

    Publicly accessible web interface by William H. Fletcher for searching the BNC. Allows the search of word, part-of-speech, or character n-grams as well as phrase frames.

  • The Sketch Engine
    http://www.sketchengine.co.uk/

    The Sketch Engine by Adam Kilgarriff and Pavel Rychly is a corpus search engine incorporating word sketches, grammatical relations, and a distributional thesaurus. A word sketch is a one-page, automatic, corpus-derived summary of a word's grammatical and collocational behaviour. Free demo account after registration.

APIs and frameworks

  • Annotation Graph Toolkit (AGTK)
    http://agtk.sourceforge.net/

    Free software library in C++ (Java port available) for the processing of annotation graphs. Annotation Graphs are a formal framework for representing linguistic annotations of time series data. Annotation graphs abstract away from file formats, coding schemes and user interfaces, providing a logical layer for annotation systems.

  • Atlas (Architecture and Tools for Linguistic Analysis Systems)
    http://www.nist.gov/speech/atlas/

    Software library in Java for the processing of annotation graphs. Altas provides a data model, a storage format, and an API.

  • LT XML
    http://www.ltg.ed.ac.uk/software/xml/

    Free software library in C for the processing of XML documents, including searching and extracting, down-translation (e.g. report generation, formatting), tokenising and sorting.

  • NITE XML Toolkit (NXT)
    http://www.ltg.ed.ac.uk/NITE/

    Software library in Java for developing tailored end user corpus tools, especially for highly structured and/or cross-annotated multimodal corpora. NXT provides a data model, a storage format, and API support for handling data, querying it, and building graphical user interfaces.

Corpus creation tools

  • CLaRK
    http://www.bultreebank.org/clark/

    An XML-based system for corpora development

  • GATE - General Architecture for Text Engineering
    http://gate.ac.uk/

    GATE is a modular system for the linguistic processing of texts. It comprises an architecture, library framework and graphical development environment. Plugins can be used to build an application for a particular annotation task. GATE is freely available under GNU Library General Public License (LGPL 2.0) and can be downloaded after a registration. It is implemented in Java, and thus available for all major platforms.

  • SPre - configurable pre-processor
    http://www.spinfo.phil-fak.uni-koeln.de/spinfo-forschung-spre.html

    SPre is a program for segmenting and annotating texts of arbitrary formats. The algorithms for the segmentation are relatively freely configurable via an XML file. Other annotators can be integrated. SPre is published as a plugin for GATE. SPre is implemented in Java, and thus available on all major platforms.

  • jTokeniser
    http://www.andy-roberts.net/software/jTokeniser/

    Program and API for tokenising natural language text strings. Various tokenisers are provided for the segmentation of sentences into words and texts into sentences. Written in Java, hence available on all major platforms. Free Software (LGPL).

Annotation tools

  • Alembic Workbench Project
    http://www.mitre.org/tech/alembic-workbench/

    Tool for manual and automatic annotation of text corpora. Automatic annotation is achieved by a mixed approach: heuristics for information extraction can be manually composed or automatically inducted. Available free of charge.

  • PALinkA: A Discourse Annotation Tool
    http://clg.wlv.ac.uk/projects/PALinkA/

    An annotation program which allows a wide range of annotations. At present it has been used to annotate texts for anaphora resolution, centering, summarisation and marking certain features in texts.

  • TASX (Time Aligned Signal data eXchange) currently down
    http://medien.informatik.fh-fulda.de/tasxforce

    TASX provides an XML based annotation format, an annotation tool and a web based query system for multimodal corpora.

  • Annotate
    http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/annotate.html

    Annotate is a tool for efficient semi-automatic annotation of corpus data. It facilitates the generation of context-free structures and additionally allows crossing edges.

  • EXMARaLDA
    http://www.exmaralda.org/

    EXMARaLDA (EXtensible MARkup Language for Discourse Annotation) provides an XML-based format and a variety of tools for discourse transcription and annotation. It's written in Java, and thus available for all major computer platforms.

  • Transcriber
    http://www.etca.fr/CTA/gip/Projets/Transcriber/

    Transcriber is a tool for assisting the manual annotation of speech signals. It provides a user-friendly graphical user interface for segmenting long duration speech recordings, transcribing them, and labeling speech turns, topic changes and acoustic conditions. It is more specifically designed for the annotation of broadcast news recordings, for creating corpora used in the development of automatic broadcast news transcription systems, but its features might be found useful in other areas of speech research.

  • Anvil
    http://www.dfki.de/~kipp/anvil/

    Anvil is a free video annotation tool.

  • MMAX
    http://mmax.eml-research.de

    A tool for multi-modal annotation in XML

Tagger

Corpus analysis tools

  • IMS Open Corpus Workbench (CWB)
    http://cwb.sourceforge.net/

    The IMS Open Corpus Workbench (former IMS Corpus Workbench) is a set of tools for full text retrieval of text corpora. The Corpus Query Processor (CQP) is a powerful corpus search tool supporting regular expressions, match conditions on all annotation levels and collocation analysis. Research and evaluation licences are available free of charge.

  • WordSmith Tools
    http://www.lexically.net/wordsmith/

    Commercial set of tools to explore the behaviour of words in texts. It provides a tool for generating lists of all words or word-clusters in a text, a concordancer to see a word in its context, and a tool for identifying key words in a text. Demo mode available (restricted functional range).

  • AntConc
    http://www.antlab.sci.waseda.ac.jp/software.html

    freeware concordance software; compiles amongst others KWIC (key words in contexts), word clusters, n-grams, word frequencies

  • TextSTAT - Simple Text Analysis Tool
    http://neon.niederlandistik.fu-berlin.de/en/textstat/

    open source concordance software; compiles amongst others KWIC (key words in contexts), word clusters, n-grams, word frequencies, retrograde/reverse sorting

  • QLDB - Querying Linguistic Databases
    http://www.ldc.upenn.edu/Projects/QLDB/

    Project about data models and query languages for linguistic databases.

  • An On-Line Repository of Association Measures
    http://www.collocations.de/AM/

    Statistical association measures, applied to cooccurrence frequency data collected in a contingency table, are the most widely used tool for the analysis of word combinations and the extraction of collocations from text corpora.

  • The UCS Toolkit (version 0.3)
    http://www.collocations.de/

    The UCS toolkit is a collection of libraries and scripts for the statistical analysis of cooccurrence data.

Other