Faculty of Language, Literature and Humanities - Corpus Linguistics and Morphology

Faculty of Language, Literature and Humanities | Department of German Studies and Linguistics | Corpus Linguistics and Morphology | Events | ESSLLI 2008: Bioinformatics methods in calculating language relationships

ESSLLI 2008: Bioinformatics methods in calculating language relationships

Bioinformatics Methods in Calculating Language Relationships

Course taught at ESSLLI 2008 in Hamburg.

Teachers: Ulf Leser, Bioinformatics, and Anke Lüdeling, Corpus Linguistics, Humboldt-Universität zu Berlin.

We will update this homepage with information concerning the course, references, slides etc. In case you have any questions, don't hesitate to contact us.


This course deals with the computation of language trees and networks using methods from bioinformatics. Since the first language tree published by Schleicher in 1853, relationships between languages have been viewed as similar to relationships between species. Although there have been many debates about the adequateness of "genetic" language trees, the genetic metaphor is on the rise in theoretical historical linguistics. Bioinformatics methods, originally designed for the comparison of DNA and genomes, are nowadays often used to construct language trees. In this course, we start with a short introduction into DNA sequence analysis and bioinformatics in general. We highlight the differences and similarities between analyzing biological sequences and analyzing human language. Important algorithms for computing similarity (of sequences, words, sentences, languages, etc.) are explained. We then turn to phylogenetic algorithms for trees, such as hierarchical clustering and maximum parsimony. Finally, we give an outlook on algorithms for infering phylogenies that are not trees. These are particularly important for historical linguistics given the large degree to which languages influence each other beyond their genetic relationships. For each problem, we introduce the required data, the algorithms, and different methods to assess the quality of the results.

Schedule & Slides

Aug 04-08, 9:15-10:45 am. Hall K. For a map and organisatorial details see the ESSLLI page

  • Monday: Introduction & background: Genetic relationships between languages. Short overview over traditional methods, problems, goals, etc., slides
  • Tuesday: Bioinformatics primer, string similarity (edit distance), slides
  • Wednesday: Tree-construction methods: distance-based methods and character-based methods (parsimony, perfect phylogeny), slides
  • Thursday: Bioinformatics methods in historical linguistics (case studies): distance-based approaches and character-based approaches, slides
  • Friday: Language contact - networks, simulation, slides

Course Materials

Background Reading

  • Baldauf, Sandra L. (2003) Phylogeny for the faint of heart: a tutorial. In: Trends in Genetics 19(6), 345-351
  • Bandelt, Hans-Jürgen/Dress, Andreas W.M. (1993) A relational approach to split decomposition. In: Opitz, O./ Lausen, B./Klar, R. Information and Classification. Springer, Berlin, 123-131
  • Bergsma, Shane/Kondrak, Grzegorz (2007) Multilingual Cognate Identification using Integer Linear Programming. In: Proceedings of the International Workshop on Acquisition and Management of Multilingual Lexicons, Borovets, Bulgaria, September 2007, 11-18. Online at http://www.cs.ualberta.ca/~kondrak/publications.html#CL
  • Bryant, David (2006) Radiation and Network Breaking in Polynesian Language Evolution. In: Forster, Peter/Renfrew, Colin (eds) Phylogenetic Methods and the Prehistory of Languages. McDonald Institute Press, University of Cambridge. Online at http://www.math.auckland.ac.nz/~bryant/Papers/05PolyNetwork.pdf
  • Croft, William (2000) Explaining language change: an evolutionary approach. Harlow, Essex: Longman.
  • Covington, Michael (1996) An algorithm to align words for historical comparison. In: Computational Linguistics 22, 481-496
  • Dixon, Robert M. W. (1997) The Rise and Fall of Languages. Cambridge University Press, Cambridge
  • Dress, Andreas W. M./Huson, Daniel H. (2004) Constructing splits graphs. In: IEEE/ACM Transactions in Computational Biology and Bioinformatics 1(3), 109-115
  • Durie, Mark/Ross, Malcolm (eds) (1996) The comparative method reviewed: Regularity and irregularity in language change. Oxford: Oxford University Press.
  • Felsenstein, Joseph (2004). Inferring Phylogenies. Sunderland, Massachusets, Palgrave Macmillan
  • Forster, Peter/Toth, Alfred/Bandelt, Hans-Jürgen (1998) Evolutionary Network Analysis of Word Lists: Visualising the Relationship between Alpine Romance Languages. In: Journal of Quantitative Linguistics 5(3), 174-187
  • Gould, Stephen Jay (2002) The Structure of Evolutionary Theory. The Belknap Press of Harvard University Press, Cambridge, MA & London
  • Haspelmath, Martin (2004) How hopeless is genealigical linguistics and how advanced is areal linguistics? A review article of Aikhenvald and Dixon 2001. In: Studies in Language 28(1), 209-223
  • Hock, Hans Henrich/Joseph, Brian D. (1996) Language History, Language Change, and Language Relationship. An Introduction to Historical and Comparative Linguistics. Mouton de Gruyter, Berlin
  • Jin, Guohua/Nakhleh, Luay/Snir, Sagi/Tuller, Tamir (2006) Efficient parsimony-based methods for phylogenetic network reconstruction. In: Bioinformatics 23, e123-e128. Online at http://www.cs.rice.edu/~nakhleh/pub.html
  • Kessler, Brett (2001) The Significance of Word Lists. CSLI Publications, Stanford
  • Kondrak, Gregorz (2003) Phonetic alignment and similarity. In: Computers and the Humanities 37(3), 273-291
  • Kondrak, Grzegorz/Sherif, Tarek (2006) Evaluation of Several Phonetic Similarity Algorithms on the Task of Cognate Identification. In: Proceedings of the COLING-ACL Workshop on Linguistic Distances Sydney, Australia, July 2006, pp. 43-50. Online at http://www.cs.ualberta.ca/~kondrak/publications.html#CL
  • Lass, Roger (1997) Historical Linguistics and Change. Cambridge: University Press.
  • Lieberman, Erez/Michel, Jean-Baptiste/Jackson, Joe/Tang, Tina/Nowak, Martin A. (2007) Quantifying the evolutionary dynamics of language. In: Nature 449(7163), 713-716. Online at http://www3.isrl.uiuc.edu/~junwang4/langev/localcopy/pdf/lieberman07verbFrequencyNATURE.pdf
  • Linder, C. Randal/Moret, Bernard M. E./Nakhleh, Luay/Warnow, Tandy (2004). Network (Reticulate) Evolution: Biology, Models, and Algorithms. In: Pacific Symposium on Bioinformatics (PSB), Hawaii. Online at http://www.cs.rice.edu/~nakhleh/Papers/psb04
  • Lüdeling, Anke (2006) Using corpora in the classification of language relationships. In: Zeitschrift für Anglistik und Amerikanistik. Special Issue on 'The Scope and Limits of Corpus Linguistics' (guest editor: Volker Gast), 217-227.
  • McMahon, April/McMahon, Robert (2005) Language Classification by Numbers. Oxford University Press, Oxford
  • Moret, Bernard M. E./Nakhleh, Luay/ Warnow, Tandy/Linder, C. Randal/Tholse, Anna/ Padolina, Anneke/Sun, Jerry/Timme, Ruth (2004) Phylogenetic networks: modeling, reconstructibility, and accuracy. In: IEEE/ACM Transactions of Computational Biololgy and Bioinformatics 1(1), 13-23. Online at http://www.cs.rice.edu/~nakhleh/Papers/tcbb04.pdf
  • Morrison, David A. (1996). Phylogenetic Tree-Building. In: International Journal of Parasitology 26(6), 589-617
  • Nakhleh, Luay/Jin, Guohua/Zhao, Fengmei/Mellor-Crummey, John (2005) Reconstructing Phylogenetic Networks Using Maximum Parsimony. In: Computational Systems Bioinformatics Conference (CSB), Stanford, USA, 93-102. Online at http://www.cs.rice.edu/~nakhleh/Papers/CSB05.pdf
  • Nakhleh, Luay/Sun, Jerry/Warnow, Tandy/Linder, C. Randal/Moret, Bernard M. E./Tholse, Anna (2003) Towards the Development of Computational Tools for Evaluating Phylogenetic Network Reconstruction Methods. In: 8th Pacific Symposium on Biocomputing (PSB 03), Hawaii , 315-326. Online at http://www.cs.rice.edu/~nakhleh/Papers/psb03.pdf
  • Nakhleh, Luay/Warnow, Tandy/Ringe, Don/Evans, Steven N. (2005) A Comparison of Phylogenetic Reconstruction Methods on an IE Dataset. In: Transactions of the Philological Society, 3(2), 171-192
  • Nakhleh, Luay/Ringe, Don/Warnow, Tandy (2005) Perfect Phylogenetic Networks: A New Methodology for Reconstructing the Evolutionary History of Natural Languages. In: Language 81(2), 382-420
  • Nerbonne, John/Heeringa, Wilbert (1997) Measuring Dialect Distance Phonetically. In: John Coleman (ed.) Workshop on Computational Phonology, Special Interest Group of the Association for Computational Linguistics, Madrid, 1997, 11-18. (online at http://www.let.rug.nl/~heeringa/dialectology/papers/)
  • Nerbonne, John/Heeringa, Wilbert/Kleiweg, Peter (1999). Edit Distance and Dialect Proximity. In: Sankoff, David/Kruskal, Joseph (eds.) Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison. CSLI Publications, Stanford (online at http://www.let.rug.nl/~kleiweg/papers/)
  • Nerbonne, John/Kleiweg, Peter/Heeringa, Wilbert/Manni, Franz (2007) Projecting Dialect Differences to Geography: Bootstrap Clustering vs. Noisy Clustering In: Preisach, Christine/Schmidt-Thieme, Lars/Burkhardt, Hans/Decker Reinhold (eds.) Data Analysis, Machine Learning, and Applications. Proc. of the 31st Annual Meeting of the German Classification Society. Berlin, Springer. (online at http://www.let.rug.nl/~kleiweg/papers/)
  • Pagel, Mark/Atkinson, Quentin D./Meade, Andrew (2007) Frequency of word-use predicts rates of lexical evolution throughout Indo-European history. In: Nature 449(7163), 717-720. Online at http://www.isrl.uiuc.edu/~amag/langev/paper/pagel07wordFrequencyNATURE.html
  • Posada, David/Crandall, Keith A. (2001) Intraspecific gene genealogies: trees grafting into networks. in: Trends in Ecology and Evolution 16(1), 37-45
  • Ringe, Don/Warnow, Tandy/Taylor, Ann (2002) Indo-European and Computational Cladistics. In: Transactions of the Philological Society 100, 59-129
  • Ritt, Nikolaus (2004) Selfish Sounds and Linguistic Evolution: A Darwinian Approach to Language Change. Cambridge, Cambridge University Press
  • Sims-Williams, Patrick (1998) Genetics, linguistics, and prehistory: thinking big and thinking straight. In: Antiquity 72, 502-527
  • Steel, Mike/Penny, David (2000) Parsimony, likelihood, and the role of models in molecular phylogenetics. In: Molecular Biology and Evolution 17(6), 839-50
  • Swadesh, Morris (1955) Towards Greater Accuracy in Lexicostatistic Dating. In: International Journal of American Linguistics 21, 121-137
  • Whelan, Simon/ Lio, Pietro/Goldman, Nick (2001) Molecular phylogenetics: state-of-the-art methods for looking into the past. In: Trends in Genetics 17(5), 262-72. Online at http://dx.doi.org/10.1016/S0168-9525(01)02272-7