Skip to main content

Trinity College Dublin, The University of Dublin

Menu Search

Module DescriptorSchool of Computer Science and Statistics

Module CodeCS4LL5
Module NameAdvanced Computational Linguistics:
Machine Learning Techniques in Machine Translation, Speech Recognition and Topic Modelling
Module Short TitleAdvanced Computational Linguistics
Semester TaughtMichaelmas
Contact HoursLecture hours: 22
Lab hours: 6
Tutorial hours: 5
Module PersonnelDr Martin Emms
Learning Outcomes
  • understand in general what a probabilistic model is, the distinction between so-called visible and hidden variables
  • the general idea of unsupervised training as way to set model parameters concerning hidden variables from evidence only on visible variables
  • understand Expectation Maximisation (EM) as a general unsupervised technique, including proofs of its convergence and property of increasing data likelihood
  • understand specific instances of this in Machine Translation and Speech Recognition and further details of how seemingly infeasibly costly calculations can in fact be feasibly done
  • consider the further case of models for the hidden 'topics' in a document collection and the further modifications to EM to solve this
Learning Aims

The aim is to give a grounding in so-called unsupervised machine learning techniques which are vital to many language-processing technologies including Machine Translation, Speech Recognition and Topic Modelling. Whilst studied in these contexts, the techniques themselves are used much more widely in data mining and machine vision for example.

Module Content

Probability basics on collections of variables with discrete outcomes (what word, what topic etc) in particular joint, marginal, and conditional probabilities; the chain rule; relative frequences as maximum likelihood estimators

Statistical Machine Transation: general (source|target) x target formulation and learning from corpus of sentence pairs; idea of 'hidden' alignment variables between sentence pairs; the so-called IBM alignment models; brute-force EM for learning alignment models; efficient exact algorithms avoiding the exponential cost of brute-force EM

Speech Recognition: general Hidden Markov Model (O|S) x S formulation where O is observable speech, and S is hidden state sequence. Brute-force EM for learning HMM parameters from corpus of observed speech; the efficient Baum-Welch algorithm avoiding the exponential cost of brute-force EM

Topic Modelling: a technique for assisting the navigation of huge document collections by seeing them as involving hidden or latent 'topic' variables; how this can be used to recover hidden relationships between documents; techniques to learn parameters of these models in each case, alongside the explanation of the algorithms, there will be practical work, either developing instances of them, or deploying existing implementations and running them on data sets to concretely see their properties

Recommended Reading List

I will be providing notes, sometimes directing attentions to particular chapters from the following books, as well as possible online sources

Kevin Murphy's book 'Machine Learning: A Probabilistic Perspective'
Russel and Norvig's book 'Artificial Intelligence: A Modern Approach'
Jurafsky and Martin's book 'Speech and Language Processing'
Phillip Koehn’s book ’Statistical Machine Translation’
associated site:
Manning and Schutze's book 'Foundations of Statistical Natural Language Processing'
note by Michael Collins on IBM models

Module PrerequisitesNo pre-requisite: to implement and experiment with tools will need to be able to program in C++
Assessment Details

Examination: 70%

Course Work: 30%

Marks simply combined: not required to pass each component separately.
Exam takes place in January.

Module Website
Academic Year of Data