You are here
Data Selection for SMT with Latent Domain Variables
I will address the problem of selecting adequate training sentence pairs from a mix-of-domains parallel corpus for a translation task represented by a small in-domain parallel corpus. I will present a novel latent domain translation model which includes domain priors, domain dependent translation models and language models. The goal of learning is to estimate the probability of a sentence pair in mix-domain corpus to be in- or out-domain using in-domain corpus statistics as prior. An EM training algorithm is presented together with solutions for estimating out-domain models (given only in- and mix-domain data). We report on experiments in data selection (intrinsic) and machine translation (extrinsic) on a large parallel corpus consisting of a mix of a rather diverse set of domains. Our results show that our latent domain invitation approach outperforms the existing baselines significantly.