ParaSense. Parallel corpora for Dutch word sense disambiguation

Start - End

2007 - 2012 (completed)

Type

PhD research

Research group(s)

LT3 - Language and Translation Technology Team

Research Focus

Language technology

Tabgroup

Abstract

Ambiguity remains one of the major problems for current Machine translation systems. The example sentence "Apple has doubled its profits in 2005" will get translated by Babelfish (Babelfish.altavista.com) as "De appel heeft zijn winsten in 2005 verdubbeld". Although "appel" (fruit) is a correct translation of the word "Apple", it is the wrong translation in this context. Other language technology applications, such as Question Answering (QA) systems or information retrieval (IR) systems, also suffer from the poor contextual Word Sense Disambiguation (WSD).

WSD is considered one of the most difficult problems within language technology today. It requires the construction of an artificial text understanding; the system should detect the correct word sense based on the context of the word. In this project we want to develop a generic automatic WSD system for Dutch. This system should detect words with more than one sense and assign the correct contextual sense. Current state-of-the-art WSD systems are mainly based on supervised learning algorithms that learn from labeled data, which are annotated corpora containing labels that have been manually assigned. Given the fact that such corpora hardly exist in Dutch and that manual labeling is very time-consuming and expensive, we will start from parallel corpora.

The approach of deducting word senses in an automated way from parallel corpora is based on the observation that a word with more than one sense often has different translations for these different senses. Given the fact that the Dutch word "blik" gets translated in English into "glance" and "tin", we can conclude that "blik" has at least two distinct senses. The use of parallel corpora for WSD has been investigated in several studies for ao English and Chinese, and appears to be a promissing method (Ng et al. 2003, Shao and Ng 2004, etc).

Using parallel corpora solves a couple of other issues as well. Defining possible senses of a polysemous word is rather subjective, and many words get different senses across dictionaries. Next to that, there is also a granularity problem: it is not clear how detailed sense distinctions must be in order to be useful in concrete applications; not all sense distinctions get lexicalised in all languages. Taking the English word "head" as an example; we see that this word is always translated as "hoofd" in Dutch (as well for "chief" as for "body part").

In this project we examined the following research topics:

to which extent can we detect word senses in an automated way, based on parallel corpora, and without using any information from dictionaries or other lexical sources?
how big is the error rate of automatic word alignment of parallel corpora?
how much syntactic knowledge is needed for a good detection of word senses/translations?
what is the optimal variation in contrasting languages for establishing an efficient sense inventory?
what is the optimal granularity for reaching a good performance?
which improvements in terms of precision and recall can we obtain by integrating our automatic WSD system in a practical application?

People

Supervisor(s)

Veronique Hoste

Department of Translation, Interpreting and Communication

Phd Student(s)

Els Lefever

Department of Translation, Interpreting and Communication

Publications

An evaluation and possible improvement path for current SMT behavior on ambiguous nouns(2011)
- Els Lefever
- Veronique Hoste
Construction of a benchmark data set for cross-lingual word sense disambiguation(2010)
- Els Lefever
- Veronique Hoste
Discovering missing Wikipedia inter-language links by means of cross-lingual word sense disambiguation(2012)
- Els Lefever
- Veronique Hoste
- Martine De Cock
Examining the validity of cross-lingual word sense disambiguation(2011)
- Els Lefever
- Veronique Hoste
Five languages are better than one: an attempt to bypass the data acquisition bottleneck for WSD(2013)
- Els Lefever
- Veronique Hoste
- Martine De Cock
Language-independent bilingual terminology extraction from a multilingual parallel corpus(2009)
- Els Lefever
- Lieve Macken
- Veronique Hoste
ParaSense or how to use parallel corpora for word sense disambiguation(2011)
- Els Lefever
- Veronique Hoste
- Martine De Cock
ParaSense: parallel corpora for word sense disambiguation(2012)
- Els Lefever
SBFC: an efficient feature frequency-based approach to tackle cross-lingual word sense disambiguation(2012)
- Dieter Mourisse
- Els Lefever
- Nele Verbiest
- Yvan Saeys
- Martine De Cock
- Chris Cornelis
SemEval-2010 Task 3: cross-lingual word sense disambiguation(2010)
- Els Lefever
- Veronique Hoste
SemEval-2010 Task 3: cross-lingual word sense disambiguation(2009)
- Els Lefever
- Veronique Hoste
SemEval-2013 task 10: cross-lingual word sense disambiguation(2013)
- Els Lefever
- Veronique Hoste
Using parallel corpora for word sense disambiguation(2011)
- Els Lefever
- Veronique Hoste
- Martine De Cock