MuST. Multilingual corpora for the automatic structuring of terms

Start - End

2013 - 2014 (completed)

Type

Postdoc research

Research group(s)

LT3 - Language and Translation Technology Team

Research Focus

Language technology

Tabgroup

Abstract

The MuST project aims to extract all domain-specific terms from a multilingual technical corpus as well as the semantic relationships between these terms. The automatic detection of synonymy and hyponymy links allows us to take an important step forward from a flat term list to a structured concept list. For the automatic terminology extraction and detection of semantic relations between the terms, we will use all available parallel corpora at hand without adding any external lexical resources. This turns it into a generic and language-independent approach, which will enable us to deploy it dynamically to new domains or documents.

The bilingual terminology-extraction is carried out with a previously developed terminology extraction tool that generates bilingual term pairs from a parallel corpus (Lefever et al. 2009). For the automatic detection of synonyms, a distributional approach will be combined with a multilingual method. This distributional approach starts from the hypothesis that semantically related words occur in similar contexts. By comparing the context and syntactic information of terms, we can distinguish semantically related terms in the term list. For the multilingual approach, we apply a previously developed approach for word sense disambiguation using parallel corpora (Lefever et al. 2011). In order to automatically extract hyponymy relations, we develop an algorithm that adapts an existing pattern-based approach (Hearst 1992) to a multilingual context and that further optimizes the results by means of comparable corpora.

References

Lefever, E., Macken, L., and Hoste, V. (2009). Language-independent bilingual terminology extraction from a multilingual parallel corpus. Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics. The Association for Computational Linguistics, Athens, Greece.

Lefever, E., Hoste, V. and De Cock, M. (2011). ParaSense or how to use Parallel Corpora for Word Sense Disambiguation. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Portland, Oregon, USA.

Hearst, M. (1992). Automatic acquisition of hyponyms from large text corpora. Proceedings of the International Conference on Computational Linguistics (COLING-1992), 539-545. Nantes, France.

People

Researcher(s)

Els Lefever

Department of Translation, Interpreting and Communication

Publications

A combined pattern-based and distributional approach for automatic hypernym detection in Dutch(2013)
- Gwendolijn Schropp
- Els Lefever
- Veronique Hoste
Applying hybrid terminology extraction to aspect-based sentiment analysis(2015)
- Orphée De Clercq
- Marjan Van de Kauter
- Els Lefever
- Veronique Hoste
Evaluation of automatic hypernym extraction from technical corpora in English and Dutch(2014)
- Els Lefever
- Marjan Van de Kauter
- Veronique Hoste
HypoTerm detection of hypernym relations between domain-specific terms in Dutch and English(2014)
- Els Lefever
- Marjan Van de Kauter
- Veronique Hoste
LT3: a multi-modular approach to automatic taxonomy construction(2015)
- Els Lefever