A respeaking and collaborative game-based approach to building a parsed corpus of European Spanish dialects

Een 'respeaking' en collaboratief spelgebaseerde aanpak tot het bouwen van een geparsed corpus van de Europees Spaanse dialecten

Start - End

2018 - 2023 (ongoing)

Type

Multiresearcher project

Department(s)

Department of Linguistics

Department of Translation, Interpreting and Communication

Research group(s)

DiaLing - Diachronic and Diatopic Linguistics

Research Focus

Communication

Language technology

Linguistics

Tabgroup

Abstract

The study of dialectal microvariation of Spanish spoken in Spain has until recently mainly focused on lexical and phonetic features. The morphosyntax of these dialects, on the contrary, remains largely unexplored, despite the recent surge in interest in dialect grammars. This is due to the lack of large annotated dialectal corpora. This project aims to fill this lacuna and will create the first
morphosyntactically annotated and parsed corpus of the European Spanish dialects. This dialect corpus will be designed in a geographically balanced way and its material will proceed from the COSER corpus (Corpus Oral y Sonoro del Español Rural `Audible Corpus of Spoken Rural Spanish'), which is the largest collection of oral data in the Spanish-speaking world but which remains largely untranscribed. As transcribing and annotating are expensive and laborintensive, this project takes a respeaking and collaborative
game-based approach to building the parsed corpus of European Spanish dialects. In other words, we intend to obtain automatic transcriptions using a speech recognizer. These will then be processing using Natural Language Processing tools and can then be used to create a crowdsourced game through which members of the public contribute to the co-creation of the parsed corpus by providing annotations in the context of a game.

People