This is an old revision of the document!
Conversational Speech Translation
The recent focus on deep neural models has enabled new end-to-end approaches for many traditional NLP tasks. One in particular is spoken language translation (SLT), which was traditionally performed using a cascade of separately trained automatic speech recognition (ASR) and machine translation (MT) models. End-to-end models have many potential benefits, but also raise questions about how to best address tasks previously accomplished with separate processing steps.
This challenge focuses on one such question: what is the best way to produce fluent translations from disfluent speech? Further information about disfluent conversational speech can be found below.
This task uses a smaller dataset than other tasks, which we hope some groups may find more approachable. To enable wide participation, we have multiple participation options:
- We ask for submissions which translate from speech, or text-only using provided ASR output
- We encourage systems with both constrained (Fisher only) and unconstrained (open) data conditions
Submitted systems will ranked in terms of multiple automatic metrics including BLEU and METEOR.
Disfluent Conversational Speech
Conversational speech has many artifacts not present in written text, including disfluencies (hesitations, repetitions, self-corrections, ..) and differences in grammar:
|Disfluent English||Fluent English|
|uh, uh, uh, um, i think it’s like that||i think it’s like that|
|i also have um eh i’m taking a marketing class ..||i’m also taking a marketing class|
|because what is, mhm do you recall now that ..||do you recall now that ..|
| and so am and so the university where i am it’s |
the university of pennsylvania
|i am at the university of pennsylvania|
Preserving such disfluencies in translation output can reduce the readability and usability of generated translations. Further, they create a domain mismatch with typical MT training data. In cascaded models, these problems were often addressed with a separate model to remove disfluencies between ASR and MT, or as a post-processing step. We note that such a model requires annotated disfluencies, which will not be available for all domains and languages. The rise of end-to-end SLT models raises the question: should disfluency removal be a separate step, or can it be incorporated into our translation models?
The task addresses the translation of conversational speech from disfluent Spanish into fluent English. We encourage creative ways to remove disfluencies and produce fluent output which will be broadly applicable. To do so, we will provide two sets of reference translations for the primary training data, disfluent and fluent, and encourage creative submissions which do not use the fluent references in training (and so may better generalize to situations where this data is unavailable).
This task uses the LDC Fisher Spanish speech (disfluent) with new target translations (fluent). This dataset has 160 hours of speech (138k utterances): this is a smaller dataset than other tasks, which we hope some groups may find more approachable.
We provide multi-way parallel data for experimentation:
- disfluent speech
- disfluent transcripts (gold)
- disfluent transcripts (ASR output)
- disfluent translations
- fluent translations
Each of these are parallel at level of the training data, such that the disfluent and fluent translation references have the same number of utterances. A more detailed description of the fluent translation data can be found here: (Salesky et al. 2018).
We have arranged an evaluation license agreement with the LDC where all participants may receive this data without cost for the purposes of this task. license agreement: iwslt_2020_ldc_evaluation_agreement.pdf
Participants should sign the license agreement and follow the directions in the pdf to return a signed copy to LDC (by email or fax). Once received, LDC will provide a download link for the data package within 1-2 days. Participants who do not already have an LDC account will need to create one to download the data; the LDC membership office will assist with any questions. Test data will be automatically distributed to the LDC accounts of participants who have registered for the training data.
To enable immediate participation, we provide preprocessed speech features, with mapped (parallel) speech and text (transcript and translation) utterances in the IWSLT data package. For those who instead wish to extract their own features etc., we additionally provide the original LDC data packages. We note that the original speech and translations require a mapping step to be made parallel, and we provide code to do so within the data package (further details in the data package README). This is only necessary if you wish to extract your own features.
We strongly encourage participants who wish to use additional data beyond what is provided (unconstrained) to also submit systems which use only the Fisher data provided (constrained); constrained and unconstrained systems will be scored separately. We will also note which systems did not use the fluent references for training.
DATA RESTRICTION NOTE: Data from Fisher dev, dev2, and test splits and the Spanish Callhome dataset are not permitted for model training.
Participant submissions will be scored against the fluent translation references for the challenge test sets (to be released separately during the evaluation test period), using the automatic metrics BLEU and METEOR. By convention to compare with previous published work on the Fisher translation datasets, we will score using lowercased, detokenized output with all punctuation except apostrophes removed.
At test time, speech submissions must only be provided speech input, and text-only only with ASR output.
We will compare to the baseline models described in (Salesky et al. 2019).
All IWSLT 2020 tasks will follow the same dates:
|January 2020: release of train and dev data|
|March 17 2020: release of test data|
|May 18th 2020: camera-ready paper due|
We will provide test and challenge test input. We would like to see outputs for all test sets. We expect submission format of plain text with one utterance per line, pre-formatted for scoring (lowercased, detokenized output with all punctuation except apostrophes removed).
- Participants must specify if their systems translate from speech, or text-only
- Participants must specify if their submission is unconstrained (use additional data beyond what is provided) or constrained (use only the Fisher data provided); constrained and unconstrained systems will be scored separately.
- Participants should also note if they did not use the fluent references to train.
Submissions should be compressed in a single .tar.gz file and sent to email@example.com.