Video Speech Translation
We are living the multiple modalities world in which we see objects, hear sounds, feel texture, smell odors, and so on. The purpose of this shared task is to ignite possibilities of multimodal machine translation. This shared task examines methods for combining video and audio sources as input of translation models. In addition to generally advancing the state of the art, our specific goals are:
- to thoroughly investigate and understand challenges in translating videos and identify promising applications.
- to create a public benchmarks for the video translation task
- to study the translation errors and new approaches for evaluating video translation outputs
- to study the performance of video translation on realistic scenarios
Similar to WMT evaluations, there are 2 evaluation tracks and focus on Chinese-English and English-Russian directions.
- Constrained submission: You are required to only use the datasets we provided in the Data section.
- Unconstrained submission: We also welcome unconstrained submissions i.e. you are also welcome to use additional datasets. If you do so, please flag all the unconstrained data sources used in your system.
All data sets from the OPUS project and the WMT evaluations are eligible. Additionally, participants can use the following data.
Currently, we do not have publicly available video corpora on the focus language directions Chinese-English and English-Russian. However, we think that the video information from the following corpora might be helpful to multimodal MT.
Dev & Test
We will provide the dev and test sets of e-commerce live shows. In particular, we will provide Chinese video clips which will be translated into English, and English video clips which will be translated into Russian. These dev and test sets contain video, manual transcriptions, and human translations. The unseen test will be released when the evaluation is due.
- dev set:
- test set:
- dev set:
- test set:
- Constrained track
- ASR: IWSLT organizers provides English engine, and we can provide a Kaldi-based Chinese system.
- Unconstrained track: Participants are encouraged to use whatever resources to build video translation systems. we will provide ASR and MT outputs from Online systems as baseline.
Multiple run submissions are allowed, but participants must explicitly indicate one PRIMARY run for each track. All other run submissions are treated as CONTRASTIVE runs. In the case that none of the runs is marked as PRIMARY, the latest submission (according to the file time-stamp) for the respective track will be used as the PRIMARY run.
Submissions have to be submitted as a gzipped TAR archive (see format below) and sent as an email attachment to email@example.com, firstname.lastname@example.org, email@example.com, and firstname.lastname@example.org.
The TAR archive should include in the file name the type of system (cascade/end-to-end) used to generate the submission
Each run has to be stored in a plain text file with one sentence per line. Each line has 2 columns separated by a tab. The first column is the wav id, and the second column is the translation output.
The email should include the following information:
- Contact Person:
- Constraint/Unconstraint data condition:
- Brief abstract about the system :
Evaluation will be carried out both automatically and manually. Automatic evaluation will make use of standard machine translation metrics, such as BLEU. Native speakers of each of the languages will manually check the quality of the translation for a small sample of the submissions. We also expect participants to support us in the manual evaluation (accordingly to the number of submissions)