Simultaneous Speech Translation

Task Description

Simultaneous machine translation has become an increasingly popular topic in recent years. In particular, simultaneous speech translation (SST) enables interesting applications such as subtitle translation for a live event or real-time video-call translation. The goal of this challenge is to examine systems for translating audio in a source language into text in a target language with consideration of both translation quality and latency.

We encourage participants to submit systems either based on cascaded (ASR + MT) or end-to-end approaches. This year, participants will be evaluated on translating TED talks from English into German. They will be given two parallel tracks to enter:

  • Text-to-Text: translating ground-truth transcripts in real-time.
  • Speech-to-Text: directly translating speech into text in real-time.

We encourage participants to enter both tracks when possible.

Evaluating a simultaneous system is not trivial as we cannot release the test data as offline translation tasks do. Instead, participants will be required to implement a provided API to read the input and write the translation, and upload their system as a Docker file so that it can be evaluated under controlled conditions. We provide an example implementation and a baseline system.

The system's performance will be evaluated in two ways:

  • Translation quality: we will use multiple standard metrics: BLEU, TER, and ChrF.
  • Translation latency: we will make use of the recently developed metrics for simultaneous machine translation including average proportion (AP), average lagging (AL) and differentiable average lagging (DAL).

In addition, we will report timestamps for informational purposes.

Training and Development Data

You may use the same training and development data available for the Offline Speech Translation task. Specifically, please refer to the “Allowed Training Data” and the “Past Editions Development Data” sections.

Evaluation Server

In this shared task, we provide an evaluation server that reads the raw data and sends out the source sentence step by step. Participants are required to adapt their model to work as a client, using our provided API, in order to receive inputs from the server and return simultaneous translations.
An evaluation script running on the server will automatically calculate the participant's system performance for both quality and latency. You can refer to this directory for the client/server implementation.

Here, we provide an example of using the evaluation script, as well as a simple baseline implemented in Fairseq.

Evaluation

We will evaluate translation quality with detokenized BLEU and latency with AP, AL and DAL. The systems will be ranked by the translation quality with different latency regimes. Three regimes, low, medium and high, will be evaluated. Each regime is determined by a maximum latency threshold. The thresholds are determined by AL, which represents the delay to the perfect real time system (milliseconds for speech and number of words for text), but all three latency metrics, AL, DAL and AP will be reported. Based on analysis on the quality-latency tradeoffs for the baseline systems, the thresholds are set as follows:

Speech Translation:

  • Low latency: AL ⇐ 1000
  • Medium latency: AL ⇐ 2000
  • High Latency: AL ⇐ 4000

Text Translation

  • Low latency: AL ⇐ 3
  • Medium Latency: AL ⇐ 6
  • High Latency: AL ⇐ 15

The submitted systems will be categorized into different regimes based on the AL calculated on the Must-C English-German test set, while the translation quality will be calculated on the blind test set. We require participants to submit at least one system for each latency regime. Participants are encouraged to submit multiple systems for each regime in order to provide more data points for latency-quality tradeoff analyses. If multiple systems are submitted, we will keep the one with the best translation quality for ranking. Besides the three latency metrics, we will also calculate the total decoding time under the server-client evaluation scheme for each system.

Submission Guidelines

Participants need to submit their systems as a Docker file along with necessary model files. We provide here an example of Dockerfile together with a baseline

Please pack and upload your docker file and model files through this link. Please prefix your files with a meaningful institution name.

Cloud Credits Application

Update: Applications are closed as of January 31 2020. Participants in this task may have access to a limited amount of cloud credits in order to train their systems. Please apply by filling out this very short form.

Contacts

Chair: Jiatao Gu (Facebook, USA)
Discussion: iwslt-evaluation-campaign@googlegroups.com

Organizers

  • Jiatao Gu (Facebook)
  • Juan Pino (Facebook)
  • Changhan Wang (Facebook)
  • Xutai Ma (JHU)
  • Fahim Dalvi (QCRI)
  • Nadir Durrani (QCRI)