This is an old revision of the document!

Non-Native Speech Translation

Task Description

Speech recognition and translation are achieving huge improvements over the last few years, as reported in numerous scientific papers. Yet taking current models out of the box and applying them in a random practical situation, often quickly leads to disillusion. The models perform great in laboratory conditions, with studio quality recording. Run them on speech of high school students at a fair and you will see error rates highly above 40% or complete failures.

The goal of the Non-Native Speech Translation Task is to examine the quality of English-to-Czech and English-to-German SLT in a realistic setting of non-native spontaneous speech, in somewhat noisy conditions. The task seeks submissions that proceed along the standard two-stage pipeline (ASR+MT) as well as end-to-end solutions, ideally recovering from disfluencies of all kinds: pronunciation, vocabulary choice as well as grammar.

The automatic evaluation of the task will be carried out in multiple tracks. The two primary criteria are:

  1. Raw ASR quality in terms of WER against the reference transcript
  2. Raw translation quality, comparing the final candidate translation with one or more references.

In link with the simultaneous speech translation challenge, we will also able to assess:

  1. SLT delay, based on timestamps of words appearing after MT and automatic word alignment with the reference transcript, and
  2. flicker, reflecting the effort wasted in reading intermediate and later edited outputs.

Depending on the number of submissions we may be able to also add a manual evaluation of the translation quality, i.e. the human standard for the criterion (2).

Participants can provide complete or partial solutions, starting at the non-segmented audio, timestamped gold transcript, or our baseline ASR output. The ideal expected submissions will include SLT output with updates, timestamped at the point when the output was emitted by MT.

Evaluation Details

As inputs, we will release audio files, unsegmented. The durations will vary between 1 minute up to a couple of dozen minutes. For each such file, we expect plaintext outputs of ASR and/or MT, in formats described below. Some parts of these outputs are mandatory, others are optional and only help to provide a more fine-grained analysis.

You can take part in ASR only, ASR followed by MT (we are interested in your ASR outputs) or joint SLT (spoken language translation, where the ASR outputs are not available at all). MT-only submissions are also possible, please contact Ondřej Bojar directly to obtain baseline ASR outputs.

You can make as many submissions as you like, but you must indicate one of the submissions as PRIMARY.

Allowed Training Data

The non-native task distinguishes between constrained and non-constrained submissions.

Constrained submissions can use only the following datasets (resources are listed several times contain data relevant for multiple stages of processing):

Non-constrained submissions are very welcome and can use any additional data.

Development Set

Unfortunately, the dev set only illustrates file formats, including expected output formats.

We will still try to extend the size of the dev set so that you can better assess your system quality during the test period.

File Format of ASR Candidates

For ASR-only submissions or for ASR+MT submissions, we expect ASR Candidate file in the following format. The format is sentence-oriented, based on your custom segmentation, case-sensitive and punctuated, i.e. you should provide correct casing and typesetting of your output. (It is better to submit just one huge sentence lowercased, if you cannot provide segmentation, than to give up altogether.)

Each line of the ASR file shows the output of your ASR system which gradually grows in subsequent lines until a sentence is completed. (We use the term sentence to whatever unit most closely resembles sentences. Usually, a sentence is ended with a punctuation mark but this is not any formal requirement.) A completed sentence needs to come as a separate line, again followed by growing partial outputs, for instance:

P 60 0 5 Good
P 80 0 65 Good mor
P 113 0 102 Good morning
P 130 0 119 Good morning how
P 148 0 140 Good morning. How are
P 201 0 195 Good morning. How are you?
C 201 0 102 Good morning.
P 220 102 218 How are you? I
C 220 102 195 How are you?
P 245 195 239 I am

The partial (“P”) candidates allow your system to extend or revise its outputs, trading precision for lower latency and higher flicker. The P segments are not considered in the evaluation of accuracy. Only the complete (“C”) segments are required. For SLT-style submissions (end-to-end speech recognition and translation), this file is not required. Please provide it if you can, because it will allow for a more fine-grained evaluation.

There are three numbers (time stamps) in each line: display time, start time and end time. All times are measured in centiseconds from the start of the sound file.

Display time shows the time when the given line/sentence was recognized, produced by the ASR system. If your system is not “on-line” in any sense, you can report 0 on all lines. The start and end time indicate the span in which the respective words were uttered in the recording. If your system does not provide timestamps, again report zeros.

The minimal acceptable submission would thus contain only full sentences, preceded with “C 0 0 0 ” on each line.

The time stamps obey these rules:

  • end time >= start time; the difference is the duration of the segment
  • display time >= end time; the difference is the processing time of the ASR
  • considering only “C” lines, the end time of the previous one generally matches the start time of the next one
  • a row of “P” segments usually has the same start time, until a “C” segment with (the same) start time comes.
File Format of Machine Translation Candidates

The format of MT output file is formally identical to the ASR output file:

P 60 0 50 Gut
P 80 0 65 Guten Morgen!
P 113 0 102 Guten Morgen!
P 130 0 119 Guten wie morgen
P 148 0 140 Guten Morgen! Wie geht es?
P 201 0 195 Guten Morgen! Wie geht es dir?
C 201 0 102 Guten Morgen!
P 220 102 218 Wie geht es dir? Ich
C 220 102 195 Wie geht es dir?
P 245 195 239 Ich bin

Again, “P”artial candidates allow to revise your output so far and are fully optional (reducing latency). The “C”omplete candidates are required and the concatenation of all the “C” candidates correspond exactly to the translation of the whole test document.

Timestamps have the same roles: display time, start time, end time. Display time is the time when the translation was produced by the MT system. Start time and end time indicate the span in the source language speech when the source of this segment was uttered.

If your translation system was truly instant, you could keep the P/C marks and timestamps exactly as in the ASR output file. Because it is not instant, the display times will be higher in MT output that ASR output, but the start and end times will be very likely identical.

If your system does not support any live processing, you can set the timestamps to zero. Again, “C 0 0 0 ” is the minimal acceptable prefix for every line, indicating that you do not provide any partial outputs and timing information.


Chair: Ondrej Bojar (Charles University, Czech Republic)
Ebrahim Ansari (Charles University, Czech Republic)
Sebastian Stüker (KIT, Germany)


The non-native speech translation task is receiving support from the EU project ELITR (H2020-ICT-2018-2-825460).