Ircam Anasynth

A linear memory CTC-based algorithm for text-to-voice alignment of very long audio recordings

Guillaume Doras, Yann Teytaut, Axel Roebel

Welcome to the companion website of our paper “A linear memory CTC-based algorithm for text-to-voice alignment of very long audio recordings”.

Datasets

We release here the evaluation datasets used to compute the alignment performances reported in our paper.

Chapter 10 (speech)

For speech, we manually annotated the audio recording of a whole chapter of a publicly available Librivox audiobook. We choose the chapter 10 of “The problems of philosophy”, by B. Russell. This audiobook does not belong to the Librispeech’s list of books, and its reader does not belong to the Librispeech’s list of readers either.

The Chapter 10 contains exactly 100 sentences, and 2672 words, and the corresponding audio has a duration of 21:05 (MM:SS). The other chapters were kept for our experiments (see paper).

To manually annotate Chapter 10, we first removed from the audio the Librivox preamble with title and copyright information. Then, we performed a first alignment of the audio with the available text with our algorithm, which we used as a starting point to align manually and precisely each word to the audio. During the alignment process, we corrected a few inconsistencies between the recording and the transcription, so that the text and the audio match exactly.

Alignment was done by adjusting the markers for the start of each word. We used the spectrogram representation to precisely align the start of the words with the corresponding onsets.

The Chapter 10 audio, text and timestamps are available here.

Playlist50 (singing voice)

For singing, we leveraged the available annotations of the DALI dataset in which all alignments are have already been annotated at word level. We selected 150 songs that were not used during our model training, and we gathered a playlist containing 50 of these songs. We simply concatenated both the audio and the corresponding annotated lyrics available in DALI. The playlist contains 18094 words, and the corresponding audio has a duration of 2:50:24 (HH:MM:SS). The other 100 songs were kept for our experiments (see paper).

The Playlist50 audio, annotations and DALI ids are available here.

Results

Speech

We provide below a visualisation of the alignment of more than two hours of audio with the corresponding text (the chapters 7 to 13) of the Librivox audiobook “The problems of philosophy”, by B. Russell. It appears clearly that the alignment remains accurate even after more than 2 hours of audio (use full screen for better rendering).

Singing voice

We provide below a visualisation of the alignment of more two hours of audio with the corresponding text of Playlist50. It appears clearly that the alignment remains accurate even after more than 2 hours of audio (use full screen for better rendering).

Other languages

We provide below a visualisation of the alignment of speech in other languages than English. These audio are excerpts of open source Librivox audiobooks in different languages. The system succeeds to align text-to-voice despite the fact that it has been trained on English language only (use full screen for better rendering).

Arabic

An extract of “كليلة ودمنة (Kalila wa dimna)”, by Abdullah Ibn al-Muqaffaʿ. The original text was manually transliterated (we tried other options to automate the transliteration such as Google Translate or polyglot, but their transliteration was not enough accurate).