information

Type
Soutenance de thèse/HDR
performance location
Ircam, Salle Igor-Stravinsky (Paris)
duration
42 min
date
July 7, 2023

PhD thesis defense of Yann Teytaut

Yann Teytaut completed his thesis "On Temporal Constraints for Deep Neural Voice Alignment" as part of the Sound Analysis and Synthesis team at the STMS Laboratory (Ircam-Sorbonne University-CNRS-Ministry of Culture) and the École doctorale Informatique, télécommunications et électronique de Paris. His research work was funded by the ANR ARS project ( http://ars.ircam.fr/ ). The Jury is composed of: Pr. Gaël Richard, Professor, Télécom Paris — Reviewer Dr. Emmanouil Benetos, Reader, Queen Mary University of London (QMUL) — Reviewer Pr. Jean-Pierre Briot, Research Director, LIP6 (CNRS/SU) — Examiner Dr. Emmanuel Vincent, Research Director, Inria Nancy-Grand Est — Examiner Dr. Rachel Bittner, Research Manager, Spotify Inc. — Examiner Dr. Romain Hennequin, Head of Research, Deezer — Examiner Dr. Chitralekha Gupta, Research Fellow, National University of Singapore (NUS) — Examiner Dr. Axel Roebel, Research Director, Ircam — PhD Supervisor Abstract: To listen, to respond, to make coincide, to coordinate, to adjust, to follow, to adapt, to be in unison, to synchronize, to align... The rich vocabulary dedicated to the correspondence of human activities shows the importance of their temporal organization. Human communication, multi-modal by nature, is fully concerned by this problematic since there exists a semantic gap between oral locutions and their symbolic sequences: how to interpret a written message without the vocal intonation? what performative style beyond a fixed musical score? This thesis proposes to uncover the complex underlying relationships between the audio and symbolic domains in order to reduce this gap through the fine study of the inherent temporality contained in voice recordings. The voice alignment task lies at the core of this objective, as it aims to determine the temporal occurrence of symbols that are assumed to be present in a voice signal. This work notably focuses on the development of an acoustic model, ADAGIO, capable of estimating such time-symbol links. Recent progress in deep learning have led to implement ADAGIO as a deep neural network in a powerful generic formalism: the “Connectionist Temporal Classification” (CTC). However, the great flexibility offered by CTC is undermined by its intrinsic lack of guarantees for temporally accurate predictions. Therefore, the key contributions of this research consist in reinforcing CTC with additional temporal constraints to improve the quality of the inferred alignments. To do so, three ancillary tasks of (1) spectral content reconstruction; (2) audio structure propagation; and (3) guided monotony are introduced and induce a positive impact on the alignment between voices, texts, and notes. Then, ADAGIO contributes to many practical applications via collaborations such as concatenative speech synthesis or the study of expressive production strategies at play for both social attitudes in speech and singing style in musical performances.

speakers


share


Do you notice a mistake?

IRCAM

1, place Igor-Stravinsky
75004 Paris
+33 1 44 78 48 43

opening times

Monday through Friday 9:30am-7pm
Closed Saturday and Sunday

subway access

Hôtel de Ville, Rambuteau, Châtelet, Les Halles

Institut de Recherche et de Coordination Acoustique/Musique

Copyright © 2022 Ircam. All rights reserved.