information

Type
Soutenance de thèse/HDR
performance location
Ircam, Salle Igor-Stravinsky (Paris)
duration
01 h 06 min
date
February 21, 2023

Soutenance de thèse d'Antoine Caillon

Antoine Caillon, doctorant de Sorbonne Université, soutient sa thèse "Apprentissage temporel hiérarchique pour la synthèse audio neuronale de la musique" menée dans l'équipe Représentations Musicales du laboratoire Ircam STMS sous la direction de Jean Bresson et Philippe Esling. Son jury sera composé de : Simon Colton Rapporteur - Queen Mary University of London (Royaume-Uni) Bob Sturm Rapporteur - Royal institute of technology (Suède) Michèle Sebag Examinateur - Université Paris Saclay Patrick Gallinari Examinateur - Sorbonne Université Mark Sandler Examinateur - Queen Mary University of London (Royaume-Uni) Jean Bresson Directeur de thèse - Sorbonne Université Philippe Esling Co-directeur de thèse et encadrant - Sorbonne Université Abstract Recent advances in deep learning have offered new ways to build models addressing a wide variety of tasks through the optimization of a set of parameters based on minimizing a cost function. Amongst these techniques, probabilistic generative models have yielded impressive advances in text, image and sound generation. However, musical audio signal generation remains a challenging problem. In this thesis, we study how a hierarchical approach to audio modeling can address the musical signal modeling task, while offering different levels of control to the user. Our main hypothesis is that extracting different representation levels of an audio signal allows to abstract the complexity of lower levels for each modeling stage. This would eventually allow the use of lightweight architectures, each modeling a single audio scale. We start by addressing raw audio modeling by proposing an audio model combining Variational Auto Encoders and Generative Adversarial Networks, yielding high-quality 48kHz neural audio synthesis, while being 20 times faster than real time on CPU. Then, we study how autoregressive models can be used to understand the temporal behavior of the representation yielded by this low-level audio model, using optional additional conditioning signals such as acoustic descriptors or tempo. Finally, we propose a method for using all the proposed models directly on audio streams, allowing their use in realtime applications that we developed during this thesis.

speakers


share


Do you notice a mistake?

IRCAM

1, place Igor-Stravinsky
75004 Paris
+33 1 44 78 48 43

opening times

Monday through Friday 9:30am-7pm
Closed Saturday and Sunday

subway access

Hôtel de Ville, Rambuteau, Châtelet, Les Halles

Institut de Recherche et de Coordination Acoustique/Musique

Copyright © 2022 Ircam. All rights reserved.