Antoine Caillon, doctorant de Sorbonne Université, soutient sa thèse "Apprentissage temporel hiérarchique pour la synthèse audio neuronale de la musique" menée dans l'équipe Représentations Musicales du laboratoire Ircam STMS sous la direction de Jean Bresson et Philippe Esling.
Son jury sera composé de :
Simon Colton Rapporteur - Queen Mary University of London (Royaume-Uni)
Bob Sturm Rapporteur - Royal institute of technology (Suède)
Michèle Sebag Examinateur - Université Paris Saclay
Patrick Gallinari Examinateur - Sorbonne Université
Mark Sandler Examinateur - Queen Mary University of London (Royaume-Uni)
Jean Bresson Directeur de thèse - Sorbonne Université
Philippe Esling Co-directeur de thèse et encadrant - Sorbonne Université
Abstract
Recent advances in deep learning have offered new ways to build models addressing a wide variety of tasks through the optimization of a set of parameters based on minimizing a cost function. Amongst these techniques, probabilistic generative models have yielded impressive advances in text, image and sound generation. However, musical audio signal generation remains a challenging problem. In this thesis, we study how a hierarchical approach to audio modeling can address the musical signal modeling task, while offering different levels of control to the user. Our main hypothesis is that extracting different representation levels of an audio signal allows to abstract the complexity of lower levels for each modeling stage. This would eventually allow the use of lightweight architectures, each modeling a single audio scale. We start by addressing raw audio modeling by proposing an audio model combining Variational Auto Encoders and Generative Adversarial Networks, yielding high-quality 48kHz neural audio synthesis, while being 20 times faster than real time on CPU. Then, we study how autoregressive models can be used to understand the temporal behavior of the representation yielded by this low-level audio model, using optional additional conditioning signals such as acoustic descriptors or tempo. Finally, we propose a method for using all the proposed models directly on audio streams, allowing their use in realtime applications that we developed during this thesis.