• Saison 2019-2020 - None - None > Soutenance de thèse de Gabriel Meseguer Brocal
  • July 9, 2020
  • Ircam
Participants
  • Gabriel Meseguer Brocal (conférencier)
  • Laurent Girin (rapporteur)
  • Gaël Richard (rapporteur)
  • Rachel Bittner (examinatrice)
  • Elena Cabrio (examinatrice)
  • Bruno Gas (examinateur)
  • Perfecto Herrera Boyer (examinateur)
  • Antoine Liutkus (examinateur)
  • Geoffroy Peeters (directeur de thèse)

Soutenance de thèse de Gabriel Meseguer Brocal : MULTIMODAL ANALYSIS: Informed Content Estimation and Audio Source Separation

Le jury

Rapporteurs :
Dr. Laurent GIRIN & Grenoble-INP - Institut Polytechnique de Grenoble
Dr. Gael RICHARD & LTCI - Télécom Paris - Institut Polytechnique de Paris

Examinateurs :
Dr. Rachel BITTNER & Spotify New York
Dr. Elena CABRIO & Université Côte d'Azur - Inria - CNRS - I3S
Dr. Bruno GAS & ISIR - UMR7222 - Sorbonne Université Paris
Dr. Perfecto HERRERA BOYER & MTG - Universitat Pompeu Fabra Barcelona
Dr. Antoine LIUTKUS & Centre Inria Nancy - Grand Est

Directeur :
Dr. Geoffroy PEETERS & LTCI - Télécom Paris - Institut Polytechnique de Paris

Covers are different interpretations of the same original musical work. They usually share a similar melodic line or harmonic structure, but typically differ greatly in one or several other dimensions, such as Real-world stimuli are produced by complex phenomena and their constant interaction in various domains. Our understanding learns useful abstractions that fuse different modalities into a joint representation.
Multimodal learning describes methods that analyse phenomena from different modalities and their interaction in order to tackle complex tasks. This results in better and richer representations that improve the performance of the current machine learning methods.

This dissertation proposes the study of multimodal learning in the context of musical signals. Throughout, we focus on the interaction between audio signals and text information.
Among the many text sources relate to music that can be used (e.g. reviews, metadata, or social network feedback), we concentrate on lyrics.
The singing voice directly connects the audio signal and the text information in a unique way, combining melody and lyrics where a linguistic dimension complements the abstraction of musical instruments.

The first obstacle we address is the lack of data containing singing voice with aligned lyrics. This data is mandatory to develop our ideas. Therefore, we investigate how to create such a dataset automatically leveraging resources from the World Wide Web. Creating this type of database is a challenge in itself that raises many research questions. We are constantly working with the classic ``chicken or the egg'' problem: acquiring and cleaning this data requires accurate models, but it is difficult to train models without data. We develop a method where dataset creation and model learning are not seen as independent tasks but rather as complementary efforts. We progressively improve the model using the collected data. Every time we have an improved version, we can in turn correct and enhance the data. Finally, we propose a method to locate automatically any errors which still remain, allowing us to estimate the overall accuracy of the dataset, select points which are correct and eventually improve erroneous data.

After developing the dataset, we center our efforts in exploring the interaction between lyrics and audio in two different tasks. First, we improve lyric segmentation by combining text and audio. Here we show that each domain captures complementary structures that benefit the overall performance. Second, we explore vocal source separation. We hypothesize that knowing the aligned phoneme information is beneficial for performing this task.
We investigate how to integrate conditioning mechanisms into source separation in multitask learning. Since multitask learning scenario is accompanied by a well-known dataset it helps us in validating the use of conditioning mechanisms. We then adapt these mechanisms for improving vocal source separation once we know the aligned phoneme.

Finally, we summary of contributions highlighting the main research questions we approach and our proposed answers.
We discuss in detail potential future work, addressing each task individually. We first propose new uses cases of our dataset as well as ways of improving its reliability.
We also analyze our conditional approach developed and different strategies to improve it.

From the same archive