information

Type
Soutenance de thèse/HDR
performance location
Ircam, Salle Igor-Stravinsky (Paris)
duration
01 h 30 min
date
July 9, 2020

Soutenance de thèse de Gabriel Meseguer Brocal : MULTIMODAL ANALYSIS: Informed Content Estimation and Audio Source Separation

Le jury Rapporteurs : Dr. Laurent GIRIN & Grenoble-INP - Institut Polytechnique de Grenoble Dr. Gael RICHARD & LTCI - Télécom Paris - Institut Polytechnique de Paris Examinateurs : Dr. Rachel BITTNER & Spotify New York Dr. Elena CABRIO & Université Côte d'Azur - Inria - CNRS - I3S Dr. Bruno GAS & ISIR - UMR7222 - Sorbonne Université Paris Dr. Perfecto HERRERA BOYER & MTG - Universitat Pompeu Fabra Barcelona Dr. Antoine LIUTKUS & Centre Inria Nancy - Grand Est Directeur : Dr. Geoffroy PEETERS & LTCI - Télécom Paris - Institut Polytechnique de Paris Covers are different interpretations of the same original musical work. They usually share a similar melodic line or harmonic structure, but typically differ greatly in one or several other dimensions, such as Real-world stimuli are produced by complex phenomena and their constant interaction in various domains. Our understanding learns useful abstractions that fuse different modalities into a joint representation. Multimodal learning describes methods that analyse phenomena from different modalities and their interaction in order to tackle complex tasks. This results in better and richer representations that improve the performance of the current machine learning methods. This dissertation proposes the study of multimodal learning in the context of musical signals. Throughout, we focus on the interaction between audio signals and text information. Among the many text sources relate to music that can be used (e.g. reviews, metadata, or social network feedback), we concentrate on lyrics. The singing voice directly connects the audio signal and the text information in a unique way, combining melody and lyrics where a linguistic dimension complements the abstraction of musical instruments. The first obstacle we address is the lack of data containing singing voice with aligned lyrics. This data is mandatory to develop our ideas. Therefore, we investigate how to create such a dataset automatically leveraging resources from the World Wide Web. Creating this type of database is a challenge in itself that raises many research questions. We are constantly working with the classic ``chicken or the egg'' problem: acquiring and cleaning this data requires accurate models, but it is difficult to train models without data. We develop a method where dataset creation and model learning are not seen as independent tasks but rather as complementary efforts. We progressively improve the model using the collected data. Every time we have an improved version, we can in turn correct and enhance the data. Finally, we propose a method to locate automatically any errors which still remain, allowing us to estimate the overall accuracy of the dataset, select points which are correct and eventually improve erroneous data. After developing the dataset, we center our efforts in exploring the interaction between lyrics and audio in two different tasks. First, we improve lyric segmentation by combining text and audio. Here we show that each domain captures complementary structures that benefit the overall performance. Second, we explore vocal source separation. We hypothesize that knowing the aligned phoneme information is beneficial for performing this task. We investigate how to integrate conditioning mechanisms into source separation in multitask learning. Since multitask learning scenario is accompanied by a well-known dataset it helps us in validating the use of conditioning mechanisms. We then adapt these mechanisms for improving vocal source separation once we know the aligned phoneme. Finally, we summary of contributions highlighting the main research questions we approach and our proposed answers. We discuss in detail potential future work, addressing each task individually. We first propose new uses cases of our dataset as well as ways of improving its reliability. We also analyze our conditional approach developed and different strategies to improve it.

speakers


share


Do you notice a mistake?

IRCAM

1, place Igor-Stravinsky
75004 Paris
+33 1 44 78 48 43

opening times

Monday through Friday 9:30am-7pm
Closed Saturday and Sunday

subway access

Hôtel de Ville, Rambuteau, Châtelet, Les Halles

Institut de Recherche et de Coordination Acoustique/Musique

Copyright © 2022 Ircam. All rights reserved.