14 avril 2005 01 h 01 min
14 avril 2005 24 min
12 mai 2005 52 min
4 février 2005 01 h 18 min
17 octobre 2007 49 min
27 juin 2007 01 h 12 min
11 juillet 2007 48 min
12 septembre 2007 01 h 07 min
19 septembre 2007 01 h 13 min
26 septembre 2007 01 h 00 min
3 octobre 2007 01 h 12 min
10 octobre 2007 01 h 10 min
24 octobre 2007 50 min
21 novembre 2007 57 min
0:00/0:00
We recently presented a new model for singing synthesis based on a modified version of the WaveNet architecture. Instead of modeling raw waveform, we model features produced by a parametric vocoder that separates the influence of pitch and timbre. This allows conveniently modifying pitch to match any target melody, facilitates training on more modest dataset sizes, and significantly reduces training and generation times. Nonetheless, compared to modeling waveform directly, ways of effectively handling higher-dimensional outputs, multiple feature streams and regularization become more important with our approach. In this work, we extend our proposed system to include additional components for predicting F0 and phonetic timings from a musical score with lyrics. These expression-related features are learned together with timbrical features from a single set of natural songs. We compare our method to existing statistical parametric, concatenative, and neural network-based approaches using quantitative metrics as well as listening tests.
Jordi BONADA, de l’université Pompeu Fabra de Barcelone (Music Technology Group), invité par l’équipe Analyse et synthèse des sons (STMS - CNRS/IRCAM/UPMC) à être membre du jury de thèse de Luc Ardillon, présente :
"A Neural Parametric Singing Synthesizer Modeling Timbre and Expression from Natural Songs"
ABSTRACT :
We recently presented a new model for singing synthesis based on a modified version of the WaveNet architecture. Instead of modeling raw waveform, we model features produced by a parametric vocoder that separates the influence of pitch and timbre. This allows conveniently modifying pitch to match any target melody, facilitates training on more modest dataset sizes, and significantly reduces training and generation times. Nonetheless, compared to modeling waveform directly, ways of effectively handling higher-dimensional outputs, multiple feature streams and regularization become more important with our approach. In this work, we extend our proposed system to include additional components for predicting F0 and phonetic timings from a musical score with lyrics. These expression-related features are learned together with timbrical features from a single set of natural songs. We compare our method to existing statistical parametric, concatenative, and neural network-based approaches using quantitative metrics as well as listening tests.
1, place Igor-Stravinsky
75004 Paris
+33 1 44 78 48 43
Du lundi au vendredi de 9h30 à 19h
Fermé le samedi et le dimanche
Hôtel de Ville, Rambuteau, Châtelet, Les Halles
Institut de Recherche et de Coordination Acoustique/Musique
Copyright © 2022 Ircam. All rights reserved.