Audio Language Models

pôle documentaire

Vous constatez une erreur ?

informations

évènements: Réunion du Groupement de Recherche "Information Signal Image viSion" : Synthèse en traitement du signal audionumérique
Type: Séminaire / Conférence
Lieu de représentation: Ircam, Salle Igor-Stravinsky (Paris)
date: 7 novembre 2024

Audio analysis and audio synthesis require modeling long-term, complex phenomena and have historically been tackled in an asymmetric fashion, with specific analysis models that differ from their synthesis counterpart. In this presentation, we will introduce the concept of audio language models, a recent innovation aimed at overcoming these limitations. By discretizing audio signals using a neural audio codec, we can frame both audio generation and audio comprehension as similar autoregressive sequence-to-sequence tasks, capitalizing on the well-established Transformer architecture commonly used in language modeling. This approach unlocks novel capabilities in areas such as textless speech modeling, zero-shot voice conversion, text-to-music generation and even real-time spoken dialogue. Furthermore, we will illustrate how the integration of analysis and synthesis within a single model enables the creation of versatile audio models capable of handling a wide range of tasks involving audio as inputs or outputs. We will conclude by highlighting the promising prospects offered by these models and discussing the key challenges that lie ahead in their development.