| 10/12 Sun | 10/13 Mon | 10/14 Tue | 10/15 Wed | |
|---|---|---|---|---|
| 7:00 – 8:00 | Breakfast | Breakfast | Breakfast | |
| 8:00 – 8:55 | Keynote by Karen Livescu Sponsored by Cisco | Keynote by Alexandre Défossez Sponsored by Treble | Keynote by Juan Pablo Bello Sponsored by Adobe | |
| 8:55 – 10:10 | Oral Session 1 | Oral Session 3 | Oral Session 4 | |
| 10:10 – 10:30 | Coffee Break Sponsored by ByteDance & Dolby | Coffee Break Sponsored by Apple, Yamaha, & Microsoft | Coffee Break Sponsored by Google & Starkey | |
| 10:30 – 12:30 | Poster Session 1 | Poster Session 2 | Poster Session 4 | |
| 12:30 – 14:00 | Lunch | Lunch | Lunch / Closing | |
| 14:00 – 16:00 | Afternoon Break | Afternoon Break / AASP-TC Meeting | ||
| 16:00 – 16:30 | Registration/check-in | Afternoon Break (Cont.) | Poster Session 3 | |
| 16:30 – 18:00 | Registration/check-in (Cont.) | Oral Session 2 | Poster Session 3 (Cont.) | |
| 18:00 – 18:15 | Break | Break | Break | |
| 18:15 – 20:30 | Opening Remark & Dinner Sponsored by Cisco | Dinner Sponsored by Treble | Award Ceremony & Dinner Sponsored by Amazon | |
| 20:30 – 22:30 | Welcome Reception Sponsored by MERL | Cocktails & Demo Session Sponsored by Reality Labs (Meta) | Cocktails & Town Hall Sponsored by Adobe |
Monday 10/13
Keynote Speech 1 (8:00 am—8:55 am)
Speaker: Karen Livescu
Title: “On the (co-)evolution of universal written, spoken, and signed language processing“
Abstract: Natural language processing research has evolved over the past few years from mainly task-specific models, to task-independent representation models fine-tuned for specific tasks, and finally to fully task-independent language models. This progression addresses a desire for universality in the sense of handling arbitrary tasks in the same model. Another dimension of universality is the ability to serve arbitrary types of language users, regardless of their choice of language, dialect, or other individual properties. Progress toward universality has historically been addressed largely independently in separate research communities focusing on written, spoken, and signed language, although they share many similarities. This talk will trace the recent progress toward universality in these three language modalities, while highlighting a few pieces of recent work.
Oral Session 1 (8:55 am — 10:10 am)
Session Chairs: Shoichi Koyama, Enzo de Sena
- O1-1 Kernel Ridge Regression Based Sound Field Estimation using a Rigid Spherical Microphone Array
Ryo Matsuda, Juliano G. C. Ribeiro, Hitoshi Akiyama, Jorge Trevino
Abstract
We propose a sound field estimation method based on kernel ridge regression using a rigid spherical microphone array. Kernel ridge regression with physically constrained kernel functions, and further with kernel functions adapted to observed sound fields, have proven to be powerful tools. However, such methods generally assume an open-sphere microphone array configuration, i.e., no scatterers exist within the observation or estimation region. Alternatively, some approaches assume the presence of scatterers and attempt to eliminate their influence through a least-squares formulation. Even then, these methods typically do not incorporate the boundary conditions of the scatterers, which are not presumed to be known. In contrast, we exploit the fact the scatterer here is a rigid sphere. Meaning, both the virtual scattering source locations and the boundary conditions are well-defined. Based on this, we formulate the scattered sound field within the kernel ridge regression framework and propose a novel sound field representation incorporating a boundary constraint. The effectiveness of the proposed method is demonstrated through numerical simulations and real-world experiments using a newly developed spherical microphone array.
- O1-2 Optimal Region-of-Interest Beamforming for Audio Conferencing with Dual Perpendicular Sparse Circular Sectors
Gal Itzhak, Simon Doclo, Israel Cohen
Abstract
We introduce a region-of-interest beamforming approach for audio conferencing that addresses dynamic acoustics and multiple-speaker scenarios. The approach employs a two-stage sparse optimization to select a subset of microphones from dual circular sector arrays: first on the xz plane and then on the xy plane, balancing spatial resolution and efficiency. Using the dual circular layout, we are able to reduce response variability across azimuth and elevation angles. The proposed approach maximizes broadband directivity while ensuring a controlled level of distortion and minimal white noise gain. Compared to existing methods, the mainlobe attained by the resulting beamformer is more accurately aligned with the region of interest. It also achieves a preferable sidelobe and backlobe suppression. Finally, the proposed approach is shown to be superior considering the directivity factor and white noise gain, in particular at medium and high frequencies.
- O1-3 Physics-Informed Direction-Aware Neural Acoustic Fields
Yoshiki Masuyama, François G. Germain, Gordon Wichern, Christopher Ick, Jonathan Le Roux
Abstract
This paper presents a physics-informed neural network (PINN) for modeling first-order Ambisonic (FOA) room impulse responses (RIRs). PINNs have demonstrated promising performance in sound field interpolation by combining the powerful modeling capability of neural networks and the physical principles of sound propagation. In room acoustics, PINNs have typically been trained to represent the sound pressure measured by omnidirectional microphones where the wave equation or its frequency-domain counterpart, i.e., the Helmholtz equation, is leveraged. Meanwhile, FOA RIRs additionally provide spatial characteristics and are useful for immersive audio generation with a wide range of applications. In this paper, we extend the PINN framework to model FOA RIRs. We derive two physics-informed priors for FOA RIRs based on the correspondence between the particle velocity and the (X, Y, Z)-channels of FOA. These priors associate the predicted W-channel and other channels through their partial derivatives and impose the physically feasible relationship on the four channels. Our experiments confirm the effectiveness of the proposed method compared with a neural network without the physics-informed prior.
- O1-4 Source and Sensor Placement for Sound Field Control Based on Mean Square Error with Prior Spatial Covariance
Shihori Kozuka, Shoichi Koyama, Hiroaki Itou, Noriyoshi Kamado
Abstract
A method for source (secondary loudspeaker) and sensor (control point) placement in sound field control is proposed. Since the placement of secondary loudspeakers and control points has a significant effect on the performance of sound field control, their optimization is of great importance in the development of practical systems. However, most current methods address either source placement or sensor placement. We propose a joint source and sensor placement method based on mean square error, which is applicable to (weighted) pressure matching. Our proposed method can be applied when the area where the sensors can be placed in the target control region is limited. Furthermore, prior information about the desired sound field can be incorporated. Numerical evaluation showed that efficient source and sensor placement can be achieved by the proposed method.
- O1-5 Towards Perception-Informed Latent HRTF Representations
You Zhang, Andrew Francl, Ruohan Gao, Paul Calamia, Zhiyao Duan, Ishwarya Ananthabhotla
Abstract
Personalized head-related transfer functions (HRTFs) are essential for ensuring a realistic auditory experience over headphones, because they take into account individual anatomical differences that affect listening. Most machine learning approaches to HRTF personalization rely on a learned low-dimensional latent space to generate or select custom HRTFs for a listener. However, these latent representations are typically learned in a manner that optimizes for spectral reconstruction but not for perceptual compatibility, meaning they may not necessarily align with perceptual distance. In this work, we first study whether traditionally learned HRTF representations are well correlated with perceptual relations using auditory-based objective perceptual metrics; we then propose a method for explicitly embedding HRTFs into a perception-informed latent space, leveraging a metric-based loss function and supervision via Metric Multidimensional Scaling (MMDS). Finally, we demonstrate the applicability of these learned representations to the task of HRTF personalization. We suggest that our method has the potential to render personalized spatial audio, leading to an improved listening experience.
Poster Session 1 (10:30 am — 12:30 pm)
Session Chair: Robin Scheibler
- P1-1 Kernel Ridge Regression Based Sound Field Estimation using a Rigid Spherical Microphone Array
Ryo Matsuda, Juliano G. C. Ribeiro, Hitoshi Akiyama, Jorge Trevino
Abstract
We propose a sound field estimation method based on kernel ridge regression using a rigid spherical microphone array. Kernel ridge regression with physically constrained kernel functions, and further with kernel functions adapted to observed sound fields, have proven to be powerful tools. However, such methods generally assume an open-sphere microphone array configuration, i.e., no scatterers exist within the observation or estimation region. Alternatively, some approaches assume the presence of scatterers and attempt to eliminate their influence through a least-squares formulation. Even then, these methods typically do not incorporate the boundary conditions of the scatterers, which are not presumed to be known. In contrast, we exploit the fact the scatterer here is a rigid sphere. Meaning, both the virtual scattering source locations and the boundary conditions are well-defined. Based on this, we formulate the scattered sound field within the kernel ridge regression framework and propose a novel sound field representation incorporating a boundary constraint. The effectiveness of the proposed method is demonstrated through numerical simulations and real-world experiments using a newly developed spherical microphone array.
- P1-2 Optimal Region-of-Interest Beamforming for Audio Conferencing with Dual Perpendicular Sparse Circular Sectors
Gal Itzhak, Simon Doclo, Israel Cohen
Abstract
We introduce a region-of-interest beamforming approach for audio conferencing that addresses dynamic acoustics and multiple-speaker scenarios. The approach employs a two-stage sparse optimization to select a subset of microphones from dual circular sector arrays: first on the xz plane and then on the xy plane, balancing spatial resolution and efficiency. Using the dual circular layout, we are able to reduce response variability across azimuth and elevation angles. The proposed approach maximizes broadband directivity while ensuring a controlled level of distortion and minimal white noise gain. Compared to existing methods, the mainlobe attained by the resulting beamformer is more accurately aligned with the region of interest. It also achieves a preferable sidelobe and backlobe suppression. Finally, the proposed approach is shown to be superior considering the directivity factor and white noise gain, in particular at medium and high frequencies.
- P1-3 Physics-Informed Direction-Aware Neural Acoustic Fields
Yoshiki Masuyama, François G. Germain, Gordon Wichern, Christopher Ick, Jonathan Le Roux
Abstract
This paper presents a physics-informed neural network (PINN) for modeling first-order Ambisonic (FOA) room impulse responses (RIRs). PINNs have demonstrated promising performance in sound field interpolation by combining the powerful modeling capability of neural networks and the physical principles of sound propagation. In room acoustics, PINNs have typically been trained to represent the sound pressure measured by omnidirectional microphones where the wave equation or its frequency-domain counterpart, i.e., the Helmholtz equation, is leveraged. Meanwhile, FOA RIRs additionally provide spatial characteristics and are useful for immersive audio generation with a wide range of applications. In this paper, we extend the PINN framework to model FOA RIRs. We derive two physics-informed priors for FOA RIRs based on the correspondence between the particle velocity and the (X, Y, Z)-channels of FOA. These priors associate the predicted W-channel and other channels through their partial derivatives and impose the physically feasible relationship on the four channels. Our experiments confirm the effectiveness of the proposed method compared with a neural network without the physics-informed prior.
- P1-4 Room Impulse Response Generation Conditioned on Acoustic Parameters
Silvia Arellano, Chunghsin Yeh, Gautam Bhattacharya, Daniel Arteaga
Abstract
The generation of room impulse responses (RIRs) using deep neural networks has attracted growing research interest due to its applications in virtual and augmented reality, audio postproduction, and related fields. Most existing approaches condition generative models on physical descriptions of a room, such as its size, shape, and surface materials. However, this reliance on geometric information limits their usability in scenarios where the room layout is unknown or when perceptual realism (how a space sounds to a listener) is more important than strict physical accuracy. In this study, we propose an alternative strategy: conditioning RIR generation directly on a set of RIR acoustic parameters. These parameters include various measures of reverberation time and direct sound to reverberation ratio, both broadband and bandwise. By specifying how the space should sound instead of how it should look, our method enables more flexible and perceptually driven RIR generation. We explore both autoregressive and non-autoregressive generative models operating in the Descript Audio Codec domain, using either discrete token sequences or continuous embeddings. Specifically, we have selected four models to evaluate: an autoregressive transformer, the MaskGIT model, a flow matching model, and a classifier-based approach. Objective and subjective evaluations are performed to compare these methods with state-of-the-art alternatives. Results show that the proposed models match or outperform state-of-the-art alternatives, with the MaskGIT model achieving the best performance. Listening examples are available at https://silviaarellanogarcia.github.io/rir-acoustic.
- P1-5 Source and Sensor Placement for Sound Field Control Based on Mean Square Error with Prior Spatial Covariance
Shihori Kozuka, Shoichi Koyama, Hiroaki Itou, Noriyoshi Kamado
Abstract
A method for source (secondary loudspeaker) and sensor (control point) placement in sound field control is proposed. Since the placement of secondary loudspeakers and control points has a significant effect on the performance of sound field control, their optimization is of great importance in the development of practical systems. However, most current methods address either source placement or sensor placement. We propose a joint source and sensor placement method based on mean square error, which is applicable to (weighted) pressure matching. Our proposed method can be applied when the area where the sensors can be placed in the target control region is limited. Furthermore, prior information about the desired sound field can be incorporated. Numerical evaluation showed that efficient source and sensor placement can be achieved by the proposed method.
- P1-6 Towards Perception-Informed Latent HRTF Representations
You Zhang, Andrew Francl, Ruohan Gao, Paul Calamia, Zhiyao Duan, Ishwarya Ananthabhotla
Abstract
Personalized head-related transfer functions (HRTFs) are essential for ensuring a realistic auditory experience over headphones, because they take into account individual anatomical differences that affect listening. Most machine learning approaches to HRTF personalization rely on a learned low-dimensional latent space to generate or select custom HRTFs for a listener. However, these latent representations are typically learned in a manner that optimizes for spectral reconstruction but not for perceptual compatibility, meaning they may not necessarily align with perceptual distance. In this work, we first study whether traditionally learned HRTF representations are well correlated with perceptual relations using auditory-based objective perceptual metrics; we then propose a method for explicitly embedding HRTFs into a perception-informed latent space, leveraging a metric-based loss function and supervision via Metric Multidimensional Scaling (MMDS). Finally, we demonstrate the applicability of these learned representations to the task of HRTF personalization. We suggest that our method has the potential to render personalized spatial audio, leading to an improved listening experience.
- P1-7 Online Incremental Learning for Audio Classification Using a Pretrained Audio Model
Manjunath Mulimani, Annamaria Mesaros
Abstract
Incremental learning aims to learn new tasks sequentially without forgetting the previously learned ones. Most of the existing incremental learning methods for audio focus on training the model from scratch on the initial task, and the same model is used to learn upcoming incremental tasks. The model is trained for several iterations to adapt to each new task, using some specific approaches to reduce the forgetting of old tasks. In this work, we propose a method for using generalizable audio embeddings produced by a pre-trained model to develop an online incremental learner that solves sequential audio classification tasks over time. Specifically, we inject a layer with a nonlinear activation function between the pre-trained model’s audio embeddings and the classifier; this layer expands the dimensionality of the embeddings and effectively captures the distinct characteristics of sound classes. Our method adapts the model in a single forward pass (online) through the training samples of any task, with minimal forgetting of old tasks. We demonstrate the performance of the proposed method in two incremental learning setups: one class-incremental learning using ESC-50 and one domain-incremental learning of different cities from the TAU Urban Acoustic Scenes 2019 dataset; for both cases, the proposed approach outperforms other methods.
- P1-8 Sound Event Detection with Audio-Text Models and Heterogeneous Temporal Annotations
Manu Harju, Annamaria Mesaros
Abstract
Recent advances in generating synthetic captions based on audio and related metadata allow using the information contained in natural language as input for other audio tasks. In this paper, we propose a novel method to guide a sound event detection system with free-form text. We use machine-generated captions as complementary information to the strong labels for training, and evaluate the systems using different types of textual inputs. In addition, we study a scenario where only part of the training data has strong labels, and the rest of it only has temporally weak labels. Our findings show that synthetic captions improve the performance in both cases compared to the CRNN architecture typically used for sound event detection. On a dataset of 50 highly unbalanced classes, the PSDS-1 score increases from 0.223 to 0.277 when trained with strong labels, and from 0.166 to 0.218 when half of the training data has only weak labels.
- P1-9 Latent Acoustic Mapping for Direction of Arrival Estimation: A Self-Supervised Approach
Adrian S. Roman, Iran R. Roman, Juan Pablo Bello
Abstract
Acoustic mapping techniques have long been used in spatial audio processing for direction of arrival estimation (DoAE). Traditional beamforming methods for acoustic mapping, while interpretable, often rely on iterative solvers that can be computationally intensive and sensitive to acoustic variability. On the other hand, recent supervised deep learning approaches offer feedforward speed and robustness but require large labeled datasets and lack interpretability. Despite their strengths, both methods struggle to consistently generalize across diverse acoustic setups and array configurations, limiting their broader applicability. We introduce the Latent Acoustic Mapping (LAM) model, a self-supervised framework that bridges the interpretability of traditional methods with the adaptability and efficiency of deep learning methods. LAM generates high-resolution acoustic maps, adapts to varying acoustic conditions, and operates efficiently across different microphone arrays. We assess its robustness on DoAE using the LOCATA and STARSS benchmarks. LAM achieves comparable or superior localization performance to existing supervised methods. Additionally, we show that LAM’s acoustic maps can serve as effective features for supervised models, further enhancing DoAE accuracy and underscoring its potential to advance adaptive, high-performance sound localization systems.
- P1-10 Distributed Asynchronous Device Speech Enhancement via Windowed Cross-Attention
Gene-Ping Yang, Sebastian Braun
Abstract
The increasing number of microphone-equipped personal devices offers great flexibility and potential using them as ad-hoc microphone arrays in dynamic meeting environments. However, most existing approaches are designed for time-synchronized microphone setups, a condition that may not hold in real-world meeting scenarios, where time latency and clock drift vary across devices. Under such conditions, we found transform-average-concatenate (TAC), a popular module for neural multi-microphone processing, insufficient in handling time-asynchronous microphones. In response, we propose a windowed cross-attention module capable of dynamically aligning features between all microphones. This module is invariant to both the permutation and the number of microphones and can be easily integrated into existing models. Furthermore, we propose an optimal training target for multi-talker environments. We evaluated our approach in a multi-microphone noisy reverberant setup with unknown time latency and clock drift of each microphone. Experimental results show that our method outperforms TAC on both iFaSNet and CRUSE models, offering faster convergence and improved learning, demonstrating the efficacy of the windowed cross-attention module for asynchronous microphone setups.
- P1-11 The Test of Auditory-Vocal Affect (TAVA) dataset
Juan S. Gómez-Cañón, Camille Noufi, Jonathan Berger, Karen J. Parker, Daniel L. Bowling
Abstract
In this paper we present the TAVA, a novel dataset for advancing research in Speech Emotion Recognition (SER) by disentangling paralinguistic and linguistic information in affective speech signals. The dataset includes 352 audio recordings of emotionally expressive English speech, each paired with a corresponding transformed electroglottographic (tEGG) version-a signal designed to preserve affective cues while systematically suppressing phonetic content. In addition, we provide over 120,000 crowd-sourced ratings of valence, arousal, and dominance for both the original and transformed signals. These ratings support fine-grained comparisons of affect perception across modalities. Building on prior work showing that tEGG signals can effectively isolate vocal affect, this dataset offers a unique resource for evaluating sensitivity to vocal affect in clinical populations with language or communication difficulties, as well as in studies aimed at dissociating linguistic and affective processing in individuals without these impairments. By contributing a phoneme-reduced, affect-rich signal representation to the SER community, we aim to enable more robust modelling of vocal affect and broaden the applicability of SER systems to diverse user populations.
- P1-12 Unsupervised Multi-channel Speech Dereverberation via Diffusion
Yulun Wu, Zhongweiyang Xu, Jianchong Chen, Zhong-Qiu Wang, Romit Roy Choudhury
Abstract
We consider the problem of multi-channel single-speaker blind dereverberation, where multi-channel mixtures are used to recover the clean anechoic speech. To solve this problem, we propose USD-DPS, Unsupervised Speech Dereverberation via Diffusion Posterior Sampling. USD-DPS uses an unconditional clean speech diffusion model as a strong prior to solve the problem by posterior sampling. At each diffusion sampling step, we estimate all microphone channels’ room impulse responses (RIRs), which are further used to enforce a multi-channel mixture consistency constraint for diffusion guidance. For multi-channel RIR estimation, we estimate reference-channel RIR by optimizing RIR parameters of a sub-band RIR signal model, with the Adam optimizer. We estimate non-reference channels’ RIRs analytically using forward convolutive prediction (FCP). We found that this combination provides a good balance between sampling efficiency and RIR prior modeling, which shows superior performance among unsupervised dereverberation approaches.
- P1-13 Microphone Occlusion Mitigation for Own-Voice Enhancement in Head-Worn Microphone Arrays Using Switching-Adaptive Beamforming
Wiebke Middelberg, Jung-Suk Lee, Saeed Bagheri Sereshki, Ali Aroudi, Vladimir Tourbabin, Daniel D. E. Wong
Abstract
Enhancing the user’s own-voice for head-worn microphone arrays is an important task in noisy environments to allow for easier speech communication and user-device interaction. However, a rarely addressed challenge is the change of the microphones’ transfer functions when one or more of the microphones gets occluded by skin, clothes or hair. The underlying problem for beamforming-based speech enhancement is the (potentially rapidly) changing transfer functions of both the own-voice and the noise component that have to be accounted for to achieve optimal performance. In this paper, we address the problem of an occluded microphone in a head-worn microphone array. We investigate three alternative mitigation approaches by means of (i) conventional adaptive beamforming, (ii) switching between a-priori estimates of the beamformer coefficients for the occluded and unoccluded state, and (iii) a hybrid approach using a switching-adaptive beamformer. In an evaluation with real-world recordings and simulated occlusion, we demonstrate the advantages of the different approaches in terms of noise reduction, own-voice distortion and robustness against voice activity detection errors.
- P1-14 Multi-Utterance Speech Separation and Association Trained on Short Segments
Yuzhu Wang, Archontis Politis, Konstantinos Drossos, Tuomas Virtanen
Abstract
Current deep neural network (DNN) based speech separation faces a fundamental challenge — while the models need to be trained on short segments due to computational constraints, real-world applications typically require processing significantly longer recordings with multiple utterances per speaker than seen during training. In this paper, we investigate how existing approaches perform in this challenging scenario and propose a frequency-temporal recurrent neural network (FTRNN) that effectively bridges this gap. Our FTRNN employs a full-band module to model frequency dependencies within each time frame and a sub-band module that models temporal patterns in each frequency band. Despite being trained on short fixed-length segments of 10 s, our model demonstrates robust separation when processing signals significantly longer than training segments (21-121 s) and preserves speaker association across utterance gaps exceeding those seen during training. Unlike the conventional segment-separation-stitch paradigm, our lightweight approach (0.9 M parameters) performs inference on long audio without segmentation, eliminating segment boundary distortions while simplifying deployment. Experimental results demonstrate the generalization ability of FTRNN for multi-utterance speech separation and speaker association.
- P1-15 Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine
Anastasia Kuznetsova, Inseon Jang, Wootaek Lim, Minje Kim
Abstract
Neural audio codecs, leveraging quantization algorithms, have significantly impacted various speech/audio tasks. While high-fidelity reconstruction is paramount for human perception, audio coding for machines (ACoM) prioritizes efficient compression and downstream task performance, disregarding perceptual nuances. This work introduces an efficient ACoM method that can compress and quantize any chosen intermediate feature representation of an already trained speech/audio downstream model. Our approach employs task-specific loss guidance alongside residual vector quantization (RVQ) losses, providing ultra-low bitrates (i.e., less than 200 bps) with a minimal loss of the downstream model performance. The resulting tokenizer is adaptable to various bitrates and model sizes for flexible deployment. Evaluated on automatic speech recognition and audio classification, our method demonstrates its efficacy and potential for broader task and architectural applicability through appropriate regularization.
- P1-16 VoxATtack: A Multimodal Attack on Voice Anonymization Systems
Ahmad Aloradi, Ünal Ege Gaznepoglu, Emanuël A. P. Habets, Daniel Tenbrinck
Abstract
Voice anonymization systems aim to protect speaker privacy by obscuring vocal traits while preserving the linguistic content relevant for downstream applications. However, because these linguistic cues remain intact, they can be exploited to identify semantic speech patterns associated with specific speakers. In this work, we present VoxATtack, a novel multimodal de-anonymization model that incorporates both acoustic and textual information to attack anonymization systems. While previous research has focused on refining speaker representations extracted from speech, we show that incorporating textual information with a standard ECAPATDNN improves the attacker’s performance. Our proposed VoxATtack model employs a dual-branch architecture, with an ECAPA-TDNN processing anonymized speech and a pretrained BERT encoding the transcriptions. Both outputs are projected into embeddings of equal dimensionality and then fused based on confidence weights computed on a per-utterance basis. When evaluating our approach on the VoicePrivacy Attacker Challenge (VPAC) dataset, it outperforms the top-ranking attackers on five out of seven benchmarks, namely B3, B4, B5, T8-5, and T12-5. To further boost performance, we leverage anonymized speech and SpecAugment as augmentation techniques. This enhancement enables VoxATtack to achieve state-of-the-art on all VPAC benchmarks, after scoring 20.6% and 27.2% average equal error rate on T10-2 and T25-1, respectively. Our results demonstrate that incorporating textual information and selective data augmentation reveals critical vulnerabilities in current voice anonymization methods and exposes potential weaknesses in the datasets used to evaluate them.
- P1-17 Learning Perceptually Relevant Temporal Envelope Morphing
Satvik Dixit, Sungjoon Park, Chris Donahue, Laurie M. Heller
Abstract
Temporal envelope morphing, the process of interpolating between the amplitude dynamics of two audio signals, is an emerging problem in generative audio systems that lacks sufficient perceptual grounding. Morphing of temporal envelopes in a perceptually intuitive manner should enable new methods for sound blending in creative media and for probing perceptual organization in psychoacoustics. However, existing audio morphing techniques often fail to produce intermediate temporal envelopes when input sounds have distinct temporal structures; many morphers effectively overlay both temporal structures, leading to perceptually unnatural results. In this paper, we introduce a novel workflow for learning envelope morphing with perceptual guidance: we first derive perceptually grounded morphing principles through human listening studies, then synthesize large-scale datasets encoding these principles, and finally train machine learning models to create perceptually intermediate morphs. Specifically, we present: (1) perceptual principles that guide envelope morphing, derived from our listening studies, (2) a supervised framework to learn these principles, (3) an autoencoder that learns to compress temporal envelope structures into latent representations, and (4) benchmarks for evaluating audio envelope morphs, using both synthetic and naturalistic data, and show that our approach outperforms existing methods in producing temporally intermediate morphs. All code, models, and checkpoints are available at https://github.com/TemporalMorphing/EnvelopeMorphing.
- P1-18 Multi-Class-Token Transformer for Multitask Self-supervised Music Information Retrieval
Yuexuan Kong, Vincent Lostanlen, Romain Hennequin, Mathieu Lagrange, Gabriel Meseguer-Brocal
Abstract
Contrastive learning and equivariant learning are effective methods for self-supervised learning (SSL) for audio content analysis. Yet, their application to music information retrieval (MIR) faces a dilemma: the former is more effective on tagging (e.g., instrument recognition) but less effective on structured prediction (e.g., tonality estimation); The latter can match supervised methods on the specific task it is designed for, but it does not generalize well to other tasks. In this article, we adopt a best-of-both-worlds approach by training a deep neural network on both kinds of pretext tasks at once. The proposed new architecture is a Vision Transformer with 1-D spectrogram patches (ViT-1D), equipped with two class tokens, which are specialized to different self-supervised pretext tasks but optimized through the same model: hence the qualification of self-supervised multi-class-token multitask (MT2). The former class token optimizes cross-power spectral density (CPSD) for equivariant learning over the circle of fifths, while the latter optimizes normalized temperature-scaled cross-entropy (NT-Xent) for contrastive learning. MT2 combines the strengths of both pretext tasks and outperforms consistently both single-class-token ViT-1D models trained with either contrastive or equivariant learning. Averaging the two class tokens further improves performance on several tasks, highlighting the complementary nature of the representations learned by each class token. Furthermore, using the same single-linear-layer probing method on the features of last layer, MT2 outperforms MERT on all tasks except for beat tracking; achieving this with 18x fewer parameters thanks to its multitasking capabilities. Our SSL benchmark demonstrates the versatility of our multi-class-token multitask learning approach for MIR applications.
- P1-19 Conditional Wave-U-Net for Acoustic Matching in Shared XR Environments
Joanna Luberadzka, Enric Gusó, Umut Sayin
Abstract
Mismatch in acoustics between users is an important challenge for interaction in shared XR environments. It can be mitigated through acoustic matching, which traditionally involves dereverberation followed by convolution with a room impulse response (RIR) of the target space. However, the target RIR in such settings is usually unavailable. We propose to tackle this problem in an end-to-end manner using wave-u-net encoder-decoder network with potential for real-time operation. We use FiLM layers to condition this network on the embeddings extracted by a separate reverb encoder to match the acoustic properties between two arbitrarily chosen signals. We demonstrate that this approach outperforms two baseline methods and provides the flexibility to both dereverberate and rereverberate audio signals.
- P1-20 Fast Text-to-Audio Generation with Adversarial Post-Training
Zachary Novack, Zach Evans, Zack Zukowski, Josiah Taylor, CJ Carr, Julian Parker, Adnan Al-Sinan, Gian Marco Iodice, Julian McAuley, Taylor Berg-Kirkpatrick, Jordi Pons
Abstract
Text-to-audio systems, while increasingly performant, are slow at inference time, thus making their latency unpractical for many creative applications. We present Adversarial Relativistic-Contrastive (ARC) post-training, the first adversarial acceleration algorithm for diffusion/flow models not based on distillation. While past adversarial post-training methods have struggled to compare against their expensive distillation counterparts, ARC post-training is a simple procedure that (1) extends a recent relativistic adversarial formulation to diffusion/flow post-training and (2) combines it with a novel contrastive discriminator objective to encourage better prompt adherence. We pair ARC post-training with a number optimizations to Stable Audio Open and build a model capable of generating ≈12s of 44.1kHz stereo audio in ≈75ms on an H100, and ≈7s on a mobile edge-device, the fastest text-to-audio model to our knowledge.
- P1-21 Post-Training Quantization for Audio Diffusion Transformers
Tanmay Khandelwal, Magdalena Fuentes
Abstract
Diffusion Transformers (DiTs) enable high-quality audio synthesis but are often computationally intensive and require substantial storage, which limits their practical deployment. In this paper, we present a comprehensive evaluation of post-training quantization (PTQ) techniques for audio DiTs, analyzing the trade-offs between static and dynamic quantization schemes. We explore two practical extensions (1) a denoising-timestep-aware smoothing method that adapts quantization scales per-input-channel and timestep to mitigate activation outliers, and (2) a lightweight low-rank adapter (LoRA)-based branch derived from singular value decomposition (SVD) to compensate for residual weight errors. Using Stable Audio Open we benchmark W8A8 and W4A8 configurations across objective metrics and human perceptual ratings. Our results show that dynamic quantization preserves fidelity even at lower precision, while static methods remain competitive with lower latency. Overall, our findings show that low-precision DiTs can retain high-fidelity generation while reducing memory usage by up to 79%.
- P1-22 Diffused Responsibility: Analyzing the Energy Consumption of Generative Text-to-Audio Diffusion Models
Riccardo Passoni, Francesca Ronchini, Luca Comanducci, Romain Serizel, Fabio Antonacci
Abstract
Text-to-audio models have recently emerged as a powerful technology for generating sound from textual descriptions. However, their high computational demands raise concerns about energy consumption and environmental impact. In this paper, we conduct an analysis of the energy usage of 7 state-of-the-art text-to-audio diffusion-based generative models, evaluating to what extent variations in generation parameters affect energy consumption at inference time. We also aim to identify an optimal balance between audio quality and energy consumption by considering Pareto-optimal solutions across all selected models. Our findings provide insights into the trade-offs between performance and environmental impact, contributing to the development of more efficient generative audio models.
- P1-23 Benchmarking Sub-Genre Classification for Mainstage Dance Music
Hongzhi Shu, Xinglin Li, Hongyu Jiang, Minghao Fu, Xinyu Li
Abstract
Music classification, a cornerstone of music information retrieval, supports a wide array of applications. To address the lack of comprehensive datasets and effective methods for sub-genre classification in mainstage dance music, we introduce a novel benchmark featuring a new dataset and baseline. Our dataset expands the scope of sub-genres to reflect the diversity of recent mainstage live sets performed by leading DJs at global music festivals, capturing the vibrant and rapidly evolving electronic dance music (EDM) scene that engages millions of fans worldwide. We employ a continuous soft labeling approach to accommodate tracks blending multiple sub-genres, preserving their inherent complexity. Experiments demonstrate that even state-of-the-art multimodal large language models (MLLMs) struggle with this task, while our specialized baseline models achieve high accuracy. This benchmark supports applications such as music recommendation, DJ set curation, and interactive multimedia systems, with video demos provided. Our code and data are all open-sourced at https://github.com/Gariscat/housex-v2.git.
- P1-24 RUMAA: Repeat-Aware Unified Music Audio Analysis for Score-Performance Alignment, Transcription, and Mistake Detection
Sungkyun Chang, Simon Dixon, Emmanouil Benetos
Abstract
This study introduces RUMAA, a transformer-based framework for music performance analysis that unifies score-to-performance alignment, score-informed transcription, and mistake detection in a near end-to-end manner. Unlike prior methods addressing these tasks separately, RUMAA integrates them using pre-trained score and audio encoders and a novel tri-stream decoder capturing task interdependencies through proxy tasks. It aligns human-readable MusicXML scores with repeat symbols to full-length performance audio, overcoming traditional MIDI-based methods that rely on manually unfolded score-MIDI data with pre-specified repeat structures. RUMAA matches state-of-the-art alignment methods on non-repeated scores and outperforms them on scores with repeats in a public piano music dataset, while also delivering promising transcription and mistake detection results.
Oral Session 2 (16:30 pm — 18:00 pm)
Session Chairs: Keisuke Imoto, Justin Salamon
- O2-1 FlexSED: Towards Open-Vocabulary Sound Event Detection
Jiarui Hai, Helin Wang, Weizhe Guo, Mounya Elhilali
Abstract
Despite recent progress in large-scale sound event detection (SED) systems capable of handling hundreds of sound classes, existing multi-class classification frameworks remain fundamentally limited. They cannot process free-text sound queries, which enable more flexible and user-friendly interaction, and they lack zero-shot capabilities and offer poor few-shot adaptability. Although text-query-based separation methods have been explored, they primarily focus on source separation and are ill-suited for SED tasks that require precise temporal localization and efficient detection across large and diverse sound vocabularies. In this paper, we propose FlexSED, an open-vocabulary sound event detection system. FlexSED builds on a pretrained audio SSL model and the CLAP text encoder, introducing an encoder-decoder composition and an adaptive fusion strategy to enable effective continuous training from pretrained weights. To ensure robust supervision, it also employs large language models (LLMs) to assist in event query selection during training, addressing challenges related to missing labels. As a result, FlexSED achieves superior performance compared to vanilla SED models on AudioSet-Strong, while demonstrating strong zero-shot and few-shot capabilities. We release the code and pretrained models to support future research and applications based on FlexSED.
- O2-2 TACOS: Temporally-Aligned Audio Captions for Language-Audio Pretraining
Paul Primus, Florian Schmid, Gerhard Widmer
Abstract
Learning to associate audio with textual descriptions is valuable for a range of tasks, including pretraining, zero-shot classification, audio retrieval, audio captioning, and text-conditioned audio generation. Existing contrastive language-audio pretrained models are typically trained using global, clip-level descriptions, which provide only weak temporal supervision. We hypothesize that CLAP-like language-audio models – particularly, if they are expected to produce frame-level embeddings – can benefit from a stronger temporal supervision. To confirm our hypothesis, we curate a novel dataset of approximately 12,000 audio recordings from Freesound, each annotated with single-sentence free-text descriptions linked to a specific temporal segment in an audio recording. We use large language models to clean these annotations by removing references to non-audible events, transcribed speech, typos, and annotator language bias. We further propose a frame-wise contrastive training strategy that learns to align text descriptions with temporal regions in an audio recording and demonstrate that our model has better temporal text-audio alignment abilities compared to models trained only on global captions when evaluated on the AudioSet Strong benchmark. The dataset and our source code are available on Zenodo and GitHub, respectively.
- O2-3 Bridging Ears and Eyes: Analyzing Audio and Visual Large Language Models to Humans in Visible Sound Recognition and Reducing Their Sensory Gap via Cross-Modal Distillation
Xilin Jiang, Junkai Wu, Vishal Choudhari, Nima Mesgarani
Abstract
Audio large language models (LLMs) are considered experts at recognizing sound objects, yet their performance relative to LLMs in other sensory modalities, such as visual or audio-visual LLMs, and to humans using their ears, eyes, or both remains unexplored. To investigate this, we systematically evaluate audio, visual, and audio-visual LLMs, specifically Qwen2-Audio, Qwen2-VL, and Qwen2.5-Omni, against humans in recognizing sound objects of different classes from audio-only, silent video, or sounded video inputs. We uncover a performance gap between Qwen2-Audio and Qwen2-VL that parallels the sensory discrepancy between human ears and eyes. To reduce this gap, we introduce a cross-modal distillation framework, where an LLM in one modality serves as the teacher and another as the student, with knowledge transfer in sound classes predicted as more challenging to the student by a heuristic model. Distillation in both directions, from Qwen2-VL to Qwen2-Audio and vice versa, leads to notable improvements, particularly in challenging classes. This work highlights the sensory gap in LLMs from a human-aligned perspective and proposes a principled approach to enhancing modality-specific perception in multimodal LLMs.
- O2-4 OpenBEATs: A Fully Open-Source General-Purpose Audio Encoder
Shikhar Bharadwaj, Samuele Cornell, Kwanghee Choi, Satoru Fukayama, Hye-jin Shim, Soham Deshmukh, Shinji Watanabe
Abstract
Masked token prediction has emerged as a powerful pre-training objective across language, vision, and speech, offering the potential to unify these diverse modalities through a single pre-training task. However, its application in audio remains underexplored — BEATs is the only notable example, but it has seen limited modifications due to the absence of open-source pre-training code. Furthermore, BEATs was trained only on AudioSet, restricting its broader downstream applicability. To address these gaps, we present OpenBEATs, an open-source framework that extends BEATs via multi-domain audio pre-training. We conduct comprehensive evaluations across six types of tasks, twenty five datasets, and three audio domains, including audio reasoning tasks such as audio question answering, entailment, and captioning. OpenBEATs achieves state-of-the-art performance on six bioacoustics datasets, two environmental sound datasets and five reasoning datasets, performing better than models exceeding a billion parameters at one-fourth their parameter size. These results demonstrate the effectiveness of multi-domain datasets and masked token prediction task to learn general-purpose audio representations. To promote further research and reproducibility, we release all pre-training and evaluation code, pretrained and fine-tuned checkpoints, and training logs.
- O2-5 Modulation Discovery with Differentiable Digital Signal Processing
Christopher Mitcheltree, Hao Hao Tan, Joshua D. Reiss
Abstract
Modulations are a critical part of sound design and music production, enabling the creation of complex and evolving audio. Modern synthesizers provide envelopes, low frequency oscillators (LFOs), and more parameter automation tools that allow users to modulate the output with ease. However, determining the modulation signals used to create a sound is difficult, and existing sound-matching / parameter estimation systems are often uninterpretable black boxes or predict high-dimensional framewise parameter values without considering the shape, structure, and routing of the underlying modulation curves. We propose a neural sound-matching approach that leverages modulation extraction, constrained control signal parameterizations, and differentiable digital signal processing (DDSP) to discover the modulations present in a sound. We demonstrate the effectiveness of our approach on highly modulated synthetic and real audio samples, its applicability to different DDSP synth architectures, and investigate the trade-off it incurs between interpretability and sound-matching accuracy. We make our code and audio samples available and provide the trained DDSP synths in a VST plugin.
- O2-6 Transcribing Rhythmic Patterns of the Guitar Track in Polyphonic Music
Aleksandr Lukoianov, Anssi Klapuri
Abstract
Whereas chord transcription has received considerable attention during the past couple of decades, far less work has been devoted to transcribing and encoding the rhythmic patterns that occur in a song. The topic is especially relevant for instruments such as the rhythm guitar, which is typically played by strumming rhythmic patterns that repeat and vary over time. However, in many cases one cannot objectively define a single “right” rhythmic pattern for a given song section. To create a dataset with well-defined ground-truth labels, we asked expert musicians to transcribe the rhythmic patterns in 410 popular songs and record cover versions where the guitar tracks followed those transcriptions. To transcribe the strums and their corresponding rhythmic patterns, we propose a three-step framework. Firstly, we perform approximate stem separation to extract the guitar part from the polyphonic mixture. Secondly, we detect individual strums within the separated guitar audio, using a pre-trained foundation model (MERT) as a backbone. Finally, we carry out a pattern-decoding process in which the transcribed sequence of guitar strums is represented by patterns drawn from an expert-curated vocabulary. We show that it is possible to transcribe the rhythmic patterns of the guitar track in polyphonic music with quite high accuracy, producing a representation that is human-readable and includes automatically detected bar lines and time signature markers. We perform ablation studies and error analysis and propose a set of evaluation metrics to assess the accuracy and readability of the predicted rhythmic pattern sequence.
Demo Session (21:00 pm — 23:00 pm)
Session Chair: Johanna Devaney
- D-1 Neural Audio Synthesis for Non-Keyboard Instruments
Franco Caspe, Andrew McPherson, Mark SandlerAbstract
Interaction with musical instruments can take as many shapes as there are musicians and instruments. However, in many production tasks such as music sketching, composition in Digital Audio Workstations, and sound design, instrumental interaction occurs mainly around keyboard interfaces and MIDI. Although this combination is highly versatile, when musicians who play other instruments, such as guitarists or singers, approach these tasks, they lose a dimension of expression as they cannot convey their usual articulations and their rhythmic sense on these interfaces. In this demonstration, we present a neural audio system for sound-based transformation of musical instruments that aims to provide capabilities similar to those of synthesizers for non-keyboard instruments, allowing musicians access to a wide range of high-quality sounds that can be used for sound production, sketching, and live playing. We achieve this in two ways: first, we employ a novel waveform autoencoder that learn control features that can convey the rich expression of musical instruments, as well as the ambiguities and uncertainties of their sound in ways that respect the nature of the source instrument. Furthermore, we run this model using an efficient, custom C++ neural network engine that supports high frame rates, making it able to be used for live playing in current audio production workflows in an audio plugin. We will demonstrate our system by sketching rhythms and melodies of different instruments with a guitar and microphone, recording and playing them live.
- D-2 PCA-DiffVox: Augmenting Vocal Effects Tweakability With a Bijective Latent Space
Chin-Yun Yu, Marco A. Martínez-Ramírez, Junghyun Koo, Wei-Hsiang Liao, Yuki Mitsufuji, George FazekasAbstract
We present PCA-DiffVox, a system that enhances vocal effects controllability through data-driven dimensionality reduction. Using 365 professionally fitted vocal effect presets spanning 130 parameters across equaliser, dynamics, delay, reverb, and panning processors, we apply Principal Component Analysis to extract perceptually meaningful control dimensions. The linear transformation preserves bidirectional mapping between high-level semantic controls and individual parameters, enabling simultaneous manipulation through both interfaces. Our web application demonstrates synchronisation between PCA-based sliders and detailed parameter controls with live visualisation. This approach provides computationally efficient creative exploration while maintaining full parameter expressiveness through the bijective latent space representation.
- D-3 Real-Time Speech Enhancement in Noise for Throat Microphone Using Neural Audio Codec as Foundation Model
Julien Hauret, Thomas Joubaud, Éric BavuAbstract
We present a real-time speech enhancement demo using speech captured with a throat microphone. This demo aims to showcase the complete pipeline, from recording to deep learning-based post-processing, for speech captured in noisy environments with a body-conducted microphone. The throat microphone records skin vibrations, which naturally attenuate external noise, but this robustness comes at the cost of reduced audio bandwidth. To address this challenge, we fine-tune Kyutai’s Mimi—a neural audio codec supporting real-time inference—on Vibravox, a dataset containing paired air-conducted and throat microphone recordings. We compare this enhancement strategy against state-of-the-art models and demonstrate its superior performance. The inference runs in an interactive interface that allows users to toggle enhancement, visualize spectrograms, and monitor processing latency.
- D-4 Real-Time System for Audio-Visual Target Speech Enhancement
Teng Ma, Sile Yin, Li-Chia Yang, Shuo ZhangAbstract
We present a live demonstration for RAVEN, a real-time audio-visual speech enhancement system designed to run entirely on a CPU. In single-channel, audio-only settings, speech enhancement is traditionally approached as the task of extracting clean speech from environmental noise. More recent work has explored the use of visual cues, such as lip movements, to improve robustness, particularly in the presence of interfering speakers. However, to our knowledge, no prior work has demonstrated an interactive system for real-time audio-visual speech enhancement operating on CPU hardware. RAVEN fills this gap by using pretrained visual embeddings from an audio-visual speech recognition model to encode lip movement information. The system generalizes across environmental noise, interfering speakers, transient sounds, and even singing voices. In this demonstration, attendees will be able to experience live audio-visual target speech enhancement using a microphone and webcam setup, with clean speech playback through headphones.
- D-5 Wireless Group Conversation Enhancement with the Tympan Open-Source Hearing Aid Platform
Ryan CoreyAbstract
Hearing aids alone perform poorly in noisy environments. Wireless remote microphones microphones worn by talkers can transmit low-noise, low-reverberation audio directly from the source to the listener’s device. However, current wireless microphones work with only one conversation partner at a time and they do not match the timing, spectrum, or spatial cues of the sound at the ears. This demonstration is a real-time implementation of an immersive multiple-talker wireless microphone system that uses adaptive filtering to align and mix the remote sources with the binaural sound at the hearing device earpieces, thereby preserving spatial awareness. This interactive demo, built on the Tympan open-source hearing-aid research platform, includes app-based controls to adjust the processing applied to each source.
- D-6 Next-Generation Synthetic Data Techniques for Training, Evaluation, and Prototyping in Audio Signal Processing
Finnur Pind, Georg Götz, Daniel Gert NielsenAbstract
The development, benchmarking, and optimization of modern audio signal processing algorithms – such as speech enhancement, source separation, echo cancellation, and blind room acoustics estimation – critically depend on access to high-quality acoustic data. While such data can be measured in real environments, simulation and synthetic data generation offer superior scalability, traceability, controllability, and coverage of diverse acoustic conditions. However, conventional simulation methods often fail to produce data accurate enough for algorithms to generalize reliably to real-world scenarios. Recent advances in acoustic simulation have enabled the generation of high-fidelity synthetic data that closely matches measured responses. This demonstration introduces a novel wave-based simulation engine, delivered as an accessible Python SDK, which allows researchers and practitioners to generate high-fidelity, multi-channel, optionally device-specific Room Impulse Responses (RIRs) and Spatial Room Impulse Responses (SRIRs) at scale. Unlike conventional geometric acoustics methods, our approach achieves a substantially closer match to real-world measurements. This fidelity, in turn, enables algorithms trained on simulated data to perform and generalize more effectively, as validated by both perceptual and algorithmic evaluation metrics. It also provides evaluation results comparable to those obtained with measured data, but with far greater scalability to test a broader range of scenarios and uncover performance limitations. In addition to algorithm development, these capabilities support fully virtual workflows for acoustic hardware prototyping. To support the community, we will release an open benchmark dataset of measured and simulated RIRs as part of the demonstration, facilitating reproducible experiments and validation.
- D-7 Speech Removal Framework for Privacy-preserving Audio Recordings
Gabriel Bibbó, Arshdeep Singh, Thomas Deacon, Mark D. PlumbleyAbstract
Public dataset such as “The Sounds of Home” are being recorded in people’s home to capture everyday soundscape. Such audio recordings from home environments provide valuable information for recognizing daily activities, monitoring health and wellbeing, and enabling smart home applications. They support the development of robust sound event detection systems under real-world conditions. However, in-home recordings contain crucial personal information in the form of speech signals. It is crucial to remove the personal information such as “speech” from domestic audio recordings when publicly sharing the recorded datasets. This demonstration showcase real-time identification of personal information, in our case it is speech, using various AI models such as convolutional neural networks (PANNs, E-PANNs), Transformer model (AST), voice activity detection (VAD) models (Silero, WebRTC). Our focus is two fold: (1) To design a speech removal system to identify and remove speech from the recorded audio in real-time. (2) How well can AI models distinguish speech from non-speech audio? Our demonstration is simple, easy to use and a software-based GUI.
- D-8 Demonstration of LipDiffuser: Lip-to-Speech Generation with Conditional Diffusion Models Running on a Portable Device
Danilo Oliveira, Julius Richter, Tal Peer, Timo GerkmannAbstract
We present a demonstration of LipDiffuser, a diffusion-based lip-to-speech model that generates high-quality speech samples using only silent video and an enrollment utterance, all recorded on-the-spot by users. The model runs locally on a laptop with a minimal setup and short processing times, enabling users to assess the quality of the generated speech, including preservation of speaker traits, and explore the limits of lip-reading.
Tuesday 10/14
Keynote Speech 2 (8:00 am—8:55 am)
Speaker: Alexandre Défossez
Title: “Text-Speech tasks as Delayed Stream Modeling“
Abstract: Speech and audio encompass a variety of tasks: source separation, diarization, transcription, TTS, translation, speech-to-speech etc. Each comes with its own training objective, specific architecture, or training dataset. In this talk, I will present how our team has been systematically using the same approach — delayed stream modeling — and training across a wide range of speech and speech-text tasks. The framework provides many benefits: efficient long-form streaming and batched inference using decoder-only Transformers, shared pre-training and hyper-parameters across applications, controllability, and more.
Oral Session 3 (8:55 am — 10:10 am)
Session Chairs: Ante Jukic, Richard Hendriks
- O3-1 Rethinking Non-Negative Matrix Factorization with Implicit Neural Representations
Krishna Subramani, Paris Smaragdis, Takuya Higuchi, Mehrez SoudenAbstract
Non-negative Matrix Factorization (NMF) is a powerful technique for analyzing regularly-sampled data, i.e., data that can be stored in a matrix. For audio, this has led to numerous applications using time-frequency (TF) representations like the Short-Time Fourier Transform. However extending these applications to irregularly-spaced TF representations, like the Constant-Q transform, wavelets, or sinusoidal analysis models, has not been possible since these representations cannot be directly stored in matrix form. In this paper, we formulate NMF in terms of learnable functions (instead of vectors) and show that NMF can be extended to a wider variety of signal classes that need not be regularly sampled.
- O3-2 Self-Steering Deep Non-Linear Spatially Selective Filters for Efficient Extraction of Moving Speakers under Weak Guidance
Jakob Kienegger, Alina Mannanova, Huajian Fang, Timo GerkmannAbstract
Recent works on deep non-linear spatially selective filters demonstrate exceptional enhancement performance with computationally lightweight architectures for stationary speakers of known directions. However, to maintain this performance in dynamic scenarios, resource-intensive data-driven tracking algorithms become necessary to provide precise spatial guidance conditioned on the initial direction of a target speaker. As this additional computational overhead hinders application in resource-constrained scenarios such as real-time speech enhancement, we present a novel strategy utilizing a low-complexity tracking algorithm in the form of a particle filter instead. Assuming a causal, sequential processing style, we introduce temporal feedback to leverage the enhanced speech signal of the spatially selective filter to compensate for the limited modeling capabilities of the particle filter. Evaluation on a synthetic dataset illustrates how the autoregressive interplay between both algorithms drastically improves tracking accuracy and leads to strong enhancement performance. A listening test with real-world recordings complements these findings by indicating a clear trend towards our proposed self-steering pipeline as preferred choice over comparable methods.
- O3-3 A Lightweight and Robust Method for Blind Wideband-to-Fullband Extension of Speech
Jan Büthe, Jean-Marc ValinAbstract
Reducing the bandwidth of speech is common practice in resource constrained environments like low-bandwidth speech transmission or low-complexity vocoding. We propose a lightweight and robust method for extending the bandwidth of wideband speech signals that is inspired by classical methods developed in the speech coding context. The resulting model has just ~370 K parameters and a complexity of ~140 MFLOPS (or ~70 MMACS). With a frame size of 10 ms and a lookahead of only 0.27 ms, the model is well-suited for use with common wideband speech codecs. We evaluate the model’s robustness by pairing it with the Opus SILK speech codec (1.5 release) and verify in a P.808 DCR listening test that it significantly improves quality from 6 to 12 kb/s. We also demonstrate that Opus 1.5 together with the proposed bandwidth extension at 9 kb/s meets the quality of 3GPP EVS at 9.6 kb/s and that of Opus 1.4 at 18 kb/s showing that the blind bandwidth extension can meet the quality of classical guided bandwidth extensions thus providing a way for backward-compatible quality improvement.
- O3-4 Listening Intention Estimation for Hearables with Natural Behavior Cues
Pascal Baumann, Seraina Glaus, Ludovic Amruthalingam, Fabian Gröger, Ruksana Giurda, Simone LionettiAbstract
Despite advances in hearing technologies, users still face challenges in acoustically complex environments. In particular, social situations with multiple speakers—where a given speaker may be attended to at one moment and become an interfering source at another—are consistently reported as especially difficult. We use a data-driven approach to estimate the engagement level of the user with each sound source in the acoustic foreground based on behavioral cues within dynamic, real-world acoustic scenes. We collect a novel dataset from two experiments that simultaneously record natural behavior and which sources are attended, covering both induced and free listening intentions. For each source, we engineer features expected to reflect the user’s engagement, without making further assumptions about underlying patterns, which are instead inferred with machine learning. Classic models such as logistic regression and tree-based methods achieve an average precision of 97% when the listening intention is induced and 76% when it is free, surpassing baselines that statically predict the most frequently attended sources. Our results suggest that it is possible to estimate user engagement with individual localized acoustic sources using machine learning, without invasive sensors and with sufficient accuracy for practical application. This paves the way for the development of adaptive hearing systems that can support users in everyday situations.
- O3-5 Unveiling the Best Practices for Applying Speech Foundation Models to Speech Intelligibility Prediction for Hearing-Impaired People
Haoshuai Zhou, Boxuan Cao, Changgeng Mo, Linkai Li, Shan Xiang WangAbstract
Speech foundation models (SFMs) have demonstrated strong performance across a variety of downstream tasks, including speech intelligibility prediction for hearing-impaired people (SIP-HI). However, optimizing SFMs for SIP-HI has been insufficiently explored. In this paper, we conduct a comprehensive study to identify key design factors affecting SIP-HI performance with 5 SFMs, focusing on encoder layer selection, prediction head architecture, and ensemble configurations. Our findings show that, contrary to traditional use-all-layers methods, selecting a single encoder layer yields better results. Additionally, temporal modeling is crucial for effective prediction heads. We also demonstrate that ensembling multiple SFMs improves performance, with stronger individual models providing greater benefit. Finally, we explore the relationship between key SFM attributes and their impact on SIP-HI performance. Our study offers practical insights into effectively adapting SFMs for speech intelligibility prediction for hearing-impaired populations.
Poster Session 2 (10:30 am — 12:30 pm)
Session Chair: Fabio Antonacci
- P2-1 FlexSED: Towards Open-Vocabulary Sound Event Detection
Jiarui Hai, Helin Wang, Weizhe Guo, Mounya ElhilaliAbstract
Despite recent progress in large-scale sound event detection (SED) systems capable of handling hundreds of sound classes, existing multi-class classification frameworks remain fundamentally limited. They cannot process free-text sound queries, which enable more flexible and user-friendly interaction, and they lack zero-shot capabilities and offer poor few-shot adaptability. Although text-query-based separation methods have been explored, they primarily focus on source separation and are ill-suited for SED tasks that require precise temporal localization and efficient detection across large and diverse sound vocabularies. In this paper, we propose FlexSED, an open-vocabulary sound event detection system. FlexSED builds on a pretrained audio SSL model and the CLAP text encoder, introducing an encoder-decoder composition and an adaptive fusion strategy to enable effective continuous training from pretrained weights. To ensure robust supervision, it also employs large language models (LLMs) to assist in event query selection during training, addressing challenges related to missing labels. As a result, FlexSED achieves superior performance compared to vanilla SED models on AudioSet-Strong, while demonstrating strong zero-shot and few-shot capabilities. We release the code and pretrained models to support future research and applications based on FlexSED.
- P2-2 TACOS: Temporally-Aligned Audio Captions for Language-Audio Pretraining
Paul Primus, Florian Schmid, Gerhard WidmerAbstract
Learning to associate audio with textual descriptions is valuable for a range of tasks, including pretraining, zero-shot classification, audio retrieval, audio captioning, and text-conditioned audio generation. Existing contrastive language-audio pretrained models are typically trained using global, clip-level descriptions, which provide only weak temporal supervision. We hypothesize that CLAP-like language-audio models – particularly, if they are expected to produce frame-level embeddings – can benefit from a stronger temporal supervision. To confirm our hypothesis, we curate a novel dataset of approximately 12,000 audio recordings from Freesound, each annotated with single-sentence free-text descriptions linked to a specific temporal segment in an audio recording. We use large language models to clean these annotations by removing references to non-audible events, transcribed speech, typos, and annotator language bias. We further propose a frame-wise contrastive training strategy that learns to align text descriptions with temporal regions in an audio recording and demonstrate that our model has better temporal text-audio alignment abilities compared to models trained only on global captions when evaluated on the AudioSet Strong benchmark. The dataset and our source code are available on Zenodo and GitHub, respectively.
- P2-3 Bridging Ears and Eyes: Analyzing Audio and Visual Large Language Models to Humans in Visible Sound Recognition and Reducing Their Sensory Gap via Cross-Modal Distillation
Xilin Jiang, Junkai Wu, Vishal Choudhari, Nima MesgaraniAbstract
Audio large language models (LLMs) are considered experts at recognizing sound objects, yet their performance relative to LLMs in other sensory modalities, such as visual or audio-visual LLMs, and to humans using their ears, eyes, or both remains unexplored. To investigate this, we systematically evaluate audio, visual, and audio-visual LLMs, specifically Qwen2-Audio, Qwen2-VL, and Qwen2.5-Omni, against humans in recognizing sound objects of different classes from audio-only, silent video, or sounded video inputs. We uncover a performance gap between Qwen2-Audio and Qwen2-VL that parallels the sensory discrepancy between human ears and eyes. To reduce this gap, we introduce a cross-modal distillation framework, where an LLM in one modality serves as the teacher and another as the student, with knowledge transfer in sound classes predicted as more challenging to the student by a heuristic model. Distillation in both directions, from Qwen2-VL to Qwen2-Audio and vice versa, leads to notable improvements, particularly in challenging classes. This work highlights the sensory gap in LLMs from a human-aligned perspective and proposes a principled approach to enhancing modality-specific perception in multimodal LLMs.
- P2-4 OpenBEATs: A Fully Open-Source General-Purpose Audio Encoder
Shikhar Bharadwaj, Samuele Cornell, Kwanghee Choi, Satoru Fukayama, Hye-jin Shim, Soham Deshmukh, Shinji WatanabeAbstract
Masked token prediction has emerged as a powerful pre-training objective across language, vision, and speech, offering the potential to unify these diverse modalities through a single pre-training task. However, its application in audio remains underexplored — BEATs is the only notable example, but it has seen limited modifications due to the absence of open-source pre-training code. Furthermore, BEATs was trained only on AudioSet, restricting its broader downstream applicability. To address these gaps, we present OpenBEATs, an open-source framework that extends BEATs via multi-domain audio pre-training. We conduct comprehensive evaluations across six types of tasks, twenty five datasets, and three audio domains, including audio reasoning tasks such as audio question answering, entailment, and captioning. OpenBEATs achieves state-of-the-art performance on six bioacoustics datasets, two environmental sound datasets and five reasoning datasets, performing better than models exceeding a billion parameters at one-fourth their parameter size. These results demonstrate the effectiveness of multi-domain datasets and masked token prediction task to learn general-purpose audio representations. To promote further research and reproducibility, we release all pre-training and evaluation code, pretrained and fine-tuned checkpoints, and training logs.
- P2-5 Modulation Discovery with Differentiable Digital Signal Processing
Christopher Mitcheltree, Hao Hao Tan, Joshua D. ReissAbstract
Modulations are a critical part of sound design and music production, enabling the creation of complex and evolving audio. Modern synthesizers provide envelopes, low frequency oscillators (LFOs), and more parameter automation tools that allow users to modulate the output with ease. However, determining the modulation signals used to create a sound is difficult, and existing sound-matching / parameter estimation systems are often uninterpretable black boxes or predict high-dimensional framewise parameter values without considering the shape, structure, and routing of the underlying modulation curves. We propose a neural sound-matching approach that leverages modulation extraction, constrained control signal parameterizations, and differentiable digital signal processing (DDSP) to discover the modulations present in a sound. We demonstrate the effectiveness of our approach on highly modulated synthetic and real audio samples, its applicability to different DDSP synth architectures, and investigate the trade-off it incurs between interpretability and sound-matching accuracy. We make our code and audio samples available and provide the trained DDSP synths in a VST plugin.
- P2-6 Transcribing Rhythmic Patterns of the Guitar Track in Polyphonic Music
Aleksandr Lukoianov, Anssi KlapuriAbstract
Whereas chord transcription has received considerable attention during the past couple of decades, far less work has been devoted to transcribing and encoding the rhythmic patterns that occur in a song. The topic is especially relevant for instruments such as the rhythm guitar, which is typically played by strumming rhythmic patterns that repeat and vary over time. However, in many cases one cannot objectively define a single “right” rhythmic pattern for a given song section. To create a dataset with well-defined ground-truth labels, we asked expert musicians to transcribe the rhythmic patterns in 410 popular songs and record cover versions where the guitar tracks followed those transcriptions. To transcribe the strums and their corresponding rhythmic patterns, we propose a three-step framework. Firstly, we perform approximate stem separation to extract the guitar part from the polyphonic mixture. Secondly, we detect individual strums within the separated guitar audio, using a pre-trained foundation model (MERT) as a backbone. Finally, we carry out a pattern-decoding process in which the transcribed sequence of guitar strums is represented by patterns drawn from an expert-curated vocabulary. We show that it is possible to transcribe the rhythmic patterns of the guitar track in polyphonic music with quite high accuracy, producing a representation that is human-readable and includes automatically detected bar lines and time signature markers. We perform ablation studies and error analysis and propose a set of evaluation metrics to assess the accuracy and readability of the predicted rhythmic pattern sequence.
- P2-7 Can Large Language Models Predict Audio Effects Parameters from Natural Language?
Seungheon Doh, Junghyun Koo, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Juhan Nam, Yuki MitsufujiAbstract
In music production, manipulating audio effects (Fx) parameters through natural language has the potential to reduce technical barriers for non-experts. We present LLM2Fx, a framework leveraging Large Language Models (LLMs) to predict Fx parameters directly from textual descriptions without requiring task-specific training or fine-tuning. Our approach addresses the text-to-effect parameter prediction (Text2Fx) task by mapping natural language descriptions to the corresponding Fx parameters for equalization and reverberation. We demonstrate that LLMs can generate Fx parameters in a zero-shot manner that elucidates the relationship between timbre semantics and audio effects in music production. To enhance performance, we introduce three types of in-context examples: audio Digital Signal Processing (DSP) features, DSP function code, and few-shot examples. Our results demonstrate that LLM-based Fx parameter generation outperforms previous optimization approaches, offering competitive performance in translating natural language descriptions to appropriate Fx settings. Furthermore, LLMs can serve as text-driven interfaces for audio production, paving the way for more intuitive and accessible music production tools.
- P2-8 Temporal Adaptation of Pre-trained Foundation Models for Music Structure Analysis
Yixiao Zhang, Haonan Chen, Ju-Chiang Wang, Jitong ChenAbstract
Audio-based music structure analysis (MSA) is an essential task in Music Information Retrieval that remains challenging due to the complexity and variability of musical form. Recent advances highlight the potential of fine-tuning pre-trained music foundation models for MSA tasks. However, these models are typically trained with high temporal feature resolution and short audio windows, which limits their efficiency and introduces bias when applied to long-form audio. This paper presents a temporal adaptation approach for fine-tuning music foundation models tailored to MSA. Our method enables efficient analysis of full-length songs in a single forward pass by incorporating two key strategies: (1) audio window extension and (2) low-resolution adaptation. Experiments on the Harmonix Set and RWC-Pop datasets show that our method significantly improves both boundary detection and structural function prediction, while maintaining comparable memory usage and inference speed.
- P2-9 Learn from Virtual Guitar: A Comparative Analysis of Automatic Guitar Transcription using Synthetic and Real Audio
Yuta Kusaka, Akira MaezawaAbstract
This paper investigates the effectiveness of using synthetic audio, generated from musical scores via virtual instrument software, for training automatic guitar transcription models. Collecting large annotated datasets from real performances is costly and labor-intensive. To overcome this, the use of synthetic data has been explored for developing automatic music transcription (AMT) models. We present a systematic comparison between a high-resolution AMT models trained on synthetic guitar data (SynthTab) and models trained on real data (GAPS), clarifying the effectiveness and limitations of synthetic data in AMT. Our experiments yield four insights: 1) models trained on diverse and well-augmented synthetic guitar data generalize well to real recordings, 2) pre-training with target instrument’s synthetic data is more effective than with non-target instrument’s real data, 3) even small synthetic datasets are valuable for pre-training, and 4) transcription discrepancies arise between models trained on synthetic data and real data, which can be mitigated by fine-tuning. These results demonstrate the utility of synthetic data for training AMT models when real aligned data is scarce.
- P2-10 SpecMaskFoley: Steering Pretrained Spectral Masked Generative Transformer Toward Synchronized Video-to-audio Synthesis via ControlNet
Zhi Zhong, Akira Takahashi, Shuyang Cui, Keisuke Toyama, Shusuke Takahashi, Yuki MitsufujiAbstract
Foley synthesis aims to synthesize high-quality audio that is both semantically and temporally aligned with video frames. Given its broad application in creative industries, the task has gained increasing attention in the research community. To avoid the non-trivial task of training audio generative models from scratch, adapting pretrained audio generative models for video-synchronized foley synthesis presents an attractive direction. ControlNet, a method for adding fine-grained controls to pretrained generative models, has been applied to foley synthesis, but its use has been limited to handcrafted human-readable temporal conditions. In contrast, from-scratch models achieved success by leveraging high-dimensional deep features extracted using pretrained video encoders. We have observed a performance gap between ControlNet-based and from-scratch foley models. To narrow this gap, we propose SpecMaskFoley, a method that steers the pretrained SpecMaskGIT model toward video-synchronized foley synthesis via ControlNet. To unlock the potential of a single ControlNet branch, we resolve the discrepancy between the temporal video features and the time-frequency nature of the pretrained SpecMaskGIT via a frequency-aware temporal feature aligner, eliminating the need for complicated conditioning mechanisms widely used in prior arts. Evaluations on a common foley synthesis benchmark demonstrate that SpecMaskFoley could even outperform strong from-scratch baselines, substantially advancing the development of ControlNet-based foley synthesis models. Demo page: https://zzaudio.github.io/SpecMaskFoley_Demo/
- P2-11 Neural-Network-Based Interpolation of Late Reverberation in Coupled Spaces Using the Common Slopes Model
Orchisama Das, Gloria Dal Santo, Sebastian J. Schlecht, Zoran CvetkovicAbstract
Six-degrees-of-freedom audio rendering for eXtended Reality (XR) from limited measurements remains a challenging problem, particularly in coupled spaces where anisotropic and multi-slope late reverberation occurs. In this work, we adopt the common slopes model to represent position-dependent, directional late reverberation in a coupled space. Fourier-encoded positional coordinates are used as input to a Multi-Layer Perceptron (MLP) to predict position-dependent common-slope amplitudes in the spherical harmonic (SH) domain. The MLP is trained using a directional energy decay curve (DEDC) loss. At inference, the MLP predicts directional amplitudes of the DEDCs at unseen positions. These amplitudes, combined with the pre-determined common decay times, can drive a modal or shaped white noise reverberator to synthesise directional late reverberation tails at new locations. We demonstrate binaural rendering for 6DoF navigation and evaluate our approach on a simulated three-room dataset. Results show that an SRIR grid spacing of 60cm is sufficient to obtain a mean EDC error below 1.5dB. We also compare our method to the Neural Acoustical Field approach and achieve faster inference with comparable EDC mismatch errors in the predicted BRIRs (mean error of 2.13dB for both methods), when both models are trained on data sampled at 60cm resolution.
- P2-12 Two Views, One Truth: Spectral and Self-Supervised Features Fusion for Robust Speech Deepfake Detection
Yassine El Kheir, Arnab Das, Enes Erdem Erdogan, Fabian Ritter-Gutierrez, Tim Polzehl, Sebastian MöllerAbstract
Recent advances in synthetic speech have made audio deepfakes increasingly realistic, posing significant security risks. Existing detection methods that rely on a single modality either raw waveform embeddings or spectral-based features are vulnerable to non-spoof disturbances and often overfit to known forgery algorithms, resulting in poor generalization to unseen attacks. To address these shortcomings, we investigate hybrid fusion frameworks that integrate self-supervised learning (SSL)-based representations with handcrafted spectral descriptors (e.g., MFCC, LFCC, CQCC). By aligning and combining complementary information across modalities, these fusion approaches capture subtle artifacts that single-feature approaches typically overlook. We explore several fusion strategies, including simple concatenation, cross-attention, mutual cross-attention, and a learnable gating mechanism, to optimally blend SSL features with fine-grained spectral cues. We evaluate our approach on four challenging public benchmarks (LA19, DF21, ITW, ASV5) and report generalization performance. All fusion variants consistently outperform an SSL-only baseline, with the cross-attention strategy achieving the best generalization with a 38 % relative reduction in equal error rate (EER). These results confirm that joint modeling of waveform and spectral views produces robust, domain-agnostic representations for audio deepfake detection. Code and pretrained models will be released upon acceptance.
- P2-13 Schrödinger Bridge Consistency Trajectory Models for Speech Enhancement
Shuichiro Nishigori, Koichi Saito, Naoki Murata, Masato Hirano, Shusuke Takahashi, Yuki MitsufujiAbstract
Speech enhancement (SE) using diffusion models is a promising technology for improving speech quality in noisy speech data. Recently, the Schrödinger bridge (SB) has been utilized in diffusion-based SE to resolve the mismatch between the endpoint of the forward process and the starting point of the reverse process. However, the SB inference remains slow owing to the need for a large number of function evaluations to achieve high-quality results. Although consistency models have been proposed to accelerate inference, by employing consistency training via distillation from pretrained models in the field of image generation, they struggle to improve generation quality as the number of steps increases. Consistency trajectory models (CTMs) have been introduced to overcome this limitation, achieving faster inference while preserving a favorable trade-off between quality and speed. In particular, SoundCTM applies CTM techniques to sound generation. The SB addresses the aforementioned mismatch and CTM improves inference speed while maintaining a favorable trade-off between quality and speed. However, no existing method simultaneously resolves both issues to the best of our knowledge. Hence, in this study, we propose Schrödinger bridge consistency trajectory models (SBCTM), which apply CTM techniques to the SB for SE. In addition, we introduce a novel auxiliary loss, incorporating perceptual loss, into the original CTM training framework. Consequently, SBCTM achieves an approximately 16× improvement in real-time factor compared with the conventional SB. Furthermore, SBCTM enables more practical and efficient speech enhancement by providing a flexible mechanism for further quality improvement. Our codes, pretrained models, and audio samples are available at https://github.com/sony/sbctm/.
- P2-14 A Unified Framework for Evaluating DNN-Based Feedforward, Feedback, and Hybrid Active Noise Cancellation
Huajian Fang, Buye Xu, Jacob Donley, Ashutosh Pandey, Deliang Wang, Daniel D. E. WongAbstract
Deep neural network (DNN)-based acoustic noise cancellation excels at modeling complex non-linear relationships in signal patterns that are difficult for linear filters to handle. Recent work has studied DNN-based feedforward (FF) and feedback (FB) control structures. However, their inconsistent experimental settings prevent fair comparison, and limited evaluation in a narrow range of acoustic conditions hinders a comprehensive understanding of their effectiveness. In this work, we present a unified framework that realizes FF and FB control structures to better understand their cancellation mechanisms under various acoustic conditions. In addition, it enables a hybrid (HB) control structure that combines the FF and FB approaches, which has not been evaluated against standalone DNN-based FF and FB configurations. We perform systematic evaluations in various settings, including different disturbing signals, reverberation conditions, source positions, room sizes, and mismatched secondary paths. The evaluation results show that the effectiveness of FF depends strongly on the modeling of the primary path and the acoustic environment, the performance of FB varies less under different acoustic conditions, and HB integrates the advantages of FF and FB.
- P2-15 Robust Speech Activity Detection in the Presence of Singing Voice
Philipp Grundhuber, Mhd Modar Halimeh, Martin Strauß, Emanuël A. P. HabetsAbstract
Speech Activity Detection (SAD) systems often misclassify singing as speech, leading to degraded performance in applications such as dialogue enhancement and automatic speech recognition. We introduce Singing-Robust Speech Activity Detection (SR-SAD), a neural network designed to robustly detect speech in the presence of singing. Our key contributions are: i) a training strategy using controlled ratios of speech and singing samples to improve discrimination, ii) a computationally efficient model that maintains robust performance while reducing inference runtime, and iii) a new evaluation metric tailored to assess SAD robustness in mixed speech-singing scenarios. Experiments on a challenging dataset spanning multiple musical genres show that SR-SAD maintains high speech detection accuracy (AUC = 0.919) while rejecting singing. By explicitly learning to distinguish between speech and singing, SR-SAD enables more reliable SAD in mixed speech-singing scenarios.
- P2-16 Frequency-Domain Signal-to-Noise Ratios Illuminate the Effects of the Spectral Consistency Constraint and Griffin-Lim Algorithms
Stephen Voran, Jaden PieperAbstract
The restoration of degraded audio signals is often performed on complex-valued frequency-domain (FD) representations. This requires manipulation of either magnitudes and phases or real and imaginary parts. In general, these manipulations do not produce consistent representations. The consequence is that the magnitudes and phases (or real and imaginary parts) of the restored time-domain signal (which are always consistent) do not match the generally inconsistent values imposed during FD restoration. In colloquial terms, “What we get is not what we asked for.” The enforcement of consistency is always heard in the resulting audio and is known in principle, but it can be better understood. We present two-dimensional FD SNR frameworks (e.g., magnitude/phase or real/imaginary) that visually reveal how consistency enforcement changes the applied FD restorations to arrive at the achieved FD restorations. We also show how extended Griffin-Lim algorithms can reduce and direct, but not eliminate, the changes produced by consistency enforcement. We apply objective estimators to connect this work to estimated speech quality and intelligibility. This work can inform machine learning training and architecture choices that must balance restoration efforts across two dimensions (e.g., magnitude and phase) to arrive at the best possible speech quality.
- P2-17 Source Separation by Flow Matching
Robin Scheibler, John R. Hershey, Arnaud Doucet, Henry LiAbstract
We consider the problem of single-channel audio source separation with the goal of reconstructing K sources from their mixture. We address this ill-posed problem with FLOSS (FLOw matching for Source Separation), a constrained generation method based on flow matching, ensuring strict mixture consistency. Flow matching is a general methodology that, when given samples from two probability distributions defined on the same space, learns an ordinary differential equation to output a sample from one of the distributions when provided with a sample from the other. In our context, we have access to samples from the joint distribution of K sources and so the corresponding samples from the lower-dimensional distribution of their mixture. To apply flow matching, we augment these mixture samples with artificial noise components to ensure the resulting “augmented” distribution matches the dimensionality of the K source distribution. Additionally, as any permutation of the sources yields the same mixture, we adopt an equivariant formulation of flow matching which relies on a suitable custom-designed neural network architecture. We demonstrate the performance of the method for the separation of overlapping speech.
- P2-18 Musical Source Separation Bake-Off: Comparing Objective Metrics with Human Perception
Noah Jaffe, John Ashley BurgoyneAbstract
Music source separation aims to extract individual sound sources (e.g., vocals, drums, guitar) from a mixed music recording. However, evaluating the quality of separated audio remains challenging, as commonly used metrics like the source-to-distortion ratio (SDR) do not always align with human perception. In this study, we conducted a large-scale listener evaluation on the MUSDB18 test set, collecting approximately 30 ratings per track from seven distinct listener groups. We compared several objective energy-ratio metrics, including legacy measures (BSSEval v4, SI-SDR variants), and embedding-based alternatives (Frechet Audio Distance using CLAP-LAION-music, EnCodec, VGGish, Wave2Vec2, and HuBERT). While SDR remains the best-performing metric for vocal estimates, our results show that the scale-invariant signal-to-artifacts ratio (SI-SAR) better predicts listener ratings for drums and bass stems. Frechet Audio Distance (FAD) computed with the CLAP-LAION-music embedding also performs competitively—achieving Kendall’s tau values of 0.25 for drums and 0.19 for bass—matching or surpassing energy-based metrics for those stems. However, none of the embedding-based metrics, including CLAP, correlate positively with human perception for vocal estimates. These findings highlight the need for stem-specific evaluation strategies and suggest that no single metric reliably reflects perceptual quality across all source types. We release our raw listener ratings to support reproducibility and further research.
- P2-19 TGIF: Talker Group-Informed Familiarization of Target Speaker Extraction
Tsun-An Hsieh, Minje KimAbstract
State-of-the-art target speaker extraction (TSE) systems are typically designed to generalize to any given mixing environment, necessitating a model with a large enough capacity as a generalist. Personalized speech enhancement could be a specialized solution that adapts to single-user scenarios, but it overlooks the practical need for customization in cases where only a small number of talkers are involved, e.g., TSE for a specific family. We address this gap with the proposed concept, talker group-informed familiarization (TGIF) of TSE, where the TSE system specializes in a particular group of users, which is challenging due to the inherent absence of a clean speech target. To this end, we employ a knowledge distillation approach, where a group-specific student model learns from the pseudo-clean targets generated by a large teacher model. This tailors the student model to effectively extract the target speaker from the particular talker group while maintaining computational efficiency. Experimental results demonstrate that our approach outperforms the baseline generic models by adapting to the unique speech characteristics of a given speaker group. Our newly proposed TGIF concept underscores the potential of developing specialized solutions for diverse and real-world applications, such as on-device TSE on a family-owned device.
- P2-20 Rethinking Non-Negative Matrix Factorization with Implicit Neural Representations
Krishna Subramani, Paris Smaragdis, Takuya Higuchi, Mehrez SoudenAbstract
Non-negative Matrix Factorization (NMF) is a powerful technique for analyzing regularly-sampled data, i.e., data that can be stored in a matrix. For audio, this has led to numerous applications using time-frequency (TF) representations like the Short-Time Fourier Transform. However extending these applications to irregularly-spaced TF representations, like the Constant-Q transform, wavelets, or sinusoidal analysis models, has not been possible since these representations cannot be directly stored in matrix form. In this paper, we formulate NMF in terms of learnable functions (instead of vectors) and show that NMF can be extended to a wider variety of signal classes that need not be regularly sampled.
- P2-21 Self-Steering Deep Non-Linear Spatially Selective Filters for Efficient Extraction of Moving Speakers under Weak Guidance
Jakob Kienegger, Alina Mannanova, Huajian Fang, Timo GerkmannAbstract
Recent works on deep non-linear spatially selective filters demonstrate exceptional enhancement performance with computationally lightweight architectures for stationary speakers of known directions. However, to maintain this performance in dynamic scenarios, resource-intensive data-driven tracking algorithms become necessary to provide precise spatial guidance conditioned on the initial direction of a target speaker. As this additional computational overhead hinders application in resource-constrained scenarios such as real-time speech enhancement, we present a novel strategy utilizing a low-complexity tracking algorithm in the form of a particle filter instead. Assuming a causal, sequential processing style, we introduce temporal feedback to leverage the enhanced speech signal of the spatially selective filter to compensate for the limited modeling capabilities of the particle filter. Evaluation on a synthetic dataset illustrates how the autoregressive interplay between both algorithms drastically improves tracking accuracy and leads to strong enhancement performance. A listening test with real-world recordings complements these findings by indicating a clear trend towards our proposed self-steering pipeline as preferred choice over comparable methods.
- P2-22 A Lightweight and Robust Method for Blind Wideband-to-Fullband Extension of Speech
Jan Büthe, Jean-Marc ValinAbstract
Reducing the bandwidth of speech is common practice in resource constrained environments like low-bandwidth speech transmission or low-complexity vocoding. We propose a lightweight and robust method for extending the bandwidth of wideband speech signals that is inspired by classical methods developed in the speech coding context. The resulting model has just ~370 K parameters and a complexity of ~140 MFLOPS (or ~70 MMACS). With a frame size of 10 ms and a lookahead of only 0.27 ms, the model is well-suited for use with common wideband speech codecs. We evaluate the model’s robustness by pairing it with the Opus SILK speech codec (1.5 release) and verify in a P.808 DCR listening test that it significantly improves quality from 6 to 12 kb/s. We also demonstrate that Opus 1.5 together with the proposed bandwidth extension at 9 kb/s meets the quality of 3GPP EVS at 9.6 kb/s and that of Opus 1.4 at 18 kb/s showing that the blind bandwidth extension can meet the quality of classical guided bandwidth extensions thus providing a way for backward-compatible quality improvement.
- P2-23 Listening Intention Estimation for Hearables with Natural Behavior Cues
Pascal Baumann, Seraina Glaus, Ludovic Amruthalingam, Fabian Gröger, Ruksana Giurda, Simone LionettiAbstract
Despite advances in hearing technologies, users still face challenges in acoustically complex environments. In particular, social situations with multiple speakers—where a given speaker may be attended to at one moment and become an interfering source at another—are consistently reported as especially difficult. We use a data-driven approach to estimate the engagement level of the user with each sound source in the acoustic foreground based on behavioral cues within dynamic, real-world acoustic scenes. We collect a novel dataset from two experiments that simultaneously record natural behavior and which sources are attended, covering both induced and free listening intentions. For each source, we engineer features expected to reflect the user’s engagement, without making further assumptions about underlying patterns, which are instead inferred with machine learning. Classic models such as logistic regression and tree-based methods achieve an average precision of 97% when the listening intention is induced and 76% when it is free, surpassing baselines that statically predict the most frequently attended sources. Our results suggest that it is possible to estimate user engagement with individual localized acoustic sources using machine learning, without invasive sensors and with sufficient accuracy for practical application. This paves the way for the development of adaptive hearing systems that can support users in everyday situations.
- P2-24 Unveiling the Best Practices for Applying Speech Foundation Models to Speech Intelligibility Prediction for Hearing-Impaired People
Haoshuai Zhou, Boxuan Cao, Changgeng Mo, Linkai Li, Shan Xiang WangAbstract
Speech foundation models (SFMs) have demonstrated strong performance across a variety of downstream tasks, including speech intelligibility prediction for hearing-impaired people (SIP-HI). However, optimizing SFMs for SIP-HI has been insufficiently explored. In this paper, we conduct a comprehensive study to identify key design factors affecting SIP-HI performance with 5 SFMs, focusing on encoder layer selection, prediction head architecture, and ensemble configurations. Our findings show that, contrary to traditional use-all-layers methods, selecting a single encoder layer yields better results. Additionally, temporal modeling is crucial for effective prediction heads. We also demonstrate that ensembling multiple SFMs improves performance, with stronger individual models providing greater benefit. Finally, we explore the relationship between key SFM attributes and their impact on SIP-HI performance. Our study offers practical insights into effectively adapting SFMs for speech intelligibility prediction for hearing-impaired populations.
Poster Session 3 (16:00 pm—18:00 pm)
Session Chair: Jiaqi Su
- P3-1 SynSonic: Augmenting Sound Event Detection through Text-to-Audio Diffusion ControlNet and Effective Sample Filtering
Jiarui Hai, Mounya ElhilaliAbstract
Data synthesis and augmentation are essential for Sound Event Detection (SED) due to the scarcity of temporally labeled data. While augmentation methods like SpecAugment and Mix-up can enhance model performance, they remain constrained by the diversity of existing samples. Recent generative models offer new opportunities, yet their direct application to SED is challenging due to the lack of precise temporal annotations and the risk of introducing noise through unreliable filtering. To address these challenges and enable generative-based augmentation for SED, we propose SynSonic, a data augmentation method tailored for this task. SynSonic leverages text-to-audio diffusion models guided by an energy-envelope ControlNet to generate temporally coherent sound events. A joint score filtering strategy with dual classifiers ensures sample quality, and we explore its practical integration into training pipelines. Experimental results show that SynSonic improves Polyphonic Sound Detection Scores (PSDS1 and PSDS2), enhancing both temporal localization and sound class discrimination.
- P3-2 Physically Informed Spatial Regularization for Sound Event Localization and Detection
Haocheng Liu, Diego Di Carlo, Aditya Arie Nugraha, Kazuyoshi Yoshii, Gaël Richard, Mathieu FontaineAbstract
Building Sound Event Localization and Detection (SELD) models that are robust to diverse acoustic environments remains one of the major challenges in multichannel signal processing, as reflections and reverberation can significantly confuse both the source direction and event detection. Introducing priors such as microphone geometry or room impulse response (RIR) into the model has proven effective in addressing this issue. Existing methods typically incorporate such priors in a deterministic way, often through data augmentation to enlarge data diversity. However, the uncertainty arising from the complex nature of audio acoustics remains largely underexplored in the SELD literature and naturally call for incorporating a stochastic modeling of acoustic prior. In this paper, we propose regularizing deep learning based SELD models with a physically constructed spatial covariance matrix (SCM) based on the estimated direction of arrival (DOA) and sound event detection (SED).
- P3-3 Stereo Reproduction in the Presence of Sample Rate Offsets
Srikanth Korse, Andreas Walther, Emanuel A. P. HabetsAbstract
One of the main challenges in synchronizing wirelessly connected loudspeakers for spatial audio reproduction is clock skew. Clock skew arises from sample rate offsets (SROs) between the loudspeakers, caused by the use of independent device clocks. While network-based protocols like Precision Time Protocol (PTP) and Network Time Protocol (NTP) are explored, the impact of SROs on spatial audio reproduction and its perceptual consequences remains underexplored. We propose an audio-domain SRO compensation method using spatial filtering to isolate loudspeaker contributions. These filtered signals, along with the original playback signal, are used to estimate the SROs, and their influence is compensated for prior to spatial audio reproduction. We evaluate the effect of the compensation method in a subjective listening test. The results of these tests as well as objective metrics demonstrate that the proposed method mitigates the perceptual degradation introduced by SROs by preserving the spatial cues.
- P3-4 Physics-Informed Transfer Learning for Data-Driven Sound Source Reconstruction in Near-Field Acoustic Holography
Xinmeng Luan, Mirco Pezzoli, Fabio Antonacci, Augusto SartiAbstract
We propose a transfer learning framework for sound source reconstruction in Near-field Acoustic Holography (NAH), which adapts a well-trained data-driven model from one type of sound source to another using a physics-informed procedure. The framework comprises two stages: (1) supervised pre-training of a complex-valued convolutional neural network (CV-CNN) on a large dataset, and (2) purely physics-informed fine-tuning on a single data sample based on the Kirchhoff-Helmholtz integral. This method follows the principles of transfer learning by enabling generalization across different datasets through physics-informed adaptation. The effectiveness of the approach is validated by transferring a pre-trained model from a rectangular plate dataset to a violin top plate dataset, where it shows improved reconstruction accuracy compared to the pre-trained model and delivers performance comparable to that of Compressive-Equivalent Source Method (C-ESM). Furthermore, for successful modes, the fine-tuned model outperforms both the pre-trained model and C-ESM in accuracy.
- P3-5 Scene-wide Acoustic Parameter Estimation
Ricardo Falcon-Perez, Ruohan Gao, Gregor Mueckl, Sebastia V. Amengual Gari, Ishwarya AnanthabhotlaAbstract
For augmented (AR) and virtual reality (VR) applications, accurate estimates of the acoustic characteristics of a scene are critical for creating a sense of immersion. However, directly estimating Room-impulse Responses (RIRs) from scene geometry is often a challenging, data-expensive task. We propose a method to instead infer spatially-distributed acoustic parameters (such as C50, T60, etc) for an entire scene from lightweight information readily available in an AR/VR context. We consider an image-to-image translation task to transform a 2D floormap, conditioned on a calibration RIR measurement, into 2D heatmaps of acoustic parameters. Moreover, we show that the method also works for directionally-dependent (i.e. beamformed) parameter prediction. We introduce and release a 1000-room, complex-scene dataset to study the task, and demonstrate improvements over strong statistical baselines.
- P3-6 Cyclic Multichannel Wiener Filter for Acoustic Beamforming
Giovanni Bologni, Richard Heusdens, Richard C. HendriksAbstract
Acoustic beamforming models typically assume wide-sense stationarity of speech signals within short time frames. However, voiced speech is better modeled as a cyclostationary (CS) process, a random process whose mean and autocorrelation are $T_1$-periodic, where $\alpha_1=1/T_1$ corresponds to the fundamental frequency of vowels. Higher harmonic frequencies are found at integer multiples of the fundamental. This work introduces a cyclic multichannel Wiener filter (cMWF) for speech enhancement derived from a cyclostationary model. This beamformer exploits spectral correlation across the harmonic frequencies of the signal to further reduce the mean-squared error (MSE) between the target and the processed input. The proposed cMWF is optimal in the MSE sense and reduces to the MWF when the target is wide-sense stationary. Experiments on simulated data demonstrate considerable improvements in scale-invariant signal-to-distortion ratio (SI-SDR) on synthetic data but also indicate high sensitivity to the accuracy of the estimated fundamental frequency $\alpha_1$, which limits effectiveness on real data.
- P3-7 EffDiffSE: Efficient Diffusion-Based Frequency-Domain Speech Enhancement with Hybrid Discriminative and Generative DNNs
Yihui Fu, Renzheng Shi, Marvin Sach, Wouter Tirry, Tim FingscheidtAbstract
Diffusion approaches to speech enhancement gain immense attention due to their improved speech component reconstruction capability. However, they usually suffer from high inference computational complexity due to multiple iterations during the reverse process. In this paper, we propose EffDiffSE, an efficient diffusion-based frequency-domain speech enhancement model with a hybrid discriminative condition DNN and generative score DNN. Our contributions are three-fold. First, we formulate the powerful time-domain Universe++ model in the frequency domain with a combined psychoacoustic loss and score matching loss. Second, we achieve a single-step efficient reverse process both during training and inference with noise-consistent Langevin dynamics. Third, an auxiliary loss is applied to the single-step reverse process output to improve the diffusion performance further. Trained and evaluated on the URGENT 2024 Speech Enhancement Challenge data splits, the proposed EffDiffSE achieves an MOS comparable to the top reported time- and frequency-domain diffusion baseline methods, while excelling them by >0.2 PESQ points, showing significantly less hallucination and inference computational complexity (3.9 GMAC/s vs. 58 … 7600 GMAC/s).
- P3-8 Adaptive Slimming for Scalable and Efficient Speech Enhancement
Riccardo Miccini, Minje Kim, Clement Laroche, Luca Pezzarossa, Paris SmaragdisAbstract
Speech enhancement (SE) enables robust speech recognition, real-time communication, hearing aids, and other applications where speech quality is crucial. However, deploying such systems on resource-constrained devices involves choosing a static trade-off between performance and computational efficiency. In this paper, we introduce dynamic slimming to DEMUCS, a popular SE architecture, making it scalable and input-adaptive. Slimming lets the model operate at different utilization factors (UF), each corresponding to a different performance/efficiency trade-off, effectively mimicking multiple model sizes without the extra storage costs. In addition, a router subnet, trained end-to-end with the backbone, determines the optimal UF for the current input. Thus, the system saves resources by adaptively selecting smaller UFs when additional complexity is unnecessary. We show that our solution is Pareto-optimal against individual UFs, confirming the benefits of dynamic routing. When training the proposed dynamically-slimmable model to use 10% of its capacity on average, we obtain the same or better speech quality as the equivalent static 25% utilization while reducing MACs by 29%.
- P3-9 ReverbMiipher: Generative Speech Restoration meets Reverberation Characteristics Controllability
Wataru Nakata, Koizumi Yuma, Shigeki Karita, Robin Scheibler, Haruko Ishikawa, Adriana Guevara-Rukoz, Heiga Zen, Michiel BacchianiAbstract
Reverberation encodes spatial information regarding the acoustic source environment, yet traditional Speech Restoration (SR) usually completely removes reverberation. We propose ReverbMiipher, an SR model extending parametric resynthesis framework, designed to denoise speech while preserving and enabling control over reverberation. ReverbMiipher incorporates a dedicated ReverbEncoder to extract a reverb feature vector from noisy input. This feature conditions a vocoder to reconstruct the speech signal, removing noise while retaining the original reverberation characteristics. A stochastic zero-vector replacement strategy during training ensures the feature specifically encodes reverberation, disentangling it from other speech attributes. This learned representation facilitates reverberation control via techniques such as interpolation between features, replacement with features from other utterances, or sampling from a latent space. Objective and subjective evaluations confirm ReverbMiipher effectively preserves reverberation, removes other artifacts, and outperforms the conventional two-stage SR and convolving simulated room impulse response approach. We further demonstrate its ability to generate novel reverberation effects through feature manipulation.
- P3-10 IS³ : Generic Impulsive–Stationary Sound Separation in Acoustic Scenes using Deep Filtering
Clémentine Berger, Paraskevas Stamatiadis, Roland Badeau, Slim EssidAbstract
We are interested in audio systems capable of performing a differentiated processing of stationary backgrounds and isolated acoustic events within an acoustic scene, whether for applying specific processing methods to each part or for focusing solely on one while ignoring the other. Such systems have applications in real-world scenarios, including robust adaptive audio rendering systems (e.g., EQ or compression), plosive attenuation in voice mixing, noise suppression or reduction, robust acoustic event classification or even bioacoustics. To this end, we introduce IS³, a neural network designed for Impulsive–Stationary Sound Separation, that isolates impulsive acoustic events from the stationary background using a deep filtering approach, that can act as a pre-processing stage for the above-mentioned tasks. To ensure optimal training, we propose a sophisticated data generation pipeline that curates and adapts existing datasets for this task. We demonstrate that a learning-based approach, build on a relatively lightweight neural architecture and trained with well-designed and varied data, is successful in this previously unaddressed task, outperforming the Harmonic–Percussive Sound Separation masking method, adapted from music signal processing research, and wavelet filtering on objective separation metrics.
- P3-11 Hybrid-Sep: Language-queried audio source separation via pre-trained Model Fusion and Adversarial Consistent Training
Jianyuan Feng, Guangzheng Li, Yangfei XuAbstract
Language-Queried Audio Separation (LASS) employs linguistic queries to isolate target sounds based on semantic descriptions. However, existing methods face challenges in aligning complex Language-queried Audio Separation (LASS) employs linguistic queries to isolate target sounds based on semantic descriptions. However, existing methods face challenges in aligning complex auditory features with linguistic context while preserving separation precision. Current research efforts focus primarily on text description augmentation and architectural innovations, yet the potential of integrating pre-trained self-supervised learning (SSL) audio models and Contrastive Language-Audio Pretraining (CLAP) frameworks, capable of extracting cross-modal audio-text relationships, remains underexplored. To address this, we present HybridSep, a two-stage LASS framework that synergizes SSL-based acoustic representations with CLAP-derived semantic embeddings. Our framework introduces Adversarial Consistent Training (ACT), a novel optimization strategy that treats diffusion as an auxiliary regularization loss while integrating adversarial training to enhance separation fidelity. Experiments demonstrate that HybridSep achieves significant performance improvements over state-of-the-art baselines (e.g., AudioSep, FlowSep) across multiple metrics, establishing new benchmarks for LASS tasks.
- P3-12 UBGAN: Enhancing Coded Speech with Blind and Guided Bandwidth Extension
Kishan Gupta, Srikanth Korse, Andreas Brendel, Nicola Pia, Guillaume FuchsAbstract
In practical application of speech codecs, a multitude of factors such as the quality of the radio connection, limiting hardware or required user experience necessitate trade-offs between achievable perceptual quality, engendered bitrate and computational complexity. Most conventional and neural speech codecs operate on wideband (WB) speech signals to achieve this compromise. To further enhance the perceptual quality of coded speech, bandwidth extension (BWE) of the transmitted speech is an attractive and popular technique in conventional speech coding. In contrast, neural speech codecs are typically trained end-to-end to a specific set of requirements and are often not easily adaptable. In particular, they are typically trained to operate at a single fixed sampling rate. With the Universal Bandwidth Extension Generative Adversarial Network (UBGAN), we propose a modular and lightweight GAN-based solution that increases the operational flexibility of a wide range of conventional and neural codecs. Our model operates in the subband domain and extends the bandwidth of WB signals from 8 kHz to 16 kHz, resulting in super-wideband (SWB) signals. We further introduce two variants, guided-UBGAN and blind-UBGAN, where the guided version transmits quantized learned representation as a side information at a very low bitrate additional to the bitrate of the codec, while blind-BWE operates without such side-information. Our subjective assessments demonstrate the advantage of UBGAN applied to WB codecs and highlight the generalization capacity of our proposed method across multiple codecs and bitrates.
- P3-13 Contrastive Representation Learning for Privacy-Preserving Fine-Tuning of Audio-Visual Speech Recognition
Luca Becker, Rainer MartinAbstract
In this work, we propose a novel approach towards privacy-preserving audio-visual speech recognition (AV-ASR). We apply feature-wise additive and multiplicative masks to the latent embeddings of a pre-trained AV-ASR model and fine-tune subsequent layers of this model using AdaLoRA. The masks are trained to preserve linguistic content necessary for ASR while degrading speaker-discriminative cues. We derive the masking mechanism directly from the sequential input and optimize it via a contrastive representation learning (CRL) objective. The method closely aggregates slightly perturbed representations of the same utterance while simultaneously increasing the distance to representations of different utterances from the same speaker. Experiments on LRS3 and VoxCeleb2 show that our approach maintains competitive word error rates (WER) while significantly reducing speaker identification performance, as measured by Equal Error Rate (EER) under a strong audio-visual speaker verification attack. These results demonstrate the potential of contrastive fine-tuning for privacy-preserving AV-ASR for edge devices.
- P3-14 Low-Complexity Individualized Noise Reduction for Real-Time Processing
Chuan Wen, Sarah VerhulstAbstract
Recent advancements in deep neural network (DNN)-based hearing aids (HAs) have significantly improved hearing-restoration performance. A recent study introduced a DNN-based framework for designing HA models tailored to the sensorineural hearing loss (SNHL) profile of individual users, improving treatment outcomes for diverse HA users. While these bio-inspired HA models perform well in clean speech conditions, their effectiveness diminishes in noisy environments. Moreover, their high computational complexity makes deployment on resource-constrained devices challenging, particularly when additional noise reduction modules are added. To address these limitations, we propose a low-complexity, real-time individualized noise reduction (INR) model that jointly performs noise suppression and hearing loss compensation. The model is optimized for deployment on embedded systems using quantization to reduce model size and computational load. To mitigate the impact of quantization error, we applied quantization-aware training (QAT), which simulates the quantization effect during the training phase. Experimental results show that the proposed INR model outperforms existing closed-loop bio-inspired HA models in noisy conditions, achieving superior objective measures of speech intelligibility and sound quality. The INT8 quantized version maintains performance comparable to the full-precision model. With a system latency of just 8 ms, the model meets the real-time requirements of hearing aid applications. These findings demonstrate the potential of the proposed INR model for real-time application in hearing aids and other low-power hearing devices.
- P3-15 JSQA: Speech Quality Assessment with Perceptually-Inspired Contrastive Pretraining Based on JND Audio Pairs
Junyi Fan, Donald WilliamsonAbstract
Speech quality assessment (SQA) is often used to learn a mapping from a high-dimensional input space to a scalar that represents the mean opinion score (MOS) of the perceptual speech quality. Learning such a mapping is challenging for many reasons, but largely because MOS exhibits high levels of inherent variance due to perceptual and experimental-design differences. Many solutions have been proposed, but many approaches do not properly incorporate perceptual factors into their learning algorithms (beyond the MOS label), which could lead to unsatisfactory results. To this end, we propose JSQA, a two-stage framework that pretrains an audio encoder using perceptually-guided contrastive learning on just noticeable difference (JND) pairs, followed by fine-tuning for MOS prediction. We first generate pairs of audio data within JND levels, which are then used to pretrain an encoder to leverage perceptual quality similarity information and map it into an embedding space. The JND pairs come from clean LibriSpeech utterances that are mixed with background noise from CHiME-3, at different signal-to-noise ratios (SNRs). The encoder is later fine-tuned with audio samples from the NISQA dataset for MOS prediction. Experimental results suggest that perceptually-inspired contrastive pretraining significantly improves the model performance evaluated by various metrics when compared against the same network trained from scratch without pretraining. These findings suggest that incorporating perceptual factors into pretraining greatly contributes to the improvement in performance for SQA.
- P3-16 Is MixIT Really Unsuitable for Correlated Sources? Exploring MixIT for Unsupervised Pre-training in Music Source Separation
Kohei Saijo, Yoshiaki BandoAbstract
In music source separation (MSS), obtaining isolated sources or stems is highly costly, making pre-training on unlabeled data a promising approach. Although source-agnostic unsupervised learning like mixture-invariant training (MixIT) has been explored in general sound separation, they have been largely overlooked in MSS due to its implicit assumption of source independence. We hypothesize, however, that the difficulty of applying MixIT to MSS arises from the ill-posed nature of MSS itself, where stem definitions are application-dependent and models lack explicit knowledge of what should or should not be separated, rather than from high inter-source correlation. While MixIT does not assume any source model and struggles with such ambiguities, our preliminary experiments show that it can still separate instruments to some extent, suggesting its potential for unsupervised pre-training. Motivated by these insights, this study investigates MixIT-based pre-training for MSS. We first pre-train a model on in-the-wild, unlabeled data from the Free Music Archive using MixIT, and then fine-tune it on MUSDB18 with supervision. Using the band-split TF-Locoformer, one of the state-of-the-art MSS models, we demonstrate that MixIT-based pre-training improves the performance over training from scratch.
- P3-17 Beyond Architecture: The Critical Impact of Inference Overlap on Music Source Separation Benchmarks
Harnick Khera, Johan Pauwels, Alan W. Archer-Boyd, Mark B. SandlerAbstract
This paper investigates how inference step size in sliding window approaches affects music source separation quality. Through systematic analysis of seven model configurations across five architectures, we demonstrate that increased segment overlap consistently improves separation quality by up to 0.37 dB SDR. We identify a universal pattern where performance improves logarithmically with overlap, with an “elbow point” at 4-8 overlaps per segment where diminishing returns begin. Our analysis reveals that: (1) state-of-the-art papers inconsistently report inference overlap settings, making fair comparisons difficult; (2) even modest overlap settings (25%) substantially improve quality through boundary artifact reduction; and (3) higher-performing models show proportionally greater improvements from increased overlap when accounting for the logarithmic nature of SDR. These findings suggest that standardized overlap reporting is essential for meaningful architectural comparisons and that differences attributed to architectural innovations may partly stem from undisclosed inference settings.
- P3-18 Towards Reliable Objective Evaluation Metrics for Generative Singing Voice Separation Models
Paul A. Bereuter, Benjamin Stahl, Mark D. Plumbley, Alois SontacchiAbstract
Traditional Blind Source Separation Evaluation (BSS-Eval) metrics were originally designed to evaluate linear audio source separation models based on methods such as time-frequency masking. However, recent generative models may introduce nonlinear relationships between the separated and reference signals, limiting the reliability of these metrics for objective evaluation. To address this issue, we conduct a Degradation Category Rating listening test and analyze correlations between the obtained degradation mean opinion scores (DMOS) and a set of objective audio quality metrics for the task of singing voice separation. We evaluate three state-of-the-art discriminative models and two new, competitive generative models. For both discriminative and generative models, intrusive embedding-based metrics show higher correlations with DMOS than conventional intrusive metrics such as BSS-Eval metrics. For discriminative models, the highest correlation is achieved by the MSE computed on Music2Latent embeddings. When it comes to the evaluation of generative models, the strongest correlations are evident for the multi-resolution STFT loss and the MSE calculated on MERT-L12 embeddings, with the latter also providing the most balanced correlation across both model types. Our results highlight the limitations of BSS-Eval metrics for evaluating generative singing voice separation models and emphasize the need for careful selection and validation of alternative evaluation metrics for the task of singing voice separation.
- P3-19 Generating Separated Singing Vocals Using a Diffusion Model Conditioned on Music Mixtures
Genís Plaja-Roglans, Yun-Ning Hung, Xavier Serra, Igor PereiraAbstract
Separating the individual elements in a musical mixture is an essential process for music analysis and practice. While this is generally addressed using neural networks optimized to mask or transform the time-frequency representation of a mixture to extract the target sources, the flexibility and generalization capabilities of generative diffusion models are giving rise to a novel class of solutions for this complicated task. In this work, we explore singing voice separation from real music recordings using a diffusion model which is trained to generate the solo vocals conditioned on the corresponding mixture. Our approach improves upon prior generative systems and achieves competitive objective scores against non-generative baselines when trained with supplementary data. The iterative nature of diffusion sampling enables the user to control the quality-efficiency trade-off, and also refine the output when needed. We present an ablation study of the sampling algorithm, highlighting the effects of the user-configurable parameters. Code and weights are released.
- P3-20 Balancing Information Preservation and Disentanglement in Self-Supervised Music Representation Learning
Julia Wilkins, Sivan Ding, Magdalena Fuentes, Juan Pablo BelloAbstract
Recent advances in self-supervised learning (SSL) methods offer a range of strategies for capturing useful representations from music audio without the need for labeled data. While some techniques focus on preserving comprehensive details through reconstruction, others favor semantic structure via contrastive objectives. Few works examine the interaction between these paradigms in a unified SSL framework. In this work, we propose a multi-view SSL framework for disentangling music audio representations that combines contrastive and reconstructive objectives. The architecture is designed to promote both information fidelity and structured semantics of factors in disentangled subspaces. We perform an extensive evaluation on the design choices of contrastive strategies using music audio representations in a controlled setting. We find that while reconstruction and contrastive strategies exhibit consistent trade-offs, when combined effectively, they complement each other; this enables the disentanglement of music attributes without compromising information integrity.
- P3-21 SILA: Signal-to-Language Augmentation for Enhanced Control in Text-to-Audio Generation
Sonal Kumar, Prem Seetharaman, Justin Salamon, Dinesh Manocha, Oriol NietoAbstract
The field of text-to-audio generation has seen significant advancements, and yet the ability to finely control the acoustic characteristics of generated audio remains under-explored. In this paper, we introduce a novel yet simple approach to generate sound effects with control over key acoustic parameters such as loudness, pitch, reverb, fade, brightness, noise and duration, enabling creative applications in sound design and content creation. These parameters extend beyond traditional Digital Signal Processing (DSP) techniques, incorporating learned representations that capture the subtleties of how sound characteristics can be shaped in context, enabling a richer and more nuanced control over the generated audio. Our approach is model-agnostic and is based on learning the disentanglement between audio semantics and its acoustic features. Our approach not only enhances the versatility and expressiveness of text-to-audio generation but also opens new avenues for creative audio production and sound design. Our objective and subjective evaluation results demonstrate the effectiveness of our approach in producing high-quality, customizable audio outputs that align closely with user specifications.
- P3-22 Perceptually-Driven Panning for an Extended Listening Area
Pedro Lladó, Annika Neidhardt, Antoine R. Souchaud, Zoran Cvetkovic, Enzo De SenaAbstract
In loudspeaker-based sound field reproduction, the perceived sound quality deteriorates significantly when listeners move outside of the sweet spot. Although a substantial increase in the number of loudspeakers enables rendering methods that can mitigate this issue, such a solution is not feasible for most real-life applications. This study aims to extend the listening area by finding a panning strategy that optimises an objective function reflecting the localisation and localisation uncertainty over a listening area. To that end we first introduce a psychoacoustic localisation model that outperforms existing models in the context of multichannel loudspeaker setups. Leveraging this model and an existing model of localisation uncertainty, we optimise inter-channel time and level differences for a stereophonic system. The outcome is a new panning approach that depends on the listening area and the most suitable trade-off between localisation and localisation uncertainty.
- P3-23 Theoretical Analysis of Recursive Implementations of Multi-Channel Cross-Talk Cancellation Systems
Filippo Maria Fazi, Francesco Veronesi, Marcos F. Simón Gálvez, Andreas FranckAbstract
Cross-talk cancellation (CTC) is a well-established technique for delivering binaural audio over loudspeakers. Traditional two-channel CTC systems are commonly implemented using networks of Finite Impulse Response (FIR) filters, with causality ensured through the use of modeling delays and regularisation techniques. Recursive implementations, which rely on a feedback network, have been proposed as an alternative for two-channel systems, offering potential benefits in computational efficiency. This paper investigates whether a similar recursive architecture can be extended to multi-channel CTC systems, particularly those employing linear arrays of loudspeakers. Through theoretical analysis, we demonstrate that recursive multi-channel CTC implementations are intrinsically non-causal under general conditions, making a direct real-time realisation infeasible without significantly compromising the system performance.
- P3-24 SFC-L1: Sound Field Control With Least Absolute Deviation Regression
Takuma OkamotoAbstract
Sound field control using loudspeaker arrays is an important acoustic and audio signal processing applications. In sound field control, least squares (LS) regression based on pressure matching or mode-matching is typically introduced to derive the driving signals of loudspeakers as a closed-form solution. The LS regression is a maximum-likelihood estimation, in which the error is assumed to be Gaussian distribution. Compared with the LS regression, the least absolute deviation (LAD) regression, in which the error is assumed to be Laplace distribution, is robust against outliers. In pressure matching-based sound field methods, outliers appear at higher frequencies according to the spatial Nyquist frequency. To improve the control accuracy for pressure matching-based methods at high frequencies, this paper proposes SFC-L1, pressure matching-based sound field control method with LAD regression instead of LS regression. In the proposed method, the LAD regression combined with L1 regularization is solved with gradient method simply implemented on PyTorch. The results of computer simulations demonstrate that the proposed LAD-based methods can improve the sound field control accuracy at high frequencies compared with the conventional LS-based methods. Additionally, PyTorch-based implementation, Torch-SFC, is open-sourced for accelerating sound field control research.
Wednesday 10/15
Keynote Speech 3 (8:00 am—8:55 am)
Speaker: Juan Pablo Bello
Title: “Reframing SELD: Learned Localization, Multichannel Processing, and the Beamforming Gap“
Abstract: This talk examines the evolution of Sound Event Localization and Detection (SELD) through the lens of multichannel audio processing. The discussion begins with the field’s early reliance on channel-independent models, whose inability to capture inter-channel structure limited localization performance. This is followed by the emergence of approaches that incorporate classical spatial features—such as generalized cross-correlation (GCC) and intensity vectors—to encode spatial relationships explicitly. Particular attention is given to two areas where advances in speech processing offer new directions for SELD. First, learned spatial modeling remains underexplored in SELD compared to speaker localization, where architectures based on cross-channel attention, graph neural networks, and geometry-aware representations are behind recent progress in the field. Second, beamforming-based methods—including powerful neural beamformers widely used in speech enhancement and recognition—have been largely overlooked in SELD. These techniques present significant potential for extending SELD research into open vocabularies, improving multichannel detection, and enabling spatial analysis of individual sources in complex environments. The talk concludes by reflecting on SELD’s current trajectory and outlining key opportunities and challenges for future research.
Oral Session 4 (8:55 am — 10:10 am)
Session Chairs: Shoko Araki, Gordon Wichern
- O4-1 Beamforming with Interaural Time-To-Level Difference Conversion for Hearing Loss Compensation
Johannes W. de Vries, Timm-Jonas Bäumer, Stephan Töpken, Richard Heusdens, Steven van de Par, Richard C. HendriksAbstract
A common hearing impairment is an insensitivity to interaural time differences (ITDs). This can strongly limit the binaural benefit in intelligibility, but could be overcome by converting the low-frequency ITDs into corresponding interaural level differences (ILDs). This approach has been validated to be effective, but with a procedure impossible to be implemented in real time. In this work, a beamformer is presented that is able to perform this binaural cue conversion based on noisy mixed data, optionally in conjunction with noise and interference suppression. Simulations show that this beamformer is able to perform the conversion well, provided accurate estimates of the acoustic transfer functions (ATFs) are available. However, the binaural cue conversion has a detrimental effect on the noise/interference reduction for higher numbers of interferers. For a low number of interferers, the proposed beamformer could be a promising alternative for reduced ITD sensitivity, which should be verified through listening tests.
- O4-2 Robust One-step Speech Enhancement via Consistency Distillation
Liang Xu, Longfei Felix Yan, W. Bastiaan KleijnAbstract
Diffusion models have shown strong performance in speech enhancement, but their real-time applicability has been limited by multi-step iterative sampling. Consistency distillation has recently emerged as a promising alternative by distilling a one-step consistency model from a multi-step diffusion-based teacher model. However, distilled consistency models are inherently biased towards the sampling trajectory of the teacher model, making them less robust to noise and prone to inheriting inaccuracies from the teacher model. To address this limitation, we propose ROSE-CD: Robust One-step Speech Enhancement via Consistency Distillation, a novel approach for distilling a one-step consistency model. Specifically, we introduce a randomized learning trajectory to improve the model’s robustness to noise. Furthermore, we jointly optimize the one-step model with two time-domain auxiliary losses, enabling it to recover from teacher-induced errors and surpass the teacher model in overall performance. This is the first pure one-step consistency distillation model for diffusion-based speech enhancement, achieving 54 times faster inference speed and superior performance compared to its 30-step teacher model. Experiments on the VoiceBank-DEMAND dataset demonstrate that the proposed model achieves state-of-the-art performance in terms of speech quality. Moreover, its generalization ability is validated on both an out-of-domain dataset and real-world noisy recordings.
- O4-3 Miipher-2: A Universal Speech Restoration Model for Million-Hour Scale Data Restoration
Shigeki Karita, Yuma Koizumi, Heiga Zen, Haruko Ishikawa, Robin Scheibler, Michiel BacchianiAbstract
Training data cleaning is a new application for generative model-based speech restoration (SR). This paper introduces Miipher-2, an SR model designed for million-hour scale data, for training data cleaning for large-scale generative models like large language models. Key challenges addressed include generalization to unseen languages, operation without explicit conditioning (e.g., text, speaker ID), and computational efficiency. Miipher-2 utilizes a frozen, pre-trained Universal Speech Model (USM), supporting over 300 languages, as a robust, conditioning-free feature extractor. To optimize efficiency and minimize memory, Miipher-2 incorporates parallel adapters for predicting clean USM features from noisy inputs and employs the WaneFit neural vocoder for waveform synthesis. These components were trained on 3,000 hours of multi-lingual, studio-quality recordings with augmented degradations, while USM parameters remained fixed. Experimental results demonstrate Miipher-2’s superior or comparable performance to conventional SR models in word-error-rate, speaker similarity, and both objective and subjective sound quality scores across all tested languages. Miipher-2 operates efficiently on consumer-grade accelerators, achieving a real-time factor of 0.0078, enabling the processing of a million-hour speech dataset in approximately three days using only 100 such accelerators
- O4-4 Context-Aware Query Refinement for Target Sound Extraction: Handling Partially Matched Queries
Ryo Sato, Chiho Haruta, Nobuhiko Hiruma, Keisuke ImotoAbstract
Target sound extraction (TSE) is the task of extracting a target sound specified by a query from an audio mixture. Much prior research has focused on the problem setting under the Fully Matched Query (FMQ) condition, where the query specifies only active sounds present in the mixture. However, in real-world scenarios, queries may include inactive sounds that are not present in the mixture. This leads to scenarios such as the Fully Unmatched Query (FUQ) condition, where only inactive sounds are specified in the query, and the Partially Matched Query (PMQ) condition, where both active and inactive sounds are specified. Among these conditions, the performance degradation under the PMQ condition has been largely overlooked. To achieve robust TSE under the PMQ condition, we propose context-aware query refinement. This method eliminates inactive classes from the query during inference based on the estimated sound class activity. Experimental results demonstrate that while conventional methods suffer from performance degradation under the PMQ condition, the proposed method effectively mitigates this degradation and achieves high robustness under diverse query conditions.
- O4-5 Combolutional Neural Networks
Cameron Churchwell, Minje Kim, Paris SmaragdisAbstract
Selecting appropriate inductive biases is an essential step in the design of machine learning models, especially when working with audio, where even short clips may contain millions of samples. To this end, we propose the combolutional layer: a learned-delay IIR comb filter and fused envelope detector, which extracts harmonic features in the time domain. We demonstrate the efficacy of the combolutional layer on three information retrieval tasks, evaluate its computational cost relative to other audio frontends, and provide efficient implementations for training. We find that the combolutional layer is an effective replacement for convolutional layers in audio tasks where precise harmonic analysis is important, e.g., piano transcription, speaker classification, and key detection. Additionally, the combolutional layer has several other key benefits over existing frontends, namely: low parameter count, efficient CPU inference, strictly real-valued computations, and improved interpretability.
Poster Session 4 (10:30 am — 12:30 pm)
Session Chair: Ishwarya Ananthabhotla
- P4-1 Incremental Averaging Method to Improve Graph-Based Time-Difference-of-Arrival Estimation
Klaus Brümann, Kouei Yamaoka, Nobutaka Ono, Simon DocloAbstract
Estimating the position of a speech source based on time-differences-of-arrival (TDOAs) is often adversely affected by background noise and reverberation. A popular method to estimate the TDOA between a microphone pair involves maximizing a generalized cross-correlation with phase transform (GCC-PHAT) function. Since the TDOAs across different microphone pairs satisfy consistency relations, generally only a small subset of microphone pairs are used for source position estimation. Although the set of microphone pairs is often determined based on a reference microphone, recently a more robust method has been proposed to determine the set of microphone pairs by computing the minimum spanning tree (MST) of a signal graph of GCC-PHAT function reliabilities. To reduce the influence of noise and reverberation on the TDOA estimation accuracy, in this paper we propose to compute the GCC-PHAT functions of the MST based on an average of multiple cross-power spectral densities (CPSDs) using an incremental method. In each step of the method, we increase the number of CPSDs over which we average by considering CPSDs computed indirectly via other microphones from previous steps. Using signals recorded in a noisy and reverberant laboratory with an array of spatially distributed microphones, the performance of the proposed method is evaluated in terms of TDOA estimation error and 2D source position estimation error. Experimental results for different source and microphone configurations and three reverberation conditions show that the proposed method considering multiple CPSDs improves the TDOA estimation and source position estimation accuracy compared to the reference microphone- and MST-based methods that rely on a single CPSD as well as steered-response power-based source position estimation.
- P4-2 Learning Robust Spatial Representations from Binaural Audio through Feature Distillation
Holger Severin Bovbjerg, Jan Østergaard, Jesper Jensen, Shinji Watanabe, Zheng-Hua TanAbstract
Recently, deep representation learning has shown strong performance in multiple audio tasks. However, its use for learning spatial representations from multichannel audio is underexplored. We investigate the use of a pretraining stage based on feature distillation to learn a robust spatial representation of binaural speech without the need for data labels. In this framework, spatial features are computed from clean binaural speech samples to form prediction labels. These clean features are then predicted from corresponding augmented speech using a neural network. After pretraining, we throw away the spatial feature predictor and use the learned encoder weights to initialize a DoA estimation model which we fine-tune for DoA estimation. Our experiments demonstrate that the pretrained models show improved performance in noisy and reverberant environments after fine-tuning for direction-of-arrival estimation, when compared to fully supervised models and classic signal processing methods.
- P4-3 Beamforming with Interaural Time-To-Level Difference Conversion for Hearing Loss Compensation
Johannes W. de Vries, Timm-Jonas Bäumer, Stephan Töpken, Richard Heusdens, Steven van de Par, Richard C. HendriksAbstract
A common hearing impairment is an insensitivity to interaural time differences (ITDs). This can strongly limit the binaural benefit in intelligibility, but could be overcome by converting the low-frequency ITDs into corresponding interaural level differences (ILDs). This approach has been validated to be effective, but with a procedure impossible to be implemented in real time. In this work, a beamformer is presented that is able to perform this binaural cue conversion based on noisy mixed data, optionally in conjunction with noise and interference suppression. Simulations show that this beamformer is able to perform the conversion well, provided accurate estimates of the acoustic transfer functions (ATFs) are available. However, the binaural cue conversion has a detrimental effect on the noise/interference reduction for higher numbers of interferers. For a low number of interferers, the proposed beamformer could be a promising alternative for reduced ITD sensitivity, which should be verified through listening tests.
- P4-4 Soft-Constrained Spatially Selective Active Noise Control for Open-fitting Hearables
Tong Xiao, Reinhild Roden, Matthias Blau, Simon DocloAbstract
Recent advances in spatially selective active noise control (SSANC) using multiple microphones have enabled hearables to suppress undesired noise while preserving desired speech from a specific direction. Aiming to achieve minimal speech distortion, a hard constraint has been used in previous work in the optimization problem to compute the control filter. In this work, we propose a soft-constrained SSANC system that uses a frequency-independent parameter to trade off between speech distortion and noise reduction. We derive both time- and frequency-domain formulations, and show that conventional active noise control and hard-constrained SSANC represent two limiting cases of the proposed design. We evaluate the system through simulations using a pair of open-fitting hearables in an anechoic environment with one speech source and two noise sources. The simulation results validate the theoretical derivations and demonstrate that for a broad range of the trade-off parameter, the signal-to-noise ratio and the speech quality and intelligibility in terms of PESQ and ESTOI can be substantially improved compared to the hard-constrained design.
- P4-5 Low-Rank Adaptation of Deep Prior Neural Networks For Room Impulse Response Reconstruction
Mirco Pezzoli, Federico Miotello, Shoichi Koyama, Fabio AntonacciAbstract
The Deep Prior framework has emerged as a powerful generative tool which can be used for reconstructing sound fields in an environment from few sparse pressure measurements. It employs a neural network that is trained solely on a limited set of available data and acts as an implicit prior which guides the solution of the underlying optimization problem. However, a significant limitation of the Deep Prior approach is its inability to generalize to new acoustic configurations, such as changes in the position of a sound source. As a consequence, the network must be retrained from scratch for every new setup, which is both computationally intensive and time-consuming. To address this, we investigate transfer learning in Deep Prior via Low-Rank Adaptation (LoRA), which enables efficient fine-tuning of a pre-trained neural network by introducing a low-rank decomposition of trainable parameters, thus allowing the network to adapt to new measurement sets with minimal computational overhead. We embed LoRA into a MultiResUNet-based Deep Prior model and compare its adaptation performance against full fine-tuning of all parameters as well as classical retraining, particularly in scenarios where only a limited number of microphones are used. The results indicate that fine-tuning, whether done completely or via LoRA, is especially advantageous when the source location is the sole changing parameter, preserving high physical fidelity, and highlighting the value of transfer learning for acoustics applications.
- P4-6 MB-RIRs: a Synthetic Room Impulse Response Dataset with Frequency-Dependent Absorption Coefficients
Enric Gusó, Joanna Luberadzka, Umut Sayin, Xavier SerraAbstract
We investigate the effects of four strategies for improving the ecological validity of synthetic room impulse response (RIR) datasets for monoaural Speech Enhancement (SE). We implement three features on top of the traditional image source method-based (ISM) shoebox RIRs: multiband absorption coefficients, source directivity and receiver directivity. We additionally consider mesh-based RIRs from the SoundSpaces dataset. We then train a DeepFilternet3 model for each RIR dataset and evaluate the performance on a test set of real RIRs both objectively and subjectively. We find that RIRs which use frequency-dependent acoustic absorption coefficients (MB-RIRs) can obtain +0.51dB of SDR and a +8.9 MUSHRA score when evaluated on real RIRs. The MB-RIRs dataset is publicly available for free download.
- P4-7 Robust One-step Speech Enhancement via Consistency Distillation
Liang Xu, Longfei Felix Yan, W. Bastiaan KleijnAbstract
Diffusion models have shown strong performance in speech enhancement, but their real-time applicability has been limited by multi-step iterative sampling. Consistency distillation has recently emerged as a promising alternative by distilling a one-step consistency model from a multi-step diffusion-based teacher model. However, distilled consistency models are inherently biased towards the sampling trajectory of the teacher model, making them less robust to noise and prone to inheriting inaccuracies from the teacher model. To address this limitation, we propose ROSE-CD: Robust One-step Speech Enhancement via Consistency Distillation, a novel approach for distilling a one-step consistency model. Specifically, we introduce a randomized learning trajectory to improve the model’s robustness to noise. Furthermore, we jointly optimize the one-step model with two time-domain auxiliary losses, enabling it to recover from teacher-induced errors and surpass the teacher model in overall performance. This is the first pure one-step consistency distillation model for diffusion-based speech enhancement, achieving 54 times faster inference speed and superior performance compared to its 30-step teacher model. Experiments on the VoiceBank-DEMAND dataset demonstrate that the proposed model achieves state-of-the-art performance in terms of speech quality. Moreover, its generalization ability is validated on both an out-of-domain dataset and real-world noisy recordings.
- P4-8 Controlling the Parameterized Multi-channel Wiener Filter using a tiny neural network
Eric Grinstein, Ashutosh Pandey, Cole Li, Shanmukha Srinivas, Juan Azcarreta, Jacob Donley, Sanha Lee, Ali Aroudi, Cagdas BilenAbstract
Noise suppression and speech distortion are two important aspects to be balanced when designing multi-channel Speech Enhancement (SE) algorithms. Although neural network models have achieved state-of-the-art noise suppression, their non-linear operations often introduce high speech distortion. Conversely, classical signal processing algorithms such as the Parameterized Multi-channel Wiener Filter (PMWF) beamformer offer explicit mechanisms for controlling the suppression/distortion trade-off. In this work, we present NeuralPMWF, a system where the PMWF is entirely controlled using a low-latency, low-compute neural network, resulting in a low-complexity system offering high noise reduction and low speech distortion. Experimental results show that our proposed approach results in significant perceptual and objective speech enhancement in comparison to several competitive baselines using similar computational resources.
- P4-9 Miipher-2: A Universal Speech Restoration Model for Million-Hour Scale Data Restoration
Shigeki Karita, Yuma Koizumi, Heiga Zen, Haruko Ishikawa, Robin Scheibler, Michiel BacchianiAbstract
Training data cleaning is a new application for generative model-based speech restoration (SR). This paper introduces Miipher-2, an SR model designed for million-hour scale data, for training data cleaning for large-scale generative models like large language models. Key challenges addressed include generalization to unseen languages, operation without explicit conditioning (e.g., text, speaker ID), and computational efficiency. Miipher-2 utilizes a frozen, pre-trained Universal Speech Model (USM), supporting over 300 languages, as a robust, conditioning-free feature extractor. To optimize efficiency and minimize memory, Miipher-2 incorporates parallel adapters for predicting clean USM features from noisy inputs and employs the WaneFit neural vocoder for waveform synthesis. These components were trained on 3,000 hours of multi-lingual, studio-quality recordings with augmented degradations, while USM parameters remained fixed. Experimental results demonstrate Miipher-2’s superior or comparable performance to conventional SR models in word-error-rate, speaker similarity, and both objective and subjective sound quality scores across all tested languages. Miipher-2 operates efficiently on consumer-grade accelerators, achieving a real-time factor of 0.0078, enabling the processing of a million-hour speech dataset in approximately three days using only 100 such accelerators
- P4-10 Long-Context Modeling Networks for Monaural Speech Enhancement: A Comparative Study
Qiquan Zhang, Moran Chen, Zeyang Song, Xiangyu Zhang, Hexin Liu, Haizhou LiAbstract
Advanced long-context modeling backbone networks, such as Transformer, Conformer, and Mamba, have demonstrated state-of-the-art performance in speech enhancement. However, a systematic and comprehensive comparative study of these backbones within a unified speech enhancement framework remains lacking. In addition, xLSTM, a more recent and efficient variant of LSTM, has shown promising results in language modeling and as a general-purpose vision backbone. In this paper, we investigate the capability of xLSTM in speech enhancement, and conduct a comprehensive comparison and analysis of the Transformer, Conformer, Mamba, and xLSTM backbones within a unified framework, considering both causal and noncausal configurations. Overall, xLSTM and Mamba achieve better performance than Transformer and Conformer. Mamba demonstrates significantly superior training and inference efficiency, particularly for long speech inputs, whereas xLSTM suffers from the slowest processing speed.
- P4-11 FasTUSS: Faster Task-Aware Unified Source Separation
Francesco Paissan, Gordon Wichern, Yoshiki Masuyama, Ryo Aihara, François G. Germain, Kohei Saijo, Jonathan Le RouxAbstract
Time-Frequency (TF) dual-path models are currently among the best performing audio source separation network architectures, achieving state-of-the-art performance in speech enhancement, music source separation, and cinematic audio source separation. While they are characterized by a relatively low parameter count, they still require a considerable number of operations, implying a higher execution time. This problem is exacerbated by the trend towards bigger models trained on large amounts of data to solve more general tasks, such as the recently introduced task-aware unified source separation (TUSS) model. TUSS, which aims to solve audio source separation tasks using a single, conditional model, is built upon TF-Locoformer, a TF dual-path model combining convolution and attention layers. The task definition comes in the form of a sequence of prompts that specify the number and type of sources to be extracted. In this paper, we analyze the design choices of TUSS with the goal of optimizing its performance-complexity trade-off. We derive two more efficient models, FasTUSS-8.3G and FasTUSS-11.7G that reduce the original model’s operations by 81% and 73% with minor performance drops of 1.2 dB and 0.4 dB averaged over all benchmarks, respectively. Additionally, we investigate the impact of prompt conditioning to derive a causal TUSS model.
- P4-12 Context-Aware Query Refinement for Target Sound Extraction: Handling Partially Matched Queries
Ryo Sato, Chiho Haruta, Nobuhiko Hiruma, Keisuke ImotoAbstract
Target sound extraction (TSE) is the task of extracting a target sound specified by a query from an audio mixture. Much prior research has focused on the problem setting under the Fully Matched Query (FMQ) condition, where the query specifies only active sounds present in the mixture. However, in real-world scenarios, queries may include inactive sounds that are not present in the mixture. This leads to scenarios such as the Fully Unmatched Query (FUQ) condition, where only inactive sounds are specified in the query, and the Partially Matched Query (PMQ) condition, where both active and inactive sounds are specified. Among these conditions, the performance degradation under the PMQ condition has been largely overlooked. To achieve robust TSE under the PMQ condition, we propose context-aware query refinement. This method eliminates inactive classes from the query during inference based on the estimated sound class activity. Experimental results demonstrate that while conventional methods suffer from performance degradation under the PMQ condition, the proposed method effectively mitigates this degradation and achieves high robustness under diverse query conditions.
- P4-13 RADE: A Neural Codec for Transmitting Speech over HF Radio Channels
David Rowe, Jean-Marc ValinAbstract
Speech compression is commonly used to send voice over radio channels in applications such as mobile telephony and two-way push-to-talk (PTT) radio. In classical systems, the speech codec is combined with forward error correction, modulation and radio hardware. In this paper we describe an autoencoder that replaces many of the traditional signal processing elements with a neural network. The encoder takes a vocoder feature set (short term spectrum, pitch, voicing), and produces discrete time, but continuously valued quadrature amplitude modulation (QAM) symbols. We use orthogonal frequency domain multiplexing (OFDM) to send and receive these symbols over high frequency (HF) radio channels. The decoder converts received QAM symbols to vocoder features suitable for synthesis. The autoencoder has been trained to be robust to additive Gaussian noise and multipath channel impairments while simultaneously maintaining a Peak To Average Power Ratio (PAPR) of less than 1 dB. Over simulated and real world HF radio channels we have achieved output speech intelligibility that clearly surpasses existing analog and digital radio systems over a range of SNRs.
- P4-14 The Perception of Phase Intercept Distortion and its Application in Data Augmentation
Venkatakrishnan Vaidyanathapuram Krishnan, Nathaniel Condit-SchultzAbstract
Phase distortion refers to the alteration of the phase relationships between frequencies in a signal, which can be perceptible. In this paper, we discuss a special case of phase distortion known as phase-intercept distortion, which is created by a frequency-independent phase shift. We hypothesize that, though this form of distortion changes a signal’s waveform significantly, the distortion is imperceptible. Human-subject experiment results are reported which are consistent with this hypothesis. Furthermore, we discuss how the imperceptibility of phase-intercept distortion can be useful for machine learning, specifically for data augmentation. We conducted multiple experiments using phase-intercept distortion as a novel approach to data augmentation, and obtained improved results for audio machine learning tasks.
- P4-15 Device-Centric Room Impulse Response Augmentation Evaluated on Room Geometry Inference
Cagdas Tuna, Andreas Walther, Emanuël A. P. HabetsAbstract
Because suitable real-world room impulse responses (RIRs) are scarce for many use cases, synthetic RIRs are indispensable to data-driven audio applications; however, any mismatch between them and real-world RIRs can cause significant performance degradation due to domain shift. This study proposes a device centric RIR augmentation method that reduces this mismatch by incorporating the acoustic characteristics of transducers from a specific audio device along with a reverberation model to generate synthetic RIRs. As an application example, we evaluate the effect of the proposed augmentation technique on the problem of room geometry inference (RGI) using a smart speaker. A neural network is trained with and without augmented RIRs and tested with measured RIRs. Comparing the performance of both training scenarios indicates a reduced domain shift between augmented synthetic RIRs and real-world measurements. The results show a considerable improvement in the estimation accuracy when augmented RIRs are used for training.
- P4-16 Modeling Multi-Level Hearing Loss for Speech Intelligibility Prediction
Xiajie Zhou, Candy Olivia Mawalim, Masashi UnokiAbstract
The diverse perceptual consequences of hearing loss severely impede speech communication, but standard clinical audiometry, which is focused on threshold-based frequency sensitivity, does not adequately capture deficits in frequency and temporal resolution. To address this limitation, we propose a speech intelligibility prediction method that explicitly simulates auditory degradations according to hearing loss severity by broadening cochlear filters and applying low-pass modulation filtering to temporal envelopes. Speech signals are subsequently analyzed using the spectro-temporal modulation (STM) representations, which reflect how auditory resolution loss alters the underlying modulation structure. In addition, normalized cross-correlation (NCC) matrices quantify the similarity between the STM representations of clean speech and speech in noise. These auditory-informed features are utilized to train a Vision Transformer–based regression model that integrates the STM maps and NCC embeddings to estimate speech intelligibility scores. Evaluations on the Clarity Prediction Challenge corpus show that the proposed method outperforms the Hearing-Aid Speech Perception Index v2 (HASPI v2) in both mild and moderate-to-severe hearing loss groups, with a relative root mean squared error reduction of 16.5% for the mild group and a 6.1% reduction for the moderate-to-severe group. These results highlight the importance of explicitly modeling listener-specific frequency and temporal resolution degradations to improve speech intelligibility prediction and provide interpretability in auditory distortions.
- P4-17 Moises-Light: Resource-efficient Band-split U-Net For Music Source Separation
Yun-Ning Hung, Igor Pereira, Filip KorzeniowskiAbstract
In recent years, significant advances have been made in music source separation, with model architectures such as dual-path modeling, band-split modules, or transformer layers achieving comparably good results. However, these models often contain a significant number of parameters, posing challenges to devices with limited computational resources in terms of training and practical application. While some lightweight models have been introduced, they generally perform worse compared to their larger counterparts. In this paper, we take inspiration from these recent advances to improve a lightweight model. We demonstrate that with careful design, a lightweight model can achieve comparable SDRs to models with up to 13 times more parameters. Our proposed model, Moises-Light, achieves competitive results in separating four musical stems on the MUSDB-HQ benchmark dataset. The proposed model also demonstrates competitive scalability when using MoisesDB as additional training data.
- P4-18 Improving Inference-Time Optimisation for Vocal Effects Style Transfer with a Gaussian Prior
Chin-Yun Yu, Marco A. Martínez-Ramírez, Junghyun Koo, Wei-Hsiang Liao, Yuki Mitsufuji, George FazekasAbstract
Style Transfer with Inference-Time Optimisation (ST-ITO) is a recent approach for transferring the applied effects of a reference audio to an audio track. It optimises the effect parameters to minimise the distance between the style embeddings of the processed audio and the reference. However, this method treats all possible configurations equally and relies solely on the embedding space, which can result in unrealistic configurations or biased outcomes. We address this pitfall by introducing a Gaussian prior derived from the DiffVox vocal preset dataset over the parameter space. The resulting optimisation is equivalent to maximum-a-posteriori estimation. Evaluations on vocal effects transfer on the MedleyDB dataset show significant improvements across metrics compared to baselines, including a blind audio effects estimator, nearest-neighbour approaches, and uncalibrated ST-ITO. The proposed calibration reduces the parameter mean squared error by up to 33% and more closely matches the reference style. Subjective evaluations with 16 participants confirm the superiority of our method in limited data regimes. This work demonstrates how incorporating prior knowledge in inference time enhances audio effects transfer, paving the way for more effective and realistic audio processing systems.
- P4-19 Combolutional Neural Networks
Cameron Churchwell, Minje Kim, Paris SmaragdisAbstract
Selecting appropriate inductive biases is an essential step in the design of machine learning models, especially when working with audio, where even short clips may contain millions of samples. To this end, we propose the combolutional layer: a learned-delay IIR comb filter and fused envelope detector, which extracts harmonic features in the time domain. We demonstrate the efficacy of the combolutional layer on three information retrieval tasks, evaluate its computational cost relative to other audio frontends, and provide efficient implementations for training. We find that the combolutional layer is an effective replacement for convolutional layers in audio tasks where precise harmonic analysis is important, e.g., piano transcription, speaker classification, and key detection. Additionally, the combolutional layer has several other key benefits over existing frontends, namely: low parameter count, efficient CPU inference, strictly real-valued computations, and improved interpretability.
- P4-20 Self-Supervised Representation Learning with a JEPA Framework for Multi-instrument Music Transcription
Mary Pilataki, Matthias Mauch, Simon DixonAbstract
We demonstrate that the Joint-Embedding Predictive Architecture is effective for learning representations suitable for Music Information Retrieval tasks. Specifically, we explore its application to multi-instrument automatic music transcription, focusing on multipitch estimation and instrument recognition. We evaluate the learned representations across multiple settings: (1) finetuning a pretrained JEPA model with transcription supervision, (2) end-to-end training with transcription supervision, (3) training an instrument-aware transcriber on frozen JEPA embeddings and (4) training an instrument-agnostic transcriber on frozen JEPA embeddings. To assess the structure of the learned representations, we compute Calinski-Harabasz clustering scores with respect to pitch index, pitch class, instrument, and octave.We find that the representations learned by JEPA and its modified version (2), primarily capture instrument identity and pitch height information, rather than pitch class distinctions. Despite this, our results demonstrate promising transcription performance and highlight the potential of non-generative self-supervised learning for multi-instrument music transcription. Code and model configurations are available on GitHub.
- P4-21 DiTVC: One-Shot Voice Conversion via Diffusion Transformer with Environment and Speaking Rate Cloning
Yunyun Wang, Jiaqi Su, Adam Finkelstein, Rithesh Kumar, Ke Chen, Zeyu JinAbstract
Traditional zero-shot voice conversion methods typically extract a speaker embedding from a reference recording first and then generate the source speech content in the target speaker’s voice by conditioning on that embedding. However, this process often overlooks time-dependent speaker characteristics, such as voice dynamics and speaking rates, as well as environmental acoustic properties of the reference recording. To address these limitations, we propose a one-shot voice conversion framework capable of replicating not only voice timbre but also acoustic properties. Our model is built upon Diffusion Transformers (DiT) and conditioned on a designed content representation for acoustic cloning. Besides, we introduce specific augmentations during training to enable accurate speaking rate cloning. Both objective and subjective evaluations demonstrate that our method outperforms existing approaches in terms of audio quality, speaker similarity, and environmental acoustic similarity, while effectively capturing the speaking rate distribution of target speakers. Audio samples are available at: ditvc.github.io .
- P4-22 Learning to Upsample and Upmix Audio in the Latent Domain
Dimitrios Bralios, Paris Smaragdis, Jonah CasebeerAbstract
Neural audio autoencoders create compact latent representations that preserve perceptually important information, serving as the foundation for both modern audio compression systems and generation approaches like next-token prediction and latent diffusion. Despite their prevalence, most audio processing operations, such as spatial and spectral up-sampling, still inefficiently operate on raw waveforms or spectral representations rather than directly on these compressed representations. We propose a framework that performs audio processing operations entirely within an autoencoder’s latent space, eliminating the need to decode to raw audio formats. Our approach dramatically simplifies training by operating solely in the latent domain, with a latent L1 reconstruction term, augmented by a single latent adversarial discriminator. This contrasts sharply with raw-audio methods that typically require complex combinations of multi-scale losses and discriminators. Through experiments in bandwidth extension and mono-to-stereo up-mixing, we demonstrate computational efficiency gains of up to 100× while maintaining quality comparable to post-processing on raw audio. This work establishes a more efficient paradigm for audio processing pipelines that already incorporate autoencoders, enabling significantly faster and more resource-efficient workflows across various audio tasks.
- P4-23 Head-Related Transfer Function Individualization Using Anthropometric Features and Spatially Independent Latent Representation
Ryan Niu, Shoichi Koyama, Tomohiko NakamuraAbstract
A method for head-related transfer function (HRTF) individualization from the subject’s anthropometric parameters is proposed. Due to the high cost of measurement, the number of subjects included in many HRTF datasets is limited, and the number of those that include anthropometric parameters is even smaller. Therefore, HRTF individualization based on deep neural networks (DNNs) is a challenging task. We propose a HRTF individualization method using the latent representation of HRTF magnitude obtained through an autoencoder conditioned on sound source positions, which makes it possible to combine multiple HRTF datasets with different measured source positions, and makes the network training tractable by reducing the number of parameters to be estimated from anthropometric parameters. Experimental evaluation shows that high estimation accuracy is achieved by the proposed method, compared to current DNN-based methods.