🍮 FLAM: Frame-Wise Language-Audio Modeling

Yusong Wu1,2, Christos Tsirigotis2, Ke Chen1, Cheng-Zhi Anna Huang3, Aaron Courville2,4, Oriol Nieto1, Prem Seetharaman1, Justin Salamon1

1Adobe Research    2Mila - Quebec AI Institute, Université de Montréal    3Massachusetts Institute of Technology    4Canada CIFAR AI Chair

ICML 2025 Paper Dataset
@inproceedings{wu2025flam,
title={{FLAM}: Frame-Wise Language-Audio Modeling},
author={Yusong Wu and Christos Tsirigotis and Ke Chen and Cheng-Zhi Anna Huang and Aaron Courville and Oriol Nieto and Prem Seetharaman and Justin Salamon},
booktitle={Forty-second International Conference on Machine Learning},
year={2025},
url={https://openreview.net/forum?id=7fQohcFrxG}}
FLAM Zero-Shot Detection Results on Real-world Examples

We present FLAM's zero-shot, open vocabulary sound event detection on real-world audio examples that are not seen during training.
We also include detection results of events that are not present in the examples in red.

Abstract

FLAM Diagram

Recent multi-modal audio-language models (ALMs) excel at text-audio retrieval but struggle with frame-wise audio understanding. Prior works use temporal-aware labels or unsupervised training to improve frame-wise capabilities, but they still lack fine-grained labeling capability to pinpoint when an event occurs. While traditional sound event detection models can precisely localize events, they are limited to pre-defined categories, making them ineffective for real-world scenarios with out-of-distribution events. In this work, we introduce FLAM, an open-vocabulary contrastive audio-language model capable of localizing specific sound events. FLAM employs a memory-efficient and calibrated frame-wise objective with logit adjustment to address spurious correlations, such as event dependencies and label imbalances during training. To enable frame-wise supervision, we leverage a large-scale dataset with diverse audio events, LLM-generated captions and simulation. Experimental results and case studies demonstrate that FLAM significantly improves the open-vocabulary localization capability while maintaining strong performance in global retrieval and downstream tasks.

On this page, we showcase the audio samples and sound event detection results from the FLAM experiments, as discussed in the paper, as well as FLAM detection results on real-world audio examples that are not seen during training. Please allow a few seconds to load the video and audio.

FLAM Zero-Shot Detection Results on Real-world Examples

We present FLAM's zero-shot, open vocabulary sound event detection on real-world audio examples that are not seen during training. We also include detection results of events that are not present in the examples in red. The title of each video links to the corresponding YouTube video.

Sound Event Detection Results

ASFX-SED Dataset (zero-shot, open-vocabulary)

example 1

example 2

example 3

example 4

Synthetic Held-out Dataset (zero-shot, open-vocabulary)

example 1

example 2

example 3

example 4

Audioset-Strong Dataset

example 1

example 2

example 3

example 4