For more detailed information about our ASFX-SED dataset and its evaluation, please visit our ASFX-SED Dataset page.
FLAM Zero-Shot Detection Results on Real-world Examples
We present FLAM's zero-shot, open vocabulary sound event detection on real-world audio examples that are not seen during training. We also include detection results of events that are not present in the examples in red. The title of each video links to the corresponding YouTube video.
Sound Event Detection Results
ASFX-SED Dataset (zero-shot, open-vocabulary)
example 1
example 2
example 3
example 4
Synthetic Held-out Dataset (zero-shot, open-vocabulary)
example 1
example 2
example 3
example 4
Audioset-Strong Dataset
example 1
example 2
example 3
example 4
FLAM

(Left) Traditional ALMs derive global audio and text embeddings, treating each ground-truth audio--text pair as a positive sample and all other pairs in the batch as negatives. (Right) FLAM instead processes frame-level audio features and trains on temporally labeled sound events paired with text descriptions. FLAM enables finer-grained localization of events based on text queries, yielding accurate open-vocabulary sound event detection.