Response to Reviewer bqpP

Experiment Details

This page includes the results of additional ablation experiments of FLAM. Two configurations of FLAM were used for the experiments:

The details of the FLAM - L=128 model are as follows: it uses the same HTSAT architecture as FLAM, but we remove one swin-transformer layer, resulting in 1/4 of downsampling rate, thus outputting 128 frames (4x) of output instead of 32 frames in FLAM. Similar to FLAM, we pretrain the new audio encoder with AudioSet pretraining, and initialize the pretrained parameters for FLAM training. The new audio encoder has less parameters but requires more compute in training because of the larger feature size.

Experiment Results

The results of the experiments are shown in the following figures. Removing the global loss, although marginally improves the SED performance, significantly reduces the retrieval performance and zero-shot classification performance. Using the audio encoder with larger output resolution results in similar trade-off, where the SED performance is marginally improved but the retrieval and zero-shot classification performance are reduced. Overall, we think that the original FLAM model achieves a good trade-off between the SED, retrieval, and zero-shot classification performance, with sufficient temporal resolution.

Tables

SED BQPP Results
SED Results
Retrieval BQPP Results
Retrieval Results
Zero-shot BQPP Results
Zero-shot Classification Results

About the impact and intuition about logit scale

Intuitively, a smaller logit scale increases the cosine distance between negative frame and text embeddings for the same loss effect. This helps the model capture finer distinctions in cosine similarity. Experimentally, we clarify our findings in Figure 3: Performance in F1 drops more when we remove the per-text bias (but retain the per-text scale), than when we remove the per-text scale (but retain the per-text bias). So per-text logit bias plays a more significant role in performance than per-text logit scale, which is still beneficial.

We train the text-dependent logit scale $\alpha^t$ in similar manner where another MLP appended to text feature extractor, giving $\alpha^t(y) = \mathrm{MLP}^\alpha(E^t(y))$. Different in per-text bias, we update $\mathrm{MLP}^\alpha$ via $\mathcal{L}_{\mathrm{SED}}$ in Eq. 4.

Impact Statement

This work introduces FLAM, a model for frame-wise audio-language alignment to improve sound event detection using natural language queries. Our goal is to advance the field of multimodal learning by enabling fine-grained and interpretable audio understanding. FLAM may benefit applications such as content indexing, accessibility, and multimedia retrieval. While we do not foresee significant ethical risks, we encourage responsible use of the model in real-world scenarios.