An Automated Audio Captioning Survey

I give an overview of the task of audio captioning.

Simon Li

This blog provides a brief overview, based on the survey, of the task of Automated Audio Captioning (AAC).

What is Audio Captioning?

Automated Audio Captioning (AAC) is a cross-modal translation task that uses natural language to describe all events in an audio clip. Generally speaking, audio captioning generates one sentence descriptions of the predominant audio events and scenes occurring in the audio clips. An example caption could be “A person was walking on a sidewalk adjacenet to a school where children were playing on the playground”, describing both the scenes and sounde events given in an audio clip. More specifically, the elements in captions and their distribution, from low-level to high-level, can be summarized as follows:

Why Audio Captioning?

Visual captioning, a similar task to audio captioning but for images, have long been investigated since deep learning was first created. Compared with visual captioning, audio captioning utilizes complementary yet rich information carried by audios, often times unavailable to visual perceptions. For instance, suppose we want to identify the person at night and whether or not they are speaking, audio signals would be more reliable than visual signals as it contains less noise for dark environment while carrying unique auditory information.

Moreover, compared to other audio understanding tasks, audio captioning is a more suitable task for human-machine interaction as it generates unrestricted natural language that’s easy for human to understand.

What has been done in the past?

Unlike visual (i.e. image and video) captioning that started since the early days of deep learning, audio captioning started relatively recent at around 2017. It was not until when audio captioning became a task in the DCASE (Detection and Classification of Acoustic Scenes and Events) from 2020 to 2022 that the topic has seen an increase in interests. Resembing many other similar tasks such as Audio Tagging (AT) and Sound Event Localizaion (SED), the encoder-decoder framework has been adopted as a standard recipe for solve the audio captioning task. In the encoder-decoder framework, the audio clip is passed into the encoder as a spectrogram, where the audio features are extracted. Then, the audio features are passed to the decoder generates captions based on the given features.

Encoder

After the input audio is converted into a spectrogram, the result is often passed into a neural network to learn the audio features. The below is the common neural network architectures used in the past publications.

Also known as the Recurrent Neural Networks, RNN is known for its ability to process sequential data and model temporal relationships between the inputs. However, using RNN by itself as an encoder does not give promising result. The reason is that the inputs are often long sequences while RNN lacks the ability to model long-range time dependencies. In addition, RNN often leads to an excessive compression of information, losing the ability to generate fine-grained descriptions.

CNN, Convolutional Neural Network, is an approach that has achieved great success in the field of Computer Vision. CNN significantly outperforms RNN is now the most common approach for audio encoding due to its time-invariant characteristics and ability to model local depdencies. However, CNN still lacks long-range time dependencies for long signals.

A combination of CNN and RNN called CRNN was proposed to aid modelling both local and long-range time dependencies. CRNN first utilizes CNN to extract features and model the temporal relationship between extracted CNN features. Then, the extracted features would be passed to RNN layers. In the end, CRNN proves to be slightly better than CNN; however, CRNN requires much more computation than CNN.

Decoder

In the existing works, decoding or text generation adopts an auto-regressive manner, meaning that when generating the captions, each word is conditioned on the audio features extracted by the encoder and the previously predicted words by the decoder.

Since sentences are sequential data, RNN is a popular method for language decoding. With a simple mean pooling method, the RNN decoder is about to get a gooed global audio representation to detect which audio events are present in the whole auido clip. However, this method does not consider relationships between audio features, thus unable to capture fine-grained details. Attention has been introduced to help with this issue, attending to the wholde audio feature sequence and place more weights on the more informative features. In summary, RNN with attention is widely used as the decoder for text generation; however, it still lacks long-range time dependencies.

Transformer models have also been employed as a decoder for audio captioning and achieved state-of-the-art performance. Built on the self-attention mechanism, the transformer is able to find out where to pay more attention to within a sequence, thus the name self-attention. To learn more about attention, check out my blog here. Attention in general can be formulated as:

Attn(Q,K,V)=Softmax(QKTdk)V\text{Attn}(Q,K,V) = \text{Softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Due to the lack of data for audio captioning, shallow Transformers with only two blocks are only used in recent works. In summary, Transformer not only achieve SoTA results but also is more computationally efficient than RNN during training.

Auxiliary Information

To aid the decoders to generate better captions, researchers have attempted to add auxiliary information such as keywords or sentences to provide more information. Some researchers have attempted to retrieve the tag of the most similar audio clip from AudioSet and align it with the audio features via an attention module. Other have tried training a decoder that joinly extimates the keyword and captions an audio. However, the accuracy of the keywords might be a bottleneck for models as wrong keywords might impact the generation adversely. Thus, most works still employ the standard decoder without the aid of auxiliary information.

Next Steps for Audio Captioning

As audio captioning is a relatively new field, there is still a large gap between the performance of the proposed systems and human. The followings are some of the areas that I think could potentially improve the model performance significantly.

Since the size of the most dataset are limited and do not cover all possible real-life scenarios, it’s possible that the resulting systems are biased and cannot generalize well to unseen contexts. However, there are an abundance of weakly labelled data from online sources. Therefore, a self-supervised audio caption would help learn more robust audio-text representation.

With the advent of Large Language Models (LLMS), incorporating large pre-trained language models could significantly improve the system. One reason is that LLMs has shown great potential as a text-generation decoder. In addition, LLMs could simply paraphrase generated captions to enhance the naturalness and diversity, ensuring captions blend seamlessly with human-generated content and contain rich and varied expressions.

Auxiliary information could also be further studied as current attempts have not brought siginificant improvements amd may not work well for all datasets.

Exposure bias in testing, caused by conditioning on previously generated texts which could lead to error accumulation, is another potential area of improvement. There have been attempts in using Reinforcement learning to address the issue and have found success based on the evalution metrics. Although the metrics are satisfactory, it is found that reinforcement learning introduced repetitive words and incorrect syntax into the caption. Further studies on designing a new objective function or incorport human feedback in reinforcement learning can help address such issue.