EnCLAP - Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning

EnCLAP is a novel framework for automated caption, utilizing two acoustic representation models with a pretrained language model.

Jaeyeon Kim, Jaeyoon Jung, Jinjoo Lee, Sang Hoon Woo

This paper proposes a novel framework for automated caption, utilizing two acoustic representation models, EnCodec and CLAP, with a pretrained language model, BART. Moreover, the proposed new training object, masked codec modeling, proves to improve the acoustic awareness and achieves SoTA on Caps and Clotho.

Introduction

Automatic captioning (AAC) is the process of generating natural language description that captures all events in a given . Although the significant recent advancement in neural networks, there still exists a substantial discrepancy between model performance and human performance. One factor is due to the lack of data in the form of -text pairs. The most commonly used datasets, Caps and Clotho, contain around 50k and 20k pairs respectively, while the image caption dataset COCO contains more than 414k captions.

To address the data scarcity, past research has extensively utilized transfer learning, employing tagging (SED) or sound even detection models as pretraining tasks for encoding while utilizing pretrained large language models such as GPT-2 and BART for decoding.

In this work, the authors propose the novel framework, EnCLAP, which utilizes two pretrained acoustic feature encoders and one pretrained langauge model. Specifically, CLAP is chosen to generate the sequence-level acoustic representations. Moreover, EnCodec is also employed to produce time step-level acoustic representations as the authors hypothesize that the pretrained language model can leverage discrete inputs better than the continuous ones. Lastly, a fine-tuned BART model decodes the representations to captions with an auxiliary training task on acoustic awareness.

The contributions of the paper are as follows:

Proposed Method

EnCLAP

The proposed method generates a caption from a given audio file in two stages. Firstly, using the given audio file, EnCodec and CLAP encoders generate a discrete acoustic code matrix and a sequence-level representation, respectively. Then, the two outputs are concatenated and passed to a pretrained BART model for captioning.

EnCodec

Note: Codec is a device that encodes or decodes a signal

EnCodec is a neural codec model that is based on a convolutional auto-encoder. The paper utilizes the output of the EnCodec’s encoder as the discrete acoustic code matrix. Specifically, the encoder maps a waveform to a discrete acoustic code matrix CVN×LC \in V^{N \times L}, where NN is the number of codebooks used for Residual Vector Quantization (RVQ), LL is the encoded length, and VV is the vocabulary of the codebooks.

CLAP

Contrastive Language-Audio Pretraining (CLAP) connects audio to text into a joint multimodal space through dual encoders and contrastive learning. In this work, only the audio encoder and the projection layers that map to the shared embeddings space are used. Then, the output of the encoder and projection is used as the sequence-level semantic representation of the audio.

BART

BART is utilized in the paper as a caption decoder.

For the discrete acoustic code matrix CC, each code sequence cnc_n is processed through the corresponding embeddings layer WnW_n. The embeddings are summed to form a input eencodecRL×Dbe_{encodec} \in \R^{L \times D_b}.

For the CLAP’s audio embedding EAE_A, it is projected using a linear layer to form the input eclapRDbe_{clap} \in \R^{D_b}.

Let ebos,eeose_{bos}, e_{eos} be the special embeddings for special tokens ** and ** for beginning and end of the sentence. Let epose_{pos} be the positional embeddings. Then, the input to the BART encoder, IbI_b is constructed as the following:

Iconcat=Concat(ebos,eencodec,eeos)R(L+2)×DbIpos=Iconcat+eposR(L+2)×DbIb=Concat(eclap,Ipos)R(L+3)×Db \begin{aligned} I_{concat} &= Concat(e_{bos}, e_{encodec}, e_{eos}) \in \R^{(L+2)\times D_b} \\ I_{pos} &= I_{concat} + e_{pos} \in \R^{(L+2) \times D_b} \\ I_b &= Concat(e_{clap},I_{pos}) \in \R^{(L+3)\times D_b} \end{aligned}

BART autoregressively generates the caption, conditioned on IbI_b.

Training

The paper applies cross-entropy loss between the ground-truth caption and the generated caption, which can be optimized through the following objective function

Lcaption=1LTt=1LTlogp(yty1:t1,EA,C) \begin{aligned} \mathcal{L}_{caption} = -\frac{1}{L_T}\sum^{L_T}_{t=1}\log p(y_t|y_{1:t-1},E_A,C) \end{aligned}

Masked Codec Modeling (MCM)

The paper introduces the MCM as an auxiliary training task that helps improve BART’s acoustic awareness, which is the contextualized relationships among acoustic codes. To compute the corrupted matrix C~\tilde{C} from the discrete acoustic code matrix CC, span masking of 15% is applied to each code sequence cn,:c_{n,:}. Then, NN linear classifiers are added on top of the BART model to predict the ground truth code for the masked codes within c~n,:\tilde{c}_{n,:}. The loss is computed with cross entropy loss between the ground truth codes and predicted codes. The objective function is the sum of the loss of each code sequence. However, each sequence is scaled differently as earlier codebooks carry more significance in RVQ.

Lmcm=i=1N(12)t×logp(ci,:c~i,:)\mathcal{L}_{mcm} = -\sum^N_{i=1}\left(\frac{1}{2}\right)^t \times \log p(c_{i,:}|\tilde{c}_{i,:})

Then, the modified training objective with auxiliary training task is:

Ltotal=Lcaption+λ×Lmcm\mathcal{L}_{total} = \mathcal{L}_{caption} + \lambda \times \mathcal{L}_{mcm}

Results & Conclusion

EnCLAP achieved SoTA results on audio captioning. Future works include expanding to music captioning and audio generation.