A Keyword Spotting Survey

I give an overview of the task of Spoken Keyword Spotting

Simon Li

This blog provides a brief overview, based on the survey, of the Deep Spoken Keyword Spotting.

Introduction

Spoken Keyword Spotting (KWS) can be defined as the task of identifying keywords from a audio stream comprised of speech. KWS has become ubiquitous in nowadays society. For example, Amazon’s Alexa, Apple’s Siri, and Microsoft’s Cortana all require a wake up word in order to be used. This measure allows the algorithms to the computationally expensive Automatic Speech Recognition (ASR) when it is only required. Besides activating speech assistant, KWS has also been involved in many other areas such as phone call routing and speech data mining.

The Three Musketeer of Keyword Spotting

The general pipelien of a deep keyword spotting system usually contains three main blocks

  1. Speech feature extractor
  2. Deep learning-based acoustic model
  3. Posterior handler

Speech Feature Extraction

Mel-scale-related features are by far the most widely used speech features for deep KWS. Mel-frequency cepstral coefficients (MFCCs) and sometimes their first and second order derivatives are used in previous works. MFCCs are obtained from the application of the discrete cosine transform to the log-Mel spectrogram, which are well-suited for GMMs that for computational reasons, use diagonal covariance matrices. The deep learning models, however, are able to exploit the spectral-temporal relationships, yielding the use of the log-Mel spectrogram for better performance. Recent works have also proposed the first derivative of log-Mel spetrogram to improve the robustness against signal gain changes.

Recurrent Neural Network Featuers

Recurrent nerual network (RNN) is known for its ability to summarize variable-length data sequences into fixed-length, compact feature vectors or embeddings. Therefore, RNNs are very suitable for generating embeddings for the Query-by-Example (QbE) KWS, which involves keyword detectino by determining the similarity between feature vectors.

Low-Precision Features

Quantization is a way to diminish the energy consumption and memory footprint of deep KWS systems by reducing the precision of the model parameters. Past research demonstrate that it is possible to achieve accuracy of the full-precision model with 4-bit quantization of model’s weights. Two kinds of low-precision representations are studied in emerging research: linearly-quantized log-Mel spectrogram and power variation over time, which is dervied from log-Mel spectrogram. Studies have shown that both representations have shown insignificant performance degradation, indicating that much of the spectral information is superfluous.

Learnable Filterbank Features

The end-toend deep learning system approach aims to be an alternate to the well-established handcrafted features. Optimal filterbank learning is part of the end-to-end training strategy, where the filterbank parameters are tuned to optimize posterior generation. In past work, researchers have found no statistically significant KWS accuracy differences between employing a learnable filterbank and log-Mel features, indicating that there is an information redundancy.

Acoustic Modeling

Fully-Connected Feedforward Neural Networks

FFNN is the model used in the debut of deep KWS, with a simple stack of three fc hidden layers with 128 neurons and ReLU activations with softmax output layer. Unfortunately, the model was grealy outperformed by HMM system and with the advent of lighter and more accurate and robost models, soon relegated to secondary level.

Convolutional Neural Networks

Due to its ability to generate local speech time-frequency correlatinos, CNNs are able to easily outperform FFNN with few parameters. Residual learning is widly used in SoTA acoustic models for deep KWS. Residula learning, in short, is creating shortcut connections between linking non-consecutive layers, which helps to train very deep CNN. Dilated convolutions are also integrated to help capture longer time-frequency pattern without needing more parameters.

TC-ResNet is a proposed a one-dimensional convolutions along with the time axis, overcoming the challenge of simultaneously capturing both high and low frequency features by means of not very deep networks. The result was able to match the performance of SoTA CNNs while having drasitically less floating point operations per second and latent time.

Depthwise separable CNN is a promising way to reduce the computation and size of CNN. Depthwise separable nmeans that the convolution is factorized into a depthwise one and a pointwise (1X1) convolution combining the outputs from the depthwise one to generate new feature maps. This proves to be effective as it reproduces the performance of TC-ResNet while using less parameters.

Posterior Handling

Non-streaming mode refers to standard multi-class classification on independent input segments containing a single word each. This task is treated as the experimental grounds for deep KWS as non-streaming mode lacks realism from a pratical point of view. Streaming mode, on the other hand, is the continuous processing of an input audio stream where the keywords are not isolated. Therefore, any segment in the streaming mode may or may not contain the desired keywords. Due to this, the sequence of raw posteriors is often smoothed by taking a moving average before further steps.

To process the smoothed posteriors yˉ{i}\bar{y}^{\{i\}}, most deep KWS system deploy the follow approach. Suppose that we have $N$ classes and that the first class is the non-keyword class. Then, a time sequence of confidence score Sc{i}S_c^{\{i\}} can be computed as

Sc{i}=Πn=2Nmaxhmax(i)kiyˉn{k}N1S^{\{i\}}_c = \sqrt[N-1]{\Pi^N_{n=2} \max_{h_\text{max}(i) \leq k \leq i} \bar{y}^{\{k\}}_n}

where hmax(i)h_\text{max}(i) indicates the onset of the time sliding window. When the confidence Sc{i}S_c^{\{i\}} exceeds a certain sensitivity threshold, a keyword is detected.

Future Directions

Immediate future work could focus on developing a novel and efficient convolutional block such that KWS performance in real-life scenario is improved while reducing the computational complexity. Moreover, based on the initial result of KWS, it is reasonable to expect Neural Architecture Search to play a greater role in architecture design.

As for computational complexity reduction, acoustic model reduction can play a crucial role by

  1. reducing memory footprint
  2. decrease inference latency
  3. less energy consumption

Although multi-channel KWS has only been marginally studied, it has show great potental to be leveraged in real-life scenario for multi-microphone system, possibly leading to further improving KWS performance in real-life.