Privacy-Sensitive Audio Features for Conversational Speech Processing
Supervisor(s) and Committee member(s): Advisor(s): Daniel Gatica-Perez - thesis supervisor, Herve Bourlard - thesis supervisor, Daniel P.W. Ellis - thesis reviewer, Simon King - thesis reviewer, Jean-Philippe Thiran - thesis reviewer, Jean-Marc Vesin - thesis jury president
The work described in this thesis takes place in the context of capturing real-life audio for the analysis of spontaneous social interactions. Towards this goal, we wish to capture conversational and ambient sounds using portable audio recorders. Analysis of conversations can then proceed by modeling the speaker turns and durations produced by speaker diarization. However, a key factor against the ubiquitous capture of real-life audio is privacy. Particularly, recording and storing raw audio would breach the privacy of people whose consent has not been explicitly obtained.
In this thesis, we study audio features instead — for recording and storage — that can respect privacy by minimizing the amount of linguistic information, while achieving state-of-the-art performance in conversational speech processing tasks. Indeed, the main contributions of this thesis are the achievement of state-of-the-art performances in speech/nonspeech detection and speaker diarization tasks using such features, which we refer to, as privacy-sensitive. Besides this, we provide a comprehensive analysis of these features for the two tasks in a variety of conditions, such as indoor (predominantly) and outdoor audio. To objectively evaluate the notion of privacy, we propose the use of human and automatic speech recognition tests, with higher accuracy in either being interpreted as yielding lower privacy.
For the speech/nonspeech detection (SND) task, this thesis investigates three different approaches to privacy-sensitive features. These approaches are based on simple, instantaneous, feature extraction methods, excitation source information based methods, and feature obfuscation methods. These approaches are benchmarked against Perceptual Linear Prediction (PLP) features under many conditions on a large meeting dataset of nearly 450 hours. Additionally, automatic speech (phoneme) recognition studies on TIMIT showed that the proposed features yield low phoneme recognition accuracies, implying higher privacy.
For the speaker diarization task, we interpret the extraction of privacy-sensitive features as an objective that maximizes the mutual information (MI) with speakers while minimizing the MI with phonemes. The source-filter model arises naturally out of this formulation. We then investigate two different approaches for extracting excitation source based features, namely Linear Prediction (LP) residual and deep neural networks. Diarization experiments on the single and multiple distant microphone scenarios from the NIST rich text evaluation datasets show that these features yield a performance close to the Mel Frequency Cepstral coefficients (MFCC) features. Furthermore, listening tests support the proposed approaches in terms of yielding low intelligibility in comparison with MFCC features.
The last part of the thesis studies the application of our methods to SND and diarization in outdoor settings. While our diarization study was more preliminary in nature, our study on SND brings about the conclusion that privacy-sensitive features trained on outdoor audio yield performance comparable to that of PLP features trained on outdoor audio. Lastly, we explored the suitability of using SND models trained on indoor conditions for the outdoor audio. Such an acoustic mismatch caused a large drop in performance, which could not be compensated even by combining indoor models.