Early models of the source signal used a simple impulse train for modeling voiced excitation. None of these models has been calibrated with direct observations of glottal area changes which are the proximal cause of the air pressure changes that we hear as sound.The effective study of the voice source thus requires both more accurate source models and a comprehensive set of underlying observations on which to base the models. The primary goal of the proposed research is to develop and evaluate a new, more powerful source model based on direct observations of vocal fold vibrations.
Besides the critical need to calibrate source models with underlying physiological data, we also need to better understand the linkage between model parameters and perceived quality. None of the previous source models has been systematically validated perceptually. That is, we cannot presently predict well how a given change(s) in a model parameter will affect what listeners hear.
The voice source contains important lexical and non-lexical information. The non-lexical information can convey, for example, prosodic events, emotional status, as well as cues pertaining to the uniqueness of the speaker’s voice. In engineering applications, there is a need for a more accurate source model that could model different voice qualities. Such a model could improve the naturalness of TTS systems. In addition, understanding what aspects of the source signal, if any, are speaker-specific, should aid in developing better speaker identification algorithms.
We propose to build on our preliminary work in developing a new source model by recording high-speed images of vocal fold vibrations with simultaneous audio recordings, analyzing the corpus to better parameterize the new voice source model and study speaker variability, performing perception experiments to uncover which aspects of the glottal model are perceptually salient, and using the model in TTS and speaker identification algorithms.
The project fosters interdisciplinary activities at:
This work is supported in part by NSF Grant No. IIS-1018863 and by NIH/NIDCD Grant Nos. DC01797 and DC011300.
Voice source, high-speed recording, vocal folds, speech synthesis, speech production model, perceptual validation.
Glottaltopograph (GTG) analyze tool: a toolkit to analyze high-speed laryngeal videos.
Glottaltopography is a method to analyze high-speed laryngeal videos. The method is described in this paper: Gang Chen, Jody Kreiman, Abeer Alwan, "The glottaltopogram: a method of analyzing high-speed images of the vocal folds", Computer Speech and Language, 2014, in press. Briefly, the "glottaltopogram" is based on principal component analysis of pixels' light-intensity time sequences from consecutive video images. This method reveals the overall synchronization of the vibrational patterns of the vocal folds over the entire laryngeal area. This method is effective in visualizing pathological and normal vocal fold vibratory patterns. The GTG toolkit is available for download here.VoiceSauce: A Program for Voice Analysis
VoiceSauce is an application, implemented in Matlab, which provides automated voice measurements over time from audio recordings. Inputs are standard wave (*.wav) files and the measures currently computed are: F0, Formants F1-F4, H1(*), H2(*), H4(*), A1(*), A2(*), A3(*), H1(*)-H2(*), H2(*)-H4(*), H1(*)-A1(*), H1(*)-A2(*), H1(*)-A3(*), Energy, and Cepstral Peak Prominence ... (details)
Gang Chen, Jody Kreiman, Abeer Alwan, "The glottaltopogram: a method of analyzing high-speed images of the vocal folds", Computer Speech and Language, 2014, in press. [link to the journal article]
G. Chen, M. Garellek, J. Kreiman, B. R. Gerratt, A. Alwan, "A perceptually and physiologically motivated voice source model", Interspeech 2013, pp. 2001-2005. [Best student paper award finalist] [slides and audio samples]
G. Chen, R. A. Samlan, J. Kreiman, A. Alwan, "Investigating the relationship between glottal area waveform shape and harmonic magnitudes through computational modeling and laryngeal high-speed videoendoscopy", Interspeech 2013, pp. 3216-3220. [poster]
Marc Garellek, Patricia Keating, and Christina M. Esposito. Relative importance of phonation cues in White Hmong tone perception. Proceedings of BLS 38 [Berkeley Linguistic Society], 2012.
Gang Chen, Jody Kreiman, Bruce Gerratt, Juergen Neubauer, Yen-Liang Shue, and Abeer Alwan, "Development of a glottal area index that integrates glottal gap size and open quotient," Journal of the Acoustical Society of America, Vol. 133, Issue 3, March 2013, pp. 1656–1666. [link to the journal article]
Jody Kreiman, Yen-Liang Shue, Gang Chen, Markus Iseli, Bruce R. Gerratt, Juergen Neubauer, and Abeer Alwan, "Variability in the relationships among voice quality, harmonic amplitudes, open quotient, and glottal area waveform shape in sustained phonation," Journal of the Acoustical Society of America, Volume 132, Issue 4, pp. 2625-2632 (2012). [link to the journal article]
Gang Chen, Yen-Liang Shue, Jody Kreiman, and Abeer Alwan, "Estimating the voice source in noise", Interspeech 2012.
Gang Chen, Jody
Kreiman, and
Abeer Alwan, "The
Glottaltopograph: A Method of Analyzing High-Speed Images of the Vocal
Folds", ICASSP 2012,
pp.3985-3988.
G. Chen, J.
Kreiman, Yen-Liang Shue, and A. Alwan, "Acoustic
Correlates of Glottal Gaps,"
Interspeech 2011, pp 2673-2676
Y.-L. Shue, G. Chen,
and A. Alwan, "On
the Interdependencies between Voice Quality, Glottal Gaps, and
Voice-Source related Acoustic Measures,"
Interspeech 2010, pp.
34-37.
G. Chen, X. Feng,
Y.-L. Shue, and A. Alwan, "On
Using Voice Source Measures in Automatic Gender Classification of
Children's Speech," Interspeech
2010, pp. 673-676.
Y.-L. Shue and A. Alwan, "A
new voice source model based on high-speed imaging and its application
to voice source estimation,"
ICASSP 2010, pp. 5134-5137.
Y.-L. Shue, J. Kreiman, and A. Alwan, "A
Novel Codebook Search Technique for Estimating the Open Quotient,"
Interspeech 2009, pp. 2895-2898.
Back to SPAPL Home Page.
Abeer Alwan (alwan@ee.ucla.edu)