[Japanese | English]

Parametric Speech Analyzer Using a Joint Estimation Model of
Spectral Envelope and Fine Structure

by Hirokazu Kameoka (2005-)

PSOLA and waveform connection techniques are known to produce high-quality synthesized speech supposing there is a large enough variety of speech fragments, but their capacity to process the characteristics of speech, such as synthesizing speech with conditions out of the database or adapting to a speaker, is not very high. Even if it was possible to deal with these problems by adding speech data corresponding to the various speaking styles of speech according to what one wants to synthesize, collecting every possible fragment data would certainly be an unrealistic and pain-staking process.

On the other hand, filter-type speech synthesizers, as a representative example of the parametric speech synthesis methods, deal with that problem by (approximately) separating the spectral envelope and the spectral fine structure. One can easily produce a new speech spectrum of different vocal tract length or F0 by controlling separately the filter characteristics and excitation signals through a small number of parameters. One can thus expect the processing capacity to be very high. Several methods are quite widely known, such as LPC (Linear Predictive Coding), Cepstrum, etc. LPC estimates the vocal tract characteristics modeled by an all-pole filter by assuming the excitation source signal of the vocal cords to be a white process. MFCCs are also a well-known and widely used feature quantizer expressing the vocal tract characteristics of speech. They enable a large variety of processings by working only on a small number of coefficients and parameters, and their ease of use made them become the mainstream analysis method in recent filter-type Text-To-Speech synthesizers. Meanwhile, STRAIGHT is known to enable a high-quality speech synthesis as it starts by estimating the F0, and then, using an analysis window varying in time according to the F0 estimate, precisely estimates the spectral envelope in a non-parametric way. Making explicit use of the F0 estimates via pitch extractor, as opposed to the LPC, is certainly one of the reasons that makes STRAIGHT such a high-quality analysis-synthesis system.

We have thus been aiming at developing a new speech model with always in mind a high-quality Text-To-Speech synthesis and analysis-synthesis systems having both these advantages (i.e., defined in a parametric way and governed by the F0 parameter).

In the filter-type speech synthesis, the generation process of voiced speech is often assumed to be a linear system with as its input an excitation source signal consisting of a sequence of pulses at intervals of the pitch period. As the input spectrum is a sequence of pulses at intervals of the pitch frequency F0, the power of each component of the harmonic structure should be extracted separately in order to obtain accurately the vocal tract characteristic and thus one hopes to have a high precision estimation of F0. On the other hand, as making a half pitch error corresponds to supposing the envelope is unnaturally jagged with zero power for all the odd order harmonics, such an error could be easily corrected if we know in advance the speech spectral envelope or at least by assuming that spectral envelopes in the power domain are usually smooth. Therefore, estimation of F0 and of the envelope, having a chicken and egg relationship, should be done jointly rather than independently with successive estimations. This is the standing point we chose to formulate a joint estimation model of the spectral envelope and the fine structure.