Saturday, January 16, 2010

Speech Synthesis!

I've never done much with speech signal processing - creating spectra and spectrograms, and calculating spectra from acoustic circuit models of the vocal tract is about as far as I go. But recently I decided it was high time I tried my hand at a (formant) synthesizer. Fortunately, I knew just where to start looking for help: Dennis Klatt's 1980 paper in JASA. Aside from one typo which I found in an equation (there was a missing division sign), the paper is most informative, and there is an appendix with the complete code for the synthesizer described. ...written in FORTRAN. So I learned a few things about FORTRAN while I was trying to decipher how the synthesizer worked. Then I noticed the typo, and it became necessary to do some hunting around the internet and some of my books to figure out which form of the equation was correct, and whether there were any other typos (there weren't). I started off, then, rewriting the Klatt Synthesizer in MATLAB. I already have a version of this by MKT, but it's in mex format, which I know a bit less well than FORTRAN, and the Klatt Synthesizer has some limitations that I want to overcome. So I was going to rewrite the Klatt Synthesizer myself, in MATLAB, thereby forcing myself to learn how the synthesizer worked. Once that was done and working, I could focus on changing the synthesizer to meet my particular needs. Well, after a while (and after I had solved the problem of the typo) I decided to take a different tack. It would be easier to write the synthesizer I wanted from the get-go, rather than using the Klatt version as a stopping-off point. So I worked on putting the principles learned from Klatt to work. The result: nothing like speech. I finally got around to some debugging tonight, and figured out what was wrong: a missing minus sign. :-} That's now fixed, and I am successfully able to produce three formant vowels. Actually, I should be able to produce much more than that - I just haven't tried the other possibilities yet. This synthesizer has some nice features. First, it has poles and zeros for each of the first 6 formants, for each of the first 3 nasal and subglottal resonances, and for each of the first two interdental resonances. If the frequency and bandwidth for any pair are identical, they are simply not synthesized (they would cancel out anyway); and if either the pole or the zero coefficients are set to zero, that pole or zero is not synthesized. Second, the synthesizer updates the resonator and anti-resonator coeficients between each pair of samples. So if the acoustic sampling rate is 16000 Hz, then 1/16000 Hz is also the parameter update interval. Third, the source is defined completely separate from the resonators and anti-resonators. If the source is given, one can calculate the vocal tract filter and input that into the synthesizer, and since the sampling rate is so large compared to the fundamental frequency, within-glottal cycle changes in the filter should be no problem. The one thing that this synthesizer cannot currently do is produced voiced obstruents. In a later version of the synthesizer, this will be remedied, and the source parameter will probably be broken down into phonation and frication/aspiration components. I plan to use this synthesizer (or its later versions) to create synthetic speech for speech perception experiments, and also to study the analysis of subglottal coupling in natural and synthetic speech, and to investigate certain aspects of fricative acoustics that I'm rather interested to look into.

Among my many readers (that's right, I'm talking about all two of you) this post is probably interesting only to me. Sorry... :-}