[PD] speaker recognition with pd ?

Tue Sep 27 01:07:20 CEST 2011

I did research for a year on how to do this. I came to write externals for PD because of that project, but I never quite got to the point where I could do it. It's on my long to-do list, which means it probably never will be finished. Here are some ideas:

1. Calculate a Chebyshev polynomial from a Linear Predictive Coding filter response. Track the peaks of the response (the formant peaks) and (maybe) find approximate matches in a database of material. A model can be built on-the-fly of formant patterns in a training mode, so you can make a database of formant peak line-sections, and this can be used to check subsequent analyses. For example, a training session can be used to build a model of a particular speaker's formant patterns, then the live input can be compared to each model.

I was trying to port the formant modelling tools from the Speech Filing System from UCL: http://www.phon.ucl.ac.uk/resource/sfs/ to PD in 2005-06, but didn't get much support from my superiors who were running this project. I never got it to work, but i'd only just begun proper C programming then. I'm sure I wasn't far off... I'd love to try again if I get time in my schedule (I now have 2 kids and 5 jobs). The advantages to this method are that with careful measurement of the residual spectrum, it is possible to re-create the sound of a voice from a good formant/residual model. Thus, we can make a person's voice "speak" the words we want them to, or the get a hundred people to sing in tune! It is a reversible algorithm, so the original sound can be re-created from the analysis.

2. The Mel-Frequency Cepstral Coefficient (MFCC) of the FFT (Fast Fourier Transform) of a waveform is a good timbral identifier. William Brent's TimbreID objects are good instantaneous timbre identifiers using this principle, but to build up a sophisticated model of a human voice (robust  enough for speaker ID) you need to work out how to build a database. For an instantaneous MFCC identifier using an internal database, check out Michael Casey's "soundspotter" PD external. This is even more efficient, since each frame of MFCC analysis is simplified as a string of 40 ASCII characters. This means that standard MySQL search techniques can be used to search the database, and hence it is a lot faster than comparing two numbers. The MFCC algorithm is non-reversible, meaning that the original waveform cannot be constructed from the analysis data.

The biggest problem with all of this is that speech is identified not just by its instantaneous timbre, but also by the way the timbre and pitch changes over time. So speech recognitpion technology uses a thing called a Markov Model to map the likelihood of one timbre changing to another. For example, the likelihood of a "k" sound followed by a "r" is quite high, since there are many words like "cracker, croak" that have this morphology. Whereas "k" followed by "s" is much rare in (English) language, so its likelihood is much less.

I...well there it is,
Ed

>> The task would be to identify from a live-talk the voice of the current

>> speaker amongst several. Training before is also possible .. i guess this
>> could be done for sure by utilizing a simple neural network trained on a
>> FFT docemposition of the voices..  so there must be some software out for
>> sure...
> 
> Something tells me a fft+neural network would be really bad at this.
> Seriously, that sounds like a doomed project if you tried.  These
> things would be huge:
> 1.  fft size (for resolution)
> 2.  network size (based on the fft size)
> 3.  training set (lots of variance in the speaker is possible)
> 
> How about autocovariance and dot-product?
> 
> Ahead of time, create an array containing normalized autocovariance
> (an autocorrelation) of the speaker's voice.
> 
> Compute a running autocovariance of the sound.  Decompose it into the
> portion of the sound matching the autocovariance of the speaker and
> compare it with the part not matching the speaker (via dot-product, or
> projection operators)
> 
> That would be ~less~ expensive and time consuming than neural
> networks, but I'd give it not much chance of success either.  Probably
> it would match quite a few different people all the same.

I think that getting some kind of basic recognition of who is speaking would not be super difficult, if you have a clean recording of the voices. You need to get the formant of the voice, then use that as the base comparison.  You could start with something like William Brent's timbreID library to isolate the different vowel sounds, then get a format for each of the vowels, then use that data for the pattern matching.  It'll definitely take some research and a solid chunk of work to get it going.

.hc

----------------------------------------------------------------------------

Access to computers should be unlimited and total.  - the hacker ethic

_______________________________________________
Pd-list at iem.at mailing list
UNSUBSCRIBE and account-management -> http://lists.puredata.info/listinfo/pd-list
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puredata.info/pipermail/pd-list/attachments/20110927/596145ee/attachment-0001.htm>