Recognition of vowel sequence based on HMM - part I

Computation of emitted probability in a state of HMM

**Tasks to do: **

**Computation of GMM models of vowels and speech pause**- Computed GMM-model parameters for 5 basic vowels and speech pause. Use the procedure analogous to the classification of particular isolated vowels based on GMM, see Task "GMM-based classification of vowels" (which took part in 7th week of the semester).
- To train vowel models use utterances for all isolated vowels, i.e.
*A1, A2, A3, E1, ...., U1, U2, U3*; to train model of speech pause use non-speech parts of utterances*B0*and*B1*, both using all speakers at least from block*BLOCK172*or*BLOCKEN*(later you can try to experiment with higher amount of training data). - You can use the data directly from the
directory
*"H:\VYUKA\ZRE\signaly\zreratdb"*when you are in computer classroom at CTU FEE. For the work at home you can download the archive of signals resampled to 16 kHz zrerat_block172_101_cs0.zip or zrerat_blocken_t02_cs0.zip. - Use
**13 MFCC**cepstral coefficients as features for the recognition, for the purpose of better pause modelling**including zeroth coefficient c[0]**. Setup of MFCC computation for given signals sampled by 16 kHz should be:

- frame length 25 ms,**frame shift 10 ms**, Hamming weighting,

- number of filter bank bands**M=30, fmin=100 Hz, fmax=7000 Hz**,

- for the computation, use the function vmfcc.m (with sequentionally called functions melbf.m, mel.m, melinv.m).

- during the training of vowels use**VAD based on power in dB**with dynamic-based thresholding with the level approx. at**60%**of dynamics,

- for the silence detection use again**VAD based on power in dB**with dynamic-based thresholding with the level approx.**30%**of dynamics.

- for application of VAD use functions speechpwr.m and thr_fixed.m used already at previous seminars. - 1st checked result (1 point):
Observe for given training data:
- cepstrum distribution for vowels same as for speech pause,
i.e. dependencies
**c[0]-c[1], c[2]-c[3], c[4]-c[5]**and**c[6]-c[7]**for**MFCC cepstrum**.

- cepstrum distribution for vowels same as for speech pause,
i.e. dependencies
- Use
*mixnum = 8*as a number of GMM components during the training of these models (lter you can try to use also higher/lower amount of mixtures). Use also GMM model with**diagonal covariance matrix**.

**Definition of HMM model for a sequence of 5 isolated vowels**- Create "structure/cell" variable as the definition of left-right
HMM model without skips for an utterance containing
5 isolated digits separated by speech paus, i.e. structure
variable as follows :

-*hmm.states*... number of HMM states,

-*hmm.b*... array of GMM models with the length related to the number of states (ATTENTION - it must be a cell array),

-*hmm.a*... trnasient matrix related to the number of states. - Compute emitted probability for particular states of defined
HMM-model using parameters saved in defined structure variable defining HMM
model when an utterance a sequence of isolated vowels
*P0*,*P1*or*P2*. Save all computed probabilities into the matrix with number of rows related to number of features vectors (i.e. number of short-time frames) for the analyzed utternace. Column number will be related to the number of states of given HMM model. - 2nd checked result (1 point):
Display for avialable data:
- time dependancy of emitted probability for the 1st, 2nd, 4th, 6th, 8th and 10th HMM state and for the sequence AEIOU (utterance P0).
- observe also these time dependencies of emitted probability for the same HMM model and for the sequences UOIEA (utterance P1) and IUAOE (utterance P2) respectively.

HOMEWORK NEEDED for the NEXT SEMINAR

- Create "structure/cell" variable as the definition of left-right
HMM model without skips for an utterance containing
5 isolated digits separated by speech paus, i.e. structure
variable as follows :
**Boundaries of vowels in utterances***P0*,*P1*and*P2*- Determine manually vowel boundaries for the utterances
*P0*,*P1*and*P2*. - Determined boundaries (in samples or frames for given short-time analysis setup, i.e. frame length of 25 ms and frame shift of 10 ms) keep for each utterance into separate vector.

- Determine manually vowel boundaries for the utterances
**Preparation for decoding**- Study the principles of the computation of a likelihood of passing through given HMM model based on forwar procedure, backward procedure, and Viterbi algorithm.
- Study also how the occupacy likelihood of
*j*-th state in the time*t*can be computed.