BE2M31ZRE seminar
Recognition of vowel sequence based on HMM - part I
Definition and construction of HMM model
Computation of emitted probability in a state of HMM
Tasks to do:
Computation of GMM models of vowels and speech pause
Computed GMM-model parameters for 5 basic vowels and speech
pause. Use the procedure analogous to the classification of
particular isolated vowels based on GMM, see
Task
"GMM-based classification of vowels".
To train vowel models use utterances for all isolated vowels, i.e. A1, A2,
A3, E1, ...., U1, U2, U3; to train model of speech pause use
non-speech parts of utterances B0 and B1 or P1, P2 and P3, both using
all speakers at least from block BLOCKEN2
(later you can try to experiment with higher amount of training data).
You can use the data directly from the
directory "H:\VYUKA\ZRE\signaly\zreratdb" when you are in computer
classroom at CTU FEE.
For the work at home you can download the archive of signals
resampled to 16
kHz zrerat_blocken_2022_cs0.zip.
Use 13 MFCC cepstral coefficients as features for the
recognition, for the purpose of better pause modelling including zeroth coefficient
c[0]. Setup of MFCC computation for given signals sampled by 16
kHz should be:
- apply pre-emphasis to each signal with the coefficient m = 0.97,
- frame length 25 ms, frame shift 10 ms, Hamming weighting,
- number of filter bank bands M=30, fmin=100 Hz, fmax=6500 Hz,
- for the computation, use the
function vmfcc.m
(with sequentionally called functions melbf.m,
mel.m, melinv.m).
- during the training of vowels use VAD based on power in
dB with dynamic-based thresholding with the level
approx. at 60% of dynamics,
- for the silence detection use again VAD based on power in
dB with dynamic-based thresholding with the level
approx. 30% of dynamics.
- for application of VAD use
functions speechpwr.m
and thr_fixed.m
used already at previous seminars.
Result:
Observe for given training data:
cepstrum distribution for vowels same as for speech pause,
i.e. dependencies c[0]-c[1], c[2]-c[3], c[4]-c[5] and c[6]-c[7] for MFCC cepstrum.
Use mixnum = 8 as a number of GMM components
during the training of these models
(lter you can try to use also higher/lower amount of mixtures). Use also GMM
model with diagonal covariance matrix.
In the first step, you can work just with your records of particular vowels and their sequences which include speech pauses. To estimate parameters of GMM models could be necessary to use lower number of mixtures (it can even be mixnum = 1). Created models will be less general and emitted probabilities could be higher for your data and lower for data of other speaker.
To simplify your work in the beginning, you can use pre-trained GMM models saved in the following mat-file cv_09_vowel_sil_gmms.mat
Definition of HMM model for a sequence of 5 isolated vowels
Create "structure/cell" variable as the definition of left-right
HMM model without skips for an utterance containing
5 isolated digits separated by speech pauses, i.e. structure
variable as follows :
- hmm.states ... number of HMM states
(realize state indexing starting and including also the first non-emitting state),
- hmm.b ... array of GMM models with the
length related to the number of states
(ATTENTION : 1. it must be a cell array, and 2. as function b() is not defined for the first and the lastnon-emitting state, use '[]' for a definition of the first element of this array, the last element can be omitted),
- hmm.a ... trnasient matrix related to the
number of states.
Compute emitted probability for particular states of defined
HMM-model using parameters saved in defined structure variable defining HMM
model when an utterance a sequence of isolated vowels P1, P2
or P3. Save all computed probabilities into the matrix with
number of rows related to number of features vectors (i.e. number
of short-time frames) for the analyzed utternace. Column number will
be related to the number of states of given HMM model.
Result:
Display for avialable data:
time dependancy of emitted probability for the
1st, 2nd, 4th, 6th, 8th and 10th HMM state and for the sequence AEIOU (utterance P0).
observe also these time dependencies of emitted probability
for the same HMM model and for the sequences UOIEA (utterance P1) and
IUAOE (utterance P2) respectively.
Main result: Observe for created
HMM models:
emitted logarithmic probabilities as color 2D graphs (analogous to spectrogram) using the function 'pcolor' for each model and always for all 3 analyzed utterances P1, P2, P3.
HOMEWORK NEEDED for the NEXT SEMINAR
Boundaries of vowels in utterances P1, P2 and P3
Determine manually vowel boundaries for the utterances P1, P2 and P3.
Determined boundaries (in samples or frames for given
short-time analysis setup, i.e. frame length of 25 ms and frame
shift of 10 ms) keep for each utterance into separate vector.
Preparation for decoding
Study the principles of the computation of a likelihood of passing
through given HMM model based on
forward and backward Viterbi algorithm.
Study also how the occupacy likelihood of j-th state
in the time t can be computed.