Speech Recognition under Real-Word Conditions

Laboratoř zpracování řečového signálu Laboratoř počítačového zpracování řeči Oddělení umělé inteligence Skupina zpracování řeči Loga účastníků

Project follows on preceding research supported by grants within which the realization team has succeeded to develop and realize basic speech recognition algorithms for Czech language. For their successful usage in the most required applications, as transcription of talks, recordings of discussions or court-trials, etc., the research must continue on analysis and modelling of standard (slang) speech collected in real conditions (e.g. with different backgrounds, noises, or with other cross-talk).

The main goal of this four-year project is especially to design new speech parameterizations, background or noise suppression, speaker change-point detection, quick adaptation to new speaker characteristics, to improve lexical and phonetic inventory of recognition system closer to colloquial speech, and also to develop language models with better coverage of inflective nature of Czech.

The main goals of this project reflect the need for deeper research and understanding in fields related to substantial improvement of speech recognition accuracy under real conditions. They logically follow up the above mentioned results of participants. The actual research goals can be summarized as follows:

The research activities within this project were conducted within 4 thematic domains: Feature extraction and signal processing, Acoustic modeling, Language modeling, and Algorithms for search and decoding. Within the field of feature extraction, the activities were focused on research and development of robust parameterization techniques for speech signal, frequently non-standard or strongly distorted. The most important results include suggested combinations of parameterization and noise background suppression (especially by techniques of blind source separation), general studies of optimized hierarchy of artificial neural networks used within feature extraction techniques, or analyses of emotional speech with applications in various recognition tasks. Within the filed of acoustic modeling, new methods of training and application of acoustic models were developed (including adaptation techniques to speaker or channel), and new models for speaker recognition based on projection to low-dimensional sub-space (iVector-PLDA) were developed, such as Sub-Space Gaussian models which are suitable especially for tasks with small amount of training data. Also, a new method of artificial neural network adaptation was presented; it can be applied both within hybrid or classical HMM- and GMM-based systems for speech recognition. The research and development of language models were focused on usage of methods usable for inflective languages. A dictionary with hierarchical structure was designed and on the basis of this structure, more compact dictionary and N-gram language model were generated. Alternative competitive language models based on recurrent neural networks were developed. Within the thematic domain of search algorithms, real-time decoders working with very large dictionaries (above 500 000 words) were developed. Several decoders were created as part of "KALDI toolkit" that is currently freely available for wide research community. The project results were published in 165 publications in total (see section publications); more than one half of them represent prestigious publications, i.e. articles in impacted journals (9) and contributions at prestigious conferences indexed in WoS database (82). Research topics of the project were linked to PhD studies of involved students; 12 PhD theses were successfully defended.

Finally, this project has moved reasonably the state-of-the-art of basic research in the field of speech recognition and it yielded also higher integration of realizing teams into international research community.