Журнал Современные наукоемкие технологии

1812-7320

Общество с ограниченной ответственностью "Издательский Дом "Академия Естествознания"

10.17513/snt.40733

ART-40733

ПРИМЕНИМОСТЬ ДИФФУЗИОННЫХ МОДЕЛЕЙ ДЛЯ ШУМОПОДАВЛЕНИЯ КЛИНИЧЕСКОЙ РЕЧИ: ЭКСПЕРИМЕНТАЛЬНЫЙ АНАЛИЗ И ОГРАНИЧЕНИЯ

https://orcid.org/0000-0002-5524-1325

Староверова

Наталья Александровна

Staroverova

N.A.

nata-staroverova@yandex.ru

Нуриев

Айдар Разилевич

Nuriev

A.R.

tmpower@mail.ru

Нижнекамский химико-технологический институт Nizhnekamsk Institute of Chemical Technology ООО «ПЭСТ» Pest LLC ФГБОУ ВО КНИТУ FGBOU VO " Kazan National Research Technological University "

07 04 2026

4 92 97

This is an open-access article distributed under the terms of the CC BY 4.0 license.

https://top-technologies.ru/ru/article/view?id=40733

Работа посвящена экспериментальной оценке применимости диффузионных вероятностных моделей для шумоподавления клинической речи. Исследование носит характер обмена опытом и направлено на сопоставление диффузионного подхода с классическим алгоритмом OM-LSA в контролируемых условиях моделирования шумов. Проведен анализ и формализация компонентной структуры нестационарных клинических шумов (стационарный фон, импульсные помехи, реверберация). На основе этого анализа выдвинуто и проверено центральное утверждение о принципиальной ограниченности классических линейных методов в данных условиях. В эксперименте использован синтетический корпус медицинской речи объемом 4,2 ч (2350 фрагментов, 18 дикторов, 16 кГц) с наложением стационарных, узкополосных и импульсных помех при SNR от +5 до -5 дБ. Для каждого уровня SNR генерировались три независимые реализации шума. Сравнение проводилось по метрикам PESQ, STOI, относительному искажению формантных частот (ΔF1, ΔF2) и Word Error Rate (WER) системы распознавания речи на базе wav2vec 2.0. Показано, что диффузионная модель демонстрирует лучшее сохранение формантной структуры и более существенное снижение WER по сравнению с OM-LSA при сопоставимых условиях. Одновременно выявлены вычислительные и методические ограничения подхода.

The work is devoted to the experimental assessment of the applicability of diffusion probabilistic models for noise reduction in clinical speech. The study is an exchange of experience and aims to compare the diffusion approach with the classical OM-LSA algorithm under controlled noise modeling conditions. The analysis and formalization of the component structure of non-stationary clinical noises (stationary background, impulse interference, and reverberation) have been conducted. Based on this analysis, a central claim has been made and verified regarding the fundamental limitations of classical linear methods under these conditions. The experiment used a synthetic corpus of medical speech with a volume of 4.2 hours (2350 fragments, 18 speakers, 16 kHz) with stationary, narrow-band, and impulse interference at SNR from +5 to –5 dB. Three independent noise realizations were generated for each SNR level. The comparison was carried out using the PESQ, STOI, relative distortion of formant frequencies (ΔF1, ΔF2), and Word Error Rate (WER) metrics of the wav2vec 2.0 speech recognition system. It was shown that the diffusion model demonstrates better preservation of the formant structure and a more significant reduction in WER compared to OM-LSA under comparable conditions. At the same time, the computational and methodological limitations of the approach were identified.

диффузионные модели шумоподавление речи клиническая акустическая среда медицинская речь нестационарный шум теоретическое обоснование формантный анализ робастное распознавание речи

diffusion probabilistic models speech enhancement clinical acoustic environment medical speech non-stationary noise theoretical justification formant analysis robust automatic speech recognition

1. Croitoru F. A., Hondru V., Ionescu R. T., Shah M. Diffusion models in vision: A survey // IEEE Transactions on Pattern Analysis and Machine Intelligence. 2023. Vol. 45. Is. 9. P. 10850–10869. DOI: 10.1109/TPAMI.2023.3261988.

2. Lu Y. J., Wang Z. Q., Watanabe S., Richard A., Yu C., Tsao Y. Conditional diffusion probabilistic model for speech enhancement // 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2022. P. 7402–7406. DOI: 10.1109/ICASSP43922.2022.9746901.

3. Sarker A., Zhang R., Wang Y., Xiao Y., Das S., Schutte D., Oniani D., Xie Q., Xu H. Natural language processing for digital health in the era of large language models // Yearbook of Medical Informatics. 2024. Vol. 33. Is. 1. P. 229–240. DOI: 10.1055/s-0044-1800750.

4. Zhang L., Cheng W., Zhao M., Tang H. Effect of acoustic environment in wards on postoperative rehabilitation in patients with oral cancer: A retrospective study // Noise and Health. 2024. Vol. 26. Is. 121. P. 148–152. DOI: 10.4103/nah.nah_34_24.

5. Лебедев Г. С., Шадеркин И. А., Лебедева Н. А. Модифицируемые факторы среды помещения: влияние на здоровье человека и цифровой мониторинг. Аналитический обзор // Журнал телемедицины и электронного здравоохранения. 2023. Т. 9. № 1. С. 21–48. DOI: 10.29188/2712-9217-2023-9-1-21-48.

6. Johnson A. E. W., Pollard T. J., Shen L. et al. MIMIC-III, a freely accessible critical care database // Scientific Data. 2016. Vol. 3. Is. 1. P. 1–9. DOI: 10.1038/sdata.2016.35.

7. Czyżewski A. et al. A comprehensive Polish medical speech dataset for enhancing automatic medical dictation // Scientific Data. 2025. Vol. 12. Is. 1. P. 1436. DOI: 10.1038/s41597-025-05776-1.

8. Cohen I., Berdugo B. Speech enhancement for non-stationary noise environments // Signal Processing. 2001. Vol. 81. Is. 11. P. 2403–2418. DOI: 10.1016/S0165-1684(01)00128-1.

9. Rix A. W., Beerends J. G., Hollier M. P., Hekstra A. P. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs // 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 2001. Vol. 2. P. 749–752. DOI: 10.1109/ICASSP.2001.941023.

10. Taal C. H., Hendriks R. C., Heusdens R., Jensen J. An algorithm for intelligibility prediction of time–frequency weighted noisy speech // IEEE Transactions on Audio, Speech, and Language Processing. 2011. Vol. 19. Is. 7. P. 2125–2136. DOI: 10.1109/TASL.2011.2114881.

11. Buder E. H., Kent R. D., Kent J. F., Milenkovic P., Workinger M. S. FORMOFFA: An automated formant, moment, fundamental frequency, amplitude analysis of normal and disordered speech // Clinical Linguistics & Phonetics. 1996. Vol. 10. Is. 1. P. 31–54. DOI: 10.3109/02699209608985160.

12. Von Neumann T., Boeddeker C., Kinoshita K., Delcroix M., Haeb-Umbach R. On word error rate definitions and their efficient computation for multi-speaker speech recognition systems // 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2023. DOI: 10.1109/icassp49357.2023.10094784.

13. Baevski A., Zhou Y., Mohamed A., Auli M. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations // Advances in Neural Information Processing Systems (NeurIPS). 2020. Vol. 33. P. 12449–12460. URL: https://proceedings.neurips.cc/paper/2020/file/92d1e1eb1cd6f9fba3227870bb6d7f07-Paper.pdf (дата обращения: 26.03.2026).

14. Morris T. P., White I. R., Crowther M. J. Using simulation studies to evaluate statistical methods // Statistics in Medicine. 2019. Vol. 38. Is. 11. P. 2074–2102. DOI: 10.1002/sim.8086.

15. Kewley-Port D., Watson C. S. Formant-frequency discrimination for isolated English vowels // The Journal of the Acoustical Society of America. 1994. Vol. 95. Is. 1. P. 485–496. DOI: 10.1121/1.410024.