Журнал Современные наукоемкие технологии

1812-7320

Общество с ограниченной ответственностью "Издательский Дом "Академия Естествознания"

10.17513/snt.40778

ART-40778

СРАВНИТЕЛЬНОЕ ИССЛЕДОВАНИЕ СТРАТЕГИЙ ФОРМИРОВАНИЯ ГИБРИДНЫХ КОРПУСОВ ОБРАЗОВАТЕЛЬНЫХ ТЕКСТОВ ДЛЯ ЗАДАЧ АВТОМАТИЧЕСКОГО АНАЛИЗА

Маслий

А. А.

Masliy

A. A.

Российская Федерация

Староверова

Н. А.

Staroverova

N. A.

nata-staroverova@yandex.ru

Федеральное государственное бюджетное образовательное учреждение высшего образования «Казанский национальный исследовательский технологический университет» Federal State Budgetary Educational Institution of Higher Education “Kazan National Research Technological University”

28 05 2026

5 73 83

This is an open-access article distributed under the terms of the CC BY 4.0 license.

https://top-technologies.ru/ru/article/view?id=40778

Дефицит качественных размеченных данных сдерживает развитие систем глубинного анализа образовательных текстов. В связи с этим актуально настоящее экспериментальное исследование гибридных корпусов. Цель – определить эффективное соотношение реальных и синтетических текстов в гибридном корпусе образовательных текстов, обеспечивающее максимальное интегральное качество, на основе математической модели, построенной по данным численного эксперимента, выполненного с помощью разработанного программного продукта. Оценены большие языковые модели (GPT-4, Grok, DeepSeek, Gemini, Cogito) по выборке эссе; исследованы комбинации текстов, сгенерированных разными моделями; протестированы гибридные датасеты с варьируемым соотношением студенческих и сгенерированных эссе. Экспертная оценка проведена тремя независимыми экспертами (коэффициент конкордации Кендалла показал высокую согласованность). Наилучший микс ИИ-текстов получен при комбинировании моделей GPT-4 и DeepSeek. На экспериментальных точках методом наименьших квадратов построена квадратичная аппроксимация, которая показала высокую точность и превосходство над линейной и кубической моделями. Абсцисса вершины параболы соответствует оптимальной конфигурации (40 % реальных / 60 % синтетических текстов). Разработанный программный комплекс на Python обеспечивает воспроизводимость эксперимента. Замещение до 60 % реальных данных синтетическими не снижает качества автоматического анализа.

A shortage of high-quality labeled data hinders the development of deep learning systems for analyzing educational texts. In this regard, the present experimental study of hybrid corpora is of high relevance. The objective is to determine the effective ratio of real and synthetic texts in a hybrid corpus of educational texts that ensures maximum integral quality, based on a mathematical model constructed from numerical experiment data performed using a custom-developed software product. Large language models (GPT-4, Grok, DeepSeek, Gemini, Cogito) were evaluated on a sample of essays; combinations of texts generated by different models were investigated; and hybrid datasets with varying ratios of student and generated essays were tested. Expert evaluation was conducted by three independent experts, with Kendall’s coefficient of concordance showing high consistency. The best mix of AI texts was obtained by combining GPT-4 and DeepSeek models. A quadratic approximation was constructed using the least squares method on experimental points, which demonstrated high accuracy and superiority over linear and cubic models. The abscissa of the parabola’s vertex corresponds to the optimal configuration (40 % real / 60 % synthetic texts). In conclusion, the developed Python software suite ensures the reproducibility of the experiment. Replacing up to 60 % of real data with synthetic data does not reduce the quality of automated analysis.

датасет синтетические данные разметка текстов репрезентативность машинное обучение языковые модели гибридные датасеты

dataset synthetic data text annotation representativeness machine learning language models hybrid datasets

1. Дюличева Ю. Ю. Применение учебной аналитики в высшем образовании: датасеты, методы и инструменты // Высшее образование в России. 2024. Т. 33. № 5. С. 86–111. DOI: 10.31992/0869-3617-2024-33-5-86-111.

2. Скворчевский К. А., Дятлова О. В. Современные адаптивные и интеллектуальные цифровые системы обучения: механизмы и потенциал // Вопросы образования // Educational Studies Moscow. 2024. № 3 (2). С. 299–337. DOI: 10.17323/vo-2024-19751.

3. Kostopoulos G., Tsiakmaki M., Kotsiantis S. Benchmarking Statistical and Deep Generative Models for Privacy-Preserving Synthetic Student Data in Educational Data Mining // Algorithms. 2026. Vol. 19 (1). P. 39. DOI: 10.3390/a19010039.

4. Илюшин Л. С., Торпашева Н. А. Технологии искусственного интеллекта как ресурс трансформации образовательных практик // Ярославский педагогический вестник. 2024. № 3 (138). С. 62–71. DOI: 10.20323/1813-145X-2024-3-138-62.

5. Rostam Z. R. K., Kertész G. Advances in Pre-trained Language Models for Domain-Specific Text Classification: A Systematic Review // ACM Transactions on Intelligent Systems and Technology. 2025. Vol. 16. Is. 6. P. 1–41. DOI: 10.1145/3763002.

6. Ma T. Systematically Visualizing ChatGPT Used in Higher Education: Publication Trend. Disciplinary Domains. Research Themes. Adoption and Acceptance // Computers and Education: Artificial Intelligence. 2025. Vol. 8. P. 100336. DOI: 10.1016/j.caeai.2024.100336.

7. Sun J., Song T., Peng W., Song J. A survey of automated essay scoring: Challenges, advances, and future // Neurocomputing. 2025. Vol. 650. 130916. DOI: 10.1016/j.neucom.2025.130916.

8. Prostakov O., Hodlevskyi V., Bouarour N., Sanchez-Ayte A., Ibrahim N., Amer-Yahia S. Reducing Human Effort in Evaluating Small and Medium Language Models as Students and as Teachers // 6th Workshop on Data Science with Human in the Loop (DaSH@VLDB). London. 2025. [Электронный ресурс]. URL: https://openreview.net/pdf?id=CG7DUrQjPQ (дата обращения: 12.03.2026).

9. Stanja J., Dannemann S., Krugel J., Hoppe A. Investigating Evidence-Oriented Generation of Synthetic Text Data with a Generative Large Language Model in Science Education // International Journal of Science Education. 2025. P. 1–23. DOI: 10.1080/09500693.2025.2538834.

10. Leinonen J., Denny P., Kiljunen O., MacNeil S., Sarsa S., Hellas A. LLM-itation is the Sincerest Form of Data: Generating Synthetic Buggy Code Submissions for Computing Education // Proceedings of the 27th Australasian Computing Education Conference (ACE 2025). ACM. 2025. P. 56–63. DOI: 10.1145/3716640.3716647.

11. Stefanovič P., Radvilaitė U., Pliuskuvienė B., Ramanauskaitė S. The influence of Gen-AI tools application for text data augmentation: case of Lithuanian educational context data classification // Scientific Reports. 2025. Vol. 15. Article number 26010. DOI: 10.1038/s41598-025-11877-z.

12. Ara S. J. S., Ramachandriah T., Haladappa M. S. Predictive model to analyze real and synthetic data for learners‘ performance prediction using regression techniques // Online Learning. 2025. Vol. 29. Is. 1. URL: https://olj.onlinelearningconsortium.org/index.php/olj/article/view/4390 (дата обращения: 12.03.2026). DOI: 10.24059/olj.v29i1.4390.

13. Nadăș M., Dioșan L., Tomescu A. Synthetic Data Generation Using Large Language Models: Advances in Text and Code // IEEE Access. 2025. Vol. 13. P. 134615–134633. DOI: 10.1109/ACCESS.2025.3589503.

14. Akaike H. A new look at the statistical model identification // IEEE Transactions on Automatic Control. 1974. Vol. 19. Is. 6. P. 716–723. DOI: 10.1109/TAC.1974.1100705.

15. Flores J. E., Cavanaugh J. E., Neath A. A. A New Class of Information Criteria for Improved Prediction in the Presence of Training/Validation Data Heterogeneity // Computational Statistics. 2025. Vol. 40. Is. 5. P. 2389–2423. DOI: 10.1007/s00180-024-01559-1.

16. Acquah D.-G. H. The Effect of Outliers on the Performance of Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) in Selection of an Asymmetric Price Relationship // Russian Journal of Agricultural and Socio-Economic Sciences. 2017. Vol. 65. Is. 5. P. 32–37. DOI: 10.18551/rjoas.2017-05.05.