Текущий выпуск Номер 2, 2026 Том 18

Все выпуски

2026 Том 18
- Номер 2
- Номер 1
2025 Том 17
2024 Том 16
- Номер 7 (специальный выпуск)
- Номер 6
- Номер 5
- Номер 4
- Номер 3
- Номер 2
- Номер 1 (специальный выпуск)
2023 Том 15
- Номер 6
- Номер 5
- Номер 4 (специальный выпуск)
- Номер 3
- Номер 2 (специальный выпуск)
- Номер 1
2022 Том 14
- Номер 6
- Номер 5
- Номер 4 (специальный выпуск)
- Номер 3
- Номер 2 (специальный выпуск)
- Номер 1
2021 Том 13
- Номер 6
- Номер 5
- Номер 4
- Номер 3
- Номер 2 (специальный выпуск)
- Номер 1
2020 Том 12
2019 Том 11
2018 Том 10
- Номер 6
- Номер 5 (специальный выпуск)
- Номер 4
- Номер 3 (специальный выпуск)
- Номер 2
- Номер 1
2017 Том 9
2016 Том 8
2015 Том 7
- Номер 6
- Номер 5
- Номер 4
- Номер 3 (специальный выпуск)
- Номер 2
- Номер 1
2014 Том 6
- Номер 6 (специальный выпуск)
- Номер 5
- Номер 4
- Номер 3
- Номер 2
- Номер 1
2013 Том 5
- Номер 6 (специальный выпуск)
- Номер 5
- Номер 4
- Номер 3
- Номер 2
- Номер 1
2012 Том 4
2011 Том 3
2010 Том 2
2009 Том 1

Результаты поиска по 'text analysis':

Найдено статей: 22

От редакции
Компьютерные исследования и моделирование, 2024, т. 16, № 7, с. 1533-1538

Editor’s note
Computer Research and Modeling, 2024, v. 16, no. 7, pp. 1533-1538
Adekotujo A.S., Enikuomehin T., Aribisala B., Mazzara M., Zubair A.F.
Computational treatment of natural language text for intent detection
Компьютерные исследования и моделирование, 2024, т. 16, № 7, с. 1539-1554

text-align: justify;">Intent detection plays a crucial role in task-oriented conversational systems. To understand the user’s goal, the system relies on its intent detector to classify the user’s utterance, which may be expressed in different forms of natural language, into intent classes. However, lack of data, and the efficacy of intent detection systems has been hindered by the fact that the user’s intent text is typically characterized by short, general sentences and colloquial expressions. The process of algorithmically determining user intent from a given statement is known as intent detection. The goal of this study is to develop an intent detection model that will accurately classify and detect user intent. The model calculates the similarity score of the three models used to determine their similarities. The proposed model uses Contextual Semantic Search (CSS) capabilities for semantic search, Latent Dirichlet Allocation (LDA) for topic modeling, the Bidirectional Encoder Representations from Transformers (BERT) semantic matching technique, and the combination of LDA and BERT for text classification and detection. The dataset acquired is from the broad twitter corpus (BTC) and comprises various meta data. To prepare the data for analysis, a pre-processing step was applied. A sample of 1432 instances were selected out of the 5000 available datasets because manual annotation is required and could be time-consuming. To compare the performance of the model with the existing model, the similarity scores, precision, recall, f1 score, and accuracy were computed. The results revealed that LDA-BERT achieved an accuracy of 95.88% for intent detection, BERT with an accuracy of 93.84%, and LDA with an accuracy of 92.23%. This shows that LDA-BERT performs better than other models. It is hoped that the novel model will aid in ensuring information security and social media intelligence. For future work, an unsupervised LDA-BERT without any labeled data can be studied with the model.

Ключевые слова: hate speech, intent classification, Twitter posts, sentiment analysis, opinion mining, intent identification from Twitter posts.

Adekotujo A.S., Enikuomehin T., Aribisala B., Mazzara M., Zubair A.F.
Computational treatment of natural language text for intent detection
Computer Research and Modeling, 2024, v. 16, no. 7, pp. 1539-1554

text-align: justify;">Intent detection plays a crucial role in task-oriented conversational systems. To understand the user’s goal, the system relies on its intent detector to classify the user’s utterance, which may be expressed in different forms of natural language, into intent classes. However, lack of data, and the efficacy of intent detection systems has been hindered by the fact that the user’s intent text is typically characterized by short, general sentences and colloquial expressions. The process of algorithmically determining user intent from a given statement is known as intent detection. The goal of this study is to develop an intent detection model that will accurately classify and detect user intent. The model calculates the similarity score of the three models used to determine their similarities. The proposed model uses Contextual Semantic Search (CSS) capabilities for semantic search, Latent Dirichlet Allocation (LDA) for topic modeling, the Bidirectional Encoder Representations from Transformers (BERT) semantic matching technique, and the combination of LDA and BERT for text classification and detection. The dataset acquired is from the broad twitter corpus (BTC) and comprises various meta data. To prepare the data for analysis, a pre-processing step was applied. A sample of 1432 instances were selected out of the 5000 available datasets because manual annotation is required and could be time-consuming. To compare the performance of the model with the existing model, the similarity scores, precision, recall, f1 score, and accuracy were computed. The results revealed that LDA-BERT achieved an accuracy of 95.88% for intent detection, BERT with an accuracy of 93.84%, and LDA with an accuracy of 92.23%. This shows that LDA-BERT performs better than other models. It is hoped that the novel model will aid in ensuring information security and social media intelligence. For future work, an unsupervised LDA-BERT without any labeled data can be studied with the model.

Keywords: hate speech, intent classification, Twitter posts, sentiment analysis, opinion mining, intent identification from Twitter posts.
Петров М.Н., Зимина С.В.
Суррогатный нейросетевой метод восстановления поля течения из однородного поля итерациями в расчетах стационарных турбулентных течений
Компьютерные исследования и моделирование, 2025, т. 17, № 2, с. 179-197

text-align: justify;">Последние годы получило широкое распространение применение нейросетевых моделей для решения задач аэродинамики. В основном такие модели, обученные по некоторому набору ранее полученных решений, позволяют предсказывать решения новых задач и являются в некотором смысле алгоритмами интерполяции. Альтернативным подходом может служить построение нейросетевого оператора, представляющего собой нейросетевую модель, которая воспроизводит поведение численного метода решения задачи. Такая модель позволяет находить решение задачи итерациями. В работе рассматривается вариант построения такого оператора с применением нейронной сети типа UNet с пространственным механизмом внимания для решения задач обтекания на прямоугольной равномерной сетке, общей для обтекаемого тела и поля течения. Для уточнения полученного решения предлагается и исследуется механизм коррекции решения. Анализируется вопрос устойчивости такого алгоритма решения стационарной задачи, проводится сравнение с некоторыми другими вариантами его построения: прием с продвижением вперед (pushforward trick), позиционное встраивание. Рассматривается вопрос выбора набора итераций для формирования обучающей выборки. Оценивается поведение решения при многократном применении нейросетевого оператора.

text-align: justify;">Демонстрация метода приводится для случая обтекания скругленной пластины турбулентным потоком воздуха с различными вариантами скругления при фиксированных параметрах набегающего потока с числом Рейнольдса $\text{Re} = 10^5$ и числом Маха $M = 0,15$. Поскольку течения с такими параметрами набегающего потока можно считать несжимаемыми, исследуются непосредственно только компоненты скорости. При этом нейросетевая модель, используемая для построения оператора, имеет общий декодер для обеих компонент скорости. Проводится сравнение полей течения и профилей скорости по нормали и по обводу тела, полученных нейросетевым оператором и численно. Анализ проводится как на пластине, так и на скруглении. Результаты моделирования подтверждают, что нейросетевой оператор позволяет находить решение с высокой точностью устойчивым образом.

Ключевые слова: аэродинамика, турбулентность, нейросетевой оператор, сверточная нейронная сеть, UNet, механизм внимания.

Petrov M.N., Zimina S.V.
A surrogate neural network method for restoring the flow field from a homogeneous field by iterations in calculations of steady turbulent flows
Computer Research and Modeling, 2025, v. 17, no. 2, pp. 179-197

text-align: justify;">In recent years, the use of neural network models for solving aerodynamics problems has become widespread. These models, trained on a set of previously obtained solutions, predict solutions to new problems. They are, in essence, interpolation algorithms. An alternative approach is to construct a neural network operator. This is a neural network that reproduces a numerical method used to solve a problem. It allows to find the solution in iterations. The paper considers the construction of such an operator using the UNet neural network with a spatial attention mechanism. It solves flow problems on a rectangular uniform grid that is common to a streamlined body and flow field. A correction mechanism is proposed to clarify the obtained solution. The problem of the stability of such an algorithm for solving a stationary problem is analyzed, and a comparison is made with other variants of its construction, including pushforward trick and positional encoding. The issue of selecting a set of iterations for forming a train dataset is considered, and the behavior of the solution is assessed using repeated use of a neural network operator.

text-align: justify;">A demonstration of the method is provided for the case of flow around a rounded plate with a turbulent flow, with various options for rounding, for fixed parameters of the incoming flow, with Reynolds number $\text{Re} = 10^5$ and Mach number $M = 0.15$. Since flows with these parameters of the incoming flow can be considered incompressible, only velocity components are directly studied. At the same time, the neural network model used to construct the operator has a common decoder for both velocity components. Comparison of flow fields and velocity profiles along the normal and outline of the body, obtained using a neural network operator and numerical methods, is carried out. Analysis is performed both on the plate and rounding. Simulation results confirm that the neural network operator allows finding a solution with high accuracy and stability.

Keywords: aerodynamics, turbulence, neural operator, convolutional neural network, UNet, attention.
Воронцов К.В., Потапенко А.А.
Регуляризация, робастность и разреженность вероятностных тематических моделей
Компьютерные исследования и моделирование, 2012, т. 4, № 4, с. 693-706

text-align: justify;">Предлагается обобщенное семейство вероятностных тематических моделей коллекций текстовых документов, в котором эвристики регуляризации, сэмплирования, частого обновления параметров, робастности относительно шума и фона могут включаться независимо друг от друга в любых сочетаниях, порождая как известные модели PLSA, LDA, CVB0, SWB, так и новые. Показано, что робастная тематическая модель на основе PLSA, разделяющая термины на тематические, шумовые и фоновые, не нуждается в регуляризации и обеспечивает разреженность искомых дискретных распределений тем в документах и терминов в темах.

Ключевые слова: компьютерныйана лиз текстов, тематическое моделирование, вероятностныйла тентный семантическийана лиз, EM-алгоритм, латентное размещение Дирихле, сэмплирование Гиббса, байесовская регуляризация, перплексия, робастность.

Vorontsov K.V., Potapenko A.A.
Regularization, robustness and sparsity of probabilistic topic models
Computer Research and Modeling, 2012, v. 4, no. 4, pp. 693-706

text-align: justify;">We propose a generalized probabilistic topic model of text corpora which can incorporate heuristics of Bayesian regularization, sampling, frequent parameters update, and robustness in any combinations. Wellknown models PLSA, LDA, CVB0, SWB, and many others can be considered as special cases of the proposed broad family of models. We propose the robust PLSA model and show that it is more sparse and performs better that regularized models like LDA.

Keywords: text analysis, topic modeling, probabilistic latent semantic analysis, EM-algorithm, latent Dirichlet allocation, Gibbs sampling, Bayesian regularization, perplexity, robusteness.
Просмотров за год: 25. Цитирований: 12 (РИНЦ).
Куликов Ю.М., Сон Э.Е.
Применение схемы«КАБАРЕ» к задаче об эволюции свободного сдвигового течения
Компьютерные исследования и моделирование, 2017, т. 9, № 6, с. 881-903

text-align: justify;">В настоящей работе приводятся результаты численного моделирования свободного сдвигового течения с помощью схемы «КАБАРЕ», реализованной в приближении слабой сжимаемости. Анализ схемы проводится на основе изучения свойств неустойчивости Кельвина–Гельмгольца и порождаемой ею двумерной турбулентности, с использованием интегральных кривых кинетической энергии и энстрофии, картин временной эволюции завихренности, спектров энстрофии и энергии, а также дисперсионного соотношения для инкремента неустойчивости. Расчеты проводились для числа Рейнольдса $\text{Re} = 4 \times 10^5$, на квадратных последовательно сгущаемых сетках в диапазоне $128^2-2048^2$ ячеек. Внимание уделено проблеме «недоразрешенности слоев», проявляющейся в возникновении лишнего вихря при свертывании двух вихревых листов (слоев вихревой пелены). Данное явление существует только на грубых сетках $(128^2)$, однако, полностью симметричная картина эволюции завихренности начинает наблюдаться только при переходе к сетке $1024^2$ ячеек. Размерные оценки отношения вихрей на границах инерционного интервала показывают, что наиболее подробная сетка $2048^2$ ячеек оказывается достаточной для качественного отображения мелкомасштабных сгустков завихренности. Тем не менее можно говорить о достижении хорошей сходимости при отображении крупномасштабных структур. Эволюция турбулентности, в полном соответствии с теоретическими представлениями, приводит к появлению крупных вихрей, в которых сосредотачивается вся кинетическая энергия движения, и уединенных мелкомасштабных образований. Последние обладают свойствами когерентных структур, выживая в процессе нитеобразования (филаментации), и практически не взаимодействуют с вихрями других масштабов. Обсуждение диссипативных характеристик схемы ведется на основе анализа графиков скорости диссипации кинетической энергии, вычисляемой непосредственно, а также на основе теоретических соотношений для моделей несжимаемой жидкости (по кривым энстрофии) и сжимаемого газа (по влиянию тензора скоростей деформации и эффектов дилатации). Асимптотическое поведение каскадов кинетической энергии и энстрофии подчиняется реализующимся в двумерной турбулентности соотношениям $E(k) \propto k^{−3}$, $\omega^2(k) \propto k^{−1}$. Исследование зависимости инкремента неустойчивости от безразмерного волнового числа показывает хорошее согласие с данными других исследователей, вместе с тем часто используемый способ расчета инкремента неустойчивости не всегда оказывается достаточно точным, вследствие чего была предложена его модификация.

text-align: justify;">Таким образом, реализованная схема, отличаясь малой диссипативностью и хорошим вихреразрешением, оказывается вполне конкурентоспособной в сравнении с методами высокого порядка точности.

Ключевые слова: численная схема «КАБАРЕ», слабосжимаемая жидкость, неустойчивость Кельвина–Гельгольца, завихренность, энстрофия, инкремент неустойчивости, недоразрешаемые слои, «паразитный» вихрь, свертывание, инерционный интервал, когерентные структуры, филаментация, скорость диссипации, дилатация.

Kulikov Y.M., Son E.E.
CABARET scheme implementation for free shear layer modeling
Computer Research and Modeling, 2017, v. 9, no. 6, pp. 881-903

text-align: justify;">In present paper we reexamine the properties of CABARET numerical scheme formulated for a weakly compressible fluid flow basing the results of free shear layer modeling. Kelvin–Helmholtz instability and successive generation of two-dimensional turbulence provide a wide field for a scheme analysis including temporal evolution of the integral energy and enstrophy curves, the vorticity patterns and energy spectra, as well as the dispersion relation for the instability increment. The most part of calculations is performed for Reynolds number $\text{Re} = 4 \times 10^5$ for square grids sequentially refined in the range of $128^2-2048^2$ nodes. An attention is paid to the problem of underresolved layers generating a spurious vortex during the vorticity layers roll-up. This phenomenon takes place only on a coarse grid with $128^2$ nodes, while the fully regularized evolution pattern of vorticity appears only when approaching $1024^2$-node grid. We also discuss the vorticity resolution properties of grids used with respect to dimensional estimates for the eddies at the borders of the inertial interval, showing that the available range of grids appears to be sufficient for a good resolution of small–scale vorticity patches. Nevertheless, we claim for the convergence achieved for the domains occupied by large-scale structures.

text-align: justify;">The generated turbulence evolution is consistent with theoretical concepts imposing the emergence of large vortices, which collect all the kinetic energy of motion, and solitary small-scale eddies. The latter resemble the coherent structures surviving in the filamentation process and almost noninteracting with other scales. The dissipative characteristics of numerical method employed are discussed in terms of kinetic energy dissipation rate calculated directly and basing theoretical laws for incompressible (via enstrophy curves) and compressible (with respect to the strain rate tensor and dilatation) fluid models. The asymptotic behavior of the kinetic energy and enstrophy cascades comply with two-dimensional turbulence laws $E(k) \propto k^{−3}, \omega^2(k) \propto k^{−1}$. Considering the instability increment as a function of dimensionless wave number shows a good agreement with other papers, however, commonly used method of instability growth rate calculation is not always accurate, so some modification is proposed. Thus, the implemented CABARET scheme possessing remarkably small numerical dissipation and good vorticity resolution is quite competitive approach compared to other high-order accuracy methods

Keywords: CABARET numerical scheme, weakly compressible fluid, Kelvin – Helmholtz instability, vorticity, enstrophy, instability increment, underresolved layers, spurious vortex, rollup, inertial interval, coherent structures, filamentation, dissipation rate, dilatation.
Просмотров за год: 17.
Забелло К.К., Гарбарук А.В.
Исследование точности метода решеточных уравнений Больцмана при расчете распространения акустических волн
Компьютерные исследования и моделирование, 2025, т. 17, № 6, с. 1069-1081

text-align: justify;">В статье проводится систематическое исследование возможностей метода решеточных уравнений Больцмана (lattice Boltzmann method, LBM или РУБ) для описания распространения акустических волн. Рассмотрена задача о распространении возмущений от точечного гармонического источника акустических возмущений в неограниченном пространстве как в неподвижной среде (число Маха $M=0$), так и при наличии набегающего потока (число Маха $M=0{,}2$). Обе рассмотренные задачи имеют аналитическое решение в приближении линейной акустики, что позволяет количественно оценить точность численного метода.

text-align: justify;">Численная реализация осуществлена с использованием двумерной модели скоростей D2Q9 и оператора столкновений Бхатнагара – Гросса – Крука (BGK). Источник колебаний задавался согласно схеме Gou, а возникающий от источника паразитный шум в моментах старших порядков убирался за счет использования процедуры регуляризации функций распределения. Для минимизации отражений от границ расчетной области использовался гибридный подход, основанный на совместном использовании характеристических граничных условий на основе инвариантов Римана и поглощающих PML-слоев (perfectly matched layer) с параболическим профилем затухания.

text-align: justify;">В ходе работы проведен детальный анализ влияния вычислительных параметров метода на точность расчета. Исследована зависимость погрешности от толщины PML-слоя ($L_{\text{PML}}^{}$) и максимального коэффициента демпфирования ($\sigma_{\max}^{}$), безразмерной амплитуды источника ($Q'_0$) и шага расчетной сетки. Показано, что метод РУБ применим для моделирования распространения акустических волн и обладает вторым порядком точности. Установлено, что для достижения высокой точности расчета (относительная погрешность давления — не более $1\,\%$) достаточно пространственного разрешения в $20$ точек на длину волны ($\lambda$). Определены минимальные эффективные параметры PML-слоя: $\sigma_{\max}^{}\geqslant 0{,}02$ и $L_{\text{PML}}^{} \geqslant 2\lambda$, обеспечивающие отсутствие отражения от границ расчетной области. Также продемонстрировано, что при амплитудах источника $Q_0' \geqslant 0{,}1$ влияние нелинейных эффектов становится существенным по сравнению с другими источниками погрешности.

Ключевые слова: решеточные уравнения Больцмана (РУБ), аэроакустика, численное моделирование, регуляризация, PML-слой, характеристические граничные условия.

Zabello K.K., Garbaruk A.V.
Investigation of the accuracy of the lattice Boltzmann method in calculating acoustic wave propagation
Computer Research and Modeling, 2025, v. 17, no. 6, pp. 1069-1081

text-align: justify;">The article presents a systematic investigation of the capabilities of the lattice Boltzmann method (LBM) for modeling the propagation of acoustic waves. The study considers the problem of wave propagation from a point harmonic source in an unbounded domain, both in a quiescent medium (Mach number $M=0$) and in the presence of a uniform mean flow ($M=0.2$). Both scenarios admit analytical solutions within the framework of linear acoustics, allowing for a quantitative assessment of the accuracy of the numerical method.

text-align: justify;">The numerical implementation employs the two-dimensional D2Q9 velocity model and the Bhatnagar – Gross – Krook (BGK) collision operator. The oscillatory source is modeled using Gou’s scheme, while spurious high-order moment noise generated by the source is suppressed via a regularization procedure applied to the distribution functions. To minimize wave reflections from the boundaries of the computational domain, a hybrid approach is used, combining characteristic boundary conditions based on Riemann invariants with perfectly matched layers (PML) featuring a parabolic damping profile.

text-align: justify;">A detailed analysis is conducted to assess the influence of computational parameters on the accuracy of the method. The dependence of the error on the PML thickness ($L_{\text{PML}}^{}$) and the maximum damping coefficient ($\sigma_{\max}^{}$), the dimensionless source amplitude ($Q'_0$), and the grid resolution is thoroughly examined. The results demonstrate that the LBM is suitable for simulating acoustic wave propagation and exhibits second-order accuracy. It is shown that achieving high accuracy (relative pressure error below $1\,\%$) requires a spatial resolution of at least $20$ grid points per wavelength ($\lambda$). The minimal effective PML parameters ensuring negligible boundary reflections are identified as $\sigma_{\max}^{}\geqslant 0.02$ and $L_{\text{PML}}^{} \geqslant 2\lambda$. Additionally, it is shown that for source amplitudes $Q_0' \geqslant 0.1$, nonlinear effects become significant compared to other sources of error.

Keywords: lattice Boltzmann method (LBM), aeroacoustics, numerical simulation, regularization, PML layer, characteristic boundary conditions.
Антипова С.А., Журкин А.М.
Ресурсно-адаптивный подход к разметке текстовых данных в структурированном виде с использованием малых языковых моделей
Компьютерные исследования и моделирование, 2026, т. 18, № 1, с. 41-59

text-align: justify;">В данной работе проведено экспериментальное исследование применения автоматической разметки текстовых данных в формате «вопрос – ответ» (QA-пары) в условиях ограниченных вычислительных ресурсов и требований к защите данных. В отличие от традиционных подходов, основанных на жестких правилах или использовании внешних API, предложено применять малые языковые модели с небольшим количеством параметров, способные функционировать локально без GPU на стандартных CPU-системах. Для тестирования были выбраны две модели: Gemma-3-4b и Qwen-2.5-3b (квантованные 4-битные версии), а в качестве исходного материала использован корпус документов с четкой структурой и формально-строгим стилем изложения. Разработана система автоматической аннотации, реализующая полный цикл генерации QA-датасета: автоматическое разбиение исходного документа на логически связные фрагменты, формирование пар «вопрос – ответ» моделью Gemma-3-4b, предварительная проверка их корректности с использованием Qwen-2.5-3b с опорой на доказательный фрагмент из контекста и экспертной оценкой качества. Экспорт полученных результатов предоставляется в формате JSONL. Оценка производительности охватывает всю систему генерации QA-пар, включая обработку фрагментов локальной языковой моделью, модули предобработки и постобработки текста. Производительность измеряется по времени генерации одной QA-пары, общей пропускной способности системы, использованию оперативной памяти и загрузке процессора, что позволяет объективно оценить вычислительную эффективность предлагаемого подхода при запуске на CPU. Эксперимент на расширенной выборке из 12 документов показал, что автоматическая аннотация демонстрирует устойчивую производительность при обработке документов различных типов, тогда как ручная разметка характеризуется существенно большими временными затратами и высокой вариативностью. В зависимости от типа документа ускорение аннотации по сравнению с ручным процессом составляет от 8 до 14 раз. Анализ качества показал, что большинство сгенерированных QA-пар обладают высокой семантической согласованностью с исходным контекстом, при этом лишь ограниченная доля данных требует экспертной корректировки или исключения. Хотя полная ручная валидация корпуса (золотой стандарт) в рамках работы не проводилась, сочетание автоматической оценки и выборочной экспертной проверки позволяет рассматривать полученный уровень качества как приемлемый для задач предварительной автоматизированной аннотации. В целом результаты подтверждают практическую применимость малых языковых моделей для построения автономных и воспроизводимых систем автоматической разметки текстов в условиях ограниченных вычислительных ресурсов и создают основу для дальнейших исследований в области эффективной подготовки обучающих корпусов для задач обработки естественного языка.

Ключевые слова: языковые модели, разметка данных, вопрос – ответ, оценка качества, локальные вычисления, ограниченные вычислительные ресурсы.

Antipova S.A., Zhurkin A.M.
Resource-adaptive approach to structured text data annotation using small language models
Computer Research and Modeling, 2026, v. 18, no. 1, pp. 41-59

text-align: justify;">This paper presents an experimental study of the application of automatic annotation of text data in the question – answer format (QA pairs) under conditions of limited computing resources and data protection requirements. Unlike traditional approaches based on rigid rules or the use of external APIs, we propose using small language models with a small number of parameters that can function locally without a GPU on standard CPU systems. Two models were selected for testing — Gemma-3-4b and Qwen-2.5-3b (quantized 4-bit versions) — and a corpus of documents with a clear structure and a formally rigorous style of presentation was used as source material. An automatic annotation system was developed that implements the full cycle of QA dataset generation: automatic division of the source document into logically connected fragments, formation of “question – answer” pairs using the Gemma-3-4b model, preliminary verification of their correctness using Qwen-2.5-3b based on evidence span from the context and expert quality assessment. The results are exported in JSONL format. Performance evaluation covers the entire QA pair generation system, including fragment processing by the local language model, text preprocessing and postprocessing modules. Performance is measured by the time it takes to generate a single QA pair, the total throughput of the system, RAM usage, and CPU load, which allows for an objective assessment of the computational efficiency of the proposed approach when running on a CPU. An experiment on an extended sample of 12 documents showed that automatic annotation demonstrates stable performance when processing different types of documents, while manual annotation is characterized by significantly higher time costs and high variability. Depending on the type of document, the acceleration of annotation compared to the manual process ranges from 8 to 14 times. Quality analysis showed that most of the generated QA pairs have high semantic consistency with the original context, with only a limited proportion of data requiring expert correction or exception. Although full manual validation of the corpus (the “gold standard”) was not performed as part of this work, the combination of automatic evaluation and selective expert review allows us to consider the resulting quality level acceptable for preliminary automated annotation tasks. Overall, the results confirm the practical applicability of small language models for building autonomous and reproducible automatic text annotation systems under limited computational resources and provide a basis for further research in the field of effective training corpus preparation for natural language processing tasks.

Keywords: language models, data annotation, question – answer, quality evaluation, local computation, limited computational resource.
Кочергин А.В., Холматова З.Ш.
Извлечение персонажей и событий из повествований
Компьютерные исследования и моделирование, 2024, т. 16, № 7, с. 1593-1600

text-align: justify;">Извлечение событий и персонажей из повествований является фундаментальной задачей при анализе и обработке текста на естественном языке. Методы извлечения событий применяются в самых разных областях — от обобщения различных документов до анализа медицинских записей. Мы определяли события на основе структуры под названием «четыре W» (кто, что, когда, где), чтобы охватить все основные компоненты событий, такие как действующие лица, действия, время и места. В этой статье мы рассмотрели два основных метода извлечения событий: статистический анализ синтаксических деревьев и семантическая маркировка ролей. Хотя эти методы были изучены разными исследователями по отдельности, мы напрямую сравнили эффективность двух подходов на собранном нами наборе данных, который мы разметили.

text-align: justify;">Наш анализ показал, что статистический анализ синтаксических деревьев превосходит семантическую маркировку ролей при выделении событий и символов, особенно при определении конкретных деталей. Тем не менее, семантическая маркировка ролей продемонстрировала хорошую эффективность при правильной идентификации действующих лиц. Мы оценили эффективность обоих подходов, сравнив различные показатели, такие как точность, отзывчивость и F1-баллы, продемонстрировав, таким образом, их соответствующие преимущества и ограничения.

text-align: justify;">Более того, в рамках нашей работы мы предложили различные варианты применения методов извлечения событий, которые мы планируем изучить в дальнейшем. Области, в которых мы хотим применить эти методы, включают анализ кода и установление авторства исходного кода. Мы рассматриваем возможность использования методов извлечения событий для определения ключевых элементов кода в виде назначений переменных и вызовов функций, что в дальнейшем может помочь ученым проанализировать поведение программ и определить участников проекта. Наша работа дает новое понимание эффективности статистического анализа и методов семантической маркировки ролей, предлагая исследователям новые направления для применения этих методов.

Ключевые слова: извлечение событий, обработка естественного языка, статистический анализ, семантическая маркировка ролей.

Kochergin A.V., Kholmatova Z.Sh.
Extraction of characters and events from narratives
Computer Research and Modeling, 2024, v. 16, no. 7, pp. 1593-1600

text-align: justify;">Events and character extraction from narratives is a fundamental task in text analysis. The application of event extraction techniques ranges from the summarization of different documents to the analysis of medical notes. We identify events based on a framework named “four W” (Who, What, When, Where) to capture all the essential components like the actors, actions, time, and places. In this paper, we explore two prominent techniques for event extraction: statistical parsing of syntactic trees and semantic role labeling. While these techniques were investigated by different researchers in isolation, we directly compare the performance of the two approaches on our custom dataset, which we have annotated.

text-align: justify;">Our analysis shows that statistical parsing of syntactic trees outperforms semantic role labeling in event and character extraction, especially in identifying specific details. Nevertheless, semantic role labeling demonstrate good performance in correct actor identification. We evaluate the effectiveness of both approaches by comparing different metrics like precision, recall, and F1-scores, thus, demonstrating their respective advantages and limitations.

text-align: justify;">Moreover, as a part of our work, we propose different future applications of event extraction techniques that we plan to investigate. The areas where we want to apply these techniques include code analysis and source code authorship attribution. We consider using event extraction to retrieve key code elements as variable assignments and function calls, which can further help us to analyze the behavior of programs and identify the project’s contributors. Our work provides novel understandings of the performance and efficiency of statistical parsing and semantic role labeling techniques, offering researchers new directions for the application of these techniques.

Keywords: event extraction, natural language processing, statistical parsing, semantic role labeling.
Орлова И.Н., Голубцова А.Н., Орлов В.А., Орлов Н.В.
Исследование достижимости цели в медицинском квесте
Компьютерные исследования и моделирование, 2025, т. 17, № 6, с. 1149-1179

text-align: justify;">В работе представлено экспериментальное исследование древовидной структуры, возникающей при медицинском обследовании. При каждой встрече с медицинским специалистом пациент получает некоторое количество направлений на консультации других специалистов или на анализы. Возникает дерево направлений, каждую ветвь которого должен пройти пациент. В зависимости от разветвленности дерева оно может быть как конечным (и в этом случае обследование может быть завершено), так и бесконечным, когда цель пациента не может быть достигнута. В работе как экспериментально, так и теоретически изучаются критические свойства перехода системы из леса конечных деревьев в лес бесконечных в зависимости от вероятностных характеристик дерева.

text-align: justify;">Для описания предлагается модель, в которой дискретная функция вероятности числа ветвей на узле повторяет динамику непрерывного гауссового распределения. Характеристики распределения Гаусса (математическое ожидание $x_0$, среднеквадратичное отклонение $\sigma$) являются параметрами модели. В выбранной постановке задача относится к проблематике ветвящихся случайных процессов (ВСП) в неоднородной модели Гальтона – Ватсона.

text-align: justify;">Экспериментальное изучение проводится путем численного моделирования на конечных решетках. Построена фазовая диаграмма, определены границы областей различных фаз. Проведено сравнение с фазовой диаграммой, полученной из теоретических критериев для макросистем, установлено адекватное соответствие. Показано, что на конечных решетках переход является размытым.

text-align: justify;">Описание размытого фазового перехода проведено с помощью двух подходов. В первом (стандартном) подходе переход описывается с помощью так называемой функции включения, имеющей смысл доли одной из фаз в общем множестве. Установлено, что такой подход в данной системе неэффективен, поскольку найденное положение условной границы размытого перехода определяется только размером выбранной экспериментальной решетки и не несет объективного смысла.

text-align: justify;">Предлагается второй (оригинальный) подход, основанный на введении в рассмотрение параметра порядка, равного обратной средней высоте дерева, и анализа его поведения. Установлено, что динамика такого параметра порядка в сечениях $\sigma = \text{const}$ с очень небольшими отличиями имеет вид распределения Ферми – Дирака ($\sigma$ выполняет ту же функцию, что и температура для распределения Ферми – Дирака, $x_0$ — функцию энергии). Для параметра порядка подобрано эмпирическое выражение, введен и рассчитан аналог химического потенциала, который и имеет смысл характерного масштаба параметра порядка, то есть тех значений $x_0$, при которых условно можно считать, что порядок сменяется беспорядком. Этот критерий положен в основу определе- ния границы условного перехода в данном подходе. Установлено, что эта граница соответствует средней высоте дерева, равной двум поколениям. На основании обнаруженных свойств предложены рекомендации для медицинских учреждений, позволяющие контролировать обеспечение конечности траектории пациентов.

text-align: justify;">Рассмотренная модель и метод ее описания с помощью условно-бесконечных деревьев имеют приложение ко многим иерархическим системам. К таким системам можно отнести сети маршрутизации интернет-соединений, бюрократические сети, торговые, логистические сети, сети цитирования, игровые стратегии, задачи популяционной динамики и пр.

Ключевые слова: медицинское обследование, ветвящийся случайный процесс, модель Гальтона – Ватсона, размытые фазовые переходы, конечные системы, условно-бесконечные траектории, макросистема, функция включения, области почти чистых фаз, параметр порядка, химический потенциал, фазовая диаграмма, критическое поведение.

Orlova I.N., Golubtsova A.N., Orlov V.A., Orlov N.V.
Research on the achievability of a goal in a medical quest
Computer Research and Modeling, 2025, v. 17, no. 6, pp. 1149-1179

text-align: justify;">The work presents an experimental study of the tree structure that occurs during a medical examination. At each meeting with a medical specialist, the patient receives a certain number of areas for consulting other specialists or for tests. A tree of directions arises, each branch of which the patient should pass. Depending on the branching of the tree, it can be as final — and in this case the examination can be completed — and endless when the patient’s goal cannot be achieved. In the work both experimentally and theoretically studied the critical properties of the transition of the system from the forest of the final trees to the forest endless, depending on the probabilistic characteristics of the tree.

text-align: justify;">For the description, a model is proposed in which a discrete function of the probability of the number of branches on the node repeats the dynamics of a continuous gaussian distribution. The characteristics of the distribution of the Gauss (mathematical expectation of $x_0$, the average quadratic deviation of $\sigma$) are model parameters. In the selected setting, the task refers to the problems of branching random processes (BRP) in the heterogeneous model of Galton – Watson.

text-align: justify;">Experimental study is carried out by numerical modeling on the final grilles. A phase diagram was built, the boundaries of areas of various phases are determined. A comparison was made with the phase diagram obtained from theoretical criteria for macrosystems, and an adequate correspondence was established. It is shown that on the final grilles the transition is blurry.

text-align: justify;">The description of the blurry phase transition was carried out using two approaches. In the first, standard approach, the transition is described using the so-called inclusion function, which makes the meaning of the share of one of the phases in the general set. It was established that such an approach in this system is ineffective, since the found position of the conditional boundary of the blurred transition is determined only by the size of the chosen experimental lattice and does not bear objective meaning.

text-align: justify;">The second, original approach is proposed, based on the introduction of an parameter of order equal to the reverse average tree height, and the analysis of its behavior. It was established that the dynamics of such an order parameter in the $\sigma = \text{const}$ section with very small differences has the type of distribution of Fermi – Dirac ($\sigma$ performs the same function as the temperature for the distribution of Fermi – Dirac, $x_0$ — energy function). An empirical expression has been selected for the order parameter, an analogue of the chemical potential is introduced and calculated, which makes sense of the characteristic scale of the order parameter — that is, the values of $x_0$, in which the order can be considered a disorder. This criterion is the basis for determining the boundary of the conditional transition in this approach. It was established that this boundary corresponds to the average height of a tree equal to two generations. Based on the found properties, recommendations for medical institutions are proposed to control the provision of limb of the path of patients.

text-align: justify;">The model discussed and its description using conditionally-infinite trees have applications to many hierarchical systems. These systems include: internet routing networks, bureaucratic networks, trade and logistics networks, citation networks, game strategies, population dynamics problems, and others.

Keywords: medical examination, branching random process, Galton – Watson model, diffuse phase transitions, finite systems, conditionally-infinite trajectories, macrosystem, inclusion function, regions of almost pure phases, order parameter, chemical potential, phase diagram, critical behavior.
Чувилин К.В.
Эффективный алгоритм сравнения документов в формате ${\mathrm{\LaTeX}}$
Компьютерные исследования и моделирование, 2015, т. 7, № 2, с. 329-345

text-align: justify;">Рассматривается задача построения различий, возникающих при редактировании документов в формате ${\mathrm{\LaTeX}}$. Каждый документ представляется в виде синтаксического дерева, узлы которого называются токенами. Строится минимально возможное текстовое представление документа, не меняющее синтаксическое дерево. Весь текст разбивается на фрагменты, границы которых соответствуют токенам. С помощью алгоритма Хиршберга строится отображение последовательности текстовых фрагментов изначального документа в аналогичную последовательность отредактированного документа, соответствующее минимальному редактирующему расстоянию. Строится отображение символов текстов, соответствующее отображению последовательностей текстовых фрагментов. В синтаксических деревьях выделяются токены такие, что символы соответствующих фрагментов текста при отображении либо все не меняются, либо все удаляются, либо все добавляются. Для деревьев, образованных остальными токенами, строится отображение с помощью алгоритма Zhang–Shasha.

Ключевые слова: автоматизация, анализ текста, лексема, машинное обучение, метрика, редактирующее расстояние, синтаксическое дерево, токен, ${\mathrm{\LaTeX}}$.

Chuvilin K.V.
An efficient algorithm for ${\mathrm{\LaTeX}}$ documents comparing
Computer Research and Modeling, 2015, v. 7, no. 2, pp. 329-345

text-align: justify;">The problem is constructing the differences that arise on ${\mathrm{\LaTeX}}$ documents editing. Each document is represented as a parse tree whose nodes are called tokens. The smallest possible text representation of the document that does not change the syntax tree is constructed. All of the text is splitted into fragments whose boundaries correspond to tokens. A map of the initial text fragment sequence to the similar sequence of the edited document corresponding to the minimum distance is built with Hirschberg algorithm A map of text characters corresponding to the text fragment sequences map is cunstructed. Tokens, that chars are all deleted, or all inserted, or all not changed, are selected in the parse trees. The map for the trees formed with other tokens is built using Zhang–Shasha algorithm.

Keywords: automation, editing distance, text analysis, lexeme, machine learning, metric, parse tree, syntax tree, token, ${\mathrm{\LaTeX}}$.
Просмотров за год: 2. Цитирований: 2 (РИНЦ).

Страницы: следующая последняя »

Журнал индексируется в Scopus

Полнотекстовая версия журнала доступна также на сайте научной электронной библиотеки eLIBRARY.RU

Журнал входит в систему Российского индекса научного цитирования.

Журнал включен в базу данных Russian Science Citation Index (RSCI) на платформе Web of Science

Международная Междисциплинарная Конференция "Математика. Компьютер. Образование"