Текущий выпуск Номер 5, 2025 Том 17

Все выпуски

2025 Том 17
2024 Том 16
- Номер 7 (специальный выпуск)
- Номер 6
- Номер 5
- Номер 4
- Номер 3
- Номер 2
- Номер 1 (специальный выпуск)
2023 Том 15
- Номер 6
- Номер 5
- Номер 4 (специальный выпуск)
- Номер 3
- Номер 2 (специальный выпуск)
- Номер 1
2022 Том 14
- Номер 6
- Номер 5
- Номер 4 (специальный выпуск)
- Номер 3
- Номер 2 (специальный выпуск)
- Номер 1
2021 Том 13
- Номер 6
- Номер 5
- Номер 4
- Номер 3
- Номер 2 (специальный выпуск)
- Номер 1
2020 Том 12
2019 Том 11
2018 Том 10
- Номер 6
- Номер 5 (специальный выпуск)
- Номер 4
- Номер 3 (специальный выпуск)
- Номер 2
- Номер 1
2017 Том 9
2016 Том 8
2015 Том 7
- Номер 6
- Номер 5
- Номер 4
- Номер 3 (специальный выпуск)
- Номер 2
- Номер 1
2014 Том 6
- Номер 6 (специальный выпуск)
- Номер 5
- Номер 4
- Номер 3
- Номер 2
- Номер 1
2013 Том 5
- Номер 6 (специальный выпуск)
- Номер 5
- Номер 4
- Номер 3
- Номер 2
- Номер 1
2012 Том 4
2011 Том 3
2010 Том 2
2009 Том 1

Результаты поиска по 'large language model':

Найдено статей: 8

От редакции
Компьютерные исследования и моделирование, 2024, т. 16, № 7, с. 1533-1538

Editor’s note
Computer Research and Modeling, 2024, v. 16, no. 7, pp. 1533-1538
Ахмад У., Иванов В.
Автоматизация построения банков высококачественных концептов с использованием больших языковых моделей и мультимодальных метрик
Компьютерные исследования и моделирование, 2024, т. 16, № 7, с. 1555-1567

Интерпретируемость моделей глубокого обучения стала центром исследований, особенно в таких областях, как здравоохранение и финансы. Модели с «бутылочным горлышком», используемые для выявления концептов, стали перспективным подходом для достижения прозрачности и интерпретируемости за счет использования набора известных пользователю понятий в качестве промежуточного представления перед слоем предсказания. Однако ручное аннотирование понятий не затруднено из-за больших затрат времени и сил. В нашей работе мы исследуем потенциал больших языковых моделей (LLM) для создания высококачественных банков концептов и предлагаем мультимодальную метрику для оценки качества генерируемых концептов. Мы изучили три ключевых вопроса: способность LLM генерировать банки концептов, сопоставимые с существующими базами знаний, такими как ConceptNet, достаточность унимодального семантического сходства на основе текста для оценки ассоциаций концептов с метками, а также эффективность мультимодальной информации для количественной оценки качества генерации концептов по сравнению с унимодальным семантическим сходством концепт-меток. Наши результаты показывают, что мультимодальные модели превосходят унимодальные подходы в оценке сходства между понятиями и метками. Более того, сгенерированные нами концепты для наборов данных CIFAR-10 и CIFAR-100 превосходят те, что были получены из ConceptNet и базовой модели, что демонстрирует способность LLM генерировать высококачественные концепты. Возможность автоматически генерировать и оценивать высококачественные концепты позволит исследователям работать с новыми наборами данных без дополнительных усилий.

Ключевые слова: интерпретируемость, большие языковые модели, нейросети с «бутылочным горлышком», машинное обучение.

Ahmad U., Ivanov V.
Automating high-quality concept banks: leveraging LLMs and multimodal evaluation metrics
Computer Research and Modeling, 2024, v. 16, no. 7, pp. 1555-1567

Interpretability in recent deep learning models has become an epicenter of research particularly in sensitive domains such as healthcare, and finance. Concept bottleneck models have emerged as a promising approach for achieving transparency and interpretability by leveraging a set of humanunderstandable concepts as an intermediate representation before the prediction layer. However, manual concept annotation is discouraged due to the time and effort involved. Our work explores the potential of large language models (LLMs) for generating high-quality concept banks and proposes a multimodal evaluation metric to assess the quality of generated concepts. We investigate three key research questions: the ability of LLMs to generate concept banks comparable to existing knowledge bases like ConceptNet, the sufficiency of unimodal text-based semantic similarity for evaluating concept-class label associations, and the effectiveness of multimodal information in quantifying concept generation quality compared to unimodal concept-label semantic similarity. Our findings reveal that multimodal models outperform unimodal approaches in capturing concept-class label similarity. Furthermore, our generated concepts for the CIFAR-10 and CIFAR-100 datasets surpass those obtained from ConceptNet and the baseline comparison, demonstrating the standalone capability of LLMs in generating highquality concepts. Being able to automatically generate and evaluate high-quality concepts will enable researchers to quickly adapt and iterate to a newer dataset with little to no effort before they can feed that into concept bottleneck models.

Keywords: interpretability, large language models, concept bottleneck models, machine learning.
Черепанов В.В.
Моделирование теплового поля неподвижных симметричных тел в разреженной низкотемпературной плазме
Компьютерные исследования и моделирование, 2025, т. 17, № 1, с. 73-91

В работе исследуется процесс самосогласованной релаксации области возмущений, созданных в разреженной бинарной низкотемпературной плазме неподвижным заряженным шаром или цилиндром с абсорбирующей поверхностью. Особенностью подобных задач является их самосогласованный кинетический характер, при котором нельзя отделить процессы переноса в фазовом пространстве и формирования электромагнитного поля. Представлена математическая модель, позволяющая описывать и анализировать состояние газа, электрическое и тепловое поле в окрестности тела. Многомерность кинетической формулировки создает определенные проблемы при численном решении, поэтому для задачи подобрана криволинейная система неголономных координат, которая минимизирует ее фазовое пространство, что способствует повышению эффективности численных методов. Для таких координат обоснована и проанализирована форма кинетического уравнения Власова. Для его решения использован вариант метода крупных частиц с постоянным форм-фактором. В расчетах применялась подвижная сетка, отслеживающая смещение в фазовом пространстве носителя функции распределения, что дополнительно уменьшило объем контролируемой области фазового пространства. Раскрыты ключевые детали модели и численного метода. Модель и метод реализованы в виде кода на языке Matlab. На примере решения задачи для шара показано наличие в возмущенной зоне существенного неравновесия и анизотропии в распределении частиц по скорости. По результатам расчетов представлены картины эволюции структуры функции распределения частиц, профилей основных макроскопических характеристик газа — концентрации, тока, температуры и теплового потока, характеристик электрического поля в возмущенной области. Установлен механизм разогрева притягивающихся частиц в возмущенной зоне и показаны некоторые важные особенности процесса формирования теплового потока. Получены результаты, хорошо объяснимые с физической точки зрения, что подтверждает адекватность модели и корректность работы программного инструмента. Отмечаются создание и апробация основы для разработки в перспективе инструментов решения и более сложных задач моделирования поведения ионизированных газов вблизи заряженных тел.

Работа будет полезной специалистам в области математического моделирования, процессов тепло- и массообмена, физики низкотемпературной плазмы, аспирантам и студентам старших курсов, специализирующимся в указанных направлениях.

Ключевые слова: математическое моделирование, разреженная плазма, абсорбирующий заряженный шар, возмущенная зона, фазовое пространство, неголономные координаты, функция распределения, самосогласованное поле, макропараметры, эволюция и стационарное состояние.

Cherepanov V.V.
Modeling the thermal field of stationary symmetric bodies in rarefied low-temperature plasma
Computer Research and Modeling, 2025, v. 17, no. 1, pp. 73-91

The work investigates the process of self-consistent relaxation of the region of disturbances created in a rarefied binary low-temperature plasma by a stationary charged ball or cylinder with an absorbing surface. A feature of such problems is their self-consistent kinetic nature, in which it is impossible to separate the processes of transfer in phase space and the formation of an electromagnetic field. A mathematical model is presented that makes it possible to describe and analyze the state of the gas, electric and thermal fields in the vicinity of the body. The multidimensionality of the kinetic formulation creates certain problems in the numerical solution, therefore a curvilinear system of nonholonomic coordinates was selected for the problem, which minimizes its phase space, which contributes to increasing the efficiency of numerical methods. For such coordinates, the form of the Vlasov kinetic equation has been justified and analyzed. To solve it, a variant of the large particle method with a constant form factor was used. The calculations used a moving grid that tracks the displacement of the distribution function carrier in the phase space, which further reduced the volume of the controlled region of the phase space. Key details of the model and numerical method are revealed. The model and the method are implemented as code in the Matlab language. Using the example of solving a problem for a ball, the presence of significant disequilibrium and anisotropy in the particle velocity distribution in the disturbed zone is shown. Based on the calculation results, pictures of the evolution of the structure of the particle distribution function, profiles of the main macroscopic characteristics of the gas — concentration, current, temperature and heat flow, and characteristics of the electric field in the disturbed region are presented. The mechanism of heating of attracted particles in the disturbed zone is established and some important features of the process of formation of heat flow are shown. The results obtained are well explainable from a physical point of view, which confirms the adequacy of the model and the correct operation of the software tool. The creation and testing of a basis for the development in the future of tools for solving more complex problems of modeling the behavior of ionized gases near charged bodies is noted.

The work will be useful to specialists in the field of mathematical modeling, heat and mass transfer processes, lowtemperature plasma physics, postgraduate students and senior students specializing in the indicated areas.

Keywords: mathematical modeling, rarefied plasma, absorbing charged ball, disturbed zone, phase space, nonholonomic coordinates, distribution function, self-consistent field, macroparameters, evolution and steady state.
Антонов И.В., Бруттан Ю.В.
Применение больших языковых моделей для интеллектуального поиска и извлечения информации в корпоративных информационных системах
Компьютерные исследования и моделирование, 2025, т. 17, № 5, с. 871-888

В данной статье исследуется эффективность применения технологии Retrieval-Augmented Generation (RAG) в сочетании с различными большими языковыми моделями (LLM) для поиска документов и получения информации в корпоративных информационных системах. Рассматриваются варианты использования LLM в корпоративных системах, архитектура RAG, характерные проблемы интеграции LLM в RAG-систему. Предлагается архитектура системы, включающая в себя векторный энкодер текстов и LLM. Энкодер используется для создания векторной базы данных, индексирующей библиотеку корпоративных документов. Запрос, передаваемый LLM, дополняется релевантным ему контекстом из библиотеки корпоративных документов, извлекаемым с использованием векторной базы данных и библиотеки FAISS. Большая языковая модель принимает запрос пользователя и формирует ответ на основе переданных в контексте запроса данных. Рассматриваются общая структура и алгоритм функционирования предлагаемого решения, реализующего архитектуру RAG. Обосновывается выбор LLM для исследования и проводится анализ результативности использования популярных LLM (ChatGPT, GigaChat, YandexGPT, Llama, Mistral, Qwen и др.) в качестве компонента для генерации ответов. На основе тестового набора вопросов методом экспертных оценок оцениваются точность, полнота, грамотность и лаконичность ответов, предоставляемых рассматриваемыми моделями. Анализируются характеристики отдельных моделей, полученные в результате исследования. Приводится информация о средней скорости отклика моделей. Отмечается существенное влияние объема доступной памяти графического адаптера на производительность локальных LLM. На основе интегрального показателя качества формируется общий рейтинг LLM. Полученные результаты подтверждают эффективность предложенной архитектуры RAG для поиска документов и получения информации в корпоративных информационных системах. Были определены возможные направления дальнейших исследований в этой области: дополнение контекста, передаваемого LLM, и переход к архитектуре на базе LLM-агентов. В заключении представлены рекомендации по выбору оптимальной конфигурации RAG и LLM для построения решений, обеспечивающих быстрый и точный доступ к информации в рамках корпоративных информационных систем.

Ключевые слова: искусственный интеллект, информационные системы, семантический поиск, обработка естественного языка, векторизация документов, RAG, LLM.

Antonov I.V., Bruttan I.V.
Using RAG technology and large language models to search for documents and obtain information in corporate information systems
Computer Research and Modeling, 2025, v. 17, no. 5, pp. 871-888

This paper investigates the effectiveness of Retrieval-Augmented Generation (RAG) combined with various Large Language Models (LLMs) for document retrieval and information access in corporate information systems. We survey typical use-cases of LLMs in enterprise environments, outline the RAG architecture, and discuss the major challenges that arise when integrating LLMs into a RAG pipeline. A system architecture is proposed that couples a text-vector encoder with an LLM. The encoder builds a vector database that indexes a library of corporate documents. For every user query, relevant contextual fragments are retrieved from this library via the FAISS engine and appended to the prompt given to the LLM. The LLM then generates an answer grounded in the supplied context. The overall structure and workflow of the proposed RAG solution are described in detail. To justify the choice of the generative component, we benchmark a set of widely used LLMs — ChatGPT, GigaChat, YandexGPT, Llama, Mistral, Qwen, and others — when employed as the answer-generation module. Using an expert-annotated test set of queries, we evaluate the accuracy, completeness, linguistic quality, and conciseness of the responses. Model-specific characteristics and average response latencies are analysed; the study highlights the significant influence of available GPU memory on the throughput of local LLM deployments. An overall ranking of the models is derived from an aggregated quality metric. The results confirm that the proposed RAG architecture provides efficient document retrieval and information delivery in corporate environments. Future research directions include richer context augmentation techniques and a transition toward agent-based LLM architectures. The paper concludes with practical recommendations on selecting an optimal RAG–LLM configuration to ensure fast and precise access to enterprise knowledge assets.

Keywords: artificial intelligence, information systems, semantic search, natural language processing, document vectorization, RAG, LLM.
Адамовский Е.Р., Чертков В.М., Богуш Р.П.
Модель формирования карты радиосреды для когнитивной системы связи на базе сотовой сети LTE
Компьютерные исследования и моделирование, 2022, т. 14, № 1, с. 127-146

Статья посвящена вторичному использованию спектра в телекоммуникационных сетях. Акцентируется внимание, что одним из решений данной проблемы является применение технологий когнитивного радио и динамического доступа к спектру, для успешного функционирования которых необходим большой объем информации, включающий параметры базовых станций и абонентов сети. Хранение и обработка информации должны осуществляться при помощи карты радиосреды, которая представляет собой пространственно-временную базу данных всех активностей в сети и позволяет определять доступные для использования в заданное время частоты. В работе представлена двухуровневая модель для формирования карты радиосреды системы сотовой связи LTE, в которой выделены локальный и глобальный уровни, описываемая следующими параметрами: набор частот, ослабление сигнала, карта распространения сигналов, шаг сетки, текущий временной отсчет. Ключевыми объектами модели являются базовая станция и абонентское устройство. К основным параметрам базовой станции отнесены: наименование, идентификатор, координаты ячейки, номер, диапазон, мощность излучения, номера подключенных абонентских устройств, выделенные им ресурсные блоки. Для абонентских устройств в качестве параметров используются: наименование, идентификатор, местоположение, текущие координаты ячейки устройства, идентификатор рабочей базовой станции, частотный диапазон, номера ресурсных блоков для связи со станцией, мощность излучения, статус передачи данных, список номеров ближайших станций, расписания перемещения и сеансов связи устройств. Представлен алгоритм для реализации модели с учетом сценариев перемещения и сеансов связи абонентских устройств. Приводится методика расчета карты радиосреды в точке координатной сетки с учетом потерь при распространении радиосигналов от излучающих устройств. Программная реализация модели выполнена с использованием пакета MatLab. Описаны подходы, позволяющие повысить быстродействие ее работы. При моделировании выбор параметров осуществлялся с учетом данных действующих систем связи и экономии вычислительных ресурсов. Продемонстрированы результаты исследований программной реализации алгоритма формирования карты радиосреды, подтверждающие корректность разработанной модели.

Ключевые слова: карта радиосреды, когнитивное радио, LTE, динамический доступ к спектру.

Adamovskiy Y.R., Chertkov V.M., Bohush R.P.
Model for building of the radio environment map for cognitive communication system based on LTE
Computer Research and Modeling, 2022, v. 14, no. 1, pp. 127-146

The paper is devoted to the secondary use of spectrum in telecommunication networks. It is emphasized that one of the solutions to this problem is the use of cognitive radio technologies and dynamic spectrum access for the successful functioning of which a large amount of information is required, including the parameters of base stations and network subscribers. Storage and processing of information should be carried out using a radio environment map, which is a spatio-temporal database of all activity in the network and allows you to determine the frequencies available for use at a given time. The paper presents a two-level model for forming a map of the radio environment of a cellular communication system LTE, in which the local and global levels are highlighted, which is described by the following parameters: a set of frequencies, signal attenuation, signal propagation map, grid step, current time count. The key objects of the model are the base station and the subscriber unit. The main parameters of the base station include: name, identifier, cell coordinates, range number, radiation power, numbers of connected subscriber devices, dedicated resource blocks. For subscriber devices, the following parameters are used: name, identifier, location, current coordinates of the device cell, base station identifier, frequency range, numbers of resource blocks for communication with the station, radiation power, data transmission status, list of numbers of the nearest stations, schedules movement and communication sessions of devices. An algorithm for the implementation of the model is presented, taking into account the scenarios of movement and communication sessions of subscriber devices. A method for calculating a map of the radio environment at a point on a coordinate grid, taking into account losses during the propagation of radio signals from emitting devices, is presented. The software implementation of the model is performed using the MatLab package. The approaches are described that allow to increase the speed of its work. In the simulation, the choice of parameters was carried out taking into account the data of the existing communication systems and the economy of computing resources. The experimental results of the algorithm for the formation of a radio environment map are demonstrated, confirming the correctness of the developed model.

Keywords: cognitive radio, radio environment map, LTE, dynamic spectrum access.
Salem N., Al-Tarawneh K., Hudaib A., Salem H., Tareef A., Salloum H., Mazzara M.
Generating database schema from requirement specification based on natural language processing and large language model
Компьютерные исследования и моделирование, 2024, т. 16, № 7, с. 1703-1713

A Large Language Model (LLM) is an advanced artificial intelligence algorithm that utilizes deep learning methodologies and extensive datasets to process, understand, and generate humanlike text. These models are capable of performing various tasks, such as summarization, content creation, translation, and predictive text generation, making them highly versatile in applications involving natural language understanding. Generative AI, often associated with LLMs, specifically focuses on creating new content, particularly text, by leveraging the capabilities of these models. Developers can harness LLMs to automate complex processes, such as extracting relevant information from system requirement documents and translating them into a structured database schema. This capability has the potential to streamline the database design phase, saving significant time and effort while ensuring that the resulting schema aligns closely with the given requirements. By integrating LLM technology with Natural Language Processing (NLP) techniques, the efficiency and accuracy of generating database schemas based on textual requirement specifications can be significantly enhanced. The proposed tool will utilize these capabilities to read system requirement specifications, which may be provided as text descriptions or as Entity-Relationship Diagrams (ERDs). It will then analyze the input and automatically generate a relational database schema in the form of SQL commands. This innovation eliminates much of the manual effort involved in database design, reduces human errors, and accelerates development timelines. The aim of this work is to provide a tool can be invaluable for software developers, database architects, and organizations aiming to optimize their workflow and align technical deliverables with business requirements seamlessly.

Ключевые слова: large language model, natural language processing entity-relationship diagrams, SQL.

Salem N., Al-Tarawneh K., Hudaib A., Salem H., Tareef A., Salloum H., Mazzara M.
Generating database schema from requirement specification based on natural language processing and large language model
Computer Research and Modeling, 2024, v. 16, no. 7, pp. 1703-1713

A Large Language Model (LLM) is an advanced artificial intelligence algorithm that utilizes deep learning methodologies and extensive datasets to process, understand, and generate humanlike text. These models are capable of performing various tasks, such as summarization, content creation, translation, and predictive text generation, making them highly versatile in applications involving natural language understanding. Generative AI, often associated with LLMs, specifically focuses on creating new content, particularly text, by leveraging the capabilities of these models. Developers can harness LLMs to automate complex processes, such as extracting relevant information from system requirement documents and translating them into a structured database schema. This capability has the potential to streamline the database design phase, saving significant time and effort while ensuring that the resulting schema aligns closely with the given requirements. By integrating LLM technology with Natural Language Processing (NLP) techniques, the efficiency and accuracy of generating database schemas based on textual requirement specifications can be significantly enhanced. The proposed tool will utilize these capabilities to read system requirement specifications, which may be provided as text descriptions or as Entity-Relationship Diagrams (ERDs). It will then analyze the input and automatically generate a relational database schema in the form of SQL commands. This innovation eliminates much of the manual effort involved in database design, reduces human errors, and accelerates development timelines. The aim of this work is to provide a tool can be invaluable for software developers, database architects, and organizations aiming to optimize their workflow and align technical deliverables with business requirements seamlessly.

Keywords: large language model, natural language processing entity-relationship diagrams, SQL.
Salem N., Hudaib A., Al-Tarawneh K., Salem H., Tareef A., Salloum H., Mazzara M.
A survey on the application of large language models in software engineering
Компьютерные исследования и моделирование, 2024, т. 16, № 7, с. 1715-1726

Large Language Models (LLMs) are transforming software engineering by bridging the gap between natural language and programming languages. These models have revolutionized communication within development teams and the Software Development Life Cycle (SDLC) by enabling developers to interact with code using natural language, thereby improving workflow efficiency. This survey examines the impact of LLMs across various stages of the SDLC, including requirement gathering, system design, coding, debugging, testing, and documentation. LLMs have proven to be particularly useful in automating repetitive tasks such as code generation, refactoring, and bug detection, thus reducing manual effort and accelerating the development process. The integration of LLMs into the development process offers several advantages, including the automation of error correction, enhanced collaboration, and the ability to generate high-quality, functional code based on natural language input. Additionally, LLMs assist developers in understanding and implementing complex software requirements and design patterns. This paper also discusses the evolution of LLMs from simple code completion tools to sophisticated models capable of performing high-level software engineering tasks. However, despite their benefits, there are challenges associated with LLM adoption, such as issues related to model accuracy, interpretability, and potential biases. These limitations must be addressed to ensure the reliable deployment of LLMs in production environments. The paper concludes by identifying key areas for future research, including improving the adaptability of LLMs to specific software domains, enhancing their contextual understanding, and refining their capabilities to generate semantically accurate and efficient code. This survey provides valuable insights into the evolving role of LLMs in software engineering, offering a foundation for further exploration and practical implementation.

Ключевые слова: large language model, natural language processing, software development life cycle.

Salem N., Hudaib A., Al-Tarawneh K., Salem H., Tareef A., Salloum H., Mazzara M.
A survey on the application of large language models in software engineering
Computer Research and Modeling, 2024, v. 16, no. 7, pp. 1715-1726

Large Language Models (LLMs) are transforming software engineering by bridging the gap between natural language and programming languages. These models have revolutionized communication within development teams and the Software Development Life Cycle (SDLC) by enabling developers to interact with code using natural language, thereby improving workflow efficiency. This survey examines the impact of LLMs across various stages of the SDLC, including requirement gathering, system design, coding, debugging, testing, and documentation. LLMs have proven to be particularly useful in automating repetitive tasks such as code generation, refactoring, and bug detection, thus reducing manual effort and accelerating the development process. The integration of LLMs into the development process offers several advantages, including the automation of error correction, enhanced collaboration, and the ability to generate high-quality, functional code based on natural language input. Additionally, LLMs assist developers in understanding and implementing complex software requirements and design patterns. This paper also discusses the evolution of LLMs from simple code completion tools to sophisticated models capable of performing high-level software engineering tasks. However, despite their benefits, there are challenges associated with LLM adoption, such as issues related to model accuracy, interpretability, and potential biases. These limitations must be addressed to ensure the reliable deployment of LLMs in production environments. The paper concludes by identifying key areas for future research, including improving the adaptability of LLMs to specific software domains, enhancing their contextual understanding, and refining their capabilities to generate semantically accurate and efficient code. This survey provides valuable insights into the evolving role of LLMs in software engineering, offering a foundation for further exploration and practical implementation.

Keywords: large language model, natural language processing, software development life cycle.
Ирхин И.А., Булатов В.Г., Воронцов К.В.
Аддитивная регуляризация тематических моделей с быстрой векторизацией текста
Компьютерные исследования и моделирование, 2020, т. 12, № 6, с. 1515-1528

Задача вероятностного тематического моделирования заключается в том, чтобы по заданной коллекции текстовых документов найти две матрицы: матрицу условных вероятностей тем в документах и матрицу условных вероятностей слов в темах. Каждый документ представляется в виде мультимножества слов, то есть предполагается, что для выявления тематики документа не важен порядок слов в нем, а важна только их частота. При таком предположении задача сводится к вычислению низкорангового неотрицательного матричного разложения, наилучшего по критерию максимума правдоподобия. Данная задача имеет в общем случае бесконечное множество решений, то есть является некорректно поставленной. Для регуляризации ее решения к логарифму правдоподобия добавляется взвешенная сумма оптимизационных критериев, с помощью которых формализуются дополнительные требования к модели. При моделировании больших текстовых коллекций хранение первой матрицы представляется нецелесообразным, поскольку ее размер пропорционален числу документов в коллекции. В то же время тематические векторные представления документов необходимы для решения многих задач текстовой аналитики — информационного поиска, кластеризации, классификации, суммаризации текстов. На практике тематический вектор вычисляется для каждого документа по необходимости, что может потребовать десятков итераций по всем словам документа. В данной работе предлагается способ быстрого вычисления тематического вектора для произвольного текста, требующий лишь одной итерации, то есть однократного прохода по всем словам документа. Для этого в модель вводится дополнительное ограничение в виде уравнения, позволяющего вычислять первую матрицу через вторую за линейное время. Хотя формально данное ограничение не является оптимизационным критерием, фактически оно выполняет роль регуляризатора и может применяться в сочетании с другими критериями в рамках теории аддитивной регуляризации тематических моделей ARTM. Эксперименты на трех свободно доступных текстовых коллекциях показали, что предложенный метод улучшает качество модели по пяти оценкам качества, характеризующим разреженность, различность, информативность и когерентность тем. Для проведения экспериментов использовались библиотеки с открытымк одомB igARTM и TopicNet.

Ключевые слова: автоматическая обработка текстов, обучение без учителя, тематическое моделирование, аддитивная регуляризация тематических моделей, EM-алгоритм, PLSA, LDA, ARTM, BigARTM, TopicNet.

Irkhin I.A., Bulatov V.G., Vorontsov K.V.
Additive regularizarion of topic models with fast text vectorizartion
Computer Research and Modeling, 2020, v. 12, no. 6, pp. 1515-1528

The probabilistic topic model of a text document collection finds two matrices: a matrix of conditional probabilities of topics in documents and a matrix of conditional probabilities of words in topics. Each document is represented by a multiset of words also called the “bag of words”, thus assuming that the order of words is not important for revealing the latent topics of the document. Under this assumption, the problem is reduced to a low-rank non-negative matrix factorization governed by likelihood maximization. In general, this problem is ill-posed having an infinite set of solutions. In order to regularize the solution, a weighted sum of optimization criteria is added to the log-likelihood. When modeling large text collections, storing the first matrix seems to be impractical, since its size is proportional to the number of documents in the collection. At the same time, the topical vector representation (embedding) of documents is necessary for solving many text analysis tasks, such as information retrieval, clustering, classification, and summarization of texts. In practice, the topical embedding is calculated for a document “on-the-fly”, which may require dozens of iterations over all the words of the document. In this paper, we propose a way to calculate a topical embedding quickly, by one pass over document words. For this, an additional constraint is introduced into the model in the form of an equation, which calculates the first matrix from the second one in linear time. Although formally this constraint is not an optimization criterion, in fact it plays the role of a regularizer and can be used in combination with other regularizers within the additive regularization framework ARTM. Experiments on three text collections have shown that the proposed method improves the model in terms of sparseness, difference, logLift and coherence measures of topic quality. The open source libraries BigARTM and TopicNet were used for the experiments.

Keywords: natural language processing, unsupervised learning, topic modeling, additive regularization of topic model, EM-algorithm, PLSA, LDA, ARTM, BigARTM, TopicNet.

Журнал индексируется в Scopus

Полнотекстовая версия журнала доступна также на сайте научной электронной библиотеки eLIBRARY.RU

Журнал входит в систему Российского индекса научного цитирования.

Журнал включен в базу данных Russian Science Citation Index (RSCI) на платформе Web of Science

Международная Междисциплинарная Конференция "Математика. Компьютер. Образование"