Classification thématique des textes multilingue Etude de cas dans le domaine de sport
Date
2025
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
university of bordj bou arreridj
Abstract
With the rise of digital development and the growing volume of textual content
published daily, particularly in the sports domain, the need to organize such content has
become increasingly important. This study aims to process multilingual sports texts
using natural language processing and machine learning techniques, in order to classify
them according to the topics they address. To standardize the linguistic processing of
multilingual texts, the automatic translation model NLLB was used to translate the
content into English, which contributed to improving the thematic segmentation of the
texts.
Several supervised algorithms were applied, including Naive Bayes, Support Vec
tor Machine SVM,andMultilayer Perceptron MLP, on a sports dataset collected from
the Kaggle platform. After data cleaning and converting the texts into numerical rep
resentations using the TF-IDF algorithm, the models were trained and compared. Re
sults showed that SVM and MLP achieved the best performance in terms of accuracy,
while the Naive Bayes model stood out for its execution speed. This study demon
strates the effectiveness of multilingual thematic classification in the sports domain
and paves the way for future improvements using more advanced language models.
Description
Keywords
naturallanguage processing, text classification, sport, TF-IDF, SVM, Naive Bayes, MLP, mBERT NLLB.