The Effect of Document Length on Machine Learning Success in Text-Based Data
Tarih
Yazarlar
Dergi Başlığı
Dergi ISSN
Cilt Başlığı
Yayıncı
Erişim Hakkı
Özet
Natural Language Processing (NLP) is an important research area for artificial intelligence studies. In the process of processing textual data, feature extraction and the creation of the word-document vector are very important. Especially for machine learning algorithms, these numerical vectors play a critical role in the creation of the model. Textual data must be preprocessed to generate these vectors. There are common methods such as removing stopwords, converting text to lowercase, and cleaning punctuation marks. The effects of these methods on the created model have also been investigated in the literature. However, it has not been investigated how the length values of the text can affect the model created. So how does a document or text having less than 10 or 20 characters affect the machine learning model? This study was carried out in order to solve this problem and fill the gap in the literature. The effect of text length on text classification models has been tested with different feature extraction methods. © 2023 IEEE.