Efficient TF-IDF method for alignment-free DNA sequence similarity analysis

Küçük Resim Yok

Tarih

2025

Dergi Başlığı

Dergi ISSN

Cilt Başlığı

Yayıncı

Elsevier Science Inc

Erişim Hakkı

info:eu-repo/semantics/closedAccess

Özet

This study proposes a pioneering alignment-free approach for the analysis of DNA sequence similarity. The method employs the representation of DNA sequences as n-grams, a technique that involves the adaptation of the Term Frequency-Inverse Document Frequency (TF-IDF) algorithm to genomic data. The primary objective of this approach is to enhance the accuracy of the results while concomitantly reducing the computational costs of the process, by ascertaining the most informative n-grams. The approach adopted in this study successfully circumvents the limitations of both traditional alignment-based and alignment-free methods, thereby demonstrating a commendable level of performance. The proposed method was tested on three different datasets and achieved high agreement with reference phylogenetic trees in the AFProject benchmark system. The results demonstrate that TF-IDF-based similarity matrices effectively capture phylogenetic relationships and significantly reduce processing time. The high accuracy rates obtained prove that the method offers a scalable and robust alternative in large genomic datasets. The method demonstrates considerable potential in DNA sequence similarity analysis, exhibiting high accuracy and low computational cost.

Açıklama

Anahtar Kelimeler

DNA sequence analysis, TF-IDF, Alignment-free method, Genomic data, Phylogenetic analysis

Kaynak

Journal of Molecular Graphics & Modelling

WoS Q Değeri

Q1

Scopus Q Değeri

Q2

Cilt

137

Sayı

Künye