Best Practices of Feature Selection in Multi-Omics Data
Abstract
Today, there is an increase in data in many areas. With this increase, the number and variety of the variables
to be evaluated also increases. The increase in data and variables became a situation that needed
to be solved among world problems. In addition, although there is a perception that having too much
data in the scientific field, having too much information, correct information, or sufficient information
may not be possible. However, it should not be forgotten that there is valuable information in a relatively
large amount of data. It should be clear that it can be beneficial to have much data to extract this helpful
information. However, performing data analyses to obtain and process this information can be difficult.
In addition, its existence is a problem called the curse of data dimensionality (Verkeysen M. and François
D., 2005). High-dimensional data sets, where these problems are most common, are used successfully
in multiple fields such as genetics, pharmacology, toxicology, nutrition, and genetics. The use of these
high-dimensional data allows one to examine biology systems, cellular metabolism, and disease etiologies
in more detail. However, the number of samples (n) of these data is considerably lower than the
number of variables (p) and the heterogeneity of the data, the missing observations in the data as a result
of the use of high-output technology, limits the use of traditional methods that can be used in this field.
Therefore, there is a need for the clinical understanding of the biological system based on research and
machine learning, and statistical learning methods to analyze this clinical information statistically (Hastie
et al., 2009). Several studies are show that machine learning methods are used and applied successfully
in studies carried out in this field. Some of these studies are listed in Table 1.