Machine Learning Methods in Bioinformatics Applications
The proteomics is an
important domain where machine learning techniques are applied in
bioinformatics. In the proteomics, two main applications of computational
methods are protein structure prediction and protein function prediction.
Generally, the first is an optimization problem and the second is a
classification problem. Evolutionary algorithm (EA) based methods are the main
optimization technologies for protein structure prediction, such as genetic
algorithm (GA), estimation distribution algorithm (EDA), etc. Supervised and
unsupervised classification methods are often used to predict protein function,
such as Clustering, SVM, NN, etc.
How
to extract the useful information from biological data is one of the main
challenges in computational biology. For the complicated problems in
proteomics, such as the long protein sequence structure prediction and
multi-label protein function classification, applying machine learning methods
simply usually cannot obtain expectable results. In our research, improved machine
learning methods are explored for solving complicated proteomic applications.
The information of the biology data is extracted and used to guide the training
procedure of machine learning methods, and some delicate techniques are also
combined according to the characteristics of application problem.
Protein
Structure Prediction is to predict protein three-dimensional structure
(tertiary structure) from its amino acid sequence (primary structure). The EDA
based methods are explored to solve the protein HP model structure prediction
and side chain placement problems. A hybrid EDA with composite fitness function
and local search is proposed to solve protein lattice HP model folding problem.
And a niching EDA method based on clustering analysis and balance searching is
also explored to solve another important structure prediction problem, the
protein side-chain prediction.
Protein
Function Classification is to classify the protein functions according to its
sequence data. The hierarchical multi-label protein function classification is
a very difficult task in this domain. An improved multi-label classification
method based on SVM with delicate decision boundary is proposed to solve
multi-label protein function classification.