Bioinformatics Applications

Machine Learning Methods in Bioinformatics Applications


The proteomics is an important domain where machine learning techniques are applied in bioinformatics. In the proteomics, two main applications of computational methods are protein structure prediction and protein function prediction. Generally, the first is an optimization problem and the second is a classification problem. Evolutionary algorithm (EA) based methods are the main optimization technologies for protein structure prediction, such as genetic algorithm (GA), estimation distribution algorithm (EDA), etc. Supervised and unsupervised classification methods are often used to predict protein function, such as Clustering, SVM, NN, etc.


 How to extract the useful information from biological data is one of the main challenges in computational biology. For the complicated problems in proteomics, such as the long protein sequence structure prediction and multi-label protein function classification, applying machine learning methods simply usually cannot obtain expectable results. In our research, improved machine learning methods are explored for solving complicated proteomic applications. The information of the biology data is extracted and used to guide the training procedure of machine learning methods, and some delicate techniques are also combined according to the characteristics of application problem.
 Protein Structure Prediction is to predict protein three-dimensional structure (tertiary structure) from its amino acid sequence (primary structure). The EDA based methods are explored to solve the protein HP model structure prediction and side chain placement problems. A hybrid EDA with composite fitness function and local search is proposed to solve protein lattice HP model folding problem. And a niching EDA method based on clustering analysis and balance searching is also explored to solve another important structure prediction problem, the protein side-chain prediction.
 Protein Function Classification is to classify the protein functions according to its sequence data. The hierarchical multi-label protein function classification is a very difficult task in this domain. An improved multi-label classification method based on SVM with delicate decision boundary is proposed to solve multi-label protein function classification.