Bioinformatics is a major benefactor of the recent advancements in Artificial Intelligence (AI). As an interdisciplinary field of science and technology, bioinformatics aims to develop methods, tools and software to improve our understanding of biological data. Machine learning, a subfield of AI, has become a powerful tool for many bioinformatics applications. Machine learning methods are especially useful at prediction and pattern detection based on large datasets. There are a number of emerging applications of machine learning within the bioinformatics space. In this post we are looking at how ML is being utilized in DNA sequencing, protein classification, and the analysis of gene expressions on DNA microarrays.

DNA Sequencing

The structure of the DNA molecule is of a twisted double helix, bonded by four chemical building blocks — called the bases (Adenine “A”, Thymine “T”, Cytosine “C”, Guanine “G”). The bases pair together, and a specific sequence of these pairs in a DNA segment encodes a functional molecule, called a gene. Functional molecules encoded by genes makeup the chemical basis of various elements of hereditary embodied in our physiology, such as hair color. In bioinformatics the processes of locating regions of the DNA that encode genes is often referred to as gene prediction or gene finding. The process of gene finding consists of a combination of extrinsic and intrinsic searches. As part of the extrinsic search, “the target genome is searched for sequences that are similar to extrinsic evidence” in the form of known gene encoding sequences previously discovered and labelled. However, given the “inherent expense and difficulty in obtaining extrinsic evidence for many genes” an intrinsic search is also performed where gene prediction algorithms attempt to identify segments of the DNA that could potentially host gene encoding sequences. These predictive models systematically search the genomic DNA for protein-coding genes. To make predictions, these algorithms leverage a combination of signals, specific sequences, content and statistical properties. Currently various machine learning and deep learning models are being deployed in this area and some of which are, Random Forest, K-Nearest Neighbour, Support Vector Machines and Multilayer Perceptron. The following two publications have closely examined the implementation and results derived using each of these methods for extrinsic gene prediction; A comparison of classification methods for gene prediction in metagenomics and Gene prediction using Deep Learning.

Protein Classification

Proteins are the “doers” of our cells, executing many functions that ultimately enable life. They are responsible for a vast array of functions within organisms, such as metabolic functions, responding to stimuli, structuring cells, transporting molecules and many more. Proteins are coded by our genes and form the basis of living tissues. Classification of protein patterns across human cells is an essential step in fully understanding the complexity of the human body. With the recent advancement in high throughput microscopy, cellular images are being generated faster than any team of humans can analyze and classify. As a result machine learning models are breaking grounds by classifying protein patterns in human cells through various machine learning methods and computer vision techniques such as Deep Convolutional Neural Fields (DCNF). At Optima AI we embarked on a journey to develop a protein pattern classifier model that would derive multilabel multiclass classifications based on cellular images produced through high throughput microscopy. Read our case study here.

Analysis of Gene Expressions

DNA Microarray, a type of lab-on-a-chip, “is a collection of DNA spots attached to a solid surface”. Microarrays are used to automatically collect and measure gene expression levels within organisms. A “gene expression is the process by which information from a gene is used in the synthesis of a functional gene product”. A functional gene product is often a protein or in the case of non-protein encodings it is a functional RNA. Machine learning is often utilized in the analysis, pattern identification and classification of gene expressions. In the field of cancer research, the advent of microarrays and RNA sequencing coupled with state-of-the-art machine learning techniques have demonstrated the potential for the detection and classification of tumors at a molecular level. Hence, enabling physicians to offer personalized treatment to cancer patients based on the genetic build up of a specific tumor. The following study demonstrates the results and the machine learning techniques used for such classification problems, Machine Learning Methods Applied to DNA Microarray Data Can Improve the Diagnosis of Cancer.

The recent advancements in the fields of bioinformatics and machine learning are presenting many opportunities in genetics, cellular biology, cancer research and personalized medicine. At Optima AI we are excited to be at the forefront of this technology and we are looking to partner with organizations that are looking to utilize machine learning to deliver the next big breakthroughs in medical research and medicine.