Protein Classification
Name: Human Protein Classification
Challenge: Classify mix proteins patterns based on cellular images
Solution: Develop a Convolutional Neural Network to label & classify Protein patterns present in cellular images
Technologies Used: Python, Fastai, Tensorflow, Keras, Scikit-learn, CNN Models
Competition: Kaggle – Protein Atlas
Case Study: Label and Classify proteins present in cellular images
Classifying and visualizing cellular images containing protein patterns is often used for biomedical research. With recent advancements in high throughput microscopy it has become even more important to develop accurate models to label and classify proteins at cellular levels efficiently, as this sort of research may hold the key to the next big breakthrough in medicine.
As part of this project we were given a training set containing 31,000 labelled images. We also had 28 protein types present and each sample contained one or more target protein labels, hence posing the challenge of a multiclass multilabel task.
Another challenge we faced was the uneven distribution of target proteins in the training data. Some protein patterns appeared quite frequently in the training set while others were observed rarely therefore posing the challenge of a strong data imbalance.
Lastly, for each training sample the images were represented by four filters of Green, Red, Blue and Yellow (RGBY), hence posing a challenge in utilizing a pre-trained CNN model as most pre-trained models are trained on a three channel RGB configuration.
Solution
We began by utilizing a pre-trained three channel CNN model from ImageNet to set up our baseline model. We added a fourth channel for the color yellow and initialized the weights for this new channel by copying the values from one of the pre-trained channels.
We were able to leverage transfer learning to retrain the model using the training set while utilizing a Focal Loss Function implementation to deal with the strong data imbalance as proposed in the following paper (https://arxiv.org/pdf/1708.02002.pdf)
We then manually put together a balanced cross-validation set to ensure a normal distribution of rare occurring protein patterns both in the training and validation sets and in a couple of these cases duplicated the training data points to further optimize the model for the outliers.