Using Machine Learning for Biological Sequence Generation

Author: Noah Ballinger    Major: Computer Science 

Coding Networks from Home

After completing my previous projects in the spring semester, as part of the Data Science lab at the University of Arkansas, I began doing research for the U.S. Department of Defense over the summer. More specifically, I was working with a biology research team working as a part of the Air Force in Ohio. Over the summer I began this research, and upon reaching the fall 2020 semester, I was able to continue my research with the Department of Defense on behalf of my research group led by Dr. Justin Zhan. Throughout this project, I worked alongside fellow undergraduate student Leo Fuentes and had the opportunity to work with several mentors including Dr. Zhan and a team of biological researchers for the Air Force. Each of my mentors in this group was able to give a unique perspective and guidance on our project, whether that was by providing a background for the biological terms or concepts that I didn’t know, by suggesting certain types of algorithmic approaches or specific papers to read or by simply keeping us on track by determining what steps should be accomplished next from week to week. All of this helped me, and Leo focuses on our one task: find the best way to generate better sequences of amino acids.

While finding a “better sequence of amino acids” may seem like a subjective term, using a biological tool known as biosensors, the Air Force lab that we worked with was able to create a list of sequences along with numeric values related to how well this biological sequence was for certain desired results. This numeric value relates to the brightness value that these sequences provide when attached to a biosensor that aligns with the desired output. In order to solve this problem, Leo and I developed a four-step method to create new, potentially better, sequences.

We knew that we wanted to design a method based on neural networks. Neural Networks are a form of AI that tries to mimic the architecture of the human brain by using layers of connected numbers to create a type of function. These networks have become quite prevalent recently due to their great ability to create solutions for real-world problems. Alongside this, we wanted to increase our likelihood of generating sequences that would form biologically instead of hypothetically high-performing sequences based on the data that was given. In order to address this, we created an initial triplet creation and built partially filled-in “skeleton” sequences in order to further improve these chances by promoting the placement of several important amino acids in specific positions of the sequence. We used both polarity and position to find a group of the best triplets. From all these triplets (we had 40 for our initial dataset), we created a randomly generated sequence with many unfilled positions. After this, I implemented a character-based RNN to fill in the partial sequences. A character-based RNN is a type of recursive neural network that generates words or sequences letter by letter. By training this on our dataset of sequences and by finding a way to input the skeleton sequence and generating the next letter symbol of the amino acid for each empty position.

After this we had a set of generated sequences that should have a high likelihood of binding and similar results to the dataset we were given. However, we wanted to find sequences that were in the upper margin or better than the dataset we were given. In order to achieve this, I implemented another neural network. This time a neural network for regression that predicted the brightness value of a sequence. We trained this on our dataset of given sequences and corresponding values. Then, we could use this network to predict the values for the sequences we had generated. Using this, after generating around a thousand sequences we could predict the brightness value for each sequence and use a percentile of the best ones.

We are currently working on plans to finalize aspects of our paper and results and hope to publish our research soon. Following this project, I plan to continue to research in the field of machine learning in the lab under Dr. Zhan.