DNA Sequence Classification Based on Milvus

Introduction

DNA sequencing is a popular concept in both academic research and practical applications, such as gene traceability, species identification, and disease diagnosis. Whereas all industries starve for a more intelligent and efficient research method, artificial intelligence has attracted much attention, especially from the biological and medical domains. More and more scientists and researchers are contributing to machine learning and deep learning in bioinformatics. To make experimental results more convincing, one common option is increasing sample size. The collaboration with big data in genomics as well brings more possibilities of use cases in reality. However, the traditional sequence alignment has limitations, which make it unsuitable for large data. In order to make less trade-off in reality, vectorization is a good choice for a large dataset of DNA sequences.

The open-source vector database Milvus is friendly for massive data. It is able to store vectors of nucleic acid sequences and perform high-efficiency retrieval. It can also help reduce the cost of production or research. The DNA sequence classification system based on Milvus only takes milliseconds to do gene classification. Moreover, it shows higher accuracy than other common classifiers in machine learning.

Data Processing

A gene that encodes genetic information is made up of a small section of DNA sequences, which consists of 4 nucleotide bases [A, C, G, T]. There are about 30,000 genes in the human genome, nearly 3 billion DNA base pairs, and each base pair has 2 corresponding bases. To support diverse uses, DNA sequences can be classified into various categories. In order to reduce the cost and make easier use of data of long DNA sequences, k-mer is introduced to data preprocessing. Meanwhile, it makes DNA sequence data more similar to plain text. Furthermore, vectorized data can speed up calculation in data analysis or machine learning.

DNA Sequence Processing With k-mers

k-mer

The k-mer method is commonly used in DNA sequence preprocessing. It extracts a small section of length k starting from each base of the original sequence, thereby converting a long sequence of length s to (s-k+1) short sequences of length k. Adjusting the value of k will improve the model performance. Lists of short sequences are easier for data reading, feature extraction, and vectorization.

Vectorization

DNA sequences are vectorized in the form of text. A sequence transformed by k-mer becomes a list of short sequences, which looks like a list of individual words in a sentence. Therefore, most natural language processing models should work for DNA sequence data as well. Similar methodologies can be applied to model training, feature extraction, and encoding. Since each model has its own advantages and drawbacks, the selection of models depends on the feature of data and the purpose of research. For example, CountVectorizer, a bag-of-words model, implements feature extraction through straightforward tokenization. It sets no limit on data length, but the result returned is less obvious in terms of similarity comparison.

Milvus Demo

Milvus can easily manage unstructured data and recall most similar results among trillions of vectors within an average delay of milliseconds. Its similarity search is based on Approximate Nearest Neighbor (ANN) search algorithm. These highlights make Milvus a great option to manage vectors of DNA sequences, hence promote the development and applications of bioinformatics.

Here is a demo showing how to build a DNA sequence classification system with Milvus. The experimental dataset includes 3 organisms and 7 gene families. All data are converted to lists of short sequences by k-mers. With a pre-trained CountVectorizer model, the system then encodes sequence data into vectors. The flow chart below depicts the system structure and the processes of inserting and searching.

System Structure and Processes Flow

Try out this demo at Milvus Bootcamp.

In Milvus, the system creates collection and inserts corresponding vectors of DNA sequences into the collection (or partition if enabled). When receiving a query request, Milvus will return distances between the vector of input DNA sequence and the most similar results in a database. The class of input sequence and similarity between DNA sequences can be determined by vector distances in results.

# Insert vectors to Milvus collection (partition "human")
DNA_human = collection.insert([human_ids, human_vectors], partition_name="human")
# Search topK results (in partition "human") for test vectors
res = collection.search(test_vectors, "vector_field", search_params, limit=topK, partition_names=['human'])
for results in res:
    res_ids = results.ids # primary keys of topK results
    res_distances = results.distances # distances between topK results & search input

DNA Sequence Classification

Searching for most similar DNA sequences in Milvus could imply the gene family of an unknown sample, thus learn about its possible functionality. If a sequence is classified as GPCRs, then it probably has an influence on body functions. In this demo, Milvus has successfully enabled the system to identify the gene families of the human DNA sequences searched with.

DNA Sequence Classification Example
DNA Sequence Classification Results

Genetic Similarity

Average DNA sequence similarity between organisms illustrates how close between their genomes. The demo searches in human data for most similar DNA sequences as that of chimpanzees and dogs respectively. Then it calculates and compares average inner product distances (0.97 for chimpanzee and 0.70 for dog), which proves that chimpanzee shares more similar genes with a human than a dog shares. With more complex data and system design, Milvus is able to support genetic research even on a higher level.

search_params = {"metric_type": "IP", "params": {"nprobe": 20}}

Performance

The demo trains the classification model with 80% human sample data (3629 in total) and uses the remaining as test data. It compares the performance of the DNA sequence classification model which uses Milvus with the one powered by MySQL and 5 popular machine learning classifiers. The model based on Milvus outperforms its counterparts in accuracy.

from sklearn.model_selection import train_test_split
X, y = human_sequence_kmers, human_labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Milvus Performance vs. 5 Popular Machine Learning Classifiers

Further Exploration

With the development of big data technology, the vectorization of DNA sequences will play a more important role in genetic research and practice. Combined with professional knowledge in bioinformatics, related studies can further benefit from the involvement of DNA sequence vectorization. Therefore, Milvus can present better results in practice. According to different scenarios and user needs, Milvus-powered similarity search and distance calculation show great potential and many possibilities.

  • Study unknown sequences: According to some researchers, vectorization can compress DNA sequence data. At the same time, it requires less effort to study the structure, function, and evolution of unknown DNA sequences. Milvus can store and retrieve a huge number of DNA sequence vectors without losing accuracy.
  • Adapt devices: Limited by traditional algorithms of sequence alignment, similarity search can barely benefit from device (CPU/GPU) improvement. Milvus, which supports both regular CPU computation and GPU acceleration, resolves this problem with the approximate nearest neighbor algorithm.
  • Detect virus and trace origins: Scientists have compared genome sequences and reported that the COVID19 virus of probable bat origin belongs to SARS-COV. Based on this conclusion, researchers can expand the sample size for more evidence and patterns.
  • Diagnose diseases: Clinically, doctors could compare DNA sequences between patients and healthy groups to identify variant genes that cause diseases. It is possible to extract features and encode these data using proper algorithms. Milvus is able to return distances between vectors, which can be related to disease data. In addition to assisting the diagnosis of disease, this application can also help to inspire the study of targeted therapy.

Learn More About Milvus

Milvus is a powerful tool capable of powering a vast array of artificial intelligence and vector similarity search applications. To learn more about the project, check out the following resources:

  • Read our blog.
  • Interact with our open-source community on Slack.
  • Use or contribute to the world’s most popular vector database on GitHub.

Read more here: Source link