Weighted pairwise genetic distance between samples from fasta file and count table of sequences?
I have a series of samples, all containing multiple DNA sequences. I’m looking to calculate the mean pairwise genetic distance between samples. I’ve figured out a way of how to do this in R using dist.dna in ape and dist_groups in usedist in a dummy dataset as follows:
library(adegenet)
library(ape)
library(usedist)
library(tidyverse)
dna <- fasta2DNAbin(file=”adegenet.r-forge.r-project.org/files/usflu.fasta“)
D <- dist.dna(dna, model = “N”)
samples <- c(repsamples <- c(rep(“sample.a”, 20), rep(“sample.b”, 20), rep(“sample.c”, 20), rep(“sample.d”, 20)))
D <- dist_setNames(D, samples)
table.paint(as.matrix(D), cleg=0, clabel.row=.5, clabel.col=.5)
D2 <- dist_groups(D, samples)
D2.grouped <- D2 %>% group_by(Label) %>% summarise(mean.distance=mean(Distance))
The problem I have is that in the actual dataset I have ~400 samples, and each sample contains upwards of 20,000 sequences. Combining these together would make a (very) large distance matrix.
As there are only ~1500 unique sequences in total, I can get around this problem by calculating a distance matrix on JUST the unique samples. What I’d then like to do is calculate mean pairwise distance BETWEEN samples (as above with dist_groups from usedist) except somehow weight each sample by the number of times the sequence occurs in the sample?
Specific example using dummy data below (10 samples, 3 sequences per sample, lots of sequence counts):
dna <- fasta2DNAbin(file=”adegenet.r-forge.r-project.org/files/usflu.fasta“)
sequencenames <- dimnames(dna)[[1]]
sequences <- sample(sequencenames,30,replace=TRUE)
count <- sample(1000:5000,30)
sample <- rep(LETTERS[seq( from = 1, to = 10 )],3)
counttable <- cbind(sample, sequences, count) %>% as.data.frame() %>% arrange(sample)
One way I could see this working would be:
1) Take the pairwise distances among sequences from dist.dna(dna, model = “N”)
2) Expand the fasta file using the count table so that every row of dist.dna would be an individual sequence within a sample
While this could work, the fasta file would be… sum(as.numeric(counttable$count)) = 101430 sequences long, and the dist matrix would be 101430* 101430 – which would rapidly become computationally intractable as the actual dataset has… 60 million sequences.
My ideal workflow would be this:
1) Take the pairwise distances among unique sequences from dist.dna(dna, model = “N”)
2) Use dist_groups as above to calculate within and between sample pairwise distance, but someway make this a weighted distance by incorporating the replicated sequences within each sample.
Any ideas how to do this in either R or Python? At this stage i’m up for anything!
Read more here: Source link