# weighted mean pairwise distance among samples containing multiple sequences?

Weighted pairwise genetic distance between samples from fasta file and count table of sequences?

I have a series of samples, all containing multiple DNA sequences. I’m looking to calculate the mean pairwise genetic distance between samples. I’ve figured out a way of how to do this in R using dist.dna in ape and dist_groups in usedist in a dummy dataset as follows:

library(ape)

library(usedist)

library(tidyverse)

D <- dist.dna(dna, model = “N”)

samples <- c(repsamples <- c(rep(“sample.a”, 20), rep(“sample.b”, 20), rep(“sample.c”, 20), rep(“sample.d”, 20)))

D <- dist_setNames(D, samples)

table.paint(as.matrix(D), cleg=0, clabel.row=.5, clabel.col=.5)

D2 <- dist_groups(D, samples)
D2.grouped <- D2 %>% group_by(Label) %>% summarise(mean.distance=mean(Distance))

The problem I have is that in the actual dataset I have ~400 samples, and each sample contains upwards of 20,000 sequences. Combining these together would make a (very) large distance matrix.

As there are only ~1500 unique sequences in total, I can get around this problem by calculating a distance matrix on JUST the unique samples. What I’d then like to do is calculate mean pairwise distance BETWEEN samples (as above with dist_groups from usedist) except somehow weight each sample by the number of times the sequence occurs in the sample?

Specific example using dummy data below (10 samples, 3 sequences per sample, lots of sequence counts):

sequencenames <- dimnames(dna)[[1]]

sequences <- sample(sequencenames,30,replace=TRUE)

count <- sample(1000:5000,30)

sample <- rep(LETTERS[seq( from = 1, to = 10 )],3)

counttable <- cbind(sample, sequences, count) %>% as.data.frame() %>% arrange(sample)

One way I could see this working would be:

1) Take the pairwise distances among sequences from dist.dna(dna, model = “N”)

2) Expand the fasta file using the count table so that every row of dist.dna would be an individual sequence within a sample

While this could work, the fasta file would be… sum(as.numeric(counttable\$count)) = 101430 sequences long, and the dist matrix would be 101430* 101430 – which would rapidly become computationally intractable as the actual dataset has… 60 million sequences.

My ideal workflow would be this:

1) Take the pairwise distances among unique sequences from dist.dna(dna, model = “N”)

2) Use dist_groups as above to calculate within and between sample pairwise distance, but someway make this a weighted distance by incorporating the replicated sequences within each sample.

Any ideas how to do this in either R or Python? At this stage i’m up for anything!