combine OTU and tax table and replace actual sequences with OTU ids (Phyloseq/dada2)

This is to the part of the question “replace actual sequences with OTU ids (Phyloseq/dada2)?”

I contacted the phyloseq/dada2 developers and based on Susan Holmes’ reply (github.com/joey711/phyloseq/issues/1030) I came up with this piece of code to replace the amplicon sequences with a numbered OTU header.

Further discussion can be found here: github.com/joey711/phyloseq/issues/213

# this changes the header from the actual sequence to Seq_001, Seq_002 etc
taxa_names(ps)
n_seqs <- seq(ntaxa(ps))
len_n_seqs <- nchar(max(n_seqs))
taxa_names(ps) <- paste("Seq", formatC(n_seqs, 
                                            width = len_n_seqs, 
                                            flag = "0"), sep = "_")
taxa_names(ps)

A possible way to get taxonomy included into the header is the following (continuing from above):

# generate a vector containing the full taxonomy path for all OTUs
wholetax <- do.call(paste, c(as.data.frame(tax_table(ps))
                  [c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus")], 
                  sep = "__"))  # to distinguish from "_" within tax ranks

# turn the otu_table into a data.frame
otu_export <- as.data.frame(otu_table(ps))
tmp <- names(otu_export)

# paste wholetax and OTU_ids together
for(i in 1:length(tmp)){
names(tmp)[i] = paste(wholetax[i], tmp[i], sep = "__")
}

# overwrite old names
names(otu_export) <- names(tmp)

> head(otu_export)[5]

# output:  
     Bacteria__Bacteroidetes__Bacteroidia__Bacteroidales__Bacteroidaceae__Bacteroides__Seq_005
F3D0                                                                                         146
F3D1                                                                                         126
F3D11                                                                                        496
F3D125                                                                                       440
F3D13                                                                                       1183
F3D141                                                                                       184

This yet does not include a test of correct sorting between the tables! So make sure the pasting and overwriting is correct.

This way you have a data.frame containing taxonomy “splittable” for each taxonomic rank, the OTU id, the sample name and the counts in one file. But apart from the export file, you still maintain the phyloseq-structure, where the OTU_ids link the different tables such as otu_table() and tax_table(). Another way would be to provide the wholetax vector to the taxa_names() command, I haven’t tested that.

Suggestions for improvements are highly welcomed!

Read more here: Source link