For dataframe manipulation, in general, you should look into the dplyr
and tidyr
packages, they offer endless possibilities if you learn to manipulate them (lots of practice will help). A good and concise cheatsheet is available here.
Regarding this problem in particular, something like this should work:
library(dplyr)
library(tidyr)
dfA <- data.frame(gene = c("a", "b", "c", "d", "e"),
mus_id = c("mus001", "mus002", "mus003", "mus004", "mus005"))
dfB <- data.frame(mus_id = c("mus001", "mus002", "mus003,mus002", "mus004"),
zeb_id = c("dar001", "dar002,dar004", "dar003,dar002", "dar004"))
dfC <- dfB %>%
tidyr::separate_rows(zeb_id, sep = ",") %>%
tidyr::separate_rows(mus_id, sep = ",") %>%
dplyr::group_by(mus_id) %>%
dplyr::summarise(zeb_id = paste0(sort(unique(na.omit(zeb_id))), collapse=","))
dfD <- dfA %>%
dplyr::left_join(dfC)
Read more here: Source link