Hi all, I would like to download the bulk RNA-seq data for all patients in the TCGA-LUAD cohort using TCGAbiolinks. Does this exist as a single matrix?
I have read the package vignette and can download individual cases however does TCGAbiolinks facilitate downloading a single matrix of all the patients?
I ask because if you download similar data from Xena browser you can download a 585 column matrix.
I tried this with TCGAbiolinks:
test<-GDCquery(project="TCGA-LUAD", data.category = 'Gene expression', data.type="Gene expression quantification", platform = "Illumina HiSeq", file.type="results", legacy = TRUE)
dim(getResults(test))
This results in 600 files.
I tried the code below to see if one file was much bigger than the others but it appears not, hence all 600 files are separate cases:
getResults(test) %>% arrange(desc(file_size)) %>% head(10)
Finally I interrogated the duplicated cases and while some cases have a file for both cancer and normal tissue (this is OK), other patients have 2 or 3 files all for cancer tissue. Which file should I choose?!
dups_index <- which(duplicated(getResults(test)[,"cases.submitter_id"]))
dups <- getResults(test)[,"cases.submitter_id"][dups_index]
for(i in 1:length(dups)){
print(i)
print(getResults(test) %>% filter(cases.submitter_id == dups[i]) %>% select(sample_type))
}
Any help appreciated, thanks in advance
Read more here: Source link