Hi all,
I have raw counts of samples in a dataframe. The row names is Ensembl ID and I want to convert them to a gene symbol. So I’ve run the code below.
query <- GDCquery(project = "TCGA-COAD" ,
data.category = "Transcriptome Profiling" ,
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts" ,
sample.type = c("Primary Tumor", "Solid Tissue Normal"),
experimental.strategy = "RNA-Seq")
GDCdownload(query)
query.counts.colon <- GDCprepare(query)
ColonMatrix <- as.data.frame(SummarizedExperiment::assay(query.counts.colon ))
ens <- row.names(ColonMatrix)
> length(ens)
[1] 56602
#Ensembl id converting
require(org.Hs.eg.db)
ens_to_symbol <- mapIds(
org.Hs.eg.db,
keys = ens,
column = 'SYMBOL',
keytype="ENSEMBL")
mart <- useDataset('hsapiens_gene_ensembl', useMart('ensembl'))
ens_to_symbol_biomart <- getBM(
filters="ensembl_gene_id",
attributes = c('ensembl_gene_id', 'hgnc_symbol'),
values = ens,
mart = mart)
ens_to_symbol_biomart <- merge(
x = as.data.frame(ens),
y = ens_to_symbol_biomart ,
by.y = 'ensembl_gene_id',
all.x = TRUE,
by.x = 'ens')
head(ens_to_symbol_biomart)
ens hgnc_symbol
1 ENSG00000000003 TSPAN6
2 ENSG00000000005 TNMD
3 ENSG00000000419 DPM1
4 ENSG00000000457 SCYL3
5 ENSG00000000460 C1orf112
6 ENSG00000000938 FGR
but when I check for duplicated gene symbols I found this :
>table(duplicated(ens_to_symbol_biomart$ hgnc_symbol))
FALSE TRUE
38446 18156
I don’t know what is the reason for these duplicates. Should I remove these duplicated rows?
Thanks for any help
Read more here: Source link