When I convert the Ensembl IDs to gene symbols, why lots of genes are duplicated?

Hi all,
I have raw counts of samples in a dataframe. The row names is Ensembl ID and I want to convert them to a gene symbol. So I’ve run the code below.

query <- GDCquery(project = "TCGA-COAD" ,
                      data.category = "Transcriptome Profiling" ,
                      data.type = "Gene Expression Quantification",
                      workflow.type = "HTSeq - Counts" ,
                      sample.type = c("Primary Tumor", "Solid Tissue Normal"), 
                      experimental.strategy = "RNA-Seq")


    GDCdownload(query)

    query.counts.colon <- GDCprepare(query)

    ColonMatrix <- as.data.frame(SummarizedExperiment::assay(query.counts.colon ))

    ens <- row.names(ColonMatrix)


  > length(ens)
    [1] 56602


 #Ensembl id converting

require(org.Hs.eg.db)
ens_to_symbol <- mapIds(
  org.Hs.eg.db,
  keys = ens,
  column = 'SYMBOL',
  keytype="ENSEMBL")


mart <- useDataset('hsapiens_gene_ensembl', useMart('ensembl'))
ens_to_symbol_biomart <- getBM(
  filters="ensembl_gene_id",
  attributes = c('ensembl_gene_id', 'hgnc_symbol'),
  values = ens,
  mart = mart)


ens_to_symbol_biomart <- merge(
  x = as.data.frame(ens),
  y =  ens_to_symbol_biomart ,
  by.y = 'ensembl_gene_id',
  all.x = TRUE,
  by.x = 'ens')
head(ens_to_symbol_biomart)


ens               hgnc_symbol

1 ENSG00000000003      TSPAN6
2 ENSG00000000005        TNMD
3 ENSG00000000419        DPM1
4 ENSG00000000457       SCYL3
5 ENSG00000000460    C1orf112
6 ENSG00000000938         FGR

but when I check for duplicated gene symbols I found this :

>table(duplicated(ens_to_symbol_biomart$ hgnc_symbol))
FALSE  TRUE 
38446 18156

I don’t know what is the reason for these duplicates. Should I remove these duplicated rows?
Thanks for any help

Read more here: Source link