TxDB.Hsapiens.UCSC.hg38.knownGene with locateVariants() identifying SNPs from various chromosome being part of the same gene

I am trying to annotate a list of SNPs using the hg38 genome (knownGene) and locateVariants().
The program is able to successfully run and provide “GeneIDs” for several of the loci.
However, some GeneIDs are applied to SNPs in completely different regions and on completely different chromosomes.
When I cross reference these GeneIDs (I like to use ABO, geneID=28, as an example) with the Homo_sapiens.gene_info file from NCBI and check the original knownGene.txt.gz file on the UCSC website, I don’t see that any of the reading frames associated with these genes as being in the ranges that locateVariant seems to indicate.
I don’t know if the problem is with the knownGene.sqlite file, the locateVariants() function, or my R code.
I have updated my R software and all used packages within the past two weeks.

The vcf file that I am using looks like:
##fileformat=VCFv4.0    
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA00001 NA00002 NA00003
1   248883847
1   248883842
1   248883838
1   248883825

The gene_info file looks like:
tax_id,GeneID,Symbol,LocusTag,Synonyms,dbXrefs,chromosome,map_location,description,type_of_gene,Symbol_from_nomenclature_authority,Full_name_from_nomenclature_authority,Nomenclature_status,Other_designations,Modification_date,Feature_type
9606,1,A1BG,-,A1B|ABG|GAB|HYST2477,MIM:138670|HGNC:HGNC:5|Ensembl:ENSG00000121410|AllianceGenome:HGNC:5,19,19q13.43,alpha-1-B glycoprotein,protein-coding,A1BG,alpha-1-B glycoprotein,O,alpha-1B-glycoprotein|HEL-S-163pA|epididymis secretory sperm binding protein Li 163pA,20221209,-
9606,2,A2M,-,A2MD|CPAMD5|FWP007|S863-7,MIM:103950|HGNC:HGNC:7|Ensembl:ENSG00000175899|AllianceGenome:HGNC:7,12,12p13.31,alpha-2-macroglobulin,protein-coding,A2M,alpha-2-macroglobulin,O,alpha-2-macroglobulin|C3 and PZP-like alpha-2-macroglobulin domain-containing protein 5|alpha-2-M,20230312,-

When the code is done test contains variants associated with gene #28 (ABO), with an intron on chr1 (wrong), a promoter on chr9 (correct), and an intron on chr11 (wrong). Subset:

       seqnames     start       end width strand LOCATION GENEID
157        chr1 246336469 246336469     1      -   intron     28
158        chr1 246336469 246336469     1      -   intron     28
159        chr1 246336469 246336469     1      -   intron     28
227014     chr9 133277952 133277952     1      - promoter     28
227016     chr9 133277979 133277979     1      - promoter     28
227018     chr9 133277980 133277980     1      - promoter     28
254294    chr11  69305850  69305850     1      -   intron     28
254295    chr11  69305850  69305850     1      -   intron     28
254296    chr11  69305850  69305850     1      -   intron     28
254297    chr11  69305850  69305850     1      -   intron     28

library("GenomicFeatures")
library("VariantAnnotation")
library("TxDb.Hsapiens.UCSC.hg38.knownGene")
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene

setwd("C:\\Users\\david\\Desktop\\Annotations")
filename <- "CP_E92_vs_E76"

dir <- "C:\\Users\\david\\Desktop\\Annotations\\D_Human_knownGene\\CP_E92_vs_E76"

setwd(dir)
vcf <- readVcf(paste(filename, "_DP4_13min_nona_rheMac10_liftover.vcf.txt", sep = ""), genome = "hg38", verbose = TRUE)
seqlevels(vcf) = paste0("chr", seqlevels(vcf))
rd <- rowRanges(vcf)
variants <- locateVariants(rd, txdb, AllVariants()) 

write.csv(variants[,c(1,7)], paste(filename, "_annotations.csv", sep = ""), row.names = FALSE) 
n <- max(count.fields(paste(filename, "_annotations.csv", sep = ""), sep = ','))
SNPs1 <- read.table(paste(filename, "_annotations.csv", sep = ""), sep=",", skip=1, header = FALSE, col.names = paste0("V",seq_len(n)), fill = TRUE)
SNPs2 <- SNPs1[SNPs1[,4]==1,]
colnames(SNPs2) <- c("seqnames", "start", "end", "width", "strand", "LOCATION", "GENEID")

SNPs3 <- SNPs2[!SNPs2[,6]=="intergenic",]

symbol <- read.csv("C:\\Users\\david\\Desktop\\Annotations\\Z_Alias\\human\\Homo_sapiens.gene_info_031523.csv", header=TRUE)
symbol <- symbol[,2:3]
colnames(symbol) <- c("GENEID", "Symbol")
snps_symbol <- merge(SNPs3, as.data.frame(symbol), by = "GENEID")

test <- SNPs2[which(SNPs2[,7]==28),]

sessionInfo()

Warning: The working directory was changed to C:/Users/david/Desktop/Annotations inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.found header lines for 3 ‘fixed’ fields: ALT, QUAL, FILTER 
found header lines for 0 ‘info’ fields:  
found header lines for 0 ‘geno’ fields:  
Warning: GRanges object contains 37482 out-of-bound ranges located on sequences 22281, 22282, 22155, 22156, 22157, 22159,...
'select()' returned many:1 mapping between keys and columns
'select()' returned many:1 mapping between keys and columns
Warning: GRanges object contains 222 out-of-bound ranges located on sequences chr1_GL383518v1_alt, chr1_KI270762v1_alt, chr1_KQ458384v1_alt,...

R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8    LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                           LC_TIME=English_United States.utf8    

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] TxDb.Hsapiens.UCSC.hg38.knownGene_3.16.0 VariantAnnotation_1.44.1                 Rsamtools_2.14.0                         Biostrings_2.66.0                       
 [5] XVector_0.38.0                           SummarizedExperiment_1.28.0              MatrixGenerics_1.10.0                    matrixStats_0.63.0                      
 [9] GenomicFeatures_1.50.4                   AnnotationDbi_1.60.2                     Biobase_2.58.0                           GenomicRanges_1.50.2                    
[13] GenomeInfoDb_1.34.9                      IRanges_2.32.0                           S4Vectors_0.36.2                         BiocGenerics_0.44.0                     

loaded via a namespace (and not attached):
 [1] httr_1.4.5               bit64_4.0.5              BiocManager_1.30.20      BiocFileCache_2.6.1      blob_1.2.3               BSgenome_1.66.3          GenomeInfoDbData_1.2.9   yaml_2.3.7              
 [9] progress_1.2.2           pillar_1.8.1             RSQLite_2.3.0            lattice_0.20-45          glue_1.6.2               digest_0.6.31            htmltools_0.5.4          Matrix_1.5-1            
[17] XML_3.99-0.13            pkgconfig_2.0.3          biomaRt_2.54.0           zlibbioc_1.44.0          BiocParallel_1.32.5      tibble_3.2.0             KEGGREST_1.38.0          generics_0.1.3          
[25] ellipsis_0.3.2           cachem_1.0.7             cli_3.6.0                magrittr_2.0.3           crayon_1.5.2             memoise_2.0.1            evaluate_0.20            fansi_1.0.4             
[33] xml2_1.3.3               tools_4.2.2              prettyunits_1.1.1        hms_1.1.2                BiocIO_1.8.0             lifecycle_1.0.3          stringr_1.5.0            DelayedArray_0.23.2     
[41] compiler_4.2.2           rlang_1.0.6              grid_4.2.2               RCurl_1.98-1.10          rjson_0.2.21             rappdirs_0.3.3           bitops_1.0-7             rmarkdown_2.20          
[49] restfulr_0.0.15          codetools_0.2-18         DBI_1.1.3                curl_5.0.0               R6_2.5.1                 GenomicAlignments_1.34.1 knitr_1.42               dplyr_1.1.0             
[57] rtracklayer_1.58.0       fastmap_1.1.1            bit_4.0.5                utf8_1.2.3               filelock_1.0.2           stringi_1.7.12           parallel_4.2.2           Rcpp_1.0.10             
[65] vctrs_0.5.2              png_0.1-8                dbplyr_2.3.1             tidyselect_1.2.0         xfun_0.37

Thank you very much for any help you can give me in identifying what is going wrong here.

Read more here: Source link