I am trying to annotate a list of SNPs using the hg38 genome (knownGene) and locateVariants().
The program is able to successfully run and provide “GeneIDs” for several of the loci.
However, some GeneIDs are applied to SNPs in completely different regions and on completely different chromosomes.
When I cross reference these GeneIDs (I like to use ABO, geneID=28, as an example) with the Homo_sapiens.gene_info file from NCBI and check the original knownGene.txt.gz file on the UCSC website, I don’t see that any of the reading frames associated with these genes as being in the ranges that locateVariant seems to indicate.
I don’t know if the problem is with the knownGene.sqlite file, the locateVariants() function, or my R code.
I have updated my R software and all used packages within the past two weeks.
The vcf file that I am using looks like:
##fileformat=VCFv4.0
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
1 248883847
1 248883842
1 248883838
1 248883825
The gene_info file looks like:
tax_id,GeneID,Symbol,LocusTag,Synonyms,dbXrefs,chromosome,map_location,description,type_of_gene,Symbol_from_nomenclature_authority,Full_name_from_nomenclature_authority,Nomenclature_status,Other_designations,Modification_date,Feature_type
9606,1,A1BG,-,A1B|ABG|GAB|HYST2477,MIM:138670|HGNC:HGNC:5|Ensembl:ENSG00000121410|AllianceGenome:HGNC:5,19,19q13.43,alpha-1-B glycoprotein,protein-coding,A1BG,alpha-1-B glycoprotein,O,alpha-1B-glycoprotein|HEL-S-163pA|epididymis secretory sperm binding protein Li 163pA,20221209,-
9606,2,A2M,-,A2MD|CPAMD5|FWP007|S863-7,MIM:103950|HGNC:HGNC:7|Ensembl:ENSG00000175899|AllianceGenome:HGNC:7,12,12p13.31,alpha-2-macroglobulin,protein-coding,A2M,alpha-2-macroglobulin,O,alpha-2-macroglobulin|C3 and PZP-like alpha-2-macroglobulin domain-containing protein 5|alpha-2-M,20230312,-
When the code is done test contains variants associated with gene #28 (ABO), with an intron on chr1 (wrong), a promoter on chr9 (correct), and an intron on chr11 (wrong). Subset:
seqnames start end width strand LOCATION GENEID
157 chr1 246336469 246336469 1 - intron 28
158 chr1 246336469 246336469 1 - intron 28
159 chr1 246336469 246336469 1 - intron 28
227014 chr9 133277952 133277952 1 - promoter 28
227016 chr9 133277979 133277979 1 - promoter 28
227018 chr9 133277980 133277980 1 - promoter 28
254294 chr11 69305850 69305850 1 - intron 28
254295 chr11 69305850 69305850 1 - intron 28
254296 chr11 69305850 69305850 1 - intron 28
254297 chr11 69305850 69305850 1 - intron 28
library("GenomicFeatures")
library("VariantAnnotation")
library("TxDb.Hsapiens.UCSC.hg38.knownGene")
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
setwd("C:\\Users\\david\\Desktop\\Annotations")
filename <- "CP_E92_vs_E76"
dir <- "C:\\Users\\david\\Desktop\\Annotations\\D_Human_knownGene\\CP_E92_vs_E76"
setwd(dir)
vcf <- readVcf(paste(filename, "_DP4_13min_nona_rheMac10_liftover.vcf.txt", sep = ""), genome = "hg38", verbose = TRUE)
seqlevels(vcf) = paste0("chr", seqlevels(vcf))
rd <- rowRanges(vcf)
variants <- locateVariants(rd, txdb, AllVariants())
write.csv(variants[,c(1,7)], paste(filename, "_annotations.csv", sep = ""), row.names = FALSE)
n <- max(count.fields(paste(filename, "_annotations.csv", sep = ""), sep = ','))
SNPs1 <- read.table(paste(filename, "_annotations.csv", sep = ""), sep=",", skip=1, header = FALSE, col.names = paste0("V",seq_len(n)), fill = TRUE)
SNPs2 <- SNPs1[SNPs1[,4]==1,]
colnames(SNPs2) <- c("seqnames", "start", "end", "width", "strand", "LOCATION", "GENEID")
SNPs3 <- SNPs2[!SNPs2[,6]=="intergenic",]
symbol <- read.csv("C:\\Users\\david\\Desktop\\Annotations\\Z_Alias\\human\\Homo_sapiens.gene_info_031523.csv", header=TRUE)
symbol <- symbol[,2:3]
colnames(symbol) <- c("GENEID", "Symbol")
snps_symbol <- merge(SNPs3, as.data.frame(symbol), by = "GENEID")
test <- SNPs2[which(SNPs2[,7]==28),]
sessionInfo()
Warning: The working directory was changed to C:/Users/david/Desktop/Annotations inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.found header lines for 3 ‘fixed’ fields: ALT, QUAL, FILTER
found header lines for 0 ‘info’ fields:
found header lines for 0 ‘geno’ fields:
Warning: GRanges object contains 37482 out-of-bound ranges located on sequences 22281, 22282, 22155, 22156, 22157, 22159,...
'select()' returned many:1 mapping between keys and columns
'select()' returned many:1 mapping between keys and columns
Warning: GRanges object contains 222 out-of-bound ranges located on sequences chr1_GL383518v1_alt, chr1_KI270762v1_alt, chr1_KQ458384v1_alt,...
R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8 LC_MONETARY=English_United States.utf8 LC_NUMERIC=C LC_TIME=English_United States.utf8
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] TxDb.Hsapiens.UCSC.hg38.knownGene_3.16.0 VariantAnnotation_1.44.1 Rsamtools_2.14.0 Biostrings_2.66.0
[5] XVector_0.38.0 SummarizedExperiment_1.28.0 MatrixGenerics_1.10.0 matrixStats_0.63.0
[9] GenomicFeatures_1.50.4 AnnotationDbi_1.60.2 Biobase_2.58.0 GenomicRanges_1.50.2
[13] GenomeInfoDb_1.34.9 IRanges_2.32.0 S4Vectors_0.36.2 BiocGenerics_0.44.0
loaded via a namespace (and not attached):
[1] httr_1.4.5 bit64_4.0.5 BiocManager_1.30.20 BiocFileCache_2.6.1 blob_1.2.3 BSgenome_1.66.3 GenomeInfoDbData_1.2.9 yaml_2.3.7
[9] progress_1.2.2 pillar_1.8.1 RSQLite_2.3.0 lattice_0.20-45 glue_1.6.2 digest_0.6.31 htmltools_0.5.4 Matrix_1.5-1
[17] XML_3.99-0.13 pkgconfig_2.0.3 biomaRt_2.54.0 zlibbioc_1.44.0 BiocParallel_1.32.5 tibble_3.2.0 KEGGREST_1.38.0 generics_0.1.3
[25] ellipsis_0.3.2 cachem_1.0.7 cli_3.6.0 magrittr_2.0.3 crayon_1.5.2 memoise_2.0.1 evaluate_0.20 fansi_1.0.4
[33] xml2_1.3.3 tools_4.2.2 prettyunits_1.1.1 hms_1.1.2 BiocIO_1.8.0 lifecycle_1.0.3 stringr_1.5.0 DelayedArray_0.23.2
[41] compiler_4.2.2 rlang_1.0.6 grid_4.2.2 RCurl_1.98-1.10 rjson_0.2.21 rappdirs_0.3.3 bitops_1.0-7 rmarkdown_2.20
[49] restfulr_0.0.15 codetools_0.2-18 DBI_1.1.3 curl_5.0.0 R6_2.5.1 GenomicAlignments_1.34.1 knitr_1.42 dplyr_1.1.0
[57] rtracklayer_1.58.0 fastmap_1.1.1 bit_4.0.5 utf8_1.2.3 filelock_1.0.2 stringi_1.7.12 parallel_4.2.2 Rcpp_1.0.10
[65] vctrs_0.5.2 png_0.1-8 dbplyr_2.3.1 tidyselect_1.2.0 xfun_0.37
Thank you very much for any help you can give me in identifying what is going wrong here.
Read more here: Source link