Get rsID for a list of SNPs in an entire GWAS sumstats file

Here is a fairly efficient way to do this; assuming hg38 and BEDOPS and standard Unix tools installed.

$ bedmap --echo --echo-map-id --delim 't' <(awk '{n=split($0,a,/[:_]/); print "chr"a[1]"t"a[2]"t"a[2]+1"t"a[3]"https://www.biostars.org/"a[4];}' sumstats.txt | sort-bed -) <(wget -qO- http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/snp150.txt.gz | gunzip -c | cut -f2-5 | sort-bed -) > answer.bed

This gets around making and storing intermediate files. The output BED file will use the UCSC-favored chromosome naming scheme.

If you instead want to set up and investigate intermediate files to see how this works:

$ awk '{n=split($0,a,/[:_]/); print "chr"a[1]"t"a[2]"t"a[2]+1"t"a[3]"https://www.biostars.org/"a[4];}' sumstats.txt | sort-bed - > sumstats.bed
$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/snp150.txt.gz | gunzip -c | cut -f2-5 | sort-bed - > snp150.bed

The sumstats.bed file will look something like:

$ echo "1:100000012_G_T" | awk '{n=split($0,a,/[:_]/); print "chr"a[1]"t"a[2]"t"a[2]+1"t"a[3]"https://www.biostars.org/"a[4];}'
chr1    100000012   100000013   G/T

We add chr as a prefix to the chromosome name, so that mapping can be done on the UCSC dbSNP data.

Finally:

$ bedmap --echo --echo-map-id --delim 't' sumstats.bed snp150.bed > answer.bed

The wget statement could take a while; it is 7GB compressed. Grab some coffee!


Update

If your reference genome is hg19, and if your sumstats.txt file has a header and is 1-based, you may need to make some changes.

sumstats header and 1-based indexing

If the coordinates in your sumstats.txt file use 1-based indexing, and if the sumstats.txt file contains a header, then you will need to make a change to the awk statement I provide above:

$ tail -n+2 sumstats.txt | awk '{n=split($0,a,/[:_]/); print "chr"a[1]"t"a[2]-1"t"a[2]"t"a[3]"https://www.biostars.org/"a[4];}' - | sort-bed - > sumstats.bed

dbSNP v150 for hg19

You could download the following VCF file of SNPs and convert it to BED via vcf2bed:

$ wget -qO- ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/All_20170710.vcf.gz | vcf2bed --sort-tmpdir=${PWD} --max-mem=2G - > hg19.dbSNP150.bed

It may be preferable to download the dbSNP file directly from NCBI as a VCF file. This will eliminate uncertainty about indexing of coordinates in UCSC’s table.

Mapping

Then run the bedmap command to map SNPs to sumstat elements:

$ bedmap --echo --echo-map-id --delim 't' sumstats.bed hg19.dbSNP150.bed > answer.bed

Read more here: Source link