Here is a fairly efficient way to do this; assuming hg38
and BEDOPS and standard Unix tools installed.
$ bedmap --echo --echo-map-id --delim 't' <(awk '{n=split($0,a,/[:_]/); print "chr"a[1]"t"a[2]"t"a[2]+1"t"a[3]"https://www.biostars.org/"a[4];}' sumstats.txt | sort-bed -) <(wget -qO- http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/snp150.txt.gz | gunzip -c | cut -f2-5 | sort-bed -) > answer.bed
This gets around making and storing intermediate files. The output BED file will use the UCSC-favored chromosome naming scheme.
If you instead want to set up and investigate intermediate files to see how this works:
$ awk '{n=split($0,a,/[:_]/); print "chr"a[1]"t"a[2]"t"a[2]+1"t"a[3]"https://www.biostars.org/"a[4];}' sumstats.txt | sort-bed - > sumstats.bed
$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/snp150.txt.gz | gunzip -c | cut -f2-5 | sort-bed - > snp150.bed
The sumstats.bed
file will look something like:
$ echo "1:100000012_G_T" | awk '{n=split($0,a,/[:_]/); print "chr"a[1]"t"a[2]"t"a[2]+1"t"a[3]"https://www.biostars.org/"a[4];}'
chr1 100000012 100000013 G/T
We add chr
as a prefix to the chromosome name, so that mapping can be done on the UCSC dbSNP data.
Finally:
$ bedmap --echo --echo-map-id --delim 't' sumstats.bed snp150.bed > answer.bed
The wget
statement could take a while; it is 7GB compressed. Grab some coffee!
Update
If your reference genome is hg19
, and if your sumstats.txt
file has a header and is 1-based, you may need to make some changes.
sumstats header and 1-based indexing
If the coordinates in your sumstats.txt
file use 1-based indexing, and if the sumstats.txt
file contains a header, then you will need to make a change to the awk
statement I provide above:
$ tail -n+2 sumstats.txt | awk '{n=split($0,a,/[:_]/); print "chr"a[1]"t"a[2]-1"t"a[2]"t"a[3]"https://www.biostars.org/"a[4];}' - | sort-bed - > sumstats.bed
dbSNP v150 for hg19
You could download the following VCF file of SNPs and convert it to BED via vcf2bed
:
$ wget -qO- ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/All_20170710.vcf.gz | vcf2bed --sort-tmpdir=${PWD} --max-mem=2G - > hg19.dbSNP150.bed
It may be preferable to download the dbSNP file directly from NCBI as a VCF file. This will eliminate uncertainty about indexing of coordinates in UCSC’s table.
Mapping
Then run the bedmap
command to map SNPs to sumstat elements:
$ bedmap --echo --echo-map-id --delim 't' sumstats.bed hg19.dbSNP150.bed > answer.bed
Read more here: Source link