tabix for ID column
Hello,
I’m looking for something similar to tabix. But instead of looking for informations within a given region, I would like to use the values in the ID column for quickly lookup.
So for example I would like to take the compressed dbSNP file, index it by the ID column and than search quickly for the line with a given rs number.
Do you know any program or way that can do this?
fin swimmer
• 2.4k views
create a text file filled with RSID CHROM POS
and sort on RSID.
awk '/^#/ {next;} ($3==".") {next;} {OFS="t";print $3,$1,$2;}' input.vcf | sort -k1,1 > index.txt
when querying , use join build an interval and query with tabix
join -t $'t' -1 1 -2 1 <(echo "rs12345") index.txt |
awk '{printf("%s:%s-%sn",$2,$3,$3);}' |
while read P;
do tabix input.vcf "$P"
done
another solution: put those ID/CHROM/POS in a sqlite3 database and query it the same way.
sqlite3 index.db 'select chrom,position from myindex where rsID="rs17625"' | awk -F '|' '{printf("%s:%s-%sn",$1,$1,$1);}' |
while read P;
do tabix input.vcf "$P"
done
Hello,
long time ago I started this topic. Unfortunately I cannot comment on the solutions mentioned here because I dropped my idea for what I need it and so never tested it.
But today I found a nice trick on github of htslib. The person who posted their uses “rs” as contig name and the number as start and end position. This results in a file like this :
rs 76264143 76264143 chr1 982843 982844 G
You can than index by the first 3 columns and query like before. Nice!
fin swimmer
Another option is a fast binary search on sorted data. No indexing needed, because the sort order is the index.
1) Get SNPs and write them into a text file sorted by SNP ID. For hg19
, for instance:
$ LC_ALL=C
$ wget -qO- ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/All_20170710.vcf.gz
| gunzip -c
| vcf2bed --do-not-sort
| awk -v OFS="t" '{ print $4, $2, $3, ($3+1), $6"https://www.biostars.org/"$7; }'
| sort -k1,1
> hg19.snp150.sortedByName.txt
2) Install pts-line-bisect:
$ git clone https://github.com/pts/pts-line-bisect.git
$ cd pts-line-bisect && make && cd ..
3) Run a query:
$ rs_of_interest=rs12345
$ ./pts-line-bisect/pts_lbsearch -p hg19.snp150.sortedByName.txt ${rs_of_interest} | head -1 | cut -f2-
Traffic: 1050 users visited in the last hour
Read more here: Source link