tabix for ID column

tabix for ID column

4

Hello,

I’m looking for something similar to tabix. But instead of looking for informations within a given region, I would like to use the values in the ID column for quickly lookup.

So for example I would like to take the compressed dbSNP file, index it by the ID column and than search quickly for the line with a given rs number.

Do you know any program or way that can do this?

fin swimmer


snp


tabix

• 2.4k views

create a text file filled with RSID CHROM POS and sort on RSID.

awk '/^#/ {next;} ($3==".") {next;} {OFS="t";print $3,$1,$2;}' input.vcf | sort -k1,1 > index.txt

when querying , use join build an interval and query with tabix

join -t $'t' -1 1 -2 1 <(echo "rs12345") index.txt  |
awk '{printf("%s:%s-%sn",$2,$3,$3);}' |
while read P;
  do tabix input.vcf "$P" 
done

another solution: put those ID/CHROM/POS in a sqlite3 database and query it the same way.

sqlite3 index.db 'select chrom,position from myindex where rsID="rs17625"' | awk -F '|' '{printf("%s:%s-%sn",$1,$1,$1);}' |
    while read P;
      do tabix input.vcf "$P" 
    done

You could put your table in a sqlite database and index the ID column. Then retrieving records by ID will be very fast. However, this would require working with a sqlite file and associated SQL syntax rather than with the usual unix/R/python tools.

Hello,

long time ago I started this topic. Unfortunately I cannot comment on the solutions mentioned here because I dropped my idea for what I need it and so never tested it.

But today I found a nice trick on github of htslib. The person who posted their uses “rs” as contig name and the number as start and end position. This results in a file like this :

rs 76264143 76264143 chr1 982843 982844 G

You can than index by the first 3 columns and query like before. Nice!

fin swimmer

Another option is a fast binary search on sorted data. No indexing needed, because the sort order is the index.

1) Get SNPs and write them into a text file sorted by SNP ID. For hg19, for instance:

$ LC_ALL=C
$ wget -qO- ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/All_20170710.vcf.gz 
   | gunzip -c 
   | vcf2bed --do-not-sort 
   | awk -v OFS="t" '{ print $4, $2, $3, ($3+1), $6"https://www.biostars.org/"$7; }' 
   | sort -k1,1 
   > hg19.snp150.sortedByName.txt

2) Install pts-line-bisect:

$ git clone https://github.com/pts/pts-line-bisect.git
$ cd pts-line-bisect && make && cd ..

3) Run a query:

$ rs_of_interest=rs12345
$ ./pts-line-bisect/pts_lbsearch -p hg19.snp150.sortedByName.txt ${rs_of_interest} | head -1 | cut -f2-


Login
before adding your answer.

Traffic: 1050 users visited in the last hour

Read more here: Source link