How to filter a VCF file with a list of CHR or contig IDs?

I need to subset/filter a SNP vcf file by a long list of non-sequential contig IDs, which appear in the CHR column. My VCF file contains 13,971 contigs currently, and I want to retain a specific set of 7,748 contigs and everything associated with those contigs (headers, all variants and genotype information etc.).

My contig list looks like:

dDocent_Contig_1

dDocent_Contig_100

dDocent_Contig_10000 etc.

I am considering the following script:

vcftools --vcf TotalRawSNPs.vcf --chr dDocent_Contig_1 --chr dDocent_Contig_100 (etc...) --recode --recode-INFO-all --out FinalRawSNPs

where I list every contig ID individually with a –chr flag before. For this –chr flag, I cannot feed it a text file of contig IDs to keep, which would be ideal. If I list all contigs individually, it’ll create a massive script in the command line.

I’ve seen options for filtering by a list of individuals, but not any clear option for filtering by CHR/contig IDs only. Is there a more efficient way to filter my vcf file by CHR/contig?

The solution does not have to be strictly vcftools related. I’m open to exploring suitable awk/mawk/grep (etc.) options as well.

Read more here: Source link