Filtering long indels from VCF



to create a multi-sample VCF in a large cohort of WES samples of very different quality I have to select only high-quality variants genotyped in as many samples as possible.

I figured out that

  1. long indels have low quality
  2. only substitutions do not provide enough variants for my analysis.

I know how to filter out indels using bcftools – is there a command that may filter out long indels only, but remain 1-2bp inserts/deletions? I feel some AWK command should be very fast, but I don’t know how to count number of chars in columns ALT/REF of the VCF and how to print only variants where both ALT/REF variants are shorter than 3 symbols.

Appreciate any help, quick googling did not solve the problem.



