Splitting A Vcf File

Splitting A Vcf File

7

Hi i downloaded a VCF file conatins multiple genome data(Muliple sample)> i want to split the VCF file to each geome file(VCF file with 1 geome). I diidnt find any script. if you have any please share with me


vcf

• 18k views

I know this is a very old question, but there’s a very efficient way of doing this that hasn’t been reported yet. hope it helps:

for file in *.vcf.gz; do
  for sample in `bcftools view -h $file | grep "^#CHROM" | cut -f10-`; do
    bcftools view -c1 -Oz -s $sample -o ${file/.vcf*/.$sample.vcf.gz} $file
  done
done

EDIT: bcftools query -l lists all samples, so the fastest loop found be the following:

for file in *.vcf*; do
  for sample in `bcftools query -l $file`; do
    bcftools view -c1 -Oz -s $sample -o ${file/.vcf*/.$sample.vcf.gz} $file
  done
done

I am assuming that you mean that you have multiple samples represented in your VCF file and that you want to get one file per sample. See the vcftools package for some possibilities. If my assumption was incorrect, please edit your question with more details.

I know this is an old post, but this method modified from above, (Jorge ) is much faster.

Get list of sample names:

  for sample in `bcftools view -h MyData.vcf.gz | grep "^#CHROM" | cut -f10-`; do echo $sample; done > sampleNames.txt

split vcf files faster:

  parallel -a sampleNames.txt  bcftools view -c1 -s {} -Oz --threads 8 -o {}.vcf.gz MyData.vcf.gz

This will use all available cores on the system.

cut -f1-9,n file.vcf

Where n is the column of the sample you want.

updated 23 months ago by

35k

written 10.1 years ago by

★

3.2k

Have a look at our Differ app. It’s free and allows you to split VCF files using a GUI on OS X.

Differ is available from www.diploid.com/differ

As of now, bcftools 1.12 has a plugin named split. To split the vcf file so that each sample has its own vcf file, just use:

bcftools +split input.vcf.gz -Oz -o vcf_per_sample

All split vcf files will be in the vcf_per_sample folder.

I had a similar problem and I had to use windows :-(.

If you are working with small-ish VCF files you can use R to work with the data (e.g., split it)

To load the file use:

file="e:/d/genome/t300.txt"
v <- read.table(file,sep='t',header = T,fileEncoding="utf-16")
str(v)

The UTF-16 encoding was particulary hard to troubleshoot. Eventually Notepad++ helped me to detect this encoding problem.
It correctly ignores the header lines and detects column headers as well.

to remove the columns (except 1 genome) use this command:

v[11:ncol(v)]<-list(NULL)

in my case the file had 9 initial columns and column 10 had the first genome.

You can modify this to filter genomes 12,13, etc…

updated 23 months ago by

35k

written 9.5 years ago by

&utrif;

300


Login
before adding your answer.

Traffic: 2025 users visited in the last hour

Read more here: Source link