Splitting A Vcf File
Hi i downloaded a VCF file conatins multiple genome data(Muliple sample)> i want to split the VCF file to each geome file(VCF file with 1 geome). I diidnt find any script. if you have any please share with me
• 18k views
I know this is a very old question, but there’s a very efficient way of doing this that hasn’t been reported yet. hope it helps:
for file in *.vcf.gz; do
for sample in `bcftools view -h $file | grep "^#CHROM" | cut -f10-`; do
bcftools view -c1 -Oz -s $sample -o ${file/.vcf*/.$sample.vcf.gz} $file
done
done
EDIT: bcftools query -l
lists all samples, so the fastest loop found be the following:
for file in *.vcf*; do
for sample in `bcftools query -l $file`; do
bcftools view -c1 -Oz -s $sample -o ${file/.vcf*/.$sample.vcf.gz} $file
done
done
I am assuming that you mean that you have multiple samples represented in your VCF file and that you want to get one file per sample. See the vcftools package for some possibilities. If my assumption was incorrect, please edit your question with more details.
I know this is an old post, but this method modified from above, (Jorge ) is much faster.
Get list of sample names:
for sample in `bcftools view -h MyData.vcf.gz | grep "^#CHROM" | cut -f10-`; do echo $sample; done > sampleNames.txt
split vcf files faster:
parallel -a sampleNames.txt bcftools view -c1 -s {} -Oz --threads 8 -o {}.vcf.gz MyData.vcf.gz
This will use all available cores on the system.
Have a look at our Differ app. It’s free and allows you to split VCF files using a GUI on OS X.
Differ is available from www.diploid.com/differ
I had a similar problem and I had to use windows :-(.
If you are working with small-ish VCF files you can use R to work with the data (e.g., split it)
To load the file use:
file="e:/d/genome/t300.txt"
v <- read.table(file,sep='t',header = T,fileEncoding="utf-16")
str(v)
The UTF-16 encoding was particulary hard to troubleshoot. Eventually Notepad++ helped me to detect this encoding problem.
It correctly ignores the header lines and detects column headers as well.
to remove the columns (except 1 genome) use this command:
v[11:ncol(v)]<-list(NULL)
in my case the file had 9 initial columns and column 10 had the first genome.
You can modify this to filter genomes 12,13, etc…
Traffic: 2025 users visited in the last hour
Read more here: Source link