print only columns with data from every line
Hi, I have a vcf file where is about 60 000 columns. Here is example of the first three lines:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 10022-20416-17 10024-34469-18A 10025-34469-18B 10034-31625-18A 10035-31625-18B 10036-31625-18C 10042-29083-18 10044-34485-18A 10045-34485-18B 10046-34485-18C 10069-33802-18 10070-20895-17 10072-20901-17 10074-20904-17 10080-20908-17 10109-34224-18 1011-22957-18 10118
2 179391728 . C T 1109.77 PASS BaseQRankSum=-2.601;ClippingRankSum=0;ExcessHet=3.0103;FS=0;MQ=60;MQRankSum=0;QD=11.81;ReadPosRankSum=0.626;SOR=0.76;DP=95;AF=0.5;MLEAC=1;MLEAF=0.5;AN=2;AC=1 GT:AD:DP:GQ:PL ./.:.:.:.:. ./.:.:.:.:. 0/1:44,47:91:99:1053,0,1069 ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:.
2 179391738 . C G 2090.77 PASS BaseQRankSum=0.25;ClippingRankSum=0;ExcessHet=3.0103;FS=2.282;MQ=60;MQRankSum=0;QD=14.32;ReadPosRankSum=0.857;SOR=0.953;DP=370;AF=0.5;MLEAC=1;MLEAF=0.5;AN=6;AC=3 GT:AD:DP:GQ:PL ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:. 0/1:88,68:156:99:2586,0,4687 ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:.
So there is many different sample numbers as columns and there is for every sample column there is some information at some variant. I would like to get the output so that there would only show that column where is information for every line like this:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 10025-34469-18B
2 179391728 . C T 1109.77 PASS BaseQRankSum=-2.601;ClippingRankSum=0;ExcessHet=3.0103;FS=0;MQ=60;MQRankSum=0;QD=11.81;ReadPosRankSum=0.626;SOR=0.76;DP=95;AF=0.5;MLEAC=1;MLEAF=0.5;AN=2;AC=1 GT:AD:DP:GQ:PL 0/1:44,47:91:99:1053,0,1069
It would also be important to see the sample number in the headers that includes this GT:AD:DP:GQ:PL info.
I think this would be possible somehow with awk, but I just don’t know how. It would be really good if this is possible to be done with unix.
• 309 views