print only columns with data from every line

print only columns with data from every line


Hi, I have a vcf file where is about 60 000 columns. Here is example of the first three lines:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  10022-20416-17  10024-34469-18A 10025-34469-18B 10034-31625-18A 10035-31625-18B 10036-31625-18C 10042-29083-18  10044-34485-18A 10045-34485-18B 10046-34485-18C 10069-33802-18  10070-20895-17  10072-20901-17  10074-20904-17  10080-20908-17  10109-34224-18  1011-22957-18   10118
2       179391728       .       C       T       1109.77 PASS    BaseQRankSum=-2.601;ClippingRankSum=0;ExcessHet=3.0103;FS=0;MQ=60;MQRankSum=0;QD=11.81;ReadPosRankSum=0.626;SOR=0.76;DP=95;AF=0.5;MLEAC=1;MLEAF=0.5;AN=2;AC=1   GT:AD:DP:GQ:PL  ./.:.:.:.:.     ./.:.:.:.:.     0/1:44,47:91:99:1053,0,1069     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.
2       179391738       .       C       G       2090.77 PASS    BaseQRankSum=0.25;ClippingRankSum=0;ExcessHet=3.0103;FS=2.282;MQ=60;MQRankSum=0;QD=14.32;ReadPosRankSum=0.857;SOR=0.953;DP=370;AF=0.5;MLEAC=1;MLEAF=0.5;AN=6;AC=3       GT:AD:DP:GQ:PL  ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     0/1:88,68:156:99:2586,0,4687     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.

So there is many different sample numbers as columns and there is for every sample column there is some information at some variant. I would like to get the output so that there would only show that column where is information for every line like this:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  10025-34469-18B
2       179391728       .       C       T       1109.77 PASS    BaseQRankSum=-2.601;ClippingRankSum=0;ExcessHet=3.0103;FS=0;MQ=60;MQRankSum=0;QD=11.81;ReadPosRankSum=0.626;SOR=0.76;DP=95;AF=0.5;MLEAC=1;MLEAF=0.5;AN=2;AC=1   GT:AD:DP:GQ:PL    0/1:44,47:91:99:1053,0,1069

It would also be important to see the sample number in the headers that includes this GT:AD:DP:GQ:PL info.
I think this would be possible somehow with awk, but I just don’t know how. It would be really good if this is possible to be done with unix.




2 hours ago by



Source link