bcftools merge

Check out the vcf_merge command I wrote:

$ fuc vcf_merge -h
usage: fuc vcf_merge [-h] [--how TEXT] [--format TEXT] [--sort] [--collapse]
                     vcf_files [vcf_files ...]

This command will merge multiple VCF files (both zipped and unzipped). It
essentially wraps the 'pyvcf.merge' method from the fuc API.

By default, only the GT subfield of the FORMAT field will be included in the
merged VCF. Use '--format' to include additional FORMAT subfields such as AD
and DP.

usage examples:
  $ fuc vcf_merge 1.vcf 2.vcf 3.vcf > merged.vcf

positional arguments:
  vcf_files      VCF files

optional arguments:
  -h, --help     show this help message and exit
  --how TEXT     type of merge as defined in `pandas.DataFrame.merge`
                 (default: 'inner')
  --format TEXT  FORMAT subfields to be retained (e.g. 'GT:AD:DP') (default:
                 'GT')
  --sort         use this flag to turn off sorting of records (default: True)
  --collapse     use this flag to collapse duplicate records (default: False)

If you are familiar with Python and are planning on performing additional analyses on the merged VCF (e.g. filtering), you can also utilize the pyvcf.merge method I wrote:

Assume we have the following data:

>>> from fuc import pyvcf
>>> data1 = {
...     'CHROM': ['chr1', 'chr1'],
...     'POS': [100, 101],
...     'ID': ['.', '.'],
...     'REF': ['G', 'T'],
...     'ALT': ['A', 'C'],
...     'QUAL': ['.', '.'],
...     'FILTER': ['.', '.'],
...     'INFO': ['.', '.'],
...     'FORMAT': ['GT:DP', 'GT:DP'],
...     'Steven': ['0/0:32', '0/1:29'],
...     'Sara': ['0/1:24', '1/1:30'],
... }
>>> data2 = {
...     'CHROM': ['chr1', 'chr1', 'chr2'],
...     'POS': [100, 101, 200],
...     'ID': ['.', '.', '.'],
...     'REF': ['G', 'T', 'A'],
...     'ALT': ['A', 'C', 'T'],
...     'QUAL': ['.', '.', '.'],
...     'FILTER': ['.', '.', '.'],
...     'INFO': ['.', '.', '.'],
...     'FORMAT': ['GT:DP', 'GT:DP', 'GT:DP'],
...     'Dona': ['./.:.', '0/0:24', '0/0:26'],
...     'Michel': ['0/1:24', '0/1:31', '0/1:26'],
... }
>>> vf1 = pyvcf.VcfFrame.from_dict([], data1)
>>> vf2 = pyvcf.VcfFrame.from_dict([], data2)
>>> vf1.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT  Steven    Sara
0  chr1  100  .   G   A    .      .    .  GT:DP  0/0:32  0/1:24
1  chr1  101  .   T   C    .      .    .  GT:DP  0/1:29  1/1:30
>>> vf2.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    Dona  Michel
0  chr1  100  .   G   A    .      .    .  GT:DP   ./.:.  0/1:24
1  chr1  101  .   T   C    .      .    .  GT:DP  0/0:24  0/1:31
2  chr2  200  .   A   T    .      .    .  GT:DP  0/0:26  0/1:26

We can merge the two VcfFrames with how='inner' (default):

>>> pyvcf.merge([vf1, vf2]).df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara Dona Michel
0  chr1  100  .   G   A    .      .    .     GT    0/0  0/1  ./.    0/1
1  chr1  101  .   T   C    .      .    .     GT    0/1  1/1  0/0    0/1

We can also merge with how='outer':

>>> pyvcf.merge([vf1, vf2], how='outer').df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara Dona Michel
0  chr1  100  .   G   A    .      .    .     GT    0/0  0/1  ./.    0/1
1  chr1  101  .   T   C    .      .    .     GT    0/1  1/1  0/0    0/1
2  chr2  200  .   A   T    .      .    .     GT    ./.  ./.  0/0    0/1

Since both VcfFrames have the DP subfield, we can use format="GT:DP":

>>> pyvcf.merge([vf1, vf2], how='outer', format="GT:DP").df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT  Steven    Sara    Dona  Michel
0  chr1  100  .   G   A    .      .    .  GT:DP  0/0:32  0/1:24   ./.:.  0/1:24
1  chr1  101  .   T   C    .      .    .  GT:DP  0/1:29  1/1:30  0/0:24  0/1:31
2  chr2  200  .   A   T    .      .    .  GT:DP   ./.:.   ./.:.  0/0:26  0/1:26

Read more here: Source link