HOG without numbering in the genome of the ancestor reconstructed using the pyHAM

Dear colleagues,

Recently I successfully complete OMA standalone run and now I am trying to analyze results using the pyHam python package. The main task for me is to reconstruct the genome of the ancestor of the analyzed species.
After executing the following commands, I get information that in the model of the ancestor genome are 10372 genes:

ham_analysis = pyham.Ham(nwk_file, orthoxml_file, use_internal_name=True)

ancestral_genome = ham_analysis.get_ancestral_genome_by_name(ancestral_genome_name)

ancestral_genes = ancestral_genome.genes

print(len(ancestral_genes))

I am a little alarmed by two facts:
1) In ancestral_genes, many elements (HOG) have this designation <HOG ()>, and do not have a number, in contrast to, for example, <HOG (17899)> or <HOG (17900)>. Is it mistake if previously the sequences assigned to <HOG ()> were included in numbered orthogroups? Or am I missing something?

2) However, if we extract information about the descendant genes in these <HOG ()>, we will see that some of these groups of orthologs contain only 1 or 2 sequences of modern species. Correct me, please, if I am mistaken, but how can it be considered that the ancestor had a gene if it is present only in 1 or 2 species out of, for example, 11 analyzed? Is it worth specifying a certain threshold for the number of modern species in which a gene must be present to reconstruct the ancestor’s genome? And how to do it most efficiently and correctly?

I would be grateful for any help!

Read more here: Source link