How to extract summary statistics from GFF3 /GTF file?

Hi!

You could try using the gffutils Python library as an alternative to the AGAT toolkit for extracting summary statistics from GFF3/GTF files. gffutils is a flexible and efficient library for working with GFF and GTF files in a variety of formats.

Here’s an example of how to use gffutils to extract summary statistics from a GFF3 file:

import gffutils

# Create a database from the GFF3 file
db = gffutils.create_db("input.gff3", dbfn="input.db", force=True, keep_order=True, merge_strategy="merge", sort_attribute_values=True)

# Extract summary statistics
gene_count = db.count_features_of_type("gene")
exon_count = db.count_features_of_type("exon")
mRNA_count = db.count_features_of_type("mRNA")

print("Number of genes:", gene_count)
print("Number of exons:", exon_count)
print("Number of mRNAs:", mRNA_count)

This code snippet creates a database from the input GFF3 file, counts the number of genes, exons, and mRNAs, and prints the results. You can modify this example to extract other summary statistics as needed.

To calculate gene_lengths maybe try:

# Create a database from the GFF3/GTF file
db = gffutils.create_db("input.gff3", dbfn="input.db", force=True, keep_order=True, merge_strategy="merge", sort_attribute_values=True)

# Calculate gene lengths
gene_lengths = {}
for gene in db.features_of_type("gene"):
    gene_length = 0
    for exon in db.children(gene, featuretype="exon"):
        exon_length = exon.end - exon.start + 1
        gene_length += exon_length
    gene_lengths[gene.id] = gene_length

# Print gene lengths
for gene_id, length in gene_lengths.items():
    print(f"{gene_id}: {length}")

DISCLAIMER: I’m using my chatbot here (tinybio.cloud) to help generate this answer. This answer has not been tested and may be incorrect. You can download it from the website.

Read more here: Source link