Get chromosome sizes from fasta file

Get chromosome sizes from fasta file

4

Hello,

I’m wondering whether there is a program that could calculate chromosome sizes from any fasta file? The idea is to generate a tab file like the one expected in bedtools genomecov for example.

I know there’s the fetchChromSize program from UCSC, but not all genomes are available over there (I need TAIR10 for instance). I’ve read this topic already.

I would like a tool that can deal with any genome regardless of the database. If it doesn’t exist I guess it’s possible to just parse fasta files, but I’d be surprised if no one else had done it before!

Cheers


genome


ucsc

• 23k views

pip install pyfaidx
faidx input.fasta -i chromsizes > sizes.genome

That’s what you’re looking for if you want to use bedtools genomecov, but you can transform to a BED file as well using -i bed.

Just found out faidx is available with samtools so another possibility is:

samtools faidx input.fa
cut -f1,2 input.fa.fai > sizes.genome

updated 2.3 years ago by

34k

written 5.6 years ago by

▴

500

  • Another good option is to use the faSize command from UCSC tools. It works on any fasta file, not just UCSC genomes.

  • You run it like this: faSize -detailed -tab file.fasta

  • Default behavior is to print to STDOUT, so you can redirect the output to a file like this:
    faSize -detailed -tab file.fasta > output.txt

  • The output will be a tab delimited file with sequence name on the first column and the length on the second.

  • You can easily install faSize and other tools from UCSC using Bioconda anaconda.org/bioconda/ucsc-fasize

samtools was crashing because I included the HLA’s,

HLA-A*01:01:38L 3374

 

Traceback (most recent call last):
  File "/usr/local/bin/faidx", line 11, in <module>
    load_entry_point('pyfaidx==0.5.2', 'console_scripts', 'faidx')()
  File "/usr/local/lib/python2.7/dist-packages/pyfaidx/cli.py", line 197, in main
    write_sequence(args)
  File "/usr/local/lib/python2.7/dist-packages/pyfaidx/cli.py", line 50, in write_sequence
    outfile.write(transform_sequence(args, fasta, name, start, end))
  File "/usr/local/lib/python2.7/dist-packages/pyfaidx/cli.py", line 120, in transform_sequence
    line_len = fasta.faidx.index[name].lenc
KeyError: 'HLA-A*01'

Finally just did:

from Bio import SeqIO
for rec in SeqIO.parse("hg38-Mix.fa","fasta"):
     print rec.id+"t"+str(len(rec.seq))

updated 2.3 years ago by

34k

written 2.9 years ago by

0


Login
before adding your answer.

Traffic: 1103 users visited in the last hour

Read more here: Source link