PacBio Bioinformatics Focusing on Pipelines to Support Long-Read Sequencing Data

NEW YORK – When Pacific Biosciences brought in Michael Eberle in February 2022 to be VP of computational biology, the company needed to help its customers catch up.

Eberle said that nearly all existing variant-calling pipelines were built for short-read data, and a priority for Menlo Park, California-based PacBio was adapting software for long reads to provide insight into parts of the genome that short reads miss.

Eberle, who came to PacBio after 15 years at Illumina, said he is “driven” by bringing molecular insights on rare diseases into clinical practice. “We need to have things that solve the most important problems for these rare disease cases and, in general, the complex parts of the genome,” he said.

Such problems include tandem repeats, segmental duplications, and frequency annotation of structural variants. The firm is addressing the latter through the federally funded, PacBio-led Consortium for Long-Read Sequencing (CoLoRS) database.

PacBio does not have a formal bioinformatics strategy, Eberle said, but is working toward identifying pain points that customers see and developing informatics software to solve those problems.

One of the major differences between short and long reads is that analysis of short reads involves counting of copy number variants, while each long read essentially produces a “mini haplotype,” according to Eberle. Counting applications are not an efficient use of long reads, he added.

Eberle said that the landscape now for long-read analysis resembles that for short-read analysis 10 to 15 years ago when next-generation sequencing was new.

With its forthcoming Revio high-throughput, long-read sequencer, PacBio needs to develop better bioinformatics resources, he said, in particular for resequencing “GWAS-type” studies, which he believes will add to existing de novo assembly applications.

Recently, the PacBio bioinformatics team developed a tool for use in analyzing spinal muscular atrophy (SMA), a neurodegenerative genetic disease caused by bi-allelic mutations in the SMN1 gene.

In January, the company published Paraphase, an open-source informatics method to genotype paralogs and pseudogenes from long-read sequences through variant calling, copy number analysis, and phasing, in a study appearing in the American Journal of Human Genetics. Principal bioinformatics scientist Xiao Chen, who also joined PacBio in February 2022, was the lead developer in collaboration with several institutions, including the Genomics England Research Consortium.

The AJHG paper looked at how Paraphase was applied to SMA. The computational method identified full-length haplotypes of SMN1 and SMN2, two highly similar genes that are distinguished by a splice variant. The prevailing methods for testing for mutations in SMN1 are PCR-based and identify the number of exon 7 copies to detect carriers.

Because SMA, a major hereditary cause of infant mortality, has a high carrier frequency, testing is recommended for all pregnancies. However, Eberle said that long-read sequencing has not been “necessarily the best way” to detect SMA risk because detection usually depends on read counting, and long-read sequencing produces fewer reads than short-read platforms.

“We turn this problem into a completely separate problem to detect copy number,” Eberle explained. PacBio’s Paraphase phases the variants in the appropriate region and identifies all of the haplotypes it sees related to SMN1 and SMN2, which he said are effectively the same as copies.

“This allows us to detect everything in the gene and in the pseudogene,” Eberle said. This is particularly important when aligning to a reference genome because the genes in question are highly homologous and subject to structural variation, including recombination, deletion, duplication, and gene conversion.

Eberle said that PacBio is now looking at the entire genome to find genes with a similar problem as SMN1 and SMN2. He and his team have modified the caller for about 10 other genes in hopes of demonstrating that Paraphase can work as a general phased variant caller for long-read sequences.

The company will eventually integrate Paraphase into a whole-genome sequencing pipeline and into a targeted sequencing panel that also covers SMN1 and SMN2 that the company developed in partnership with Twist Biosciences and Baylor College of Medicine.

PacBio and its partners are particularly interested in targeted sequencing and analysis of full haplotypes on a population level, he said.

Another PacBio-developed informatics tool is TRGT — for tandem repeat genotyping tool — a computational analysis method that provides full characterization of genome-wide tandem repeats, including composition, structure, repeat unit length, and CpG methylation for each repeat allele and flanking sequence. Introduced last September, TRGT can also identify sequence changes that are potentially associated with pathogenic expansions in diseases such as cerebellar ataxia with neuropathy and bilateral vestibular areflexia syndrome.

TRGT, which PacBio plans to describe in a peer-reviewed journal, according to Eberle, comes with a companion tool, TRVZ, to visualize read pileups and methylation data for each repeat allele and flanking sequence, and both are available on GitHub.

The company is also working on a genome-wide caller to look at the more than 1 million tandem repeats in the human genome at once to support discovery of novel repeat expansions, according to Eberle. This kind of technology could also be useful for genome-wide association studies.

Other informatics projects at PacBio include software to make phasing more complete by including structural variants, as well as a tool for copy number variant calling that is already available on GitHub.

PacBio has phased out support for older bioinformatics software, including Basic Local Alignment with Successive Refinement, or BLASR. Eberle said that BLASR supports continuous long-read (CLR) data, while PacBio has moved to higher-quality circular consensus sequences (CCS).

The company’s flagship HiFi reads are produced with CCS technology. “We think that people should be moving on to HiFi data, which is easily much better,” he said.

In 2021, PacBio President and CEO Christian Henry hinted that the company would be creating new bioinformatics to support then-recent acquisitions, including sample preparation firm Circulomics and short-read sequencing technology developer Omniome. PacBio said it plans to start shipping the Onso short-read sequencing platform that resulted from the Omniome technology later this year.

Eberle said there is a great need to develop bioinformatics support for Onso, especially for somatic variant calling, because of the platform’s high accuracy, for which current software tools have not been trained.

Read more here: Source link