PacBio sequencing output increased through uniform and directional fivefold concatenation

Strategy and design of the method

We sought to develop a simple method to increase the sequencing capability of PacBio CCS to sequence several diverse DNA libraries ~ 870 bp in length that encoded protein variants originating from a directed evolution campaign. To achieve an increase in the throughput of a PacBio sequencing run, we assembled individual genes by concatenation, producing larger DNA amplicons (Fig. 1). We limited the assembly of amplicons to contain five genes each to ensure that the polymerase will make at least three passes of each of the resulting ~ 5,000 bp amplicons during the CCS sequencing run. The consensus of these three subreads corrects random sequencing errors, thereby providing a sequencing accuracy of at least 99%. If a gene of a different length was to be concatenated, the number of genes per amplicon would have to be adjusted to yield a similar final amplicon length. The concatenation of the genes was carried out using Golden Gate assembly, which utilises Type IIS restriction enzymes42,43. Type IIS enzymes cleave DNA distal from its recognition site, subsequently removing the recognition site from the sequence44.

Figure 1
figure1

Overview of the concatenation procedure. The DNA pool to be sequenced was split into five PCR reactions with distinct primer sets, producing library variants 1–5. The PCR reactions also appended the variants with complementary BsaI restriction sites, which contained the four unique scar sequences (AD) (yellow, purple, green, grey). The complementary overhangs are shown as black rectangles. Sample-specific barcodes were attached to the 5′-end of library variant 1 and the 3′-end of variant 5. The design enabled controlled and directional assembly of the five library variants upon Golden Gate assembly. The different barcoded concatemers were all combined and SMRTbell hairpin adapters were ligated to the barcodes of each amplicon. PacBio CCS sequencing and the SMRTbell hairpin adapters allowed continuous reads of the large concatemer amplicons producing several subreads. The consensus of the subreads produced high fidelity sequence reads.

Before assembling the five genes, the DNA libraries were first purified by agarose gel electrophoresis to remove any species of incorrect length. We then divided each purified DNA library into five aliquots. We amplified each aliquot by PCR using five different pairs of primers yielding the library variants 1–5 (Supplementary Table S1). The primers were designed to overlap with the terminal constant regions of the encoded protein variants resulting in a slightly reduced size of the sequenced protein variants of 806–848 bp. The primers also introduced flanking BsaI restriction sites with the appropriate overhangs. A 16 bp sample-specific barcode was added during PCR to the 5′ end of library variant 1, and the 3′ end of library variant 5, providing a unique identifier to differentiate sample populations (see Supplementary Table S2 for barcode specific primers). All PCR primers were designed to have similar melting temperatures and GC-content to prevent primer dimers and mispriming events. The resulting PCR products (library variants 1 to 5) represented the genes one to five of the final assembled concatemer amplicon (Fig. 1).

We designed the primers such that four unique 4 bp “scar” sequences A to D (see Supplementary Table S1) were located at the connection between each of the five library variants after the Golden Gate assembly. Each scar was chosen based on previous studies that demonstrated increased assembly efficiency for particular sequences45. These scars enabled us to easily identify the individual genes (library variants 1 to 5) in the full-length amplicon.

To assemble the full-length amplicon, we mixed equimolar amounts of the amplified library variants 1 to 5. The Golden Gate reaction mix contained the BsaI restriction enzyme to produce overlapping flanking regions and T4 DNA ligase to ligate each of the variants in a one-pot reaction, producing the final concatenated amplicon ‘12345’. Facilitated by the unique pair of barcodes that flank each final amplicon, several DNA libraries were pooled and sequenced together on a single PacBio run.

Our initial attempts, following standard protocols to assemble the full-length 5 × amplicon, yielded mostly partially assembled products, which made it impractical to obtain the 1 µg of the final amplicon as required for PacBio sequencing (Supplementary Fig. S1). Most published protocols for Golden Gate assembly are designed for the assembly of fragments from circular vectors, but not from linear dsDNA like in our case46. We therefore optimised each stage of the concatenation process to ensure sufficient yield of the fully assembled 5 × amplicon.

Optimization of concatenation

To reduce the impact of potential PCR bias in sequencing, it was important to optimise the amplification of the library variants 1 to 5 to avoid mispriming events, non-target amplification, overamplification, and to minimise any impact of the original PCR template upon assembly. We first identified an annealing temperature that produced a similar amount of the target amplified product for each of the primers sets without amplifying non-target PCR species. This was important to minimise any potential bias between library variants 1–5. A temperature gradient established that an annealing temperature of 54 °C was optimal for all primer sets. Amounts of > 10% of the original template DNA in the final amplified DNA were found to be detrimental to assembly efficiency. The template appeared to function as a non-competitive inhibitor towards the restriction and/or ligase enzymes, leading to partial or improper assembly of the full-length amplicon and producing a smear rather than discrete bands for the assembled products (Supplementary Fig. S1). The PCR amplification was therefore optimised to produce sufficient product with the least number of PCR cycles to reduce potential bias, while keeping the amount of original template below 10%. The PCR product was purified by AMPure XP beads to remove any residual primers. About 2 µg of each of the purified library variants (1 to 5) was needed for the subsequent procedures.

Assembly of the library variants to produce the 5 × concatenated amplicon was carried out in a one-pot reaction with the BsaI enzyme, T4 DNA ligase and an equimolar mix of all five library variants 1 to 5. The enzyme BsaI was chosen because it produces a small overhang of only 4 bp that is still sufficient for efficient ligation47,48,49. However, alternative type IIS restrictions enzymes could be used, especially in case the gene of interest already contained a BsaI recognition site. Previous research has shown varying assembly yields using different kits or different enzyme vendors. Therefore, we compared several commercially available enzymes including T4 DNA ligase, T7 DNA ligase, BsaI, BsaI v2, and Eco31I. For our assembly, we found the NEB Golden Gate assembly kit to be most efficient, whereas enzymes from Promega and Thermofisher showed slightly lower efficiency (see Supplementary Fig. S2). However, all enzyme combinations showed partial assembly. The yield of the 5 × amplicon was further improved by optimizing protocol details. The reaction is often cycled between 37 °C (optimal temperature for digestion enzyme) and 16 °C (optimal for ligase reaction) for a total of 30 cycles. When the time at each temperature was increased from 1 to 5 min and the number of cycles was increased to 50, the production of the fully assembled amplicons increased. (Supplementary Fig. S2).

The combined optimization of the PCR and assembly reaction conditions improved the yield of the fully assembled 5 × amplicon from initially 7% to 40% but did not completely remove the partially assembled products of 1 × to 4x (Fig. 2). This is not surprising since the Golden Gate protocol usually assembles genes from vectors, which has been shown to have higher efficiency than the assembly of linear dsDNA. Furthermore, the efficiency in many of those studies was assessed indirectly through transformation efficiency43,50.

Figure 2
figure2

An example agarose gel showing the efficiency of concatenation for library variants 1–5 using enzymes from different commercial sources. Performance was quantified for each reaction by the percentage of fully assembled 5 × amplicon (marked by blue triangles). The NEB Golden Gate assembly kit was shown to concatenate the library variants more efficiently, increasing the amount of fully assembled product to 40%. A more detailed comparison can be seen in the Supplementary Fig. S2.

To enable efficient sequencing, the fully assembled 5 × amplicon was purified from the shorter concatemers using size separation by agarose gel electrophoresis. The DNA recovery from gel extraction is usually below 50% and the recovered DNA is often dilute and contaminated with residual EDTA or guanidine salts. However, the efficiency of PacBio sequencing is highly sensitive to the purity and quality of the DNA sample. Therefore, we included additional washing steps to reduce potential contaminants. The DNA samples were further purified and concentrated using AMPure XP beads (Beckman Coulter). The assembly reaction was scaled up to attain enough fully assembled 5 × amplicon for each gel extracted DNA library. We quantified the sequencing samples with PicoGreen and performed a size analysis via Agilent Bioanalyzer.

We applied this strategy to nine separate DNA library samples S1-S9, which were the output of different directed evolution experiments performed with the same starting library. The goal was to compare the protein sequence populations across the different experiments to understand the dependence of specific mutations on changing experimental conditions. After purification and quantification, the nine concatenated DNA samples were combined to a single pool at a ratio that reflected the relative sequencing depth we desired for each sample (Table 1). Finally, we submitted the combined, yet individually barcoded, sequencing samples for PacBio sequencing (Supplementary Fig. S3).

Table 1 Recovery of sequences during the post-sequencing data processing. All nine different DNA library samples, each containing a unique sample specific pair of barcodes, were combined and sequenced in a single flowcell.

Sequencing and raw data processing

Sequencing was carried out on a PacBio Sequel system, using one PacBio flowcell and a collection time of ten hours. Circular consensus sequencing (CCS) was performed on the pooled samples using the SMRT Link software v8.0, producing a consensus sequence for each well of the flowcell. We filtered for sequences that had at least 3 passes and a sequencing accuracy of at least 99% (Q20), which yielded a total of 124,715 sequences.

Data processing and data analysis

A number of existing bioinformatic tools can perform generic processing of high-throughput DNA sequencing data51,52,53, however, no single tool can demultiplex, deconcatenate and translate DNA sequences to amino acid sequence. Therefore, to process the sequencing data from CCS, we developed DeCatCounter, an all-in-one bioinformatic pipeline for demultiplexing, deconcatenating, filtering by length and (optionally) translating DNA sequencing data (Fig. 3). DeCatCounter takes as input raw, CCS sequencing read files (FASTA or FASTQ) and outputs a dereplicated list of unique nucleic acid sequences and their read counts, together with the associated list of peptides sequences and their read counts, if desired (Fig. 3).

Figure 3
figure3

Overview of the post-sequencing data processing. The CCS function in the SMRT Link software v8.0 assimilated the consensus sequences from each well and removed the SMRTbell hairpin adapters. We developed a custom script to sort (demultiplex) and deconcatenate the sequences. Firstly, the sample-specific barcodes were used to separate each selection output from the mixed pool of sequences. Next, we extracted the five original library variants from the concatenated amplicon using the forward and reverse constant region at the 5’ and 3’ termini of each variant (not shown in the figure). A length filter was then applied to ensure the recovery of full-length library variants. See Supplementary Fig. S4 for further detail.

During demultiplexing, DeCatCounter searches each sequence for the sample-specific pair of barcodes and assigns it to the corresponding sample population (see Supplementary Table S2). We allowed for a potential sequencing error of up to two mismatches for each barcode sequence (16 bp) yielding a recovery of 61% of the sequences. Increasing the number of allowed errors to three or four resulted in a similar recovery of 62%. However, decreasing the number of allowed errors to only one mismatch greatly reduced the number of sequences recovered to 50%. We therefore chose to use two allowed mismatches. The retrieval of only 61% of the sequences after the demultiplexing was lower than anticipated. This percentage varied between different samples (Table 1). Demultiplexing retrieved 76–81% of sequences for DNA samples S6-S9; but only 43–48% for S3 and S4 and 12–13% for S1, S2 and S5. The reduced recovery of some samples may be attributable to differences in handling and storage of the fully assembled 5 × amplicons. Samples that were stored for longer time periods and underwent several freeze/thaw events showed the lowest recovery in the demultiplexing step. These factors likely led to damaged flanking regions of the assembled amplicons, which was indicated in the QC assessment of the sequencing run.

To deconcatenate the fully assembled 5 × amplicons into the individual five genes (library variants 1 to 5), DeCatCounter searched for and trimmed the constant regions of 40 bp and 22 bp that flanked each gene on its 5′ and 3′ terminus, respectively (see Supplementary Fig. S4 for details). Certain constant terminal regions are common to gene libraries. Here, these constant termini were part of the original gene library design and had been used to PCR-amplify library variants after each round of selection and directed evolution. For this search, we allowed four potential sequencing mismatches for the 40 bp long 5’-terminal constant region and two mismatches for the 22 bp long 3’-terminal constant region, which also accounts for possible DNA synthesis errors incurred from primer template mismatches and PCR amplification following successive rounds of selection. Applying these constraints resulted in an average of 99% recovery. Increasing or decreasing the number of allowed mismatches by one did not substantially change the recovery rate (1–2% change). Ideally, the deconcatenation step should yield five times as many individual genes compared to the fully sequenced concatemers. Our experimental results demonstrated that this step was highly efficient, yielding 99% of the expected number of genes for all nine sample pools. To ensure that each retrieved sequence represented a gene from the sample populations and not some spurious sequences in the assembly mix, the sequences were filtered for the expected length range between 707–825 bp. This range represents a 5% length variation over the original protein library after the removal of the constant regions used in the demultiplexing process. Variation in length was allowed due to the multiple rounds of directed evolution and homologous recombination experienced by each sample. This filtering for length retained 96–99% of sequences, confirming that our purifications throughout the sample preparation prevented the assembly of unintended sequences (e.g. primers or products of PCR mispriming). The number and percentage of sequences retrieved after each data processing step are detailed in Table 1.

Analysis of protein populations from directed evolution experiment facilitated by high sequencing depth and accuracy

The nine DNA samples S1-S9 that we sequenced in a single PacBio flowcell were populations of protein variants from a directed evolution experiment. This laboratory selection campaign started from a library based on a single protein containing several randomised regions. Each DNA sample encoded protein variants from different stages or generations of this campaign. The goal was to identify changes in the sequence populations throughout the generations of evolution in response to specific selection pressures.

For each sample population, the demultiplexed, deconcatenated and length-filtered sequences were clustered based on sequence similarity. The clusters (sequence families) were defined by distinct amino acid motifs in the regions of the starting library that had been randomised prior to the directed evolution campaign. Sequence families were translated to their amino acid sequences and ranked based on their relative frequency in their sample population. We then compared the frequencies of protein sequence families within each sample population and across multiple samples. The stringency of the selection conditions had been increased stepwise four times, such that the enriched populations from each selection condition are represented by five samples S1-S5 (Fig. 4a). The population of sequences that resulted from the lowest selection pressure (sample S1) remained highly diverse, with a large number of different sequence families identified. As expected, this diversity was reduced with an increase of selection stringency, as demonstrated by the reduction in the number of sequence families in sample S2. Upon further increase in selection pressure, only a single sequence family appeared to dominate the selection for populations S3-S5. In addition, comparison among samples S1-S3 illustrates how the dominant sequence family in the population changed throughout the course of the evolutionary experiment (Fig. 4a). For example, sequence families 1 and 2 dominated sample S1, accounting for ~ 85% of the population, while family 3 made up only 5% of the population. In contrast, sequence family 3 dominated the population of sample S2, at 95% of the population, while the fraction of sequences belonging to sequence families 1 and 2 were greatly reduced. Starting with sample S2, a new sequence family 14 started growing in frequency. This sequence family 14 became the dominant family for the samples S3-S5.

Figure 4
figure4

Analysis of nine sequence populations from a protein directed evolution campaign. (a) The samples S1-S5 represent protein populations from progressive generations of laboratory evolution performed with stepwise increases in selection pressure. The pie charts for each sample illustrate the distribution of different sequence families within that population. Each family is shown in a different colour and the size of the segment represents the abundance of that family in the sample population. The black, green and purple segments in samples S1 and S2 represent sequence families 1, 2 and 3, respectively. Sequence family 14 is depicted in orange and dominates the sample S3 and all following generations. (b) Analysis of sequence distribution at subfamily level for samples S4, S6, and S7. One round of selection was carried out on the same population of protein variants, yet using three different selection conditions resulting in the three sample populations S4, S6 and S7. While all sequences in those samples belonged to sequence family 14, subfamilies were discerned within each of the populations as shown by different shades of orange. These subfamilies constitute either a specific point mutation or a cluster of mutations. The abundance of different subfamilies in a population is represented in the graph. Samples S8 and S9 were experimental control populations and are not shown here.

Finer differences were expected among samples S4, S6, and S7. In contrast to the evolutionary progression of samples S1-S5, these three samples comprised a separate comparative study carried out as three parallel selection branches, all starting from aliquots of sample S3 (Fig. 4b). Each branch was a single round of selection under different selection conditions, yielding samples S4, S6 and S7. Samples S8 and S9 (not shown) were generated as reference samples to control for potential unintended technical bias. As expected, we found the populations of S4, S6 and S7 to also be dominated by sequence family 14, as with sample S3 (Fig. 4). Changes across the three sample populations were therefore more subtle, often differing by only a handful of specific point mutations (see subfamilies in Fig. 4b). These results demonstrated that the sequencing method could resolve unique sequence subfamilies that emerged in the sample population as different selection conditions were applied for just one round of selection. Slight differences between subfamilies could be discerned due to accurate sequencing of each population in sufficient depth, despite using only a single PacBio flowcell for all nine samples.

Read more here: Source link