Phylogeographic reconstruction of the marbled crayfish origin

Procambarus fallax collections and PCR genotyping

Animals were collected from various wild populations (Table S1) in compliance with state and local regulations (Georgia department of natural resources scientific collection permit 115621108, state of Florida collection permits S-19-10 and S-20-04). DNA was isolated from abdominal muscle tissue using SDS-based extraction and precipitation with isopropanol. PCR genotyping of 92 specimens was performed using the following primer pairs: Cytb (FWD CAGGACGTGCTCCGATTCATG and REV GACCCAGATAACTTCATCCCAG), Dnmt1 (FWD GCTTTCTGGTCTCGTATGGTG and REV CTGCACACAGCCTAAGATGC), Cox1 (FWD CTGCTATTGCTCATGCAGGT and REV TGCCCGAGTATCTACATCCA). Amplicons were verified by agarose gel electrophoresis and cloned using the TOPO TA Cloning Kit (Invitrogen) according to the manufacturer’s instructions. The purified plasmids were sequenced by eurofins genomics and the sequences were aligned using SnapGene software.

Procambarus virginalis PacBio genome sequencing and assembly

Genomic DNA was isolated from three independent animals using SDS-based extraction and precipitation with isopropanol. PacBio large-insert library preparation was done following the recommended protocols by Pacific Biosciences. For each animal, a library was generated from 1 to 5 µg of sheared and concentrated DNA with an insert size target of ~10 kb. After library preparation, sequencing was performed on a PacBio SEQUEL platform according to the manufacturer’s instructions. Movie times were 600 for most SMRT cells with some being 240, 360, and 900. Sequencing results for animals were pooled resulting in a total of 37 SMRT cells comprising 69,074,290 reads and 242,146,920,794 bases.

Long subreads generated from the PacBio SEQUEL platform were assembled into contigs using the Canu assembler25 version 1.7. After self-error correction and trimming for reading quality estimates and SMRT cell adapters, the remaining 31,652,687 reads, comprising 91,697,556,862 bp, were used in the assembly phase. Computations were done on a high-performance cluster running the Slurm Workload Manager using up to 56 CPUs and 450 GB memory. To achieve an improved gene representation, contigs were connected using transcriptome information. Briefly, all transcripts were mapped onto the contig assembly and linkage information was extracted to connect contigs using L_RNA_SCAFFOLDER26 version 1.0. Contiguity of sequences was improved using proximity ligation based on Chicago and Hi-C methods (provided by Dovetail Genomics, Chicago, USA) using fresh frozen tissue of one additional animal. After DNA extraction and library preparation, libraries were on an Illumina HiSeq X platform using a PCR-free paired-end 150 bp sequencing protocol. Reads were processed for scaffolding and error-correcting using Dovetail’s HiRise scaffolding pipeline (Dovetail Genomics, Chicago, USA).

Automatic annotation was performed as previously described8. Briefly, sequences larger or equal 10 kb were extracted from the assembly and annotated using the MAKER pipeline27 version 3.00. Annotation data were provided by the manually curated Uniprot/Swiss-Prot database28 and the annotated marbled crayfish transcriptome8. Functional domains were predicted using InterproScan29 version 5.39–77.0.

In order to assess the assembly quality, a defined set of single-copy ortholog arthropod genes were searched within all sequences using BUSCO18 version 4.1.4. BUSCO was run using default parameters in genome mode with the supplied arthropod sequence database, containing a total of 1066 orthologs from the arthropoda_odb10 database. Taxonomic interrogation was performed by using blobtools30 version 1.1.1, the P. virginalis Petshop 1 WGS dataset8, and the blast nucleotide database as a hit database. Blobtools was run according to the provided workflow by first creating a BlobDB database using standard parameters, followed by visualization.

Procambarus fallax whole-genome sequencing

Illumina libraries for 28 specimens were prepared by the DKFZ Genomics and Proteomics Core Facility using standard procedures and sequenced on an Illumina HiSeq X platform (PE150 protocol). Raw reads were trimmed and filtered using Trimmomatic31 version 0.32.

Whole-genome sequencing data analysis

Trimmed and processed reads were mapped on the P. virginalis mitochondrial genome sequence (Genbank KT074364.1) or on the V1.0 genome assembly using bowtie232 version 2.1.0. Alignment files were converted to BAM format and duplicates were removed with SAMtools33, version 1.9. Furthermore, alignment files were filtered for reference sequences longer than 10 kbp. Multi-sample SNV profiles were calculated for batches of P. fallax animals using freebayes (arxiv.org/abs/1207.3907, version v0.9.21-7-g7dd41db) with a ploidy setting for diploid organisms. The obtained variant sites from vcf files were subjected to linkage pruning using PLINK34 version 1.9 and specific parameter settings for windows (50 Kb), window step sizes (10 bp), and R2 threshold (>0.1). The resulting set of variants was used for a Principal Component Analysis (PCA) using default parameter settings in PLINK and plotted using the ggplot2 package in R (version 3.2.1). Alignment files for the mitochondrial genomes were used for consensus calling by the bcftools package from SAMtools. The obtained consensus sequences of P. fallax mitochondrial genomes together with the P. virginalis reference mitochondrial genome were aligned using Clustal Omega with default parameters. For the resulting multiple sequence alignment, a phylogenetic tree was constructed by the maximum likelihood method implemented in PhyML35 (v3.1/3.0). The tree in Newick format was visualised via the Interactive Tree of Life online tool36, with statistical branch support assessed by the bootstrap method37.

To detect parental nuclear alleles of P. virginalis, SNV profiles from all P. fallax animals were compared with the set of marbled crayfish-specific heterozygous positions. A matrix was built for each marbled crayfish allele (majority and minority allele) containing either 0 or 1, representing either the presence or absence of the respective alleles in P. fallax animals. A neighbor-joining distance matrix was calculated for each allele and an unrooted tree was plotted using APE38 version 5.3. To identify triploid genomes, biallelic heterozygous variants were extracted from SNV profiles. Focusing only on reference alleles, homozygous single nucleotide polymorphisms in P. fallax were discarded. For each variant position, alternative allele frequencies were calculated as the number of alternative allele read observations divided by the total read depth. Finally, frequencies were plotted using the histogram routine in R (version 3.6.1).

Statistics and reproducibility

All statistical methods and visualizations were performed using publicly available tools and packages from the statistical computing framework R. Individual packages, parameter settings, and version numbers are mentioned in the respective sections. Linkage pruning was performed using PLINK34 (v1.9) and an internal r2 statistic threshold of >0.1 for squared correlation of raw inter-variant allele counts. Principal component analysis was performed using the default internal R function with standard parameters. Phylogenetic tree reconstruction was performed by the maximum likelihood method implemented in PhyML35 (v3.1/3.0) with default parameters. Here, statistical support for each branch was assessed using the bootstrap method37. Histogram plots for heterozygous allele frequencies were generated using the default internal R (v3.6.1) histogram function.

This study includes data for PCR genotyping of N = 92 P. fallax specimens. For each animal, the total number of genetic polymorphisms on 3 gene fragments was extracted and plotted on a scale from 0 (minimum) to 21 (maximum) using the internal R (v3.5.0) heatmap function with default parameters. Additionally, data from N = 4 specimens of P. alleni were sequenced and integrated. Replicates were defined as a group of individuals with distinct animals from one species (i.e., N = 92 for P. fallax and N = 4 for P. alleni). Analysis of whole-genome sequencing data was performed on N = 28 P. fallax animals collected from 23 independent collection sites in Florida and southern Georgia. Additionally, N = 2 publicly available WGS datasets of P. virginalis samples were considered for this study. Here, replicates were also defined as the number of individual animals from one species.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Read more here: Source link