Whole-genome sequencing and analysis of Apocynum cannabinum

Sequencing data statistics

The PacBio platform were used for whole genome sequencing the A. cannabinum genome. Table 1 represents total sequencing volume was 117.13 G, and the coverage was 490.04 X based on calculations of the genome size (239.02 M) estimated from a survey [41]. In addition, the Illumina platform was used for the sequencing of a constructed second-generation small fragment library.

Table 1

Apocynum cannabinum genome sequencing data statistics

Pair-end libraries

Insert size

Total data (G)

Read length (bp)

Sequence coverage (X)

llumina reads

350 bp

31.94

150

133.63

PacBio reads

29.8 kb

8.76

11734

162.16

10X_Genomics

500–700 bp

46.43

150

194.25

Total

117.13

490.04

Assembly results and statistics

The genome size was 260 Mb, the contig N50 was 3.11 Mb, and the scaffold N50 was 4.19 Mb (scaffolds of 100 bp and above were selected for assembly results). The GC analysis showed that the GC content of A. cannabinum was 33.26%. Through Hi-C-assisted assembly, the final genome sequence was anchored to 11 chromosome. The draft genome scaffold chromosome anchoring results showed that the constructed super-scaffold N50 was 21.16 Mb and the number of super-scaffolds was six, which was significantly decreased. The number of scattered scaffolds was low, showing good assembly results. Among the scaffolds, N90 was 11, indicating that 90% of sequences were on the 11 chromosome. A comparison of N50, N60, N70, N80, and N90 showed that the Hi-C assembled A. cannabinum genome result was significantly improved from the original draft genome (Table 2).

Table 2

Apocynum cannabinum genome assembly results

Sample ID

Length

Number

Contig (bp)

Scaffold (bp)

Contig (bp)

Scaffold (bp)

10X_Genomics

Total

259,856,155

260,145,681

479

439

Max

9,567,299

15,154,524

Number > = 2000

479

439

N50

3,107,135

4,189,930

26

19

N60

2,442,000

3,120,689

35

27

N70

1,722,976

2,442,000

48

36

N80

1,116,808

1,547,882

66

49

N90

257,161

339,954

109

80

Hi-C

Total

259,856,155

260,165,581

569

330

Max

9,567,299

26,508,341

Number > = 2000

567

330

N50

2,442,000

21,162,547

33

6

N60

1,938,819

20,272,879

45

7

N70

1,256,579

19,617,059

61

9

N80

786,196

19,407,958

87

10

N90

207,482

17,325,238

152

11

A heatmap was used to validate the Hi-C assembly results (Fig. 1). From the heatmap, we can see that the interaction relationships near the diagonal were far stronger than those far away from the diagonal. The overall results showed that there was no significant clustering error between the A. cannabinum chromosomes.

Evaluation of assembly results

BUSCO and CEGMA software were used to evaluate the assembly results for A. cannabinum. The results showed that of the 1440 orthologous single-copy genes in the plant database, 1348 genes (C = 93.6%) were completely covered, of which 1258 single copies were covered (S = 87.4%), 89 multiple copies were covered (D = 6.2%), 30 genes were not completely covered (F = 2.1%), and 62 genes were missing (M = 4.3%). This suggests that the vast majority of single-copy genes was completely assembled, that there was no over-assembly or re-assembly, and that the assembly accuracy and integrity were good. At the same time, 227 genes were assembled from 248 core eukaryotic genes (CEGs), indicating that the assembly results were relatively complete. Alignment using BWA software showed that the alignment rate of all small fragment reads to the genome was 94.80% and the coverage was 96.95%, demonstrating that the reads and the assembled genome possess good consistency.

Repeat sequence annotation

Repeat sequences can be classified as tandem repeats and interspersed repeats. Tandem repeats include microsatellite sequences and minisatellite sequences. Interspersed repeats are also known as transposon elements and include DNA transposons that use DNA-DNA transposition and retrotransposons. Commonly seen retrotransposons include long terminal repeats (LTRs), long interspersed elements (LINEs), and short interspersed elements (SINEs). Around 36.62% of sequences in the A. cannabinum genome were identified as repeat sequences, of which DNA transposons, LTRs, LINEs, and SINEs accounted for 2.22%, 28.45%, 1.01%, and 0%, respectively(Table 3).

Table 3

Statistics of repeat sequence classification results for A. cannabinum

 

Denovo + Repbase length (bp)

% in genome

TE protein length (bp)

% in genome

Combined TE length (bp)

Type

DNA

5,282,438

2.03

765,646

0.29

5,763,303

2.22

LINE

1,744,782

0.67

1,157,517

0.44

2,630,319

1.01

SINE

1,597

0.00

0

0

1,597

0.00

LTR

72,896,704

28.02

16,110,716

6.19

74,004,538

28.45

Unknown

14,826,765

5.70

0

0

14,826,765

5.70

Total

92,189,164

35.44

18,033,358

6.93

93,170,210

35.81

Note: Denovo + Repbase are transposon elements obtained after the databases predicted by RepeatModeler, RepeatScout, Piler, and LTR_FINDER were combined with the RepBase nucleic acid database. This was followed by integration using Uclust software according to the 80-80-80 principle, following which RepeatMasker was used for genome annotation. The TE proteins are transposon elements that were obtained when the RepeatProteinMask software was used to annotate genomes in the RepBase protein database. Combined TEs is the result obtained by combining the results obtained from these two methods and following the removal of redundant genes. Unknown indicates that this repeat sequence could not be classified by RepeatMasker.

Table 3

Statistics of repeat sequence classification results for Apocynum cannabinum

 

Denovo + Repbase length (bp)

% in genome

TE protein length (bp)

% in genome

Combined TE length (bp)

Type

DNA

5,282,438

2.03

765,646

0.29

5,763,303

2.22

LINE

1,744,782

0.67

1,157,517

0.44

2,630,319

1.01

SINE

1,597

0.00

0

0

1,597

0.00

LTR

72,896,704

28.02

16,110,716

6.19

74,004,538

28.45

Unknown

14,826,765

5.70

0

0

14,826,765

5.70

Total

92,189,164

35.44

18,033,358

6.93

93,170,210

35.81

Note: Denovo + Repbase are transposon elements obtained after the databases predicted by RepeatModeler, RepeatScout, Piler, and LTR_FINDER were combined with the RepBase nucleic acid database. This was followed by integration using Uclust software according to the 80-80-80 principle, following which RepeatMasker was used for genome annotation. The TE proteins are transposon elements that were obtained when the RepeatProteinMask software was used to annotate genomes in the RepBase protein database. Combined TEs is the result obtained by combining the results obtained from these two methods and following the removal of redundant genes. Unknown indicates that this repeat sequence could not be classified by RepeatMasker.

Gene structure annotation

Through the annotation of the A. cannabinum genome, a total of 22,793 non-redundant protein coding genes were predicted. The mean gene sequence length of the gene was 3532 bp, the mean length of coding sequence was 1206 bp, and the exons and introns were 231 bp and 551 bp, respectively. The mean number of exons in every gene was 5.22. Comparison of the lengths of genes, coding sequences, introns, exons, and the number of exons in A. cannabinum with C. gigantea, Catharanthus roseus, C. arabica, and L. usitatissimum showed that the lengths of the coding sequences and exons and the number of exons in every gene in A. cannabinum were closest to C. gigantea and L. usitatissimum, while C. roseus had lower numbers than these three species and C. arabica had higher numbers than these species. The lengths of the genes and introns of A. cannabinum were most similar to C. arabica, while the lengths of the genes and introns of L. usitatissimum were significantly lower than the other four plant species. The lengths of almost all introns in the five plant species were within 1000 bp, and introns with lengths greater than 2000 bp were extremely rare or absent (Fig. 2).

Annotation of gene function and non-coding RNAs

A total of 78.70% of genes could be aligned to known metabolic pathways; 12,823 functional genes could be annotated by GO function; and 4.4% of genes had unknown functions. Among the annotated protein-coding genes, 21,780 genes had homologous sequences in the protein database, accounting for 95.6% of annotated genes. Furthermore, 95.2% of genes were aligned to non-redundant protein databases, of which 92.30% contained conserved protein domains, showing that the functions of most A. cannabinum genes were relatively conserved. In addition, 689 tRNAs, 3934 rRNAs, 592 snRNAs, and 159 miRNAs were annotated in the A. cannabinum genome (Fig. 3).

Comparative genomic analysis of A. cannabinum

Clustering analysis of the gene families indicated that 43,837 gene families were clustered in 17 species, of which 3084 gene families were common to all these species, while there were 55 common single-copy gene families in the different species. The clustering results of A. cannabinum, A. venetum, C. roseus, C. gigantea, and C. capsularis were used to plot a Venn diagram (Fig. 4).

From the Venn diagram, we can see that A. cannabinum contains 283 species-specific gene families and 449 genes compared with other species. The gene family expansion and contraction analysis results showed that the most recent common ancestor (MRCA) contained 43,823 gene families, of which 399 gene families and 166 genes were expanded in A. cannabinum, while 492 gene families and 231 genes were contracted (Fig. 5).

The gene family clustering analysis results showed that C. roseus, A. venetum, A. cannabinum, and C. gigantea clustered in the same taxon. The Markov chain Monte Carlo (MCMC) program in PAML software indicated that the speciation time for A. cannabinum was around 35.9 (26.7–46.8) million years ago.

Read more here: Source link