Single-cell DNA and RNA sequencing reveals the dynamics of intra-tumor heterogeneity in a colorectal cancer model | BMC Biology

Organoid culture of small intestinal cells and lentiviral transduction

C57BL/6J mice and BALB/cAnu/nu immune-deficient nude mice were purchased from CLEA Japan (Tokyo, Japan). The small intestine was harvested from wild-type male C57BL/6J mice at 3–5 weeks of age (Additional file 1: Figure S9A). Crypts were purified and dissociated into single cells, which were then put into serum-free Matrigel-based organoid culture as previously described [22, 43]. Five days later in the first passage, organoids were lentivirally transduced with shRNA against APC, where the efficiency of introducing the shRNA was around 90% [22, 43]. To select for APC-repressed cells, transduced organoids were thereafter maintained in culture medium lacking R-spondin 1, which activates Wnt pathway and is thereby indispensable for propagation of normal intestinal cells. To obtain a single-cell clone, shAPC-transduced organoids were dissociated and plated at the concentration of 0.5 cell/well in a 96-well plate. Immediately after plating, 50 wells containing only a single cell were identified under microscope; the remaining 46 wells were occupied by no cells or 2 or more cells. Then, we selected 24 out of the 50 single-celled wells and placed the organoids from each well into a separate well of a 24-well plate. Fresh culture media at the passage were necessary for removing wastes and multiplying organoids. We repeated these procedures for 5 wells selected from among the 24 wells, and finally specified 1 well for later use as organoids originating from a single cell.

We then grew organoids from a single cell in an individual well, transferring the fastest growing organoids from 1 to 2 to 4 to 8 wells as they multiplied (Additional file 1: Figure S9B). On the way to the 8-well stage, as is usually done in this type of culture, we collected organoids into a single tube at every passage, resulting in the intermixing of organoids across different wells. At the 8-well stage, we collected organoids from 4 out of the 8 wells into a single tube for storage and also sampled the DNA for sequencing, which was used as the T0.5 sample. Two of the remaining wells were used for re-culture, and two were used for the mouse injection test; the same procedure was used for T1, T2, and T3 (Additional file 1: Figure S9B). Organoids composed of 5 × 105 cells were mixed with 200 μl of Matrigel and injected into one side of the dorsal skin of nude mice, while uninjected cells were maintained in 3D cultures for later use.

Analysis of subcutaneous tumors in nude mice

Palpable tumors from the injection sites were harvested for histological examination or cell culture. Half of the subcutaneous tumors were formalin-fixed, paraffin-embedded, and sectioned at a thickness of 5 μm, followed by HE staining to assess histological features. The other half of the tumors were minced and digested to recover cells as described previously [22], then seeded onto solidified Matrigel to resume organoid culture.

Single-cell transcriptome and exome sequencing

Cultured mouse organoids derived from a single cell were harvested and treated with Accumax (Innovative Cell Technologies, AM105) to generate a single-cell suspension. To capture cells and extract RNA or DNA from a single cell, the cell suspensions (4.4 × 105 cells/ml) were loaded on a C1 Single Cell Auto Prep System (Fluidigm, C1) with medium-sized (10–17 μm) microfluidic chips for 96 cells. Approximately 1300 cells were applied to each chip. Captured cells were imaged on a BZ-9000 digital microscope (Keyence, BZ-9000) and a visual inspection was performed to determine whether a single cell was captured in each well of the chip. Capture efficiency for a single cell was determined as 71–82%.

For single-cell whole transcriptome (RNA) sequencing, captured cells were lysed on the chip using a C1 Single-Cell Auto Prep Reagent Kit for mRNA Seq (Fluidigm, 100-6201). Full-length cDNA fragments were transcribed and amplified from poly-A RNA in each single cell using the SMARTer Ultra Low RNA kit (Takara Bio, 634832). Tagmentation of cDNA was performed and sequencing libraries were prepared using the Nextera XT DNA sample preparation kit (Illumina, FC-131-1096) according to the manufacturer’s protocol. Up to 52 independent single-cell RNA-seq libraries were prepared for sequencing.

For single-cell DNA sequencing, genomic DNA was prepared from single cells using the C1 Single-cell Auto Prep Reagent Kit for DNA Seq (Fluidigm, 100-7357) and whole-genome amplification was achieved by multiple displacement amplification with Phi29 DNA polymerase and the Illustra GenomiPhi v.2 kit (GE Healthcare, 25660032). Amplified genomic DNA (70 ng) was used to generate exome sequence libraries using the Hyper Prep kit (Kapa Biosystems, KK8504), SureSelect Target Enrichment kit (Agilent Technologies, 931171), and SureSelect XT Mouse All Exon v.1 probe (Agilent Technologies, 5190-4642).

Bulk-cell transcriptome and exome sequencing

Among the cells that were not used for single-cell capture with the C1 system, suspensions of about 200 cells were subjected to whole transcriptome (RNA) sequencing for bulk-cell RNA-seq (T1, T2, and T3 samples). The sequencing libraries were prepared using the same reagents as the single cell RNA-seq. As control bulk cells, normal intestinal crypt epithelial cells from two wild-type mice of the same strain were grown in the 3D culture system for 7 days, then harvested and lysed for total RNA preparation using the miRNAeasy Mini kit (Qiagen, 217004). RNA-seq libraries for control bulk RNA were generated using the SureSelect Strand Specific kit (Agilent Technologies, G9691B). Bulk DNA from over 1 × 105 cells was obtained from the cell culture (T0.5 sample, 1.5 months after culture initiation) and the remaining cells in single-cell capture (T1, T2, and T3 samples) using the QIAamp DNA Mini kit (Qiagen, 51304), and 500 ng of DNA were used to construct exome sequencing libraries with the same reagents as the single cell DNA-seq.


All the sequencing libraries were subjected to paired-end sequencing of 101-bp fragments on the HiSeq2500 DNA sequencer (Illumina, SY–401–2501).

Transcripts per kilobase million (TPM) calculation for single and bulk cells

The procedure for calculating TPM is summarized in Additional file 1: Figure S10. Specifically, sequence reads were quality-filtered and trimmed using PrinSeq [44], and then used as input for quality-check by FastQC ( We used the following parameters: –min_len 30 (removing reads ≤ 30 bases); –min_qual_mean 20 (average read quality ≤ 20); –trim_tail_right 5, –trim_tail_left 5 (trim bases if the 3′ and 5′ end poly A/Ts are ≥ 5 bases); and –trim_qual_right 20, –trim_qual_left 20 (trim 3′ or 5′ end for read quality ≤ 20). Paired-end reads were mapped to the University of California Santa Cruz mouse genome sequence (mm10) [45] using Bowtie2 [46] built in RSEM [47]. Expression levels (in TPM) were estimated by RSEM using the command rsem-calculate-expression with the parameters –estimate-rspd, –paired-end, –bowtie2, -p 30, and –ci-memory 10192. We removed eight T1 cell samples due to an excessive number of genes (≥ 5200) with TPM ≥ 10 (with reference to results in [48]) or with too few unique mapping reads (< 2.2 × 106). We also removed two samples with unique mapping rates that were too low (< 20%) and discarded genes with low expression levels (≤ 10 TPM) across all cell samples, leaving 14,258 out of 32,732 genes for analysis.

Estimation of replicate variabilities of gene expression levels

To estimate replicate variabilities of gene expression levels for a single cell, we first randomly selected three cells at each time point and then employed a bootstrapping approach where we re-sampled sequencing reads in the FASTQ file for each cell. We repeated this re-sampling three times for each cell to make three replicate sets of sequence reads. For each replicate set, we used the same method as for the observed data to obtain bootstrapped gene expression levels in log2 [TPM + 1]. We calculated the relative error of the expression level for each gene, setting the observed value in the denominator, and then averaged the relative errors across genes.

Detection of highly variable genes

To detect genes with variable expression levels across cells, we defined highly variable genes according to the CV, corrected in the locally weighted scatterplot smoothing (LOWESS) method using the “lowess” function in R. To fit a single LOWESS curve across all ranges, we divided average expression level data into three ranges: < 4, 4–8.5, and > 8.5. cCV values were yielded by dividing CV values by the value of the upper variability band (± 1.96 times the standard deviation) of smoothed curve estimated using “” in the “msir” package. Because of the large bias in original CV values against low average expression levels, only those with cCV values > 1.3 and high average expression levels (log2 [TPM + 1] ≥ 4) were defined as highly variable genes.

PCA of RNA data

PCA was carried out for gene expression levels (log2 [TPM + 1]) without scaling.

Hierarchal clustering, correlation plot, and heatmap analysis

For hierarchal clustering, we used the “hclust” function in the “stats” package of R software, where we calculated the Euclidean distance of expression levels (log2 [TPM + 1]) of all highly variable genes between cells and used the Ward method for agglomeration. We generated correlation plots of highly variable genes using the “corrplot” function in the R “corrplot” package, where we used the Ward method for agglomeration. We divided genes into three clusters based on these hierarchical clustering results using the “addrect = 3” option. A heatmap was generated using the “heatmap.2” function in the “ggplot2” package. In the heatmap, cells were arranged according to their order in the dendrogram described above and genes were arranged according to their order in the correlation plot of highly variable genes.

Gene set enrichment analysis

DAVID [49] was used to identify gene ontologies (biological processes) in which genes of an identified group were enriched (P < 0.01).

SNV detection for single and bulk cells

For bulk-cell data, we used a previously described method for SNV/indel calling [50] by cisCall with cell-line/frozen parameters [51], mapping reads to the mouse genome (mm9) [45] by BWA [52]. We filtered out PCR-duplicated reads as well as reads and bases with low mapping and base qualities. The remaining variants were further filtered statistically using Fisher’s exact test to compare fore- and background samples, which came from two different individuals of the same pure C57BL/6J strain. We verified the negligible effects of using a different individual for the background sample. A series of filters was used to remove suspicious variant calls, such as those related to misalignments. Variants for which allele frequencies were significantly greater than 1% in the binomial test were retained. The procedure is summarized in Additional file 1: Figure S10.

For single-cell sequencing data, we called SNVs only at SNV sites called in bulk-cell sequencing data. Specifically, we counted nucleotide bases with high qualities (mapQ ≥ 20, BaseQ ≥ 10) in single-cell sequencing data as well as in background data (same as those used in bulk-cell SNV calling) with the Samtools mpileup function [52, 53]. We then selected variants with the largest χ2 test statistic (with Yates’s correction) among all possible variants at each position to identify those that were most likely to differ between single-cell and background data. We required a variant count ≥ 2 and VAF ≥ 2% for such variants in single-cell data. We then selected variants that overlapped with SNV sites called in bulk-cell data.

Detecting mutation in cancer-related genes

We investigated nonsynonymous mutations in cancer-related genes contained in Tier1 in COSMIC Cancer Gene Census v87 [25, 54].

MDS of DNA data and the diversity index

We performed MDS from the cell × site matrix composed of zero and one, which respectively represent the absence and presence of SNVs (both synonymous and nonsynonymous SNVs) and NA, which represents an undetermined call due to shallow depth. We assigned zero to non-called sites as the true negative when those sites had depths ≥ 30 and assigned NA to non-called sites when the depth was < 30. We only used SNV sites where a variant was called in at least one cell and the VAFs at the same sites in bulk data ranged from 10 to 35% (polymorphic) for at least one time point. We removed cells and sites (two each) with too few or too many NAs, yielding 104 sites and 69 cells. Using this 0/1/NA matrix, we calculated the p-distance (proportion of different sites) used in molecular evolution without using NA, and then performed MDS.

The diversity index was calculated as the average Euclidian distance from the centroid over cells in the MDS space, where we used up to the sixth dimension because of a sudden drop in the eigenvalues over this dimension. To calculate the statistical significance of differences between cell groups, we used a bootstrapping approach in which we randomly re-sampled cells’ sequences from the 0/1/NA matrix of each cell group 10,000 times and performed the same MDS as in the observed data for each replicate. We then calculated the diversity index for each replicate to determine the 95% confidence interval and standard deviation for each cell group.

Lorenz curve and Gini coefficients

A Lorenz curve was generated with read depth at each site using the “Lc” function in the “ineq” package of R software. To quantify uniformity, the Gini coefficient was calculated using the “Gini” function in the “ineq” package.

ADO rate

The ADO rate was defined as the rate of homozygous sites in single-cell samples where the bulk sample was heterozygous (defined as sites where VAFs were 45–55%) at the same nucleotide site. We removed outlier cells with high ADO rates at each time point (one cell each with an ADO rate > 80% at T2 and T3).

Average copy number

The average copy number, ACN, was calculated as follows:

$$ mathrm{ACN}=2times left{left({2}^{frac{sum left({log}_2{R}_itimes {L}_iright)}{sum {L}_i}}right)times left(frac{sum {L}_i}{GL}right)+left(1-frac{sum {L}_i}{GL}right)right}, $$


where log2Ri, Li, and GL represent the log-ratio of CNA segment i, length of CNA segment i, and genome length (50 Gb for mouse, 40 Gb for human), respectively. CNAs of mouse bulk data were detected as previously described [50]. Briefly, segments were called for the same fore- and background BAM files as those used in SNV with Exome CNV [55] and Varscan2 [56]. Overlapping segments called by both tools were used as CNA segments.

Random forest

Random forest was used to generate the classifier for the histological type and MSI status of human cancer. We used gene expression levels, number of SNVs in each gene, total mutation (SNV/indel) number, and mutation density (total number of SNVs/indels divided by target region size) as explanatory variables. Using TCGA data [24, 35], we first filtered out unimportant explanatory variables using the two-sided Kruskal-Wallis test with P values of 5.00 × 10−5 and 1.00 × 10− 9 yielding 171 and 78 variables for histological type and MSI status, respectively. These were used to train a random forest classifier with the “randomForest” function in the “randomForest” package of R software, with the options ntree = 10000 (setting the number of trees to grow to 1000) and mtry = 5 (setting the number of variables randomly sampled to 5). Using the created classifier, the same explanatory variables for mouse data were used to classify each feature in the mouse model.

MDS of mouse cell and TCGA samples

We first identified TCGA samples with gene expression patterns similar to the mouse single-cell groups. For that purpose, we calculated a normalized 1 − r distance as follows:

$$ {d}_{h,G}=frac{1-{r}_{h,{m}^G}}{mathrm{MADN}left(1-{r}_{m_i^G,{m}^G}right)}, $$


where ri,j is a Pearson correlation coefficient between vectors i and j of expression levels in log across highly variable genes, h represents a human TCGA sample, G represents a mouse single-cell group, ( {m}_i^G ) represents mouse single cell i in group G, mG represents the centroid of ( {m}_i^G ) that was calculated by the median, and MADN represents the median absolute deviation adjusted by a factor for asymptotically normal consistency. We calculated this distance from a TCGA sample to every mouse group and selected a TCGA sample for those whose minimum distance across the groups was less than 4.05 and the difference between the first and second minimum distances was larger than 0.31. For selected TCGA and all mouse single-cell samples, MDS was performed based on the distance of 1 − r.


We called SNVs in single-cell RNA-seq using SCmut [37], which is specifically designed to identify SNVs (heterozygous in a tumor sample but homozygous in the normal sample) in single-cell RNA-seq data with the aid of bulk-cell DNA-seq data. In their study [37], SCmut was applied to Illumina sequencing reads obtained using the Fluidigm C1 system, which is the same system that we used.

Statistical analysis

All statistical analyses were performed using R ( Symbols “*” and “**” indicate p < 0.05 and 0.01, respectively, unless otherwise noted. The two-sided Wilcoxon rank sum test was used in the analysis shown in Fig. 2B. A bootstrapping test was used in the analysis shown in Fig. 4D. The details of this bootstrapping test are described above in the subsection “MDS of DNA data and the diversity index.” The number of samples (n) used in statistical analyses is indicated in each figure or figure legend as appropriate.

Read more here: Source link