BAMboozle removes genetic variation from human sequence data for open data sharing

Strategy for stripping human sequence data of genetic information

To lower the barriers in sharing sequence data, we propose, like others recently17, to remove information on genetic variation that could be used to infer the identity from aligned reads and compromises the privacy of the donor (Fig. 1a). Genetic variation, including SNPs, indels, and short-tandem repeats, is the main compromising information to remove in such data. Provided that count data does not carry sufficient information to allow the identification of individual patients, data processed to accurately remove all genetic variant information can—in principle—be shared openly according to the European General Data Protection Regulation (GDPR), fitting the criteria for “information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable”18. If, however, it would turn out that count data does carry considerable amounts of information about individuals, a procedure that eliminates all genetic variant information might only be useful for pseudonymization, which describes methods that reversibly reduce identifying information (GDPR recital 26)19.

Fig. 1: Schematic overview of the BAMboozle procedure.
figure1

a Sequenced reads typically harbor identifying genetic variants that are deviating from the reference genome. Data sanitation removes such variants by replacement with the reference sequence. b Schematic representation of an aligned 100 bp read (red) with an alternative allele at a single-nucleotide polymorphism (SNP) position is corrected to the reference base in output read (blue). Raw and corrected CIGAR (Compact Idiosyncratic Gapped Alignment Report) strings (with operator descriptions below) are shown over reads, and raw and corrected MD tags are shown to the right. c Aligned 100 bp read (red) with donor-specific insertion (relative to reference), with information as in b. d Aligned 100 bp read (red) with donor-specific deletion (relative to reference), with information as in b. e Aligned 100 bp read pair sequence (paired-end data) that was clipped at the 5’ end relative to reference. Note the correction in alignment end position as a result of the removal of clipped positions. f Spliced alignment that additionally contains an insertion not present in reference. The spliced alignment information is preserved during the sanitation of genetic variation. g Spliced alignment that additionally contains a deletion not present in reference. The spliced alignment information is preserved during the sanitation of genetic variation. CIGAR string operators: M, alignment match; I, insertion; D, deletion; N, skipped in reference; S, soft clipping; H, hard clipping; P, padding; =, sequence match; X, sequence mismatch.

The removal of genetic variant information in typical sequence data needs to correctly handle multiple types of genetic variation (SNPs, insertions, deletions, short tandem repeats, immune haplotypes, etc.), present in different sequencing-based assays (e.g., RNA-seq, scRNA-seq, DNA-seq, ATAC-seq) from various sequencing platforms (Illumina, MGI, Pacific Biosciences, Oxford Nanopore Technologies) and strategies (e.g., single- or paired-end sequencing). At the same time, the remaining information such as alignment positions, custom Sequence Alignment Map (SAM) tags should be preserved as close to the original as possible, and the strategy should be agnostic to the sequence alignment tool used, as, for example, STAR20 and BWA21 report alignments differently that may affect the data processing. In contrast to a recent study17, we reasoned it would be possible to write a single, efficient, versatile, and intuitive tool for the removal of genetic information from genomics data that can easily be integrated into existing infrastructures.

In brief, removal of genetic information from data can be implemented for sequenced and aligned reads stored in the ubiquitous BAM format by replacing donor genetic variation with the sequence of the reference genome (Fig. 1a). This procedure cannot be applied to the unmapped reads in the BAM file since donor-specific genetic information is harder to identify with no matching reference sequence. Unmapped reads are therefore discarded. In the simplest case, a fully aligned sequence read containing a SNP is replaced with the reference base (Fig. 1b). In the case of insertions in the donor sequence, the non-reference portion of the sequenced read is discarded and the remaining sequence is extended by the length equal to the insertion while keeping the 5ˈ mapping position intact (Fig. 1c). Conversely, deletions are resolved by inserting the missing reference sequence and removing the equal numbers of bases from the 3’ end of the altered sequence read (Fig. 1d). Portions of the read that may be soft- or hard-clipped are replaced by reference sequence, as the non-matching sequence could stem not only from technical artifacts such as adapters but also from genetic variation. In the case of reads starting with clipping, for single-end sequencing data the reference position of the read start is adjusted; however, this is not possible for paired-end reads because it would invalidate the mate-pair information (TLEN and PNEXT fields). Instead for paired-end reads, the clipped sequence portion is added to the end of the read (Fig. 1e). In RNA-seq data, spliced alignments are common and reads can span several exon–exon junctions. For these reads, both the 5’ mapping position and the location of the splice junctions are preserved when scrubbing the sequence data of insertion and deletion events (Fig. 1f, g). In addition to the actual sequence field itself within the BAM file, information on the presence of genetic variation in the donor individual is stored in the CIGAR string and accessory fields. Therefore, the CIGAR field and MD tag are consequently corrected while removing each type of genetic variation (Fig. 1). Genetic information could be inferred from additional accessory fields, such as alignment score, mapping quality, and other alignment information (e.g., number of hits NH). To solve this, we update or remove these fields while leaving other auxiliary tags (e.g., sample information, cell barcode, gene identity, or custom flags) in place. While aiming to report data as close to the input sequence data as possible, resolving indel variation results in out-of-phase quality scores with respect to the base calls. The resulting privacy preserving BAM file is fully compliant with the SAM specifications22 (as confirmed by picard-tools validation; see “Methods”) and thus smoothly compatible with existing bioinformatics tools and pipelines. We implemented our strategy for removing genetic variation in an open-source stand-alone Python script, called BAMboozle, that only requires the input BAM file and genome or transcriptome reference sequence used for the alignment.

Importantly, the removal of SNPs and indels with BAMboozle is not dependent upon user-supplied lists or databases of SNPs and indels, rather all single-nucleotide or indel differences relative to the reference sequence are replaced by the annotated reference bases. Thus, complete removal of SNPs and indels with BAMboozle is possible even though current databases of SNPs and indels are incomplete. This procedure effectively removes all types of genetic variant information, as sequenced reads aligning to the genome are all reverted to reference sequence, and the remaining sequenced reads that fail to align are discarded.

Validating the absence of genetic information in BAMboozled sequence data

To illustrate the effectiveness of BAMboozle, we analyzed a recently published 10× Genomics23 dataset of scRNA-seq data generated from five equally abundant cell lines24 derived from five different donors. After preprocessing the raw data with zUMIs25, we summarized SNP coverage and assigned the 2937 high-quality (≥66% exonic, ≥25,000 reads) cells to their donor of origin using cellsnp-lite26 and vireo27. Cellsnp-lite and vireo can identify sample-related genetic variant information with and without predefined list of SNPs or indels. Here these tools were used to quantify donor-related genetic variation present in the raw sequence data and after BAMboozle processing. We projected the scRNA-seq data in two dimensions using UMAP to visualize the overall structure of the cellular transcriptomic data (Fig. 2a). Clearly, cells from each cell line (and type) were distinctly grouped in the UMAP visualization indicating cell line-specific gene expression patterns, and importantly, each of the cells from each main cluster could be assigned to a single donor (colored in Fig. 2a) corresponding to the individual from which the respective cell line was derived. Next, we processed the BAM file with BAMboozle and repeated the donor assignment using cellsnp-lite and vireo and the same settings. Importantly, analysis of processed human sequence data failed to assign any cell to a donor (i.e., the data lacks a donor structure) (Fig. 2b). The transcriptome information, however, was completely intact enabling a meaningful analysis of the gene expression changes between the five studied cell lines. Finally, we investigated the number of reads in the data that had donor-specific information, as quantified by cellsnp-lite26 and 23% of raw reads contained donor-specific information, whereas not a single read with donor-specific information remained after BAMboozle processing (Fig. 2c). Therefore, the complete lack of reads containing donor-related genetic variant information validated the removal of genetic information from human sequence data achieved with BAMboozle. As the cellsnp-lite analysis is currently not computationally feasible in unguided (whole-genome) mode for each individual cell, it was based on approximately 7.4 million most common human SNPs (see “Methods”). To confirm the absence of donor variation outside these previously known positions, we tabulated the coverage on every position of the genome over all cells using bcftools mpileup28 and could confirm that the stored reads contained only reference sequence. We note that the results presented here present one of several possible state-of-the-art analyses and could be repeated with other workflows for SNP and indel detection.

Fig. 2: Processing of human scRNA-seq and scATAC-seq data with BAMboozle.
figure2

a, b Uniform Manifold Approximation and Projection (UMAP) plot of scRNA-seq data (10× Genomics) from five different cell lines (H2228, H1975, A549, H838 and HCC827; in total 2937 cells), colored according to donor identity inferred from the a raw and b processed sequencing data. c Percentage of 10× Genomics scRNA-seq reads containing alternate allele information at interrogated variant positions, for raw and processed data, respectively. All cells were combined and variants with sufficient coverage (n = 75,461) were used. d, e UMAP plot of scRNA-seq data (Smart-seq3) generated from PBMCs (4 donors) and HEK293T cells (in total 3129 cells), colored according to donor identity inferred from the d raw and e processed sequencing data. f Percentage of Smart-seq3 scRNA-seq reads containing alternate allele information at interrogated variant positions, for raw and processed data, respectively. All cells were combined and variants with sufficient coverage (n = 1,503,063) were used. g, h UMAP plot of scATAC-seq data (10× Genomics) generated from one PBMC donor and GM12787 cells (in total 1497 cells), colored according to donor identity inferred from the d raw and e processed sequencing data. i Percentage of 10× Genomics scATAC-seq reads containing alternate allele information at interrogated variant positions, for raw and processed data, respectively. All cells were combined and variants with sufficient coverage (n = 95,127) were used.

Similar results were achieved when analyzing scRNA-seq data from human peripheral mononuclear cells (PBMCs) of four donors mixed together with a HEK293T cell line (derived from a separate donor) generated using the Smart-seq3 protocol15 (Fig. 2d–f). To demonstrate that BAMboozle can also correctly remove genetic information from other data types, we next analyzed single-cell ATAC-seq (scATAC-seq) data that was derived from a single donor of PBMCs together with a lymphoblastoid cell line of a second donor. While the donor structure was clearly present in the raw BAM file (Fig. 2g), the BAMboozle procedure removed the ability to identify donor structures (using cellsnp-lite), while maintaining the cell-type separation (Fig. 2h). Reassuringly, processed sequence data did not contain a single read with genetic variation (Fig. 2i), again demonstrating the complete removal of SNPs and indels with BAMboozle.

scRNA-seq data retain key utilities after removal of genetic information

To demonstrate the utility of scRNA-seq data processed with BAMboozle, we applied several typical downstream analyses to the data. First, we demonstrate that the numbers of genes detected at different sequence read depths are not affected by the removal procedure (Fig. S1a), since this benchmarking is an important standard for method comparisons. Thereafter, we demonstrate that the cellular expression levels were accurately computed, as exemplified by the two distinct clusters of B cells observed in the PBMC data (used above). The average gene expression levels (mean read counts) were virtually unchanged after removing genetic information (Fig. 3a, b). Unsurprisingly, testing for differential gene expression levels (see “Methods”) yielded identical results (Fig. 3c, d), providing confidence that the BAMboozle procedure did not introduce unwanted noise. Next, we performed RNA velocity, an important analysis strategy that can capture and quantify the dynamic of cellular gene expression changes29. Of note, in cases where human scRNA-seq data are only available as count tables, RNA velocity analysis is not possible, as it requires the summarization of spliced, unspliced, and junction-spanning read counts from BAM files. We show that data processed with BAMboozle retains the same velocity information as the original data (Fig. 4), validating the scar-free removal of genetic information around and on splice junctions.

Fig. 3: Accurate cellular transcriptomes in single-cell RNA-seq data after removal of genetic variants.
figure3

a Scatter plot showing accurate gene expression patterns in raw (y-axis) and BAMboozle-processed (x-axis) single-cell RNA-sequencing data. Data from 366 cells and 19,833 detected genes. b UMAP generated from processed sequence data maintains cell-type and state granularity, highlighting the two B cell clusters. c Volcano plots showing adjusted p values and log-fold changes (computed with limma-trend) in a comparison of B cell clusters 1 and 2, based on raw (left) and processed (right) sequence data, respectively. Genes are colored according to significance. d Scatter plots showing correlation in gene-level fold-changes estimated with limma-trend between B cell cluster 1 and 2.

Fig. 4: Genotype-free RNA-seq data maintains splicing kinetics information.
figure4

a RNA velocity inferred from raw single-cell RNA-sequence data, with estimated cellular flows overlaid on top of the UMAP visualization (b). b RNA velocity analysis of single-cell RNA-sequencing data after removal of genetic variation with BAMboozle show similar cell-state flows as in the analysis run on raw sequence data (a).

Naturally, analysis tools that are used for inferring clonal structures among cells are impossible to run on data lacking genetic variation, since not a single informative read was present in the data after applying BAMboozle. Finally, we analyze the scRNA-seq dataset consisting of five distinct cell lines for the presence of copy number variations (CNVs). As the inference of CNVs using the inferCNV package (see “Methods”) does not require additional sequence-derived information above the already widely shared gene expression count tables, we could infer CNVs even after applying BAMboozle (Fig. S1b).

Comparison of BAMboozle and ptools performance

Our implementation of BAMboozle differs in several key aspects from an alternative approach (called ptools) described by Gürsoy et al.17. First, BAMboozle does not restrict the removal of genetic information to the main chromosomes but rather takes all contigs in the user-provided genome reference. Alternative chromosome contigs and patches often contain highly polymorphic sites in the human genome, such as additional haplotypes for immunology-related genes such as major histocompatibility complex30. Second, whereas ptools retains original reads that are unmapped or aligned on contigs without a reference sequence, we discard this information in BAMboozle to prevent leakage of clearly identifiable genetic sequence. Furthermore, we aim to minimize the information loss during the data processing by keeping splice sites and 5’ mapping positions as intact as possible. We note that ptools does not retain information accurately in reads containing two or more spliced alignments (Fig. 5a), deletions in spliced alignments (Fig. 5b), and insertions in spliced alignments (Fig. 5c). The disruption of splice sites could potentially be used for identifying reads where the removal procedure was applied, and the amount of information leakage in such cases has not been systematically investigated. Furthermore, we aimed to provide an efficient and versatile implementation. To benchmark the strategies further, we processed a dataset consisting of 2.4 billion PE150 reads (the data used for Fig. 2d–f) with ptools and compared the performance with the processing with BAMboozle. The run-time of BAMboozle was 16.5-fold faster than ptools (5 h with BAMboozle vs. 80 h with ptools, Fig. 5d) while using 9.9-fold less storage space on disk at peak usage (Fig. 5e). Finally, since ptools did not remove genetic variation present in alternative chromosome contigs, we note that 81,326 genomic positions containing donor-specific genetic variation remained after ptools processing (as compared to 0 such positions with BAMboozle).

Fig. 5: Comparison between BAMboozle and ptools.
figure5

ac Spliced alignments with avoidable information loss during processing in ptools. Yellow boxes denote exons for gene models. Exemplary reads (original and processed by ptools or BAMboozle) are shown as gray boxes. Reference skip (N) for spliced alignments is denoted by gray lines. a Alignment covering several splice junctions is correctly reported by BAMboozle while ptools reports only the first splice site. b Spliced alignment with deletion. Information loss in BAMboozle is avoided by keeping 5’ mapping position and splice junction intact. c Insertion in a spliced alignment is resolved while keeping splice junction intact in BAMboozle. d, e Benchmarking of ptools and BAMboozle on 2.42 billion PE150 reads shows that run time for BAMboozle is d 16.5-fold faster (4 h 57 min vs 81 h 31 min) and e consumes 9.9-fold less storage space (378.5 GB peak vs 3741.1 GB peak).

Read more here: Source link