scrnaseq – Integrating scRNA-seq data using raw data

I believe when you say alignment, you mean aligning reads to a genome (sometimes to transcriptome) and count these to get count matrices. In the aforementioned paper, however, what is meant is “bringing different data sets to a level where they can be compared/integrated/…”. Basically scRNA-seq data are heavily prone to batch effetcs and if you try to integrate different data sets (let’s say data generated with 10x 3′ and 10x 5′ chemistries), there is a good chance is most of the signal would come from the technical differences rather than biology, for example resulting in clustering by scRNA-seq technology and not cell type/stat, hence there are many batch-effect-removal/data-integration/data-alignment tools for scRNA-seq.

Seurat has a nice CCA tutorial, you would not need the raw data, count data would be fine but you will need to be able to specify what data are coming from what source, the authors must have provided this information.

When it comes to what an “anchor” is, you would need to read about the CCA technique but in a rather naïve way, you can think of these as commonalities between data sets or some kind of a reference and are used to “align” these different data sets.

Moreover, reducing the number of genes from tens of thousands to thousands is just fine and is not really specific to CCA: not all genes are expected to be expressed at all times in all cell types and hence are not that informative and also scRNA-seq is not able to pick up lowly-expressed genes, causing these to show up with 0 counts in the count matrix, so again not that informative/interesting for the downstream analysis steps.

Read more here: Source link