I’ve been trying to setup an analysis pipline for RNAvelocity in AWS EC2. I used one of the 10x dataset, 10k Peripheral blood mononuclear cells (PBMCs) from a healthy donor, Single Indexed, as a test model to setup the pipeline. For speed and cost saving, I first used samtools to sort a 10PBMC bam file from 10x by firing a following command;
samtools sort -l 7 -m 2048M -t CB -O BAM -@100 -o /temp/home/cellsorted_PBMC.bam /temp/home/PBMC_10K.bam
and then,
velocyto run -b filtered_feature_bc_matrix/barcodes.tsv -o /temp/home -m GRCh38_rmsk.gtf cellsorted_PBMC.bam.bam refdata-gex-GRCh38-2020-A/genes/genes.gtf
Veoclyte complained that there is no CB tag in the 10K PBMC.bam, when I examined the bam file, I saw absolutely no CB in the sorted bam, as follows,
A00519:643:HCMYWDSXY:4:2172:22525:26287 16 chr1 148893 255 91M * 0 0 ACATGGCAAGATCCCGTCTCTATGATAAAAAATTAGCTGGACATGGTGGCACATGTCTGTAGTCCCAGCTACTTGGGAGACTGAAGTGAGA FFFFFFFFFF:FFF:FFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:6 HI:i:2 AS:i:89 nM:i:0 RG:Z:SC3_v3_NextGem_SI_PBMC_10K:0:1:HCMYWDSXY:4 TX:Z:ENST00000484859,+724,91M GX:Z:ENSG00000241860 GN:Z:AL627309.5 fx:Z:ENSG00000241860 RE:A:E MM:i:1 xf:i:17 CR:Z:GCAGCTCTGTGAATAT CY:Z:FFFFFFFFFFFFFFFF UR:Z:TCTAAAACCTAC UY:Z:FFFFFF:FFFFF UB:Z:TCTAAAACCTAC
The original unsorted bam has CB tag,
A00519:643:HCMYWDSXY:3:2144:3649:12790 16 chr1 498309 1 65M26S * 0 0 GGCCAAAATATGTAAGCACATTTGCATTTATTAGGCACTTTATTTCCATTATTACACTGTGATATCCCATGTACTCTGCGTTGATACCACT F,,,FF:F,FFFFF:FFFF:FFFFFFFFFFFFF:FFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F:FFFF:FFF:FFFFFFFFFFF NH:i:4 HI:i:1 AS:i:61 nM:i:1 ts:i:26 RG:Z:SC3_v3_NextGem_SI_PBMC_10K:0:1:HCMYWDSXY:3 RE:A:I xf:i:0 CR:Z:TCATTGTAGTATAGAC CY:Z:FFFFFFFFFFFFFFFF CB:Z:TCATTGTAGTATAGAC-1 UR:Z:ACTCTAATCTGC UY:Z:FFFFF:FFFFFF UB:Z:ACTCTAATCTGC
An interesting thing is when I sorted a smaller, truncated version of PBMC_10K.bam (created by samtools view -h Parent_NGSC3_DI_PBMC_possorted_genome_bam.bam|head -n 10000 | samtools view -bS > test.bam
) by the exact same samtool command, I saw the CB tag preserved in the sorted bam.
Does anybody have any idea as to why sorting the entire PBMC_10K.bam based on the CB deletes the CB tag in the sorted bam while the CB tags are spared in sorted the smaller version of the same bam. I’d appreciate any pointers at this point. Thanks.
using
samtools –version
samtools 1.11
Using htslib 1.11
Copyright (C) 2020 Genome Research Ltd
Read more here: Source link