rna seq – How will Seurat handle pre-normalized and pre-scaled data?

I don’t do transcriptome analysis, it ain’t my thing, however I do understand statistical analysis as well as the underlying issue regarding the public availability of molecular data … I agree with the OP its not ideal. However, yes the OP can continue with ‘clustering’, personally I definitely prefer it if the transformation was removed, or analysed from the raw data.

Summary The authors of the data set are obliged to provide the raw data, presuming its published. Why? This is because whilst the OP can perform the analysis – it is very likely the OP will be asked to repeat the analysis on the untransformed data (again note this ain’t my area, so ‘very likely’ might be ‘possibly’ in this field). At this point the original authors have to declare their untransformed data, if this was a formal request to the OP’s analysis.

Please read the caveats and discussion.

  1. Firstly could the OP reverse the normalisation? Its just a transformation applied to all the data which in their analysis resulted in a normal distribution, I don’t know TPM but if this was log-transform you would supply ‘un-log’ it, square-root then square it, reciprocal do another reciprocal will ‘de-transformation’. Can TPM be undone? I agree it ain’t ideal, from several perspectives but it gets things moving quickly.

  2. Ask the authors for the raw data – this is what I would do. They are obliged to provide it if the data is published. Virtually all journals have this policy in place, however I agree that ASM remain an outlier in this respect (who are significant publisher). Any other journal will return to the authors and point out their ‘check-boxes’ when they submitted the manuscript. It is etiquette to go to the authors first and provide sufficient chance for them to respond, prior step 3 (below) – giving them the absolute maximum benefit of the doubt.

  3. What you do if the authors don’t respond? The OP would then contact the journal explaining: a) why you need the data, i.e. the analysis performed; b) explain the analytical significance of the raw data, i.e. arguments the OP has raise ‘not ideal’ and general concerns raised here. I do have experience in this regard and the authors will oblige if asked by the journal.

  4. Searaut’s cluster analysis certainly used tSNE … I believe it uses dimensionality reduction such principle component analysis (PCA) – I don’t know which they use – and then tSNE analysis (which I think you are calling ‘clustering*’). This is a very robust ‘clustering’* (please see footnote), there is no question this is as good as it gets. If that is singly your goal I agree that dimensionality reduction calculations don’t need transformations, transformations are needed for parametric statistics and other areas of model fitting.

The problems with transformation

I agree the authors should not have done this because:

  1. Transformations are not perfect, if its wrong – you’re stuck.
  2. Not only do you not need to transform the data, its better you don’t – if untransformed data results in nice ‘clusters’ thats the preferred result. Dimensionality reduction + tSNE gives great results thats well known.

Corroborating the OP’s concern
Dimensionality reduction calculations do not need a normal distribution because its sectioning up the observable variance to maximise the classification of the data. My opinion – in statical analysis – is if you need to transform, THEN perform dimensionality reduction (PCA-stuff) THEN tSNE, that data has been heavily manipulated. It doesn’t mean to say the conclusions are not valid but its been put through a lot and would STRONGLY SUGGEST THE RAW DATA HAS NOT GIVEN GOOD RESULTS. Thus, personally I would be cautious, why should be OP be put under unnecessary caution?

It is important to note this is not my field and the field will have its own standards. If transformation is routine for all statical calculations (except non-parametric where it may not be valid) then authors of the data will argue they are correct and no further data needs to be released. Personally, I really doubt it, but we’ll see what others say.

Footnote * clustering is a statistical method usually resulting in a tree, it is generally avoided because there are so many different clustering methods giving different results and no agreement on which is the ‘optimal’ method. I think when the OP say ‘clustering’ they refer to clusters following a tSNE analysis, which will be a bivariate plot.


From the comments Failure to declare a reference genome is a formal omission that has to be rectified and the journal is obliged to do this. Thus the OP would simply contact the journal, they may tell the OP to contact the authors and when the authors respond the response is passed back to the journal. If its an online journal you can simply leave a comment in the ‘comments’ section: the authors will answer. In this instance officiating via the journal is preferred to ensure the information is verified by all parties – I presume the OP does not know the authors. From the comments I get the gist the OP is working via supervision, and would therefore just notify their supervisor who should officiate on OP’s behalf (they might then delegate it back to the OP – but at least the supervisor is formally aware of this).

In context, stuff happens and there is formal process to ensure parity is achieved its simply patience in working through it.

Read more here: Source link