Mostly because it’s typically unnecessary given that reference-based classification should yield a similar result without being subjected to potential biases introduced during the integration process.
SingleR (and presumably Seurat, I don’t know as I don’t use it) uses a reference dataset and asks “Which reference sample’s expression profile is this cell’s counts most similar to?”. And then labels appropriately, assuming the answer to that question is relatively clear – SingleR will refuse to label cells that are very ambiguously scored. In most cases with SingleR, you can ignore any potential batch effects since the question is all relative, assuming you’re willing to make the assumption that any effect applies to all cell types relatively similarly. In Seurat’s method, it’s not really clear to me whether the integrated counts that have been adjusted are being used, but I’m assuming so.
Your method is doing much the same in a more manual way, but making the assumption that all clusters are homogenous cell types, which may or may not be true. Additionally, integration methods have the rather unfortunate side effect of cramming populations together whether they’re actually biologically similar or not – they often need parameter tweaks to actually preserve unique populations in my experience. Reciprocal PCA methods (which Seurat supports) are generally more conservative and have a softer touch, so you could consider trying that if you feel this may be occurring.
I can’t speak to why your dataset may not be performing well with Seurat’s methods, though the main concern off the top of my head is that your query dataset contains cell types not found in the reference dataset. In such cases, those cells may still be labeled, just incorrectly. I don’t know if you can get the full score matrix out of Seurat for each cell and potential label, but if so, a closer look at that could indicate which cell types are really causing issues in your data.