Let’s start off to thank the ones that helped me lately.
I almost feel bad for how many questions I have asked in the last weeks, but the answers were always of great help, so thanks for that!
And yet I have another question. As described in my previous questions I am running a Quantseq experiment with 600 patients divided in 2 groups, where we try to gain insight in pathophysiological differences between the groups. The plan is to run a DE-analysis (EdgeR), GSEA, GO-enrichment and to illustrate this with a network (Cytoscape).
Now I have been discussing with collegues lately that importance of pseudogenes and lncRNA’s. The opinion is that they should be removed and that we should only look at protein coding RNA’s. Also because including non-important genes contribute to the FDR and such reduce your statistical power (same argument as for filtering out lowly expressed genes).
I really agree that excluding non-important genes increases your statistical power, but having read some articles, especially about lncRNA’s, I am not that certain that those genes are not important. Furthermore, lncRNA’s are thought to have a regulatory cell function and even seem to code for small peptides/proteins. Pseudogenes on the other hand are mostly disfunctional, but even some pseudogenes seem to have regulatory functions.
So my question is if someone has experience with this kind of experiments and what your approach was? Is it reasonable to only include protein_coding genes? Or would this lead to a significant loss of information? All opinions and experiences are greatly appreciated!
Edit: of important note. Also a pragmatic argument could be made to exclude non-coding genes, because it seems to me that the research of pseudogenes and lncRNA’s is still in an early phase. If currently available databases do not include the function of these genes (for example in GO), then you will never find the function of these genes. (I hope this makes sense)
Read more here: Source link