PCA from plink2 for SGDP using a pangenome and DeepVariant

Hi there,

I’m doing my first experiments with PCA and UMAP as dimensionality reductions to visualize a dataset I’ve been working on. Basically, I used the samples from the SGDP which I then mapped on the human pangenome for, finally, calling small variants with DeepVariant.

I moved on with some PopGen analyses and as a preliminary inspection of groups in this panel I’m doing a PCA with Plink2. Now, starting from the joint callset for this ~300 samples I removed genomic regions which could be troublesome e.g. repeats, cent&sat, low mappability and SDs. Following this I attempted my first PCA but, for some reason, samples are smeared all over the plot… (see figure below)

Looking up, I found this old but very useful post on how things should have been done. That is, I should have removed INDELs and focused on bi-allelic SNPs. So, my next step has been to run the following on my VCF file

bcftools norm -m+ $VCF | bcftools view -m2 -M2 -v snps -Oz -o $new_file_name

However, the result didn’t change significantly. The smearing issue persists and there are no defined clusters/groups in the plot…

For reference this are the Plink2 commands I’m using to generate the eigenvec and eigenval files to use for plotting

./plink2 --vcf $VCF --set-missing-var-ids @:#:\$r:\$a --rm-dup --indep-pairwise 200kb 0.5 --not-chr X,Y,MT --vcf-half-call m --out SGDP_snps_bi_norm

./plink2 --vcf $VCF --set-missing-var-ids @:#:\$r:\$a --not-chr X,Y,MT --vcf-half-call m --maf 0.05 --extract SGDP_snps_bi_norm.prune.in --make-pgen --pca --out SGDP_snps_bi_norm

which I double-checked with the author of the tool. I’m kind of lost on what’s going wrong, if anyone has more experience with this type of analysis any help is much appreciated. Thanks in advance!pca

Read more here: Source link