How much can I rely on DNA segments less than 8CM

Part 2

A brief summary of the data is at this Google Sheet. I’ve been (casually, not earnestly) thinking about trying to locate a child and both parents who have done a 30X WGS or better, and at least a couple of their cousins (preferably 2nd through 4th) who also have done WGS and who are all willing to share their BAM files with me. It would take some computing horsepower and some work in Python or R, but it would be interesting to deconstruct their reported matches and compare the WGS data to the microarray results.

My informal GEDmatch experiment didn’t yield quite what I’d expected. Using the GEDmatch default settings as-is, one-to-many runs at ≥ 30.5cM were essentially identical across the board. Deviations of up to 8.5% began as of ≥ 20cM. By the next selectable threshold though, ≥ 10cM, the disparities were somewhat astonishing. The microarray tests differed markedly at that level, with the current ones using the Illumina GSA chipset being the worst performers…meaning the ones showing the greatest numbers of what stood a good chance of being false-positives. At greater than or equal to 10cM, the lowest performer indicated that as few as 1 in every 3.1 reported matches was likely to be valid. 

To look at that particular microarray test data further, I started with the first reported one-to-many match at 10cM and did one-to-one matching with the next 50 of them (the second tab on that Google Sheet labeled “10cM-Sampling”). Those 50 reported matches yielded a total of 69 segments. Of those, only 14 also appeared as matches to the WGS superkit: a potential 79.7% segment error rate as opposed to the aggregate summary rate of 68%.

I hadn’t expected potential error rates that high that quickly. My guess had been that at ≥ 20cM I’d see matching rates nearly identical with the superkit, and that the discrepancies at ≥ 10cM would be somewhere around 15-20%…not 70%.

The data implied that, using GEDmatch’s default settings, at the level of a reported 20cM the segment would be real roughly 92% of the time. Still not good enough to denote precision, but a fair trade-off as a minimum threshold for genealogical purposes. That the accuracy improved dramatically as we approach 30cM would imply that a sweet spot probably lives somewhere in the low 20s. Conversely, the drop-off from 92% to 30% at 20cM to 10cM respectively, would imply that we need to be well above 10cM to infer an actual match.

There are thresholds we can manipulate at GEDmatch to help with accuracy. One is to never leave the “overlap cutoff” setting at its default of 45,000. That allows far too few of the same markers to be in the comparison. If our microarray tests average around 650,000 markers and the lowest overlap is 17%, that equates to over 110,000 markers. GEDmatch, though, works with “slimmed” versions of the uploaded data, but 17% is the worst-case scenario. I’d advise using 90,000 as the minimum overlap cutoff and dropping to 72,000 with cognizant caution.

The SNP count does definitely matter. Prior to GEDmatch Genesis the minimum was 700 SNPs; with Genesis going into production that became a dynamic range between 200 and 400; now it’s “about 2/3 of segments will have between 185 and 214 SNPs.” I consider the original 700 somewhat reasonable, but the modifications were made to accommodate those tests that overlap on only a minority of the same SNPs. In order to keep reported match numbers high, GEDmatch decreased the SNP density requirements.

With a very broad brush, 1cM will be approximate, with a lot of flexibility based upon chromosomal location, to about 1 million base pairs. Our microarray tests look only at about one marker in every 4,800 base pairs of nucleotides, on average. At that relative density, there should be just over 200 SNPs per centiMorgan. Anything much lower than that means fewer SNPs have been examined in a comparison between two tests than the approximate physical average across the genome that was tested by the microarray. Caution should be applied. If you see, for example, a 7cM segment reporting 300 matching SNPs when the genomic average should be closer to 1,400, it can be an indication that the comparison is flawed.

Super-lacking in precision, but if you take that 200 SNPs per cM and halve it to 100 per cM, I think you end up with a reasonable threshold that’s simple to apply, e.g., a 10cM segment should be comparing, give or take, around 1,000 SNPs. Much lower than that, be skeptical.

At the bottom of the GEDmatch free one-to-one autosomal comparison tool is a little checkbox to “prevent hard breaks.” I recommend that be kept unchecked. The distance between matching SNPs of up to a half million base pairs is already arguably excessive. Using that checkbox to allow even larger gaps does nothing to improve accuracy of the comparison.

I won’t dig down into a fifth point, one talking about match pile-up areas (these typically originate via something called linkage disequilibrium and mean that many small segments cannot be traced to specific ancestors because they are too old and spread too pervasively throughout similar regional, clan/tribe, and even familial populations). Without this biological foible, the testing companies would have no way to even start trying to provide the “ethnicity estimates” that they do.

A simple way to illustrate the effect this can have is to take two GEDmatch kits from people who share a regional “founder population,” like you and I do via Great Britain, and do a one-to-one match at the default settings but with the centiMorgan threshold dropped down to the floor-minimum of 3cM. For instance, you can run your kit against my 23andMe v5 test, ZL4037910. Even though most of my great-grandparental lines were in America before the late 1700s (and at a cursory glance you and I have no surnames in common), you’ll find that, with its free autosomal comparison tool, GEDmatch shows us as sharing 19.6cM over 4 segments. With my AncestryDNA v2 test (CS3670291) it’s even crazier: 61.4cM over 17 segments.

I’d like to claim cousinship, but it’s improbable that any of those segment “matches” are valid.

laugh

Read more here: Source link