"Gary might have achieved the seemingly impossible task of writing a post longer than Edison's."
Kerry, I get the vague feeling that could be an implication I'm overly wordy in my posts here. Surely you jest...
"I do think your proposals throw the baby out with the bathwater."
They well might, Gary, but I believe it comes down to the purpose of the "Confirmed with DNA" status. That word "confirmed" has been debated extensively before, however, and it's been made clear that it will remain as-is for WikiTree.
From a standpoint of genetic genealogy, not of WikiTree policy, my personal opinion is that the propagation in the "DNA Connections" panel to ancestor profiles as potentially sharing autosomal DNA as far back as 6g-grandparents is sufficient to serve as "FYI" information and/or a research hint.
If the goal is accuracy in the use of DNA as evidence, I think we need to be informed and careful enough to discern, if there are many babies in the same bathwater, which baby is really ours and to let the rest go. And now that has to rate as my worst follow-on metaphor ever.
I won't take this too far off-topic; this is about Greg's super-handy app and not a general genetics discussion. But it's still germane to whatever guidelines evaluation might be done by the WikiTree DNA Project, so I'm going to take up appreciable screen space. Surprise.
We had a recent conversation about some of the difficulties with using autosomal DNA for distant relationships three weeks ago here on G2G. I deferred responding to some of Frank's specific issues because I'm currently preparing a presentation on the subject.
My own arbitrary and personal definition of what determines whether a given segment of "matching" autosomal or xDNA is suitable for use in genealogy has two parts:
1. Is the segment very likely to be a valid segment, or is there sufficient possibility that it is a false-positive?
2. With a high degree of confidence, can the segment be identified as having originated with a single, specific ancestor?
Item one is about the physical, chromosomal segment. Every testing and reporting company uses its own methodology for this evaluation; the same DNA "match" will be somewhat different as reported by each company; and none of the companies disclose the actual base-pair data. A matching segment should be a continuous set of half-identical nucleotide values, unbroken by mismatched loci and having no excessively long region of no matches. Our common microarray tests look only at about 1 in every 4,800 base pairs, so long regions of no half-identical matches might infer that two distinct segments have been conflated in the reporting. Compounding this is the routine practice of assuming any no-calls (loci where one or the other set of test data was unable to determine a value at either base in the pair) are matches.
The centiMorgan itself--by definition and computational estimation--becomes increasingly imprecise as an evaluation tool the smaller segments become. Add to that, our microarray tests simply cannot tell us the actual start and stop points of a purported segment.
How the 7cM value became so commonly cited in genealogy as some de facto threshold in determining segment validity I've never quite figured out. My best guess is that's simply what GEDmatch first decided to use as its default minimum. But almost 10 years ago Dr. Tim Janzen reported on the results of comparing traditionally phased trios using what reporting detail we have available, and he found that fully 58% of all 7cM segments were false, this based solely on the phasing results, not a physical examination of the base-pair detail.
The most common problem in the interpretation of the actual base-pair data deals with the fact that no DNA test can differentiate between which nucleotide values come from the maternal chromosome in a pair, and which from the paternal. This is commonly called "haplotype switching": as ISOGG explains it, "...matching alleles zig-zagging backwards and forwards between the maternal side and the paternal side." In a paper in Molecular Biology and Evolution (Durand, et al., 2014), researchers found that this was the reason for false-positive matching between small segments in up to 67% of the instances.
GEDmatch uses no additional comparison refinement tools like computational phasing or genotype imputation; their matching is arithmetic only, calculating the number of continuously matching SNPs, an allowance for a small number of mismatches, and the distance in base pairs between the matching SNPs. Given that, the greater the number of SNPs in a kit's data, hypothetically the more accurate the matching results should be.
In an informal check using a baseline kit comprised of over 2.08 million SNPs extracted from whole genome sequencing data, and employing GEDmatch's Tier 1 one-to-many tool using its default settings, I compared the results from that superkit to 11 different tests/versions of our common microarray results using the same DNA sampled at the same time. In the worst performing instance, that of a 23andMe v5 test (and mind you this has nothing to do with the accuracy of that test, only with way that GEDmatch uses its data for comparisons), at a threshold of ≥ 10cM for every kit in the database shown as a match to the superkit, in round numbers 3 times as many kits showed as matching the 23andMe test (9,242 versus 28,846), presenting the real possibility that over 19,600 of the matches attributed to the 23andMe kit were false. Interestingly--though purely coincidental--that would be a 67.96% false-positive rate compared to the 67% reported in the Durand study. It wasn't possible to go lower than the 10cM threshold: the GEDmatch server would consistently deliver a "memory exceeded" error and, since that time, they have modified the available search parameters to eliminate the possibility of that volume of matches; 7,500 is the new maximum. No, I did not break GEDmatch...at least I don't think I did.
Item two--with a high degree of confidence, can the segment be identified as having originated with a single, specific ancestor--is an even stickier and more difficult criterion to evaluate.
Contrary to popular opinion, recombination (crossing over)--the process that occurs during Prophase I of meiosis and the function that creates our DNA segments--is not a random operation. There are several biological mechanisms in play that make that so, from genetic linkage to crossover interference to the centromere effect to linkage disequilibrium to crossover hotspots and their shifting due to something called deamination as males age.
One result of these various mechanisms is that, as we step back in time generation by generation, we reach a point where we will never be able to accurately attribute a given segment to a given ancestor. Just as we Europeans carry about 2% Neanderthal DNA, we carry DNA segments from our various founder populations and even from entire populations at a broad, continental level. If that weren't true, none of the "ethnicity estimates" we see would be possible.
This is one of many reasons that numerous voices have been urging the scientific community to do away with the single-genome-as-reference concept and move to a pangenomic model that better considers attributes specific to diverse, global populations. The National Institutes of Health have, for now, agreed and put the release of GRCh39 on indefinite hold until a determination is made how to proceed. (A reminder that all our genetic genealogy comparisons of autosomal DNA is still being done against GRCh37, a reference assembly first published in 2009 and retired in 2013; considering that this is the basis for both base pair numbering and calculating centiMorgans, genealogy is already a decade behind the curve).
In doing research for the mentioned presentation, I came across what I believe is a simple and important summary of the situation (Mathieson and Scally, PLOS Genetics, March 2020):
"Another source of confusion is that three distinct concepts--genealogical ancestry, genetic ancestry, and genetic similarity--are frequently conflated. We discuss them in turn, but note that only the first two are explicitly forms of ancestry, and that genetic data are surprisingly uninformative about either of them. Consequently, most statements about ancestry are really statements about genetic similarity, which has a complex relationship with ancestry, and can only be related to it by making assumptions about human demography whose validity is uncertain and difficult to test."
Gary, you encapsulated part of that very problem when you wrote: "This MCRA couple is Cornish, and I've noticed consistently higher cM results on my Cornish branch." We all have so-called pile-up regions of DNA in our genomes. These are chunks of autosomal or xDNA that display far greater frequencies of sharing than should be expected, and the causal elements here can range from occurrence at the level of continental populations, to regional and founder populations, to haplotypic pile-ups that can be associated with tribal/clan groups and even individual families.
The social practice of endogamy isn't necessary to create this. Every population bottleneck in each of our long genetic histories has resulted in a compression of the mating population, whether from disaster, pandemic, migration, or geography. At each of those intervals our genetic history has undergone a narrowing, with the resultant downstream funnel a commingling of DNA that becomes difficult to evaluate, requiring knowledge, effort, and detailed investigation to accurately attribute to a specific ancestor at several generations, and downright unlikely much further. For example, at 5th cousins we're already looking at 12 meiosis events between two test-takers; as many as 24 among three test-takers; 36 among four test-takers, and so on. The odds of finding a segment shared by two 5th cousins aren't great even at a 4g-grandparent MRCA--about 1 in 6--and they drop precipitously from there. Very roughly (and admittedly inaccurately; since triangulation has never been scientifically studied no one has published the computed probabilities), the odds of finding three 7th cousins who share the same segment of DNA from the same ancestor would be on the order of 1 in 166.