I told myself I would sit this one out, but...
I wholeheartedly agree with Andreas and Derrick that the term "industry standard" has no place in this landscape, and is a misconception when applied to genetic genealogy in general. I'm going to try to convince myself not to write more it. Meanwhile, a synopsis of how some of the testing companies state they perform matching, if they state it at all.
23andMe: 7cM with a minimum 700 SNPs for the first half-identical region (HIR); 5cM and a minimum 700 SNPs for each additional HIR. The error rate allowance seems to be pegged at roughly 1%: 1 opposite homozygote per 300 SNPs, and each opposite homozygote in an HIR must be separated by roughly 300 SNPs. For fully-identical regions (FIR), the threshold is 5cM and 500 SNPs.
Note also that 23andMe adds X-chromosome segment matching into the autosomal total count, and no other company does that. They do not presume a match if only the X-chromosome and none of the autosomes match, but if there is an autosomal match the criteria are:
- Male-to-male: 1cM, minimum 200 SNPs
- Male-to-female: 6cM, minimum 600 SNPs
- Female-to-female: 6cM, minimum 1,200 SNPs
FTDNA: Their formerly useful "Learning Center" sitemap (https://www.familytreedna.com/learn/sitemap/) was razed sometime in the past few days. One can only hope it's because FTDNA is correcting and expanding it, not abandoning it. That said, the previous information that I'd found shows a multiple-choice sort of criteria:
- 1) 9cM with a minimum 500 SNPs for a single HIR, regardless of the total amount shared.
- 2) If there is no segment of at least 9cM within 500 SNPs, the single-segment threshold is reduced to 7.69cM if there is a combined-segment total of at least 20cM, which total then includes very small HIRs between 1cM and 7cM (which kinda drives me nuts).
- 3) Only for specific but undefined non-European populations, 5.5cM with a minimum 500 SNPs for the first HIR. This seems to be applicable to only 1% or less of the customer base.
Unlike 23andMe, no known error rate estimation is reported. Also note that their Family Finder matching routines exclude certain regions of certain chromosomes (specifically some small areas at telomeres and centromeres) and as a result there may not be a one-to-one correlation when looking at the data in GEDmatch.
AncestryDNA: Here we can see the least of what actually goes on under the hood. The simple statement is that the minimum threshold to be considered a match is 6cM, but they do count 3cM segments. Now, that said, the count is done after the application of their proprietary phasing algorithm called Underdog, and a second proprietary algorithm called Timber that uses genotyped modeling to attempt to filter out small regions of excess IBD sharing which are not useful for genealogy, i.e., not indicative of recent ancestry.
One outcome here is precisely what Derrick reported: you'll sometimes see a smaller total matching amount reported by AncestryDNA (which, alas, is all we can get from them) than GEDmatch will show shared in a single segment. The good news is that Ancestry is designed to computationally phase the data (as opposed to traditional trio phasing where the parents' data are known) and "condition" it via population genotyping before reporting matches, so the matches should theoretically be more accurate...less chaff in the wheat, in other words. But Ancestry provides us no segment-level detail, so we really can't tell what exactly is taking place or how the numbers directly compare to what we see in third-party reporting using the Ancestry raw data.
MyHeritage: Is almost as mysterious as Ancestry. From what I can tell, their threshold is 8cM for an single HIR, and then at least 6cM for each additional HIR segments. They do not report what minimum SNP threshold is in use. However, they also indicate (as of February this year) that they perform some type of computational phasing; but I've never found any detail about what the phasing entails or how it affects results. They also indicate that they use something that might be called "anti-Timber." Where Ancestry's Timber looks to filter out small regions of excess IBD sharing, MyHeritage seems to use an imputation algorithm that they refer to as a "stitching" process. It evidently looks for very small HIR segments that are very close to each other and may have missed the 6cM threshold only because of a few mismatched SNPs separating them. If the algorithm--using whatever proprietary criteria programmed--determines that the two very small segments really look like they should be a single segment, the matching routine will report it as such. This will also explain some differences you may see when running MyHeritage raw data through the unfiltered GEDmatch.
MyHeritage, similar to FTDNA, also makes some adjustments for certain populations. If the test-taker's ancestry is at least 50% Ashkenazi, 12cM is required for the first HIR.
GEDmatch: This is where I hope relatively newish WikiTreer Aaron Wells will chime in. Maybe we could offer kolaches and coffee to the room?
To my knowledge, GEDmatch does zero "pre-treatment" of any of the raw data. I believe it's what-you-see-is-what-you-get. Which is a great thing for us DNA nerds, but removes all of the let-us-help-you-with-that services from the testing companies. No Underdog; no Timber; no imputation; no genotyping; no phasing; no "stitching"; no exclusion of SNP-poor areas at centromeres and telomeres; no exclusion of pile-up regions of excess IBD sharing. It's up to the genealogist to understand how to evaluate the raw data. You can set your own search criteria at the centiMorgan, SNP count, and mismatch bunching limit levels, but there are no other filters or conditions applied (generally speaking; some utilities have less granular control--like the test for runs of homozygosity--but there's still no hidden math going on).
I've had reason the last couple of weeks to work almost exclusively with GSA chip results in Genesis, and I've come away believing that GEDmatch has decided not to apply any behind-the-scenes imputation in trying to arrive at better OmniExpress-to-GSA chip comparisons (since the SNP overlap is only 23% or less), but to report the unfiltered results while giving us clear indication of the number of SNPs compared and a sliding scale towards a red flag (well, literally a red background on a field) if we need to start being concerned that we may be working with too few in-common SNPs for the comparison at hand. I'd really like Aaron's input on that.
Because of the extended Genesis beta, I'd thought the direction would be an imputational model with best-guesswork employed by GEDmatch to get the two disparate chip types to compare. But if the end result will continue to be a stance of "just the facts" and not to mess with manipulating the data behind the scenes, I'm way good with that. But it means that, as it always should be, it's incumbent on us as genealogists to learn enough to make our own well-informed evaluations, analyses, and decisions, and to weigh what we consider to be an acceptable level of accuracy versus quickly adding an icon somewhere that indicates a degree of verification that might be totally false.