Are the averages reported by “The Shared cM Project” misleading?

Question

Are the averages reported by “The Shared cM Project” misleading?

1.1k views

A popular tool in genetic genealogy is “The Shared cM Project”, a crowd-sourced database of shared DNA between various relationships. It’s very easy to use and the main interface gives the average and range of cM’s shared by relationship. More detailed information is documented at https://thegeneticgenealogist.com/wp-content/uploads/2017/08/Shared_cM_Project_2017.pdf. While the predictions closely track calculated values, these diverge with more distant relationships.

A graph of predicted cM values should approximate a normal distribution. Here is a histogram from The Shared cM Project for a 1C match which does appear to follow that pattern:

However, as the relationship distance increases, the left side of the histograms get chopped off. Here is the 4C histogram:

What’s happening here seems to be an example of (unavoidable) reporting bias. DNA testing companies do not report “matches” when the shared DNA is zero or very low. Therefore, testers reporting results for use in the tool do not report anything for distant cousins who don’t match at the testing site. Since these very low values are not added to the database, it causes the reported mean, median, and range of cM values to be skewed to the right, to a higher number. The tool is reporting the values correctly from the data, but it isn’t the actual expected values for more distant matches. This divergence can be seen in the following table, derived from a table created by Roberta Estes at DNAeXplained:

	Expected cM’s
Relationship	Calculated	Shared cM Tool	Ratio
1C	850	874	1.03
2C	212.5	233	1.10
3C	53.13	74	1.39
4C	13.28	35	2.64
5C	3.32	25	7.53

Do you agree that The Shared cM Project tool is misleading for the expected cM’s shared at more distant relationships? I look forward to learned replies, as well as from some dude named Edison Williams who fancies himself knowledgeable in this area. I assume he'll tell me that everyone already knows this and provide a link to his prior post on this topic.

asked Nov 6, 2018 in The Tree House by Kerry Larson G2G6 Pilot (234k points)

Thank you, Edie. I've been so distraught over the comments in this topic that I've stayed away from G2G for several days.

Psych! I've been away, but only because of 18-hour work days in four cities, a statewide genealogy conference (Edie, were you in San Antonio for TxSGS?) that I had to leave early, an annual charity event I help run...oh, and a deep trench through all our backyards on the block where Comcast is sinking new fiber-optic cable and inadvertently cut the AT&T cable along the way (i.e., no landline, no internet for a while). But other than that...

No, unfortunately for him, Kerry seems to share a tiny bit of my "strange sense of humor" DNA. We've come up with some odd stuff in email. But back on topic.

Kerry, everyone already knows this, and here's a link...

Seriously, though, most points have been covered well. Blaine never implies that data from the Shared cM Project is anything other than what it is: crowd-sourced information. As Barry and Pam pointed out, there are several vulnerabilities inherent in the model that Blaine really has no good way to account for, among them:

Under-reporting of actual relationships that share no or little DNA
Over-reporting of relationships that share little DNA: relationships that are likely false in terms of shared DNA measured (e.g., 7th and 8th cousins)
Inaccurate genealogical relationships (inaccurate family trees)
Broad spectrum of genetic genealogy knowledge of the people submitting data (data may or may not be researched and analyzed)
Variability of matching criteria among testing companies
Self-reporting of endogamy/pedigree collapse (I don't believe Blaine factors this into the averages or histograms, but does chart it in the PDF)
No allowance for population-level or haplotypical pile-up regions

My strong suspicion is that the quality of the data in the Shared cM Project mirrors the levels of complexity of working with atDNA at different levels: up close and personal (through 2nd cousins) it ain't rocket surgery and the amounts of shared DNA are very indicative of the relationships; beyond 3rd cousins and through 5th, it requires a solid, fundamental understanding of the science, excellent paper-trails, and the possible correlations are pocked with pitfalls; beyond 5th cousins it sorta is rocket surgery and the majority of atDNA-based evidence I've seen presented at and beyond 6th cousins is unsupportable and likely false.

An aside: I learned last month that a piece I wrote about this particular soapbox issue--that genetic genealogy is more complex than some assume (the assumption based largely on the online tools available that lean toward oversimplification), and that many claiming to be knowledgeable in the field really aren't--is being reprinted as part of a curriculum handout at 2019 training sessions at some U.S. institutions and national genealogical conferences. And let me hasten to add that I'm a hobbiest; I ain't no expert. I do know a few, though...point being that there aren't all that many and we should give an eyebrow scrunch and head-tilt to any self-professing themselves to be expert or professional genetic genealogists.

So there's a only small set of the population qualified to provide authoritative data to Blaine's project about experiential atDNA and distant cousinships. But I believe the hope for such a project--I know it's mine because I do regularly contribute data (haven't submitted anything beyond 3C1R though)--is that with enough sheer volume of input the distributions should even out and become more reasonable...or at least we can decide where, based on the histograms, we want to clamp for ourselves what we consider outliers.

More in a bit. I have a conference call in a few...

commented Nov 14, 2018 by Edison Williams G2G6 Pilot (439k points)

I'm back. Aren't ya thrilled?

Funny you should mention ERSA, Kerry...and thanks for reminding me of it. I'd seen that paper you linked, and another from 2014 (in PLOS Genetics) that was even more interesting to me. The lead author of that one was Hong Li, and any of you who have looked at the ISOGG article on excess IBD sharing will recognize the name because he's one of the very few researchers who has delved into that issue. In that 2014 paper, it indicates ERSA was updated to v2.0 (now v2.1, but I gather that was issued a few years ago) in order to account for and mask known pile-up regions. In the abstract, it notes: "We identified several genomic regions with excess pairwise IBD in both the pedigree and control datasets using three established IBD methods: GERMLINE, fastIBD, and ISCA. These spurious IBD segments produced a 10-fold increase in the rate of detected false-positive relationships among controls compared to high-density microarray datasets."

With a little digging I learned that Chad Huff, principal author of the first paper, was almost a neighbor (well, same major metropolitan area). The ERSA modeling software--no idea of OS or platform requirements--is available for download but registering to receive the download link requires a questionnaire that specifies institutional affiliation and contact information for the principal investigator. But I did try registering a little over a year ago; got no response. May try again now; would love to experiment. Google-Fu can find it, but I'm not linking to it openly since it's clear they don't mean it to be for general public use.

And really quickly, a "coefficient of relationship," a CoR, really is a thing. I know this isn't what you asked, but I personally feel there is a lot of value in knowing the theoretical average sharing amount regardless of any experiential data we can get. If nothing else it provides a consistent, justifiable baseline against which we can benchmark results and additional data.

It's basically what RJ described, though deals exclusively in percentages. Here's the formula (it's what I used to build this table):

"R" is the coefficient, resulting in the theoretical amount of shared DNA expressed as a decimal. "X" and "Y" represent the two individuals involved, and "n" is the number of direct links--meiosis events--between the individuals, counted only in one direction, X to Y. The summation operation is necessary in order to calculate the correlation of two living individuals.

Simple example: your 2g-grandmother to you. Starting with her, "X," there are four meiosis events to get to you. So 1/2 ^ 4 = 0.0625; shared percentage is 6.25%; shared centiMorgans (using ISOGG Method II, and what we see at GEDmatch, FTDNA, etc.) would then be 425cM.

When you're comparing two individuals to a common ancestor, a collateral relationship, you add up the number of discrete meiosis events along both legs of the inheritance chain to each common ancestor by starting with X, counting back to the CA, then counting forward again to Y. For example, for 2nd cousins sharing a g-grandparental couple, for X you've got three meiosis events back and three forward, for n = 6. Raise 1/2 ^ 6 = 0.015625. Y is the same because the MRCA couple represents two ancestors in common, so you sum the two results of n for 0.03125, 3.125% shared DNA, or 212.5cM.

A half 2nd cousin would be exactly half that, or back to the 1.5625% figure because only one g-grandparent is shared: you can count the meiosis trail only one time; in other words, you count the meiosis trail passing through no individual which is not a common ancestor more than once.

"Coefficient of inbreeding" is also a thing, but it's more complex and relies on a known result of the coefficient of relationship. A few months ago I said I might write something up about CoI for genealogy, but moved it off the radar. May still be worthwhile, but the problem is that in order to calculate the CoI, the actual ancestry both to and prior to an MRCA has to be known. For genealogy, we're typically looking to investigate hypotheses about that ancestry...we don't already know it, or we wouldn't be looking for a CoI result as guidance. Catch 22.

Dunno. I may still do it. But the CoR can be used to evaluate theoretical sharing for any possible relationship if all parentage is known along both legs of the inheritance chain of two people. If anyone gets bored and wants a pedigree collapse problem to use the CoR on, here's a nifty one from a friend, Yvette Hoitink: https://www.dutchgenealogy.nl/worst-case-of-pedigree-collapse-ever/. "She [Engelberta Harmina Roerdink] descends from her great-great-grandparents Jan Roerdink and Clasina Rengerdink four different ways, as all four of her grandparents are grandchildren of this couple. Following the family tradition, she herself married her first cousin, another great-great-grandson of Jan Roerdink and Clasina Rengerdink."

commented Nov 14, 2018 by Edison Williams G2G6 Pilot (439k points)

Well look who finally waltzed in and decided to entertain us with his witty repartee. ... (just trying to make up for my earlier missing emoji). I was about ready to send out a search party. I wondered if you wandered off to the MyHeritage conference and got lost in Norway.

Nothing has changed from your prior experience with ERSA. I registered, got the email that a download link was forthcoming, then nothing. I like the table of relationships you created; it's an easy quick reference.

Regarding the CoR, I wish there was a known standard deviation, standard error of the mean, or something that gave a clue to help in estimating possibilities and probabilities. I wonder if it could be extrapolated from close relationships in The Shared cM Project which track closely with theoretical, to more distant relationships? I'm still working on fully understanding The DNA Geek's "What are the Odds?" and other articles. She seems to be the best explainer of this stuff.

"Coefficient of inbreeding" just sound icky.

Thanks for weighing in with your expertise, and you are an expert whether you claim it or not. However, most people don't know that expert opinion is the lowest level of evidence, but they continue to rely on it above all else, except those on the leftward spike of your favorite Dunning-Kruger Effect who keep their own counsel. BTW, what the heck is 'rocket surgery'.

commented Nov 15, 2018 by Kerry Larson G2G6 Pilot (234k points)

4 Answers

Best answer

I think it is not misleading to people in practice because of they way they use the tables. These are called "conditional distributions", where instead of the distributions of shared cMs among all fourth cousins say, the histogram is the "histogram of the distribution among fourth cousins of the shared cM conditioned on them having a detectable amount of DNA".

People tend only to go to these histograms when they have a detectable match, and when this is the practice the conditional histograms are the correct ones to be using.

However: I have commented elsewhere that there are many other problems with this self-reported data. People use different thresholds for their matches, whether because FTDNA uses the 5cM threshold, Ancestry "timbers" all of their data while computing matches, MyHeritage imputes data, or because GEDmatch users can set the threshold to be whatever they want. There is no standardization or verification of what people are doing.

There is also no checking that the relationships people are reporting are accurate. I believe Bettinger throws out clear outliers when, say, someone mistypes the relationship level or the shared cM total. But there is no accounting for misreporting of a match of 5th cousins due to shoddy genealogy work. Even with solid genealogy work, there is usually no way to prove that endogamy or other similar issues aren't contributing to the match.

(Uncertain quality of the genealogy work is also a big problem with the "statistical studies" performed by combining Ancestry matching data with family trees kept on Ancestry.)

The problems above are exacerbated with the distant matches, where the thresholds and quality of genealogy work make a much bigger difference.

Garbage in, garbage out.

answered Nov 6, 2018 by Barry Smith G2G6 Pilot (291k points)
selected Nov 8, 2018 by Andreas West

Thanks very much for your feedback Barry. You’re obviously smarter than the average bear about this stuff. I started investigating when I received some unexpected results from the What Are The Odds? (WATO) tool. For some matches around 50 cM, WATO was reporting that a 5C relationship was as likely or more likely than a 4C relationship. This seemed counter-intuitive to me. The culprit is how The Shared cM tool groups its data and reports probabilities. For a 50 cM match, it lumps 4C and 5C as follows:

24.48% 5C2R † 5C3R † 6C 6C1R 5C 6C2R4C1R 5C1R 7C Half 3C2R 4C2R 7C1R3C3R 4C3R 8C or more distant

21.90% 4C Half 3C1R 3C2R

Part of the reason for this is the skewed distribution of the Shared cM Project where it says that the average for a 4C match is 35 cM and the average for a 4C match is 25 cM. In reality, the predicted cM’s should be much lower for both and 4C ~4x 5C. So, even though a 50 cM match is much more likely to represent 4C or closer, because 5C is lumped into a larger group of relationships with a slightly higher overall probability for that group, 24.48% vs. 21.90%, WATO predicts that the match is more likely a 5C than a 4C.

In fairness, I read somewhere that the WATO tool is better for closer relationships. But I’m looking for a realistic method of calculating some of these relationship probabilities.

commented Nov 6, 2018 by Kerry Larson G2G6 Pilot (234k points)
edited Nov 6, 2018 by Kerry Larson

I see your point here, but the issue becomes devising a better tool. Since there is little confidence in segments below 7 cM, the problem of how to treat that end of the distribution is a bit of a philosophical question. If I lower the threshold on GEDmatch, for example, on a one-to-one comparison with a paper 4th cousin who doesn't meet normal standards for a match, I might well discover multiple 2-6 cM "fragments" if you will. How do I interpret those bits? Do I report the match as 0 cM because none of the segments meet the threshold? Do I say, well golly, I know there is a common ancestor, so they are likely IBD? What if I have deep Colonial roots? Might they just as likely be common in that population? How would you have individuals report the state of non match for "known" relatives? How could you guarantee that there was not a misattributed parentage or unknown adoption somewhere between the two relationships? Pragmatically, I cannot see a reliable means of gathering that end of the spectrum from observational data. Do you use a statistical adjustment based on hypothetical expectations that cannot be verified?

Perhaps the best approach is to devise confidence bands for the Shared cM Project and WATO tools for the different relationships. I believe they have informally done so when they state that it is more accurate on closer relationships. As someone involved in adoptee birth family searches, I find the tools very helpful on a pragmatic level for establishing possible relationships for matches to individuals of unknown parentage. I only use the tools on closer relationships (3C and closer).

commented Nov 7, 2018 by Pam Tabor

Thanks very much for your excellent comments Pam. They all seem spot-on to me. I actually like WATO but was mystified by this recent scenario. The math that WATO uses is probably perfectly valid, but it is relying on those Shared cM Project values that may not be. As it stands, it doesn't appear that WATO should be used for any relationship beyond a 4C and that it's probably best to use, as per your experience, for closer relationships than that.

I'm searching for another method to calculate odds. My intuition is that a 50 cM match is tremendously more likely to represent a 4C than a 5C relationship, but intuition can be wrong, especially when it comes to probabilities. I am confident that 5C is not more likely than 4C by a 1.12 odds ratio as calculated by WATO.

commented Nov 7, 2018 by Kerry Larson G2G6 Pilot (234k points)

The problem with Barry's answer is that while he makes a really good argument, it might sound like he's coming to one conclusion, when he's really coming tp the opposite conclusion (unless I'm missing what his conclusion really is!)

He uses the phrase "not misleading" near the beginning, which might give the casual reader the idea that he's answering the title question, but what he's actually doing is REFUTING the latter part of the argument that the author seems to be making (and the author seems to be defending the "not misleading" side).

He astutely points out that these are CONDITIONAL distributions, and that that's a GOOD thing. BUT there's an ADDITIONAL condition at play - one that is even more subtle, but just as important.

This is a distribution of CM values for KNOWN 4th cousins! In other words, the 4Cs within the matches of the people reporting the data THAT THEY WERE ABLE TO IDENTIFY AS 4C. Undoubtedly, there were MANY more 4Cs within their matches, but these were not reported because they had not yet been determined to BE 4Cs! Naturally, the 4C matches that are LEAST likely to be discovered to be such are going to be the ones with very low cM values. So there's an inherent BIAS towards higher values in this data. This should start to be a real problem around the 4C level, and can only get exponentially worse as you go to 5C and beyond.

Barry further points out - but (again) doesn't explicitly say - that another major factor in why the chart/distributions are misleading is that the distribution is DIFFERENT, depending on who's doing the calculating! The most important example of this is AncestryDNA (the biggest source of data, no doubt) where they throw away some common DNA sequences that the others don't - resulting in SIGNIFICANTLY lower cM values. It's folly not to have a separate distribution/chart for at LEAST AncestryDNA.

He (correctly) talks about "outliers", but fails to point out that while the chart says 4C goes out to 127cM, this histogram shows that it's below 105cM, 99% of the time. Further, anything above that might not only be data entry errors, and research errors, but endogamy. The chart's author INCLUDES endogamy data, even though it poisons the well.

So the bulk of his response explains pretty well how "Heck, yeah, it's misleading!" but it never explicitly SAYS that, and in fact seems to start off saying the exact opposite. Even THAT is despite actually missing a few things!

commented Jan 6, 2019 by Living Stanley G2G6 Mach 9 (91.1k points)

Kerry, I have been quietly skeptical whenever I hear people talk about "WATO", and what you're telling us means I've been wrong - I shouldn't be "quiet"! :)

In general, my skepticism is reflected in the existence of this thread itself - I just don't think that the REAL underlying distributions are well-known enough to be making those calculations, especially if they're using the flawed (but not completely useless) Shared_cM_Project data.

Plus, probability is pretty tricky! It's SUPER easy to screw up even some of the simplest of probability problems! It's like a lot of things - people look to the "experts", and when software pops out a number, they simply choose to believe it. To question it is to dive into the deep end of the pool, when they know only too well that they don't know how to swim.

What you're telling us is horrifying, from a mathematical/probability point of view - gross incompetence! For a START, they ought to be dividing it into what I think of a "half cousin" categories. What I mean by that is that - for simplicity - I think of a 3C1R, for example, as a "3.5C". A 3C1R has half the shared DNA, on average, as a 3C, and a 4C, in turn, has half the DNA of a 3C1R. At least one source refers to this as there being "classes" or "groups". Within a given group, the relations are usually indistinguishable.

They seem to do that right for the 4C category (although the list might be longer), but it makes me wonder what they do with 4C1R? Do they even recognize it exists?

More importantly, though - and this is what you're pointing out - is that to lump everything at 5C and beyond into one group, when that group turns out to be BIGGER than the previous group is just nuts! Obviously, you have to go out to a point where lumping it all together doesn't screw everything else up, and they clearly didn't do that. It's rank incompetence - nobody should have much confidence in their calculations.

The funny thing is, people seem to UNDERSTAND that - at least experienced people do - but feel unqualified to call this sort of thing out (maybe because they ARE unqualified, in a way). They know to take all this stuff with a grain of salt, and that's the right thing to do, but a sad statement regarding the state of the art in this area.

commented Jan 7, 2019 by Living Stanley G2G6 Mach 9 (91.1k points)

Frank, welcome back!

Did you read all of the posts, or just pick and choose? Seems like you picked and chose what you wanted to respond to.

This subject obviously has meaning to you because after almost two months' absence, you've dived back in with a vengeance. And then, as now, you seem furiously opinionated.

Perhaps from your absence, you've lost my direction. We've already said--as has Blaine Bettinger himself--that these data are crowd-sourced and, deferring accuracy to sample size, are what they are.

I've already stated the following sample weaknesses in the Shared cM Project:

Under-reporting of actual relationships that share no or little DNA
Over-reporting of relationships that share little DNA: relationships that are likely false in terms of shared DNA measured (e.g., 7th and 8th cousins)
Inaccurate genealogical relationships (inaccurate family trees)
Broad spectrum of genetic genealogy knowledge of the people submitting data (data may or may not be researched and analyzed)
Variability of matching criteria among testing companies
Self-reporting of endogamy/pedigree collapse (I don't believe Blaine factors this into the averages or histograms, but does chart it in the PDF)
No allowance for population-level or haplotypical pile-up regions

I personally look to the mathematical autosomal DNA sharing as a finite baseline before I consider benchmark variances. The Shared cM Project, as more data is amassed, will help build a solid reference. But I don't think we're there yet...the new 2018 data are still pending.

Welcome back, Frank.

commented Jan 7, 2019 by Edison Williams G2G6 Pilot (439k points)

Well, thanks, Edison.! I guess... I haven't actually commented on this thread before today, so I don't know that it makes sense to say I'm "back".

I'm a mathematical guy as well as a genealogy enthusiast, so yes, this is special interest of mine. Actually, it should be of interest to anybody trying to make sense of their DNA matches. Yes, I have written on it before (as you certainly have written on many things before). Yes, I do have some strong "opinions" on some aspects - these are probably sometimes better characterized as "observations" and/or "arguments" based on my own experience and analysis of what I have seen, and of what others have said.

Do you have something to say about these "opinions", or are these 7 points you have supposed to be considered some sort of "7 Commandments" thing, with anything else being some sort of apostasy? I certainly don't have a problem with that list, aside from perhaps some apparent redundancy.

If you must know, I looked at the shorter parts of this thread first, saving the one you commented on at length for last, but didn't wait until going through all that before commenting as I saw fit. I didn't really see a concrete conclusion in there, just general opining about how complicated it all is, followed by an extended discussion of about the most oversimplified model one could imagine (which completely ignores the "conditional probability" aspect of it that has been talked about here). Correct me if I'm wrong, but I think I'm at least seeing in there that you're agreeing "Yes, the averages can be misleading", so thank you for that!

For AncestryDNA, I don't even use Blaine's chart any more (except maybe to get an idea of what the craziest outliers might be for something). I have enough matches, from enough test results that I have a decent idea about what relationships result in what cM levels, after having assembled my own statistics (which I did because of the problems with Blaine's). I you know anything about statistics, you know that you don't even need a HUGE data set to get a pretty decent idea about a distribution.

commented Jan 7, 2019 by Living Stanley G2G6 Mach 9 (91.1k points)

Related questions

+16 votes

3 answers

698 views

Version 4.0 of the Shared cM Project is now available

asked Mar 27, 2020 in The Tree House by Darlene Athey-Hill G2G6 Pilot (539k points)

+17 votes

3 answers

455 views

Do you follow Blaine Bettinger's Shared cM Project?

asked Aug 26, 2017 in The Tree House by Mags Gaulden G2G6 Pilot (641k points)

+20 votes

1 answer

368 views

Blaine Bettinger and the Shared cM Project

asked May 26, 2017 in The Tree House by Mags Gaulden G2G6 Pilot (641k points)

+7 votes

6 answers

725 views

Have you notice that Ancestry lowered the cM on shared matches?

asked May 28, 2022 in The Tree House by David Anthony Taylor G2G6 Mach 1 (16.6k points)

+17 votes

3 answers

887 views

Are you using auDNA segments which are less than 7 cM?

asked Dec 3, 2015 in The Tree House by Peter Roberts G2G6 Pilot (703k points)

+3 votes

4 answers

238 views

Is there a way to determine the cM of a segment from start and end positions of a segment?

asked Apr 29, 2020 in Genealogy Help by Kent Creamer G2G1 (1.5k points)

+3 votes

5 answers

1.4k views

What relationship would an 8.5 cm DNA match be?

asked Mar 31, 2020 in The Tree House by Greta Moody G2G6 Pilot (199k points)

+4 votes

2 answers

917 views

Is a 1000 CM match on an Ancestry autosomal test unusually high?

asked Mar 1, 2020 in The Tree House by Bob Scrivens G2G6 Mach 2 (21.4k points)

+2 votes

2 answers

277 views

Generation 1.9 1066.1 Cm 29.752% relationship 1st cousins or closer?

asked Jan 4, 2020 in Genealogy Help by Staci

+6 votes

3 answers

426 views

How many cM match expected between 1/2 siblings?

asked Jan 9, 2018 in Genealogy Help by Lawrene Toews G2G Crew (550 points)

Answer 1 · 2018-11-07T04:16:41+0000

While I see your point about the distribution in the histogram for 4th cousins, I can't say that I have seen any study that reports vastly different results for 4th cousins - which is around a 35 cM average. A 4th cousin match is 10 meioses. It's going to typically be a small amount shared - if at all. 35 cM as an expected amount fits, and we know that the range will be 0 cM to a few outliers that go past 100 cM shared.

Answer 2 · 2018-11-07T15:11:09+0000

I am not a mathematician nor am I a probabilities wizard. What I do is work a lot with adoptees looking to link up with bio families. For me the rule of thumb is this...

The closer the relationship the more accurate the data and the less likely there are false matches.

With each generation back a degree of uncertainty creeps in because of how DNA can recombine and start dropping data from a branch in the overall tree.

So for me Blaine's tool is just fine because I see it as a guide not hard and fast numbers. Guides have deviations so if you think of it that way and you have a result that does not quite fit... I look at how close the relationship is and just sort of make adjustments based on how accurate I view the info based on the closer relationships being more accurate than farther ones.

Charts and graphs can't really factor in all the possible issues that can creep in the farther you move away from a close relationship. In fact, for anything past 4th cousins I give more credence to the paper trail than the DNA because of all the possible things that can affect the DNA.

answered Nov 7, 2018 by Laura Bozzay G2G6 Pilot (830k points)

So...

Basically, what you're saying is:

It's "just fine" that it's misleading because I KNOW it can be misleading, so I pretty much just use my own personal experience and common sense instead, even though I use it. I don't even believe that stuff like this CAN be useful, beyond a certain point.

So if I'm reading you correctly, you're in the "Yes, it can be somewhat misleading" category. But you're OK with it, because it's better than nothing.

Really, the closest relationships are the easy ones but Blaine's tool makes even those it harder than they need to be - by lumping the different companies in together, and by including overly-wild outliers.

When you get to the level where the matches are only marginally useful for helping present-day adoptees, it can give a pretty bad impression of what sorts of values you should expect at various levels, although, admittedly, matches of a given relationship generally DO fall within the stated range (because the range given is so overly big).

commented Jan 7, 2019 by Living Stanley G2G6 Mach 9 (91.1k points)

The tools limitations are based on the data submitted. If more data is submitted for siblings and 1st cousins it will have a more complete average than say 2nd cousins twice removed which means it average is less of a true average if based on 1 or 2 pieces of submitted data. I think Blaine has tried to alert users to shortcomings in the tool so I do not think it is misleading. The word misleading Carrie's with it an intent to deceive and I do not see that here. I think many tools have limitations that is why we have various options of things like screwdrivers, computer programs and even eating utensils. I think that is why Blaine continues to refine the tool. I agree the tool has limitations but I do not agree it intends to deceive its users. To me just because something has limitations does not inherently mean it is misleading.

I have faith that most users are smart enough to realize 1. It is only as good as the data submitted 2. It does not profess to be 100% accurate 3. There are variables it can't calculate for. So a user like me finds it useful as a guide that in my experience works more often than not.

I use it along with paper trails and CM and SNP values. I don't know anyone who relies only on the tool and in fact Blaine strongly suggests in his writing that you need to look at these other things. If he was misleading he would not do that.

I understand you have an issue with how he has done his math. I don't see where what he has made available is a bad thing because of the math. It is not purported to be anything but a collection of user donated data put into a reference table.

commented Jan 8, 2019 by Laura Bozzay G2G6 Pilot (830k points)

"Well, the tool only tells you two things: (1) the average and (2) the 99% range of values."

Frank, quick question. We're still waiting for the 2018 update of the Shared cM Project data, but when I refer to it I mean the published results, https://thegeneticgenealogist.com/wp-content/uploads/2017/08/Shared_cM_Project_2017.pdf, not the simplified algorithms used at the DNA Painter tool or WATO. I have always pretty much ignored the 99th percentile numbers; simply too broad a swath to be of much use.

In the 2017 publication there wasn't enough information to create histographs for all relationship categories, but when I look at the data I go straight to the histograms; the opening question talks about the histographic information. Still, I consider the theoretical averages first as a simple, unbiased baseline.

Are you referring to some implementation of the data as a computational utility? Because a single average and the 99th percentile numbers certainly are not the only things Blaine presents.

commented Jan 8, 2019 by Edison Williams G2G6 Pilot (439k points)

Laura,

You remind me of a woman who objected to my saying that AncestryDNA is misleading its customers in reporting that a match in the "3RD COUSIN" has a "Possible range : 3rd-4th cousins". After some discussion, it came out that she had known 2Cs and 2C1Rs in her "3RD COUSIN" category. (Actually, I think most, if not all were). I'm not sure if she ever recanted her "I don't think it's misleading" assertion.

Maybe you've stumbled upon a reason for this irrational behavior - your claim that "The word 'misleading' carries with it an intent to deceive...".

That's not true at all, but maybe it's a popular misconception. One can be misled by all kinds of things, whether they're even created by a human or not, and when it is created by a human it can be intentionally misleading or unintentionally misleading.

From the context, it's crystal clear that "intentionally misleading" isn't even anything that anybody should remotely think is "on the table here". One clue would be that the question is "are they midleading?", not "is he deceiving us?" There are all kinds of phrases that can be called upon when deceit is being alleged, like "con man" or "rip off". You won't see any such thing in this discussion, because that would be completely ridiculous. Nobody is envisioning Blaine as some sort of Snidely Whiplash character, hanging out in his lair, twirling his handlebar moustache, and letting out a maniacal laugh as he watches others suffer. There is just no conceivable reason why anybody would intentionally try to trick people about this.

Your assertion that "I have faith that most users are smart enough o realize [the 'limitations']" strikes me as shockingly naïve, especially for someone who helps adoptees. This is a non-trivial subject, and most people are new and overwhelmed and intimidated by it. They see Blaine's chart and they cling to it like a life preserver. I highly doubt very many read the fine print, or make anything of it when they do.

I wouldn't call my issues "how he has done the math", but maybe that's how a non-expert might look at it. Blaine has plenty enough data - that's not the problem. And he can refine it until the end of time, but if he doesn't change how he did things it won't make any difference.

There are inherent limitations from the crowd sourcing that he can't do much about, but he could:

(1) Separate out the testing companies. AncestryDNA in particular in known for throwing away cMs, so the results there come out lower. They started doing that at a certain point in time, so any older data should be tossed.

(2) Remove the endogamy-tainted results from the table. There must be plenty of proper thing to do with that data, but he's literally asking contributors "Is this data point no good", and if they say "Yes" he just uses it anyway. That mistake is practically malpractice.

(3) Throw away more of the tails. Or maybe give a 90th or 95th percentile number, but also a maximum.Just look at the 4C data on the chart, and compare to the histogram above. Really, once you get to 3C, you don't have a tail at the left, so no "chopping off" should be done on that side for those cases.

(4) Publish medians instead of means. The asymmetry of the distributions make the mean unrepresentative.

(5) "Clustering" would also help. He identifies these cluster in his report, but the does this 2D thing for his chart (it isn't even arranged very well with all the "halfs" thrown over on one side). There is no discernable difference between the relations with the clusters - it just makes it harder for the user to separate them out, and many don't really have enough data to make much sense.

So we'd be better off with a column (for the "clusters), instead of the big matrix. The you could easily fit several such columns on there, on for each organization (AncestryDNA, GEDmatch, etc.)

Again, these mistakes are not the work of someone trying to trick anybody - they're just the result of not being all that great with analyzing and presenting data (which I assume is not really his area of expertise, so much). Even the AncestryDNA blunders may not be deceitful. It could be that it's left up to marketing people, or software people who don't know much. The quality of their software in general, seems to be pretty poor. Just watch one of their commercials, and ask yourself if these guys care about technical accuracy. It's a joke.

commented Jan 10, 2019 by Living Stanley G2G6 Mach 9 (91.1k points)

Frank some things you do not know about me.

1. I do my homework. If you look up the definition of the word misleading this is what you will find:

https://www.merriam-webster.com/dictionary/mislead you will notice the first definition states it has intent to deceive this is why I objected to your posts... I do not believe as you have stated in your own response there is intent to lead in a wrong direction or into a mistaken action or belief often by deliberate deceit quoting from Merriam Webster

2. I tested out of my college math requirement. So while I am not a mathematician that does not mean I do not understand the subject.

3. I worked with analysis for years. I was considered a subject matter expert in multiple high level computer systems like Oracle Business Suites and a number of proprietary software programs that you need clearance to even touch.

4. I test at a level that only 2% of the nation tests at in terms of dual brain functionality and am considered to be adept at both linear and lateral thinking.

I stand by my statements. Just thought you might want to know that I am basing my ideas on both dictionary definitions and the etymology of the word mislead.

https://www.etymonline.com/search?q=Mislead

commented Jan 11, 2019 by Laura Bozzay G2G6 Pilot (830k points)

Answer 3 · 2019-01-07T07:03:47+0000

Yes, they're misleading, but sometimes it really doesn't matter very much, and sometimes it's more about the range of values that are given being overly broad.

The two histograms given are helpful in visualizing the important aspects of what's going on.

(1) In the 1C histogram, it shows that it's about the reduction in common DNA, and the "coin toss" of how much of what is passed on to one cousin happens to be the SAME bits as that to the other cousin.

But what it DOESN'T show you is that if you saw a histogram that was just for AncestryDNA, you'd see that it is shifted to the left, and doesn't have as big a "tail" on the right side. The average would be about 30cM lower, I think, and the main part of the distribution would lie between the low 600cMs to about 1100cM.

But it doesn't really matter, that the mean is off by 30cM (for AncestryDNA users). There's little (if any) overlap between what you'll see for this level of relationship, and others. Even if you're off 30cM, you're still going to be able to EASILY tell that this is a 1C (or half-aunt/uncle/niece/nephew, or great-aunt/uncle/niece/nephew, or great-grandparent/child). It's Blaine's over-exaggerated range of values that can cause some grief - even the average value given for 1C almost falls within the range given for the next relationship class (1C1R, H1C, etc.).

So from Blaine's chart, you might think it's HARD to distinguish between a 1C and a 1C1R. But if you look at his data sheets, and just consider the data for AncestryDNA, you'll see that 95% of the 1Cs reported are 636cM or higher, while 95% of the 1C1Rs reported are 635cM or below. So it isn't very hard to tell them apart at all, usually, at least on AncestryDNA. In my own experience (with an admittedly limited amount of data), there's no overlap AT ALL between the 1Cs and the 1C1Rs. There's actually 100cM between the two groups!

(2) The 4C histogram shows how it gets once a different mechanism comes into play - where it's as much about WHETHER there's a match between relatives as it is about how MUCH matches. The most likely case is that there's no discernable match at all, and so there's no central "hump", and there's a downward slope right at the start.

In THIS kind of distribution, the mean is automatically going to be atypical, and the median should be used. In the distribution given, over half the values are below 29.9cM, but the reported average is 35cM (it gets worse with more distant cousins).

But if I'm looking at 4Cs in AncestryDNA, there are two additional problems, vs this histogram: (1) The "crowd" reporting the data probably aren't FINDING - and therefore are NOT REPORTING - 4Cs with especially low cM values and (2) AncestryDNA calculates lower cM values.

The 47 cases of 4C matches that I have handy average 25cM - a far cry from the 35cM on Blaine's table. Only 4 of them are even above 41cM, the maximum value being 69cM (WAY short of the 127cM on the table). I don't have very many "mystery matches" left that are above 40cM, so as I discover more 4Cs on my list of matches, my top value is unlikely to change, and my average may even fall somewhat.

So are they misleading? "Yes, obviously." Sometimes it might not matter, but still, "Yes".

Categories

Are the averages reported by “The Shared cM Project” misleading?

Please log in or register to add a comment.

Please log in or register to answer this question.

4 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Related questions