Using Small Segments in DNA Confirmation

+14 votes
202 views

I have a had a few inquiries from some of you WikiPeeps and noticed a few posts here in G2G about people using small segments to confirm DNA. I even had someone tell me that 2cM is an acceptable standard within a specific ethnic group recently because DNA is so hard for this ethnic group.

To answer this I grabbbed a post by the Genetic Genealogist, and WikiTreer, Blaine Bettinger that I am hoping will help those of you wanting to squeeze matches from small segments...

Small Matching Segments – Friend or Foe? , Blaine Bettinger, December 2, 2014

asked Jul 27 in The Tree House by Mags Gaulden G2G6 Pilot (393,890 points)

4 Answers

+7 votes
 
Best answer

Great discussion. As you can tell, small segments are an area of interest for me!

I have two other blog posts that emphasize the prevalence of false matches below 10 cM: "The Danger of Distant Matches" and "The Effect of Phasing on Reducing False Distant Matches (Or, Phasing a Parent Using GEDmatch)." 

I consider small segments to be "poison," in that too many of them are false matches and we can't tell the difference between the false segments and the real segments. I use "Poison M&Ms" as an example. If I handed someone a bowl of M&Ms and told them that 30% are poisoned and there's no visual difference (similar to 30% of small segments of 5 cM or smaller), no one would eat the M&Ms. Similarly, we can't use small segments because they poison our genealogical conclusions. 

Unfortunately, there's no evidence to support the hypothesis (which I often see touted) that sharing a large segment with a match increases the probability that the small segments are real. There's also no evidence to support the hypothesis that triangulating a small segment increases the probability that it is real. They're both decent hypotheses, but the only hard data we have is the study of false segments (from the 23andMe 2014 paper referenced in my posts).

(This discussion also doesn't address the issue that small segments are almost always very, very old rather than genealogically relevant anyway. For more on this, you can see the Speed and Balding paper in the ISOGG Wiki IBD page).

 

answered Jul 31 by Blaine Bettinger G2G1 (1,050 points)
selected Jul 31 by Maggie N.
I sincerely appreciate the time you spend on this subject.  It certainly makes it easier to discuss. There are a couple of areas that I am interested in getting your input. Our terminology, your data set, and how Wikitree uses or should use DNA.

My background is with 23andme, and I think that might explain why the terminology you use is different than what I am used to.  All DNA is considered Inherited. There is an “Adam and Eve” type of understanding. A segment refers to a section of DNA.

A segment that decends from a parent to child from a common ancestor and is shared by 2 DNA samples is Identical by Decent (IBD). A segment that is shared by 2 DNA samples but does not decend from a common ancestor is Identical By Chance (IBC).

Genealogy distinguishes between IBD segments within a genealogical time frame and those segments that are not, Identical by State (IBS).  There is no attempt to identify these common ancestors, but they are presumed to exist.

Shared Segments are either IBD, IBS or IBC. Some people treat IBS as a subgroup within IBD but not in our case.

Putting aside what sample sizes should be used for each segment type and using the terms as I have described, is there any agreement that...

(1) There is no match unless 2 DNA Samples share at least 1 IBD segment?

(2) For each match,  a reasonable prediction can be made using the IBD and IBS segments?
Ken, if your question is not related to small segments, it might be helpful to start a new G2G thread for the topic. Thanks!
Ken - if you start a new thread, could you please direct my attention to it? Thanks!

Let me see if I can tie some of this together first.

The original post sets the topic to be “Using Small Segments in DNA Confirmation” and to respond to this is the reference to an article on “Using Small Segments in Matching”.  The target audience is “those of you wanting to squeeze matches from small segments...”

Are we talking about the same thing?

In keeping with this topic, I would like to address your interpretation of some of the 23andMe data.

For Example: “The researchers found that more than 67% of all reported segments shorter than 4 cM are false-positive segments “

This is from the abstract of the paper.

“We then used GERMLINE, a widely used IBD detection method, to detect IBD segments within this cohort. Exploiting known familial relationships, we identified a false-positive rate over 67% for 2–4 centiMorgan (cM) segments, in sharp contrast with accuracies reported in simulated data at these sizes. Nearly all false positives arose from the allowance of haplotype switch errors when detecting IBD, a necessity for retrieving long (>6 cM) segments in the presence of imperfect phasing. We introduce HaploScore, a novel, computationally efficient metric that scores IBD segments proportional to the number of switch errors they contain.” [This fixes the problem]

I am actually not sure about the details of the process but here is my general understanding.

Individuals are phased and then compared with other phased individuals.  The shared segments are stored as haplotypes (one strand). in a central database. Then the program GERMLINE uses the database to determine if a pair of DNA Samples contain an inheritable segment.  They do not compare DNA with each other. Each DNA Sample is compared with existing haplotypes and if they both share the same haplotype, the segment is reported as shared.

This process does cause many false-positives.  The problem was that phasing process caused “Switch Errors”.  This is why 23andme wrote a post processing program HaploScore to compensate for “Switch Errors”. “Switch Errors” are when the phasing program confuses the child with the parent and this fixes it.

It appears that this dramatically improves processing time and accuracy.  After all, the segments are phased.  This is very similar to what AncestryDNA does now.

If this is wrong, please let me know.

We're definitely talking about the same thing. My point to the audience of those of you wanting to squeeze matches from small segments..” is that they can't. Small segments are poison and there is *currently* no way to use them with any confidence.

HaploScore appears to improve IBD detection, but I don't think they made the case that it resolves it (I note FIG. 4 of the paper, for example). Additionally, I don't know how HaploScore affects the smaller segment data that 23andMe provides to users via their chromosome browser.

Notably, at AncestryDNA, we are clearly getting a significant number of switch errors, as shown by the unusually high number of segments shared by close relatives. For example, I share 49 segments of DNA with my mother at 23andMe, when it should be 23. If phasing worked at AncestryDNA, my parents would match all of my matches rather than missing a third of them.

And, of course, most people are getting their small segments from GEDmatch or FTDNA where there is no phasing.  

 

Blaine, Our experiences are very different.  I am not attempting to use the 23andMe study to support or refute "Small" segments.  My main point is that this is a study/investigation into a possible method beneficial to 23andMe and not in use anywhere, as far as I know.

I do believe understanding why we reach such opposite conclusions about small segments is worth exploring.

Your comfort level that 67% of 2-4cM are false positives is "in sharp contrast’ with my own.  I am not so concerned about them being used in matching, but I am interested in how these segments are useful in genealogy. I will provide examples soon.

AncestryDNA and 23andme differentiate use 6cM as an important threshold, so I supposed segments less than 6cM would be considered “small segments”.  The importance of these small segments are vastly different.  The AncestryDNA objective is to attract subscribers and then sell subscriptions.  The 23andme objective is to collect samples to be then sold to research and consumer health and trait market.

While small segments are 23andMe’s gold, AncestryDNA considers them not worth much. Both AncestryDNA and 23andme use or will use GERMLINE to identify IBD segments. the type of dictionary they choose will be vastly different because their goals are different. 

This might also explain why we have such different experiences.  

+6 votes
Excellent article! Thanks for posting this, Mags!
answered Jul 27 by Kay Wilson G2G6 Pilot (114,380 points)
Thanks for these posts, Mags. The amount of "cousins" with super small segments writing me is mind-boggling so I can now steer them towards this link.
+4 votes
To add to this conversation is an article, [https://www.google.com/amp/s/dna-explained.com/2015/01/21/a-study-utilizing-small-segment-matching/amp/ "A Study Utilizing Small Segment Matching"] from Roberta Estes that indicates that some small segment matches may hold.  This article appeared in her blog, DNAexplained about the same time as the Bettinger article.

She seems to indicate that there may be some value in small segment matching.  Here is a quote from the article : "As we move back in time, the DNA from more distant ancestors will be divided into smaller and smaller segments, so if we ever want the ability to identify and track those segments back in time to a specific ancestor, we have to learn how to utilize small segment data"

What I take from this personally is that not all small segment matches (<5cm) are going to be IBS but the level of analysis required to distinguish IBD from IBS in small segment matches may be beyond the neophye user of DNA matching in genealogy (such as myself).   Should novice users of DNA even consider small segment matches when trying to confirm relationships ?
answered Jul 28 by David Douglass G2G6 Mach 1 (12,500 points)
The only way to truly work with small segments with fewer opportunities for false positives is phasing. Blaine mentions this in the Blog post I posted above.

Small segment work is not an easy task because of the false positives that are returned in a very high percentages. Being able to do the work to verify and rule out false positives is a very important part of this.

Mags
+1 vote

I have some problems with this paper. I describe the terms I use here....Terms

1st. No DNA services use small segments (IBS) in deciding if 2 DNA samples are a match. They require at least 1 IBD segment.  The small segments that are the subject of this paper and this discussion are clearly IBS.  IBS segments are only used to predict a particular relationship between matches such as being 4th cousins. I believe that these predictions play no role in Wikitree DNA rules/guidelines.

In his FTDNA example, the matches between him and his father to his Distant Cousin is based on the 8.25 cM (IBD) segment and not the other IBS segments.

2nd. Even in a perfect world, it is possible for a child to report a longer segment than the parent, or report a segment that is not reported for a parent.  This would require a more detailed explanation but an explanation as to how this happens in an imperfect world would be a good start.

Segments are reported for small segments when a comparison of at least 500 consecutive SNP's that are valid. For example, a child will report a segment 1cM segment because the 500 SNP's compared with the distant cousin are valid.  The same comparisons are performed on the parent and distant cousin, but 1 of the parent SNP's is a no-call which puts the parent below the 500 SNP threshold.

The segment is reported for the child and not the parent.

There maybe a more predictable reason, especially if you use Gedmatch for the comparisons. I believe that FTDNA has used 3 different chips to produce its results, I am not sure if the SNP chipset was the same, but I know that 23andme had 4 chipsets and allowed their V2 chipset to upgrade to V3.  

I am sure similar results will be more noticeable when you compare a child and parent to Distant cousins when they are not using the same chipset.  This is one reason I upload my 3 DNA tests to gedmatch.

Here are the #SNP's for my father's 3 kits.

AncestryDNA: 
Number of regular SNPs = 680,968
Heterozygosity index = 0.148916 (fraction of total SNPs that are heterozygous)
No-calls = 10593 = 1.5317520797153 percent.

23andme: 
Number of regular SNPs = 943,313
Heterozygosity index = 0.206257 (fraction of total SNPs that are heterozygous)
No-calls = 4552 = 0.47338705735658 percent.

FTDNA
Number of regular SNPs = 686, 078
Heterozygosity index = 0.277065 (fraction of total SNPs that are heterozygous)
No-calls = 24,323 = 3.4238409011249 percent.

FTDNA is clearly a problem for gathering statistics.

answered Jul 30 by Ken Sargent G2G6 Mach 3 (36,900 points)
edited Jul 30 by Ken Sargent

I am familiar with the conditions that result in the child's segment being longer than the parent or the child's segment being reported when the parent is not.  This was based on my experience using 23andme.

I don't believe the no-calls had much of an impact, at least during the time periods I was more active on that site.

Gedmatch actually warns me about the FTDNA sample.  

"This kit has an unusually high number of no-calls, which usually results in a larger number of false matches."

This made me think about it a bit more.  the counting of valid SNP's continues until it reaches a Pair of SNP's that are not compatible.  The bigger problem is that a no-call that would have identified an invalid SNP is missed and the counting continues.  This would have the opposite effect and report a segment size greater than it should have been reported. 

Related questions

+5 votes
4 answers
133 views asked Apr 5 in Requests for Genealogy Help by Paula Dea G2G6 Mach 4 (42,990 points)
+7 votes
2 answers
92 views asked Sep 8 in Policy and Style by Jordan Morgan G2G Crew (340 points)
+9 votes
2 answers
+5 votes
1 answer

WikiTree  ~  About  ~  Help Help  ~  Search Person Search  ~  Surname:

disclaimer - terms - copyright

...