Using Small Segments in DNA Confirmation

Question

Using Small Segments in DNA Confirmation

4 Answers

Best answer

Great discussion. As you can tell, small segments are an area of interest for me!

I have two other blog posts that emphasize the prevalence of false matches below 10 cM: "The Danger of Distant Matches" and "The Effect of Phasing on Reducing False Distant Matches (Or, Phasing a Parent Using GEDmatch)."

I consider small segments to be "poison," in that too many of them are false matches and we can't tell the difference between the false segments and the real segments. I use "Poison M&Ms" as an example. If I handed someone a bowl of M&Ms and told them that 30% are poisoned and there's no visual difference (similar to 30% of small segments of 5 cM or smaller), no one would eat the M&Ms. Similarly, we can't use small segments because they poison our genealogical conclusions.

Unfortunately, there's no evidence to support the hypothesis (which I often see touted) that sharing a large segment with a match increases the probability that the small segments are real. There's also no evidence to support the hypothesis that triangulating a small segment increases the probability that it is real. They're both decent hypotheses, but the only hard data we have is the study of false segments (from the 23andMe 2014 paper referenced in my posts).

(This discussion also doesn't address the issue that small segments are almost always very, very old rather than genealogically relevant anyway. For more on this, you can see the Speed and Balding paper in the ISOGG Wiki IBD page).

answered Jul 31, 2017 by Blaine Bettinger G2G1 (1.8k points)
selected Jul 31, 2017 by Maggie N.

I sincerely appreciate the time you spend on this subject. It certainly makes it easier to discuss. There are a couple of areas that I am interested in getting your input. Our terminology, your data set, and how Wikitree uses or should use DNA.

My background is with 23andme, and I think that might explain why the terminology you use is different than what I am used to. All DNA is considered Inherited. There is an “Adam and Eve” type of understanding. A segment refers to a section of DNA.

A segment that decends from a parent to child from a common ancestor and is shared by 2 DNA samples is Identical by Decent (IBD). A segment that is shared by 2 DNA samples but does not decend from a common ancestor is Identical By Chance (IBC).

Genealogy distinguishes between IBD segments within a genealogical time frame and those segments that are not, Identical by State (IBS). There is no attempt to identify these common ancestors, but they are presumed to exist.

Shared Segments are either IBD, IBS or IBC. Some people treat IBS as a subgroup within IBD but not in our case.

Putting aside what sample sizes should be used for each segment type and using the terms as I have described, is there any agreement that...

(1) There is no match unless 2 DNA Samples share at least 1 IBD segment?

(2) For each match, a reasonable prediction can be made using the IBD and IBS segments?

commented Jul 31, 2017 by Ken Sargent G2G6 Mach 6 (62.0k points)

Let me see if I can tie some of this together first.

The original post sets the topic to be “Using Small Segments in DNA Confirmation” and to respond to this is the reference to an article on “Using Small Segments in Matching”. The target audience is “those of you wanting to squeeze matches from small segments...”

Are we talking about the same thing?

In keeping with this topic, I would like to address your interpretation of some of the 23andMe data.

For Example: “The researchers found that more than 67% of all reported segments shorter than 4 cM are false-positive segments “

This is from the abstract of the paper.

“We then used GERMLINE, a widely used IBD detection method, to detect IBD segments within this cohort. Exploiting known familial relationships, we identified a false-positive rate over 67% for 2–4 centiMorgan (cM) segments, in sharp contrast with accuracies reported in simulated data at these sizes. Nearly all false positives arose from the allowance of haplotype switch errors when detecting IBD, a necessity for retrieving long (>6 cM) segments in the presence of imperfect phasing. We introduce HaploScore, a novel, computationally efficient metric that scores IBD segments proportional to the number of switch errors they contain.” [This fixes the problem]

I am actually not sure about the details of the process but here is my general understanding.

Individuals are phased and then compared with other phased individuals. The shared segments are stored as haplotypes (one strand). in a central database. Then the program GERMLINE uses the database to determine if a pair of DNA Samples contain an inheritable segment. They do not compare DNA with each other. Each DNA Sample is compared with existing haplotypes and if they both share the same haplotype, the segment is reported as shared.

This process does cause many false-positives. The problem was that phasing process caused “Switch Errors”. This is why 23andme wrote a post processing program HaploScore to compensate for “Switch Errors”. “Switch Errors” are when the phasing program confuses the child with the parent and this fixes it.

It appears that this dramatically improves processing time and accuracy. After all, the segments are phased. This is very similar to what AncestryDNA does now.

If this is wrong, please let me know.

commented Aug 1, 2017 by Ken Sargent G2G6 Mach 6 (62.0k points)

We're definitely talking about the same thing. My point to the audience of “those of you wanting to squeeze matches from small segments..” is that they can't. Small segments are poison and there is *currently* no way to use them with any confidence.

HaploScore appears to improve IBD detection, but I don't think they made the case that it resolves it (I note FIG. 4 of the paper, for example). Additionally, I don't know how HaploScore affects the smaller segment data that 23andMe provides to users via their chromosome browser.

Notably, at AncestryDNA, we are clearly getting a significant number of switch errors, as shown by the unusually high number of segments shared by close relatives. For example, I share 49 segments of DNA with my mother at 23andMe, when it should be 23. If phasing worked at AncestryDNA, my parents would match all of my matches rather than missing a third of them.

And, of course, most people are getting their small segments from GEDmatch or FTDNA where there is no phasing.

commented Aug 1, 2017 by Blaine Bettinger G2G1 (1.8k points)

Blaine, Our experiences are very different. I am not attempting to use the 23andMe study to support or refute "Small" segments. My main point is that this is a study/investigation into a possible method beneficial to 23andMe and not in use anywhere, as far as I know.

I do believe understanding why we reach such opposite conclusions about small segments is worth exploring.

Your comfort level that 67% of 2-4cM are false positives is "in sharp contrast’ with my own. I am not so concerned about them being used in matching, but I am interested in how these segments are useful in genealogy. I will provide examples soon.

AncestryDNA and 23andme differentiate use 6cM as an important threshold, so I supposed segments less than 6cM would be considered “small segments”. The importance of these small segments are vastly different. The AncestryDNA objective is to attract subscribers and then sell subscriptions. The 23andme objective is to collect samples to be then sold to research and consumer health and trait market.

While small segments are 23andMe’s gold, AncestryDNA considers them not worth much. Both AncestryDNA and 23andme use or will use GERMLINE to identify IBD segments. the type of dictionary they choose will be vastly different because their goals are different.

This might also explain why we have such different experiences.

commented Aug 2, 2017 by Ken Sargent G2G6 Mach 6 (62.0k points)

Related questions

+8 votes

2 answers

285 views

DNA confirmation using Ancestry

asked Dec 30, 2017 in Genealogy Help by Mark Dorney G2G6 Mach 6 (64.2k points)

+8 votes

4 answers

580 views

Genetic Genealogy - Multiple Small cM segments

asked Apr 5, 2017 in Genealogy Help by Paula Dea G2G6 Mach 8 (89.3k points)

+8 votes

3 answers

269 views

Need suggestions for DNA confirmation of paternal grandparents

asked Jul 29, 2017 in The Tree House by Karen Raichle G2G6 Mach 8 (87.2k points)

+9 votes

4 answers

505 views

DNA raw data + GEDMatch = Now what?

asked Feb 3, 2018 in Genealogy Help by Kat Venegas Jacobus G2G6 Mach 1 (10.2k points)

+21 votes

1 answer

535 views

Queezy about transferring your DNA Raw Data File to GEDmatch?

asked Nov 7, 2017 in The Tree House by Mags Gaulden G2G6 Pilot (641k points)

+17 votes

0 answers

438 views

Amoeba Sisters: the best biology resource you never knew existed

asked Aug 12, 2017 in The Tree House by Edison Williams G2G6 Pilot (439k points)

+9 votes

2 answers

800 views

Why the prohibition on using 3C for DNA confirmation triangulation?

asked Feb 28, 2022 in The Tree House by Mack Tyner G2G6 (7.0k points)

+6 votes

2 answers

291 views

DNA Confirmation statements using FTDNA studies rather than WikiTreers?

asked Aug 11, 2020 in Policy and Style by M Cole G2G6 Mach 8 (89.4k points)

+12 votes

5 answers

472 views

Can we get relaxed DNA Confirmation rules when using at least one phased kit?

asked May 27, 2018 in Policy and Style by William Foster G2G6 Pilot (121k points)

+8 votes

1 answer

327 views

DNA Confirmation using MyHeritage?

asked Mar 31, 2018 in Policy and Style by Matthew Combs G2G3 (3.3k points)

Answer 1 · 2017-07-28T17:15:36+0000

To add to this conversation is an article, [https://www.google.com/amp/s/dna-explained.com/2015/01/21/a-study-utilizing-small-segment-matching/amp/ "A Study Utilizing Small Segment Matching"] from Roberta Estes that indicates that some small segment matches may hold. This article appeared in her blog, DNAexplained about the same time as the Bettinger article.

She seems to indicate that there may be some value in small segment matching. Here is a quote from the article : "As we move back in time, the DNA from more distant ancestors will be divided into smaller and smaller segments, so if we ever want the ability to identify and track those segments back in time to a specific ancestor, we have to learn how to utilize small segment data"

What I take from this personally is that not all small segment matches (<5cm) are going to be IBS but the level of analysis required to distinguish IBD from IBS in small segment matches may be beyond the neophye user of DNA matching in genealogy (such as myself). Should novice users of DNA even consider small segment matches when trying to confirm relationships ?

Answer 2 · 2017-07-30T14:58:10+0000

I have some problems with this paper. I describe the terms I use here....Terms

1st. No DNA services use small segments (IBS) in deciding if 2 DNA samples are a match. They require at least 1 IBD segment. The small segments that are the subject of this paper and this discussion are clearly IBS. IBS segments are only used to predict a particular relationship between matches such as being 4th cousins. I believe that these predictions play no role in Wikitree DNA rules/guidelines.

In his FTDNA example, the matches between him and his father to his Distant Cousin is based on the 8.25 cM (IBD) segment and not the other IBS segments.

2nd. Even in a perfect world, it is possible for a child to report a longer segment than the parent, or report a segment that is not reported for a parent. This would require a more detailed explanation but an explanation as to how this happens in an imperfect world would be a good start.

Segments are reported for small segments when a comparison of at least 500 consecutive SNP's that are valid. For example, a child will report a segment 1cM segment because the 500 SNP's compared with the distant cousin are valid. The same comparisons are performed on the parent and distant cousin, but 1 of the parent SNP's is a no-call which puts the parent below the 500 SNP threshold.

The segment is reported for the child and not the parent.

There maybe a more predictable reason, especially if you use Gedmatch for the comparisons. I believe that FTDNA has used 3 different chips to produce its results, I am not sure if the SNP chipset was the same, but I know that 23andme had 4 chipsets and allowed their V2 chipset to upgrade to V3.

I am sure similar results will be more noticeable when you compare a child and parent to Distant cousins when they are not using the same chipset. This is one reason I upload my 3 DNA tests to gedmatch.

Here are the #SNP's for my father's 3 kits.

AncestryDNA:
Number of regular SNPs = 680,968
Heterozygosity index = 0.148916 (fraction of total SNPs that are heterozygous)
No-calls = 10593 = 1.5317520797153 percent.

23andme:
Number of regular SNPs = 943,313
Heterozygosity index = 0.206257 (fraction of total SNPs that are heterozygous)
No-calls = 4552 = 0.47338705735658 percent.

FTDNA:
Number of regular SNPs = 686, 078
Heterozygosity index = 0.277065 (fraction of total SNPs that are heterozygous)
No-calls = 24,323 = 3.4238409011249 percent.

FTDNA is clearly a problem for gathering statistics.

Categories

Using Small Segments in DNA Confirmation

Please log in or register to add a comment.

Please log in or register to answer this question.

4 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Related questions