Segment differences between 23andme and GEDMatch

Question

Segment differences between 23andme and GEDMatch

We have several people with considerable experience with 23&Me, and I've been hoping they would show up here. I'm not one of them, never used 23&Me. I can make some general comments, but you really need a true 23&Me expert!

Most of the testing companies have created their own proprietary algorithms for processing and matching - for what they filter, what they specifically look for, how they determine the quality of a matching segment. GEDmatch does not do any of that, they simply show you what they see, straightforward comparisons of raw data. My guess is that 23&Me has some reason that that segment match did not reach some internal threshold for acceptance. But there could be other reasons. Because mutations are always happening, GEDmatch allows a mistake every so many SNP's, in comparisons. I'm sure the other companies do too, but if their parameters are slightly different, or they start the segment at a slightly different point, then one might see 2 changes in a frame where the other saw 1 each in 2 adjacent frames. So one company might throw out a segment in the middle, where the other company might not. Losing a small segment in the middle could leave 2 small segments left, too small to qualify.

If this were AncestryDNA, then I would blame it on Timber, an AncestryDNA filter that removes pileups and known endogamous segment matches. But I haven't heard of 23&Me doing that. Need to hear from a true 23&Me expert though!

commented Nov 20, 2018 by Rob Jacobson G2G6 Pilot (137k points)

2 Answers

Answer 1 · 2018-11-18T18:29:59+0000

Glad it's interesting!

What I was thinking ... which I haven't tested out myself yet ... is if the region that you're looking at is indeed in one of these "pile up" areas, it may be that 23andMe has chosen simply not to report data in their summaries to you about these regions. Then when you download from 23andMe and the upload again to GEDmatch, since all of your data is in the download, GEDmatch likely reports everything to you, pile up area or not. So, in a sense, you may be getting a more complete view through GEDmatch though 23andMe is not being inaccurate. They're perhaps trying to help avoid your going on a wild goose chase in a pile up area ...

I know at least for myself, the pile up area appears to demonstrate itself on chr 22. There are a bizillion number of matches I have in and overlapping/extending with the very common area on chr 22. All of them certainly cannot be attributed to just an area of very high IBD (identical by descent) though since they extend on one side or the other.

I have a clearer view of my Dad's DNA in this region (as I was not able to ask my mom to test before she died), and I very much want to sort out which matches in this region are from his father's side and which from his mother's ... but I have not been able to do that yet though I've done a fair amount of triangulation, etc. (Basically, my family is not related to many people who have tested ... yet ... Virtually all matches are 4th or more distant cousins.) Also, most of these matches on chr 22 appear to be Jewish ... so, simplistically, I'm trying to first figure out if the Jewish connection comes through my paternal grandmother or paternal grandfather. Depending on the week, my theory tends to bounce from one to the other ... I digress.

Learning about genetic genealogy is one of the more interesting and satisfying things I've done in a while. Hope you continue to enjoy it as well.

commented Nov 18, 2018 by Susan Keil G2G6 Mach 6 (67.5k points)

Answer 2 · 2018-11-20T17:24:42+0000

Hiya, Nancy! This is a faux pas because Rob Jabcobson already answered your question. He wrote: "Most of the testing companies have created their own proprietary algorithms for processing and matching - for what they filter, what they specifically look for, how they determine the quality of a matching segment."

Rob can accurately answer stuff in 30 words. Takes me...er...just a bit more. Ahem.

From some of your comments I'm gathering this wild DNA-for-genealogy ride is somewhat new to you. So I'll apologize up-front if it ever sounds like I'm talking down to you. I'm not. I'm keenly interested in all this stuff, and sometimes go on a tear where it seems my mission is to get others brand new to it excited as well. Or...extremely bored, glassy-eyed, and turned off. Depends upon your perspective.

And mind you, I'm addressing only the most common, inexpensive over-the-counter, autosomal DNA tests we get today from 23andMe, AncestryDNA, Family Tree DNA, MyHeritage, Living DNA, and others.

What those tests examine are SNPs, single nucleotide polymorphisms. Simpler than it sounds. We have four DNA "letters": A, C, G, and T; adenine, cytosine, guanine, and thymine. Those are DNA's nucleic acids (deoxyribonucleic acid and all), and there are only the four of them. To make it even easier, they pair in what's called "complementary bases," what you'll often see referred to as "base pairs," and A and G always pair with one another, as do C and T...they never cross-pollinate.

A SNP is a precise spot along a chromosome, a locus, that's determined by the folks with the really big brains. A SNP is going to be one of those fours letter, be one of those nucleic acids.

You have around 3.2 billion base pairs making up the DNA in your cells. That's a boatload of letters. For population studies and genealogy, there are a lot of those base pairs we don't really care much about. The reason is they are within a protein-encoding gene, or in a regulatory region near a gene. In other words, they're important stuff to the very survival of a newborn and play a direct role in disease or physical traits. If an appreciable number of letters mutate in the HLA region of chromosome 6, as an example, it could put a severe kibosh on a baby's autoimmune system...not good for viability of the species.

What population geneticists are most interested in are points along chromosomes that are not involved in producing proteins, loci that act as biological markers. Most of our 3.2 billion base pairs is junk DNA (yep; an actual term) as far as we know, meaning that it can freely mutate--change or even delete its nucleic acid--and it won't affect the survivability of the organism. They have no known affect on health or development, but we're learning things all the time.

Ta dah! Those are the SNPs and, depending who you talk to, somewhere around 10 to 12 million of them have been identified and cataloged. Mind you, some SNPs may be contained within a gene, and some act as guideposts, helping scientists locate genes that are associated with disease; some of these SNPs can themselves be indicators or predictors of health factors such as an individual's risk of developing certain diseases or response to certain drugs. Some companies like 23andMe are very interested in these medical/health indicators, not just the population/ancestral indicators.

Okay. Whew. Let's go back to the concept of "biological markers." You have around 10 million SNPs and 3.2 billion base pairs, so SNPs make up only about 0.313% of your whole genome. Not much. Further, our consumer tests don't look at all 10 million, only about 650,000, give or take. So the tests are skipping a whopping 99.98% of your genome.

So how the heck are we getting segment lengths that sound awfully precise--like 12.63cM--out of that?

Smoke and mirrors. No, not really. But now we start to see just how much assumptive math, guesswork, probabilities, and modeling goes into what we get out of our tests.

Time for one of my patented TASAs: Terrible and Stupid Analogies. Let's turn those biological markers, the SNPs, into highway mile-marker signs (I know this won't translate to many countries, but hey, I'm from Texas). Let's say you want to drive from Austin to El Paso, a trip that's going to take you about 600 miles and only a small portion of that through what you'd call urbanization. West Texas can be impressive country, but it can also be awfully desolate. You can travel I-10 for many miles without seeing an abode or place of business.

So that 600-mile drive is your genome. Doesn't work out proportionately, but let's say that, along the highway, there is a mile-marker sign every mile to tell you where you are. Our SNPs. If you stand at mile-marker 482, you know exactly where you are (the slight differences in the human genome reference models notwithstanding). You're at location 482 and there's a lovely bluebonnet growing right there at the base of the sign. You just drove that last mile yourself, so you know a bit about what the country looked like.

But how many consecutive bluebonnets were there between mile-marker 482 and 481 before they were interrupted by a patch of different wildflowers? In our genome and using the typical, commercial DNA tests, there are 5,000 wildflowers between each mile-marker, and each can be one of four different kinds of flowers.

The DNA tests can examine only the SNPs, only the mile-markers. The tests can't see the individual wildflowers in between; there isn't even a car window to look out as you drive past.

So what's happening here is that analysis of the test results is making some assumptions. For example, there are thousands of fully-sequenced and examined whole genomes about which we have data (unfortunately, more data for those of Western Eurasian origins than any other, but that looks like it's slowly improving as technologies become less expensive, more available, and more accepted). If a lot of your sampled SNPs closely correlate with some of those reference genomes, we can make some guesstimates about what's in the 99.98% of your genome we didn't look at. Broadly, this is called genotyping, though the actual term has other connotations, too.

On top of that, we're going to throw some math into the picture. Hey; I never promised there would be no math homework. The math almost always takes the form of probabilities when dealing with DNA...and how we use it is constantly changing--hopefully improving--the more we learn. In other words, if we aren't actually, experimentally evaluating Widget XYZ, how can we know, or at least reliably predict, what Widget XYZ really is?

I don't want to confuse things by diving into what a centiMorgan actually is and how it's calculated. Suffice to note for now that it is not a physical measurement; it isn't a ruler looking at the length of anything; it's based on a linear equation first developed several decades ago by an Indian mathematician; and it's an estimate of genetic relatedness general enough that, when you see it expressed in a few significant digits, you should probably just round to the nearest whole number. It isn't a terribly precise value, and it even differs considerably depending upon whether you are looking at a female or male genome: what we see reported from our tests are what are called sex-averaged values. All these are reasons for not messing with very small cM numbers, but that's a different can of Annelida (phyla; segmented worms; biology nerd humor...gack; no wonder all I get are groans when I pun).

And...the 12,000 character post limit hits. Who can write anything in fewer than 12,000 characters. Cough. Cough, cough.

Categories

Segment differences between 23andme and GEDMatch

Please log in or register to add a comment.

Please log in or register to answer this question.

2 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Related questions