Segment differences between 23andme and GEDMatch

+7 votes
137 views
One of my 2nd cousins who I had connected with through 23andMe, has now posted DNA results on GEDMatch Genesis. The results are fairly consistent, however there is one segment (11cM) on a chromosome, that doesn't show up in the 23andMe results. Is this unusual?

The problem segment is on chromosome 22 ..at the very end, from  48,567,467 to 51,162,059.

I presume that is due to different tolerance and sampling protocols? Or is it due to 'pile-up', as someone suggested?
asked in The Tree House by Nancy Reid G2G3 (3.8k points)
retagged by Nancy Reid

Does anyone else have any explanation for this? Where did this mysterious 11 cM segment come from? Why wasn't it on 23andme, but is there on GEDMatch? frown

I know I'm not someone else ... Do you mind sharing which chromosome this is happening on? How about the start/stop positions too? 11 cM is not really all that long of a segment, especially depending on where in the genome it is.

I'd up vote your question in hopes others would have a better chance of seeing/opening your post and commenting ... but I already did the other day when you first posted! You could try editing your original post and adding some additional tags. See what's available and use as many as it'll let you ... say, in addition to dna,

23_and_me gedmatch gedmatch_genesis dna_confirmation cousins

Just a thought.
Thanks for your suggestions Susan! I've implemented a few of them already. I was surprised that not many people responded, but I have to admit that my tags were very weak!

Although 11 cM isn't very long, it's also not super short. Why would 23 and me have ignored that segment? Does anyone have any additional ideas?
I'll see if I can recruit a couple of gentlemen who I notice usually responding to dna-related questions ...
We have several people with considerable experience with 23&Me, and I've been hoping they would show up here.  I'm not one of them, never used 23&Me.  I can make some general comments, but you really need a true 23&Me expert!

Most of the testing companies have created their own proprietary algorithms for processing and matching - for what they filter, what they specifically look for, how they determine the quality of a matching segment.  GEDmatch does not do any of that, they simply show you what they see, straightforward comparisons of raw data.  My guess is that 23&Me has some reason that that segment match did not reach some internal threshold for acceptance.  But there could be other reasons.  Because mutations are always happening, GEDmatch allows a mistake every so many SNP's, in comparisons.  I'm sure the other companies do too, but if their parameters are slightly different, or they start the segment at a slightly different point, then one might see 2 changes in a frame where the other saw 1 each in 2 adjacent frames.  So one company might throw out a segment in the middle, where the other company might not.  Losing a small segment in the middle could leave 2 small segments left, too small to qualify.

If this were AncestryDNA, then I would blame it on Timber, an AncestryDNA filter that removes pileups and known endogamous segment matches.  But I haven't heard of 23&Me doing that.  Need to hear from a true 23&Me expert though!
Thanks for stopping by, Rob. I knew I could count on you and Edison. WikiTree generally has an exceptional community.

Yes! Thanks Rob. I think the 'missing small segment in middle' is a missing link to this problem. yes

2 Answers

+5 votes
Just an idea: Take a look at the following image and see if the segment that doesn't match between the 2 is in a known "pile up" area.

https://isogg.org/wiki/Identical_by_descent#/media/File:Table_3_start_and_stop_positions.jpg

Also there's this prior discussion that may help: https://www.wikitree.com/g2g/474682/results-differ-between-23andme-myheritage-gedmatch-correct
answered by Susan Keil G2G6 Mach 2 (21.1k points)
That's interesting Susan. I hadn't read anything yet about certain segments of certain chromosomes that have higher than expected representation among DNA results. So much more for me to learn!

I'm not sure that explains the fact that one common segment between my 2C and I shows up on GEDMatch, but not 23andme.

Thanks again for the tips.
Glad it's interesting!

What I was thinking ... which I haven't tested out myself yet ... is if the region that you're looking at is indeed in one of these "pile up" areas, it may be that 23andMe has chosen simply not to report data in their summaries to you about these regions. Then when you download from 23andMe and the upload again to GEDmatch, since all of your data is in the download, GEDmatch likely reports everything to you, pile up area or not. So, in a sense, you may be getting a more complete view through GEDmatch though 23andMe is not being inaccurate. They're perhaps trying to help avoid your going on a wild goose chase in a pile up area ...

I know at least for myself, the pile up area appears to demonstrate itself on chr 22. There are a bizillion number of matches I have in and overlapping/extending with the very common area on chr 22. All of them certainly cannot be attributed to just an area of very high IBD (identical by descent) though since they extend on one side or the other.

 I have a clearer view of my Dad's DNA in this region (as I was not able to ask my mom to test before she died), and I very much want to sort out which matches in this region are from his father's side and which from his mother's ... but I have not been able to do that yet though I've done a fair amount of triangulation, etc. (Basically, my family is not related to many people who have tested ... yet ... Virtually all matches are 4th or more distant cousins.) Also, most of these matches on chr 22 appear to be Jewish ... so, simplistically, I'm trying to first figure out if the Jewish connection comes through my paternal grandmother or paternal grandfather. Depending on the week, my theory tends to bounce from one to the other ... I digress.

Learning about genetic genealogy is one of the more interesting and satisfying things I've done in a while. Hope you continue to enjoy it as well.

Sounds like it's the infamous chromo 22 pile-up on the highway! laugh

I'm enjoying learning about this DNA stuff! And I love how helpful (and patient) everyone is here! 

+8 votes

Hiya, Nancy! This is a faux pas because Rob Jabcobson already answered your question. He wrote: "Most of the testing companies have created their own proprietary algorithms for processing and matching - for what they filter, what they specifically look for, how they determine the quality of a matching segment."

Rob can accurately answer stuff in 30 words. Takes me...er...just a bit more. Ahem.

From some of your comments I'm gathering this wild DNA-for-genealogy ride is somewhat new to you. So I'll apologize up-front if it ever sounds like I'm talking down to you. I'm not. I'm keenly interested in all this stuff, and sometimes go on a tear where it seems my mission is to get others brand new to it excited as well. Or...extremely bored, glassy-eyed, and turned off. Depends upon your perspective.  wink

And mind you, I'm addressing only the most common, inexpensive over-the-counter, autosomal DNA tests we get today from 23andMe, AncestryDNA, Family Tree DNA, MyHeritage, Living DNA, and others.

What those tests examine are SNPs, single nucleotide polymorphisms. Simpler than it sounds. We have four DNA "letters": A, C, G, and T; adenine, cytosine, guanine, and thymine. Those are DNA's nucleic acids (deoxyribonucleic acid and all), and there are only the four of them. To make it even easier, they pair in what's called "complementary bases," what you'll often see referred to as "base pairs," and A and G always pair with one another, as do C and T...they never cross-pollinate.

A SNP is a precise spot along a chromosome, a locus, that's determined by the folks with the really big brains. A SNP is going to be one of those fours letter, be one of those nucleic acids.

You have around 3.2 billion base pairs making up the DNA in your cells. That's a boatload of letters. For population studies and genealogy, there are a lot of those base pairs we don't really care much about. The reason is they are within a protein-encoding gene, or in a regulatory region near a gene. In other words, they're important stuff to the very survival of a newborn and play a direct role in disease or physical traits. If an appreciable number of letters mutate in the HLA region of chromosome 6, as an example, it could put a severe kibosh on a baby's autoimmune system...not good for viability of the species.

What population geneticists are most interested in are points along chromosomes that are not involved in producing proteins, loci that act as biological markers. Most of our 3.2 billion base pairs is junk DNA (yep; an actual term) as far as we know, meaning that it can freely mutate--change or even delete its nucleic acid--and it won't affect the survivability of the organism. They have no known affect on health or development, but we're learning things all the time.

Ta dah! Those are the SNPs and, depending who you talk to, somewhere around 10 to 12 million of them have been identified and cataloged. Mind you, some SNPs may be contained within a gene, and some act as guideposts, helping scientists locate genes that are associated with disease; some of these SNPs can themselves be indicators or predictors of health factors such as an individual's risk of developing certain diseases or response to certain drugs. Some companies like 23andMe are very interested in these medical/health indicators, not just the population/ancestral indicators.

Okay. Whew. Let's go back to the concept of "biological markers." You have around 10 million SNPs and 3.2 billion base pairs, so SNPs make up only about 0.313% of your whole genome. Not much. Further, our consumer tests don't look at all 10 million, only about 650,000, give or take. So the tests are skipping a whopping 99.98% of your genome.

So how the heck are we getting segment lengths that sound awfully precise--like 12.63cM--out of that?

Smoke and mirrors.  laugh  No, not really. But now we start to see just how much assumptive math, guesswork, probabilities, and modeling goes into what we get out of our tests.

Time for one of my patented TASAs: Terrible and Stupid Analogies. Let's turn those biological markers, the SNPs, into highway mile-marker signs (I know this won't translate to many countries, but hey, I'm from Texas). Let's say you want to drive from Austin to El Paso, a trip that's going to take you about 600 miles and only a small portion of that through what you'd call urbanization. West Texas can be impressive country, but it can also be awfully desolate. You can travel I-10 for many miles without seeing an abode or place of business.

So that 600-mile drive is your genome. Doesn't work out proportionately, but let's say that, along the highway, there is a mile-marker sign every mile to tell you where you are. Our SNPs. If you stand at mile-marker 482, you know exactly where you are (the slight differences in the human genome reference models notwithstanding). You're at location 482 and there's a lovely bluebonnet growing right there at the base of the sign. You just drove that last mile yourself, so you know a bit about what the country looked like.

But how many consecutive bluebonnets were there between mile-marker 482 and 481 before they were interrupted by a patch of different wildflowers? In our genome and using the typical, commercial DNA tests, there are 5,000 wildflowers between each mile-marker, and each can be one of four different kinds of flowers.

The DNA tests can examine only the SNPs, only the mile-markers. The tests can't see the individual wildflowers in between; there isn't even a car window to look out as you drive past.

So what's happening here is that analysis of the test results is making some assumptions. For example, there are thousands of fully-sequenced and examined whole genomes about which we have data (unfortunately, more data for those of Western Eurasian origins than any other, but that looks like it's slowly improving as technologies become less expensive, more available, and more accepted). If a lot of your sampled SNPs closely correlate with some of those reference genomes, we can make some guesstimates about what's in the 99.98% of your genome we didn't look at. Broadly, this is called genotyping, though the actual term has other connotations, too.

On top of that, we're going to throw some math into the picture. Hey; I never promised there would be no math homework. The math almost always takes the form of probabilities when dealing with DNA...and how we use it is constantly changing--hopefully improving--the more we learn. In other words, if we aren't actually, experimentally evaluating Widget XYZ, how can we know, or at least reliably predict, what Widget XYZ really is?

I don't want to confuse things by diving into what a centiMorgan actually is and how it's calculated. Suffice to note for now that it is not a physical measurement; it isn't a ruler looking at the length of anything; it's based on a linear equation first developed several decades ago by an Indian mathematician; and it's an estimate of genetic relatedness general enough that, when you see it expressed in a few significant digits, you should probably just round to the nearest whole number. It isn't a terribly precise value, and it even differs considerably depending upon whether you are looking at a female or male genome: what we see reported from our tests are what are called sex-averaged values. All these are reasons for not messing with very small cM numbers, but that's a different can of Annelida (phyla; segmented worms; biology nerd humor...gack; no wonder all I get are groans when I pun).

And...the 12,000 character post limit hits. Who can write anything in fewer than 12,000 characters. Cough. Cough, cough.

answered by Edison Williams G2G6 Pilot (177k points)

Ahem. So your DNA test has examined 650,000 mile-markers, SNPs, and you want to know if you and I might be related. All we can compare are mile-markers, not the 5,000 points (average) in between each mile marker. Our first assumption is that if you and I share a whole bunch of mile-markers in a consecutive line, then that stretch of highway very probably matches exactly, that all the other thousands or millions of points along the way also match.

Our tests can't tell us precisely when the matching begins and ends because we can only see the mile-markers. There might be many thousands--or no--matching points before and after that particular stretch of road. There could be patches of pink evening primroses interrupting the shared sequence of bluebonnets, but for now we're going to assume there aren't.

But how many consecutive matching mile-marker signs are enough to assume a match? GEDmatch defaults to 500 SNPs. Is that enough? Are fewer still strongly indicative of a match? Is 700 stronger evidence? There are SNP-dense areas of chromosomes, and other areas that are virtual SNP deserts. Those stretches are still chock full of DNA base pairs, but there are stretches where there may be too little SNP density to determine reliable results.

As an example of that, the new chromosome browser at FTDNA clearly shows chromosomal areas where they simply don't report SNPs. These include a sizable length at the centromere of chromosome 1, at or near the centromeres of Chr 5, 9, and 16, and the beginning of Chr 13-15 and 21 and 22 (an example is the gray-hatched areas in the image). This sort of testing analysis will vary from testing company to testing company.

Is it no longer a match if we're good on mile-markers 218 through 397 except that 304 is different? What about if both 304 and 305 are different, but all the others are the same?

In GEDmatch, now we're talking about the settings for the mismatch evaluation window and the mismatch-bunching limit. Basically an error allowance. What proportion of mismatched SNPs are we willing to accept and still call it a match? In GEDmatch, you can set the minimum SNP count to be considered as an evaluation threshold (the "how many are enough" question); the number of SNPs in what they call a mismatch evaluation window: two sequential non-matching SNPs have to be farther apart from one another than this limit (by default, the same as the count threshold); and then the mismatch-bunching limit: how close single mismatching SNPs may be to each other (by default this is 50% of the value of the mismatch evaluation window).

We have some granular level of control over what we get from GEDmatch. Many people simply accept the defaults--in fact, some even interpret those defaults as being some sort of industry standard, which they aren't--but it's good to know what the options are and what they mean.

The commercial testing companies set their own criteria and parameters and, as Rob noted, these can be closely-held intellectual properties that will never be revealed to us. Bidness and all that. Prime examples here are the proprietary algorithms Underdog (genotype computational phasing) and Timber (match filtering) used by AncestryDNA, and the MyHeritage algorithm that considers two shorter segments that they determine should be "stitched" together to represent one meaningful segment.

To the best of my knowledge, this is how 23andMe currently evaluates autosomal segments. For half-identical regions (HIR; what all the other testing companies and GEDmatch report) at least one detected segment must be a minimum of 7cM and contain a minimum of 700 tested SNPs. One or more SNPs may be a no-call (the test couldn't return a definitive allele) and not affect the calculation, but 700 SNPs have to have been targeted in the micro-array. If one segment meets that criterion, then subsequent HIR segments will be included if they are at least 5cM and 700 SNPs. Their error rate allowance seems to be fixed at one opposite homozygote per 300 SNPs, and each opposite homozygote in an HIR must be separated by 300 SNPs. For fully-identical regions (FIR) only, the threshold is consistent at 5cM and 500 SNPs. Again, recognize that GEDmatch doesn't and can't distinguish between HIR and FIR from the uploaded raw results.

So there you have it. That 11cM segment on Chr 22 from 48,567,467 to 51,162,059 is right at the very end of the chromosome, at the telomere or "end cap" of the chromosome. In fact, using the GRCh38 human genome reference model, Chr 22 is only 50,818,468 base pairs long. I did mention differences between reference models, right?

Telomere areas, like centromeres, are often iffy for SNP evaluation. It wouldn't surprise me one bit if 23andMe "capped" (again with the bad puns) their consideration of Chr 22 well before position 50M, maybe even by 48M or 49M, which could render that segment as one not reported.

And I did mention differences between female and male genomes, right?  smiley  That what we see reported is a sex-averaged centiMorgan value? Well, that particular portion of Chr 22 is, if you're female4.5cM; if you're male, it's 18.3cM (this per the genome map interpolator at Rutger's University).

So what you see when it comes to centiMorgans isn't always exactly what you get. There will seldom be a one-to-one correlation between results as reported by different companies, and you can play around with the settings in GEDmatch and get differing results...some of no use if the parameters are set too loosely. Pile-up areas were mentioned, and these can result in "sticky" segments persisting throughout certain populations, and even within certain groups of autosomal DNA haplotypes. Odds are, if you were to chart all the one-to-many matches you see in GEDmatch, you'd find spike-points that represent haplotypical pile-up regions that should be considered skeptically when it comes to use as genealogical evidence (Debbie Kennett wrote a great piece this year about this phenomenon with nifty charting examples).

This is absolutely great stuff for genealogy, but it isn't exactly as simple, as cut-and-dried, as some try to make it out to be. You can't accurately turn it into simplistic formulae like "three cousins and a shared 7cM segment and you're golden." Takes a little study and work.

Thanks for stopping by, Edison!!! I'm going to re-read this with coffee tomorrow morning! wink

Wow! You are a great story teller Edison! I love your "TASA" TM about the highway markers. I'll be dreaming about bluebonnets tonight. 

There is no shortage of stuff there for me to digest so I will definitely read this several times. I took a degree in human biology back in the dark ages. I vaguely remember my genetics course. (I loved it.) But they still had lots to discover back then and one undergrad course wasn't exactly diving deep. 

I definitely don't understand your comments about the difference between the female genome and the male genome....specifically what those implications are. I'm female and the cousin I'm comparing to is male. What would the sigificance of this be out there near the end of poor chromosome 22? 

More question to follow for sure. 

Thanks again for this incredible lesson. It has now whet my appetite to understand all of this better.

You're very welcome, Susan! If you want the unabridged version (not that super-compressed, Reader's Digest short summary), just let me know!  <snarf>
angel

Wow. I think Nancy read that.  surprise

"I definitely don't understand your comments about the difference between the female genome and the male genome....specifically what those implications are."

For your Chr 22 match with a male cousin, has no affect. You just have to use the sex-averaged value as reported by GEDmatch. BTW, uber-complex math problem to explain sex-averaged cM values. Remember the 4.5cM female and 18.3cM male values I mentioned? (4.5+18.3)/2 = 11.4cM. Et voila!

Mentioning the gender difference was really just to illustrate that centiMorgans are shifty lil' creatures; no absolute measurements about 'em. A centiMorgan is determined by a chunk of a chromosome, from an arbitrary point A to point B, for which a crossover during a single meiosis, a single generation, is expected to occur 1% of the time. There's no consistency along the length of a chromosome, or among different chromosomes.

You and I have about the same number of DNA base pairs, about 3.2 billion. But on average, during oogenesis, egg formation, your genome will see about 45 crossovers, or recombinations. My genome, being male and not as complex and sophisticated <wink> during spermatogenesis will only undergo crossover about 26 times. Ergo, the significant difference in centiMorgan calculations between the male and female genomes.

BTW, all the talk about DNA segments? Those 45 and 26 crossovers? Et voila again! That's how segmenting happens. The Waring Blender of the remarkable process of meiosis. Fun stuff.

Related questions

+5 votes
1 answer
+6 votes
2 answers
458 views asked May 13, 2018 in The Tree House by Azure Robinson G2G6 Mach 1 (11.9k points)
+19 votes
1 answer
+2 votes
1 answer
102 views asked Dec 28, 2018 in Genealogy Help by Billy Dunn G2G6 (6.5k points)
+9 votes
1 answer
+5 votes
0 answers
+4 votes
1 answer

WikiTree  ~  About  ~  Help Help  ~  Search Person Search  ~  Surname:

disclaimer - terms - copyright

...