Why so many DNA matches on the same chromosome & range?

Question

Why so many DNA matches on the same chromosome & range?

2 Answers

Best answer

Hi, Kevin. There are two rather notorious global population pile-up regions that have been identified on Chromosome 15. Under our GRCh37 reference mapping that all the big testing/matching companies still use, those regions are from 20,060,673 to 25,145,260, and from 27,115,823 to 30,295,750. I say "notorious" because the MyHeritage "stitching" algorithm seems to have a problem there as well and, reportedly, also throws up a lot of suspect matches in or overlapping those areas.

For more information about the pile-up regions, see: Li, et al. "Relationship Estimation from Whole-Genome Sequence Data." PLoS Genetics 10, no. 1 (January 2014): e1004144. https://doi.org/10.1371/journal.pgen.1004144.

I'd also caution about GEDmatch default results when dealing with segments that small. They use no form of genotype imputation to help refine match probability, but instead rely on straightforward math to determine what might and might not be a match. For larger, more meaningful segment sizes, that doesn't present a problem. But it very well may when you work with small segments.

My casual and unscientific comparisons at GEDmatch using results from whole-genome sequencing tests with ~2.1 million SNP "superkits" have indicated that, when using the default GEDmatch settings, only around 8% of reported "matching" segments there are likely to be false at a segment size threshold of 20cM or greater. Lowering that to 10cM, though, showed that an aggregate problem rate could be as high as 68%.

answered Aug 26, 2023 by Edison Williams G2G6 Pilot (444k points)
selected Aug 26, 2023 by Jonathan Crawford

This is topic drift here, Edison, but I'm more skeptical about the value of imputation. It's not just the sheer number of SNPs that is important -- it's their information content. Imputation gives you the most likely value for a SNP near measured SNPs, so you're basically saying the same thing twice, but with some possibility of error compared to direct measurement.

I have an imputed file at GEDmatch. DNA.Land (if you remember them) would supply an imputed file with 39,000,000 SNPs based on whatever raw data you uploaded to them. The imputed file matches 97% of my SNPs on a different chip, which is far better than chance. But the one-to-one comparison breaks it up into 147 segments instead of the expected 22.

Imputation may not be quite so bad for the genealogy companies that have larger datasets, though.

I think GEDmatch may be too tolerant of errors that crop up when you're comparing kits with low SNP overlap.

commented Aug 26, 2023 by Ann Turner G2G6 Mach 1 (16.9k points)

Ann, Sorry! My time on G2G has been very sparse and hit-or-miss the past couple of months.

And, yeah; I just kind of danced over "imputation" with no explanation, didn't I? Then again, nobody really wants more word-count from me. <cough>

But what I was thinking of specifically is AncestryDNA. They don't give us a chromosome browser or chromosomal detail, but they do provide us more insight than do most about their process to arrive at DNA matches. You know this better than I do, but for everyone else's entertainment and as a super-quick summary...

Ancestry starts with a form of computational phasing based on genotyping, or how your DNA compares to reference models. To help compensate for population genetic differences, they subdivide each chromosome into tiny "microsegments" of 96 SNPs each. They use a type of "hashing" function, a form of imputation, to more efficiently compare these microsegments across their entire database. Matching starts at that microsegment level, and then they step farther away in both directions along the chromosome, one SNP at a time, to check if they still match. When they bump into a pair of alleles that are a mismatch regardless of the computational phasing, that's the segment demarcation (unlike GEDmatch, whose defaults allow a very lenient number of mismatches).

Then comes the proprietary Timber algorithm which, I think, has a bad rep. It's job is to de-emphasize, or down-weight, the matching information that is less likely to be informative of closer relationships. This takes into account not just documented pile-up regions, but how an individual's results compare (wrongly or not, I refer to these as "haplotypic pile-up regions") to the entire dataset. Ancestry describes it this way:

"The strategy is to analyze matching results accumulated over a large number of genotype samples, then identify, separately for each individual, regions of the genome with unusually high rates of matching. Once we have identified these regions, we reduce the genetic distance of detected IBD segments overlapping these regions. We call these adjusted distances 'Timber scores.'"

By definition, yep, imputation is augmentative. It's all about guessing the value of something that wasn't actually tested. But I think it can be used as a constraint, as well, to better qualify the genealogical/ancestral value of a segment by some fancy extrapolation of how it compares to a large dataset of multiple populations.

Ultimately the point, though, was that GEDmatch does nothing of the kind, as far as I know. Discounting the "slimming" procedure which I've never seen a detailed explanation of (at least not detailed enough to inform us of what, exactly, they're throwing away), everything at GEDmatch is simple arithmetic.

And, I believe unfortunately, since the advent of Genesis going into production and the need to compare across DNA tests that may have as few as 23% of the same SNPs tested, they've loosened their default matching criteria at least twice. For the free one-to-one matching tool, the minimum SNP count used to be 700. Now, it's a dynamically adjusted threshold where two-thirds of the segments deemed as valid will have considered only 185 to 214 of the same SNPs. That's a huge difference. And the "mismatch-bunching limit," GEDmatch's allowance for alleles that are actually non-matching, is whatever that SNP window is, then divided by 2. They also introduced the option to "prevent hard breaks"; it isn't on by default, but checking it allows gaps of over a half-million base pairs to be present and still consider the segment to be continuous and unbroken. For a small chromosome like Chr 21, a gap that size represents over 1% of the overall chromosome.

I'm rambling. Not unusual. ;-) But unless I know what parameters someone has applied to their match reporting at GEDmatch, I've come to basically discount any segment that isn't at least in the high teens in centiMorgan calculation.

P.S. When did you upload that whopping 39-million-SNP file to GEDmatch? I've never been able to get a complete answer about how large their catalog is--in other words, the maximal number of markers they will accept and where we can download a list of those--and since I began trying to use WGS data there in 2019, I've never been able to get anything larger than a file just under 4 million SNPs to upload. That's been all trial and error beyond the collection of SNP "templates" that WGSExtract uses. And interestingly, the 3.9-million-SNP file ended up being "slimmed" to very nearly the same 1.1-1.2 million count as the WGSExtract "combined" kits.

commented Aug 31, 2023 by Edison Williams G2G6 Pilot (444k points)

Just a couple of quick comments. Hashing has nothing to do with imputation -- it's a way reducing a bunch of individual data points to a single string. That makes comparisons much faster.

"Slimming" removes heterozygous genotypes. They are half-identical matches to literally everyone in the database, so they add no information content about the validity of a segment.

I used DNA Kit Studio to extract SNPs from the DNA.Land VCF file, which was based on an upload of my 23andMe v5 kit. The VCF (Variant Call Format) file with 39,000,000 rows wasn't just variants per se. It actually included positions with the REF alleles. I asked DKS to create a 23andMe v3 kit (I never tested directly on that chip). That is what I uploaded to GEDmatch. DKS offers several templates, including combinations.

http://dnagenics.com/dna-kit-studio/

The genealogy companies are a black box -- we have no way of knowing whether imputation actually improves detection of IBD segments when compared to just using whatever SNPs have been directly measured. They must think it's worthwhile, though, and they may do much better than my experience would lead me to believe.

I lobbied pretty hard for the GEDmatch option to prevent hard breaks. I thought they were a mistake to begin with. Hard breaks create artificial, biologically implausible breaks, as can be easily seen with parent/child comparisons. GEDmatch gives 47 segments in my case instead of the expected 22. The genetic genealogy companies don't worry about SNP-poor regions when they are straddled by long consecutive runs of matching SNPs.

commented Sep 1, 2023 by Ann Turner G2G6 Mach 1 (16.9k points)

Answer 1 · 2023-08-26T01:38:18+0000

From dnapainter.com on chromosome 15

Common pile-up area

1527,115,823 - 30,295,750 (9.29cM)

This is known as a 'pile up' area. If someone only matches on these segments, then the match might be very far back. Pile-up areas can vary for different populations and ethnicities. Read more at https://isogg.org/wiki/Identical_by_descent

Also please note that this list is not exhaustive.

Categories

Why so many DNA matches on the same chromosome & range?

Please log in or register to add a comment.

Please log in or register to answer this question.

2 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Related questions