Are there new explanations of pile up regions?

+8 votes
745 views
The cause of chromosome pile up regions was discussed in several questions during 2018.  I have not noticed that fresh insights into causes have appeared.  My question today resulted from a new small segment match on Chromosome 22, in MyHeritage, having many many matches in the chromosome browser.  The matches were within the marked pile up area on Ch 22 in DNA Painter.
in The Tree House by Douglas Rutherford G2G2 (3.0k points)
retagged by Ellen Smith

1 Answer

+17 votes
 
Best answer

Hey, Douglas. We've pretty much understood for a while the various mechanisms, at least the basics, that arrive at our seeing autosomal DNA "pile-up" regions or, more technically, areas of excess IBD sharing (a listed summary at the bottom). I put together a little "cheat-sheet" (it's in PDF format) last year of some particular areas on different chromosomes to be extra cautious about when it comes to evaluating matching segments. But these relate to the human population as a whole, and pile-up regions come in different flavors.

Chr 22 can sometimes be a problem. Under the outdated GRCh37 genome reference assembly we still use for genealogy (released June 2013, one major and 14 patch release versions ago) it's 51,304,566 base pairs long; in the newer GRCh38 assembly it's a bit shorter, 50,818,468 base pairs long. Chr 22 is an acrocentric chromosome (explained in the cheat-sheet) so the first ~12,200,000 positions can't be used accurately for genealogy. The centromeric and pericentromeric regions account for positions ~12,200,001 to 17,900,000. Global population levels of excess IBD sharing have been reported at positions 16,051,881 through 25,095,451. And there are 969 known protein-coding genes on the chromosome starting after the 16 million bp mark and including some large ones like IGL, MY018B, TTC28, the aptly-named LARGE1, and others. That latter can be an important consideration due to something called genetic linkage which prevents segments breaking within a gene or its closest flanking neighbor alleles...keeps us from getting the equivalent of a Star Trek transporter scramble where pieces end up where they're not supposed and cause deleterious affect to the organism.

Basically, the first half of Chr 22 isn't of much use for genealogy. And we have to keep in mind that our typical autosomal microarray tests target around 15% of the tested SNPs in exonic regions, those areas involved in coding genes, because the companies are interested in clinical and pharmacological applications. Some of these are population/genealogy relevant, but the majority are not.

Too, MyHeritage has earned a reputation for being a bit...aggressive with matching. They employ what they call a "stitching" algorithm that tries to infer when two smaller segments are showing as distinct and separate, and if parameters are met MyHeritage will synthetically "stitch" them together and report it as being a single segment. This is essentially the opposite of what AncestryDNA's Timber algorithm does in its attempt to remove potential false-positive matches. My vote is that the AncestryDNA approach is the more accurate of the two.

I know that no one wants me to take a deep dive (and a few thousand very dry words) into an explanation of factors that can lead to excess IBD sharing, though I do believe it's one of the elements at the heart of many incorrect DNA matching representations, particularly triangulations. So I'll just do a quick bullet-summary for now. Basic explanations can be found with a little Google-fu, but it will take reading some journal articles to get a good grasp of them; Google Scholar can help with that (and I just now added five recent papers to the ISOGG "Identical by descent" page). Most of these factors deal with biological functions that take place during the two stages and multiple phases of meiosis; one deals with still-maturing genomic information:

The centromere effect: Centromere-proximal crossovers are suppressed; crossovers do not occur close to the centromere in highly repetitive (HR) heterochromatic areas. As the density gradually becomes less repetitive (LR) toward the adjacent euchromatin the suppression becomes less strong.

Non-pericentromeric heterochromatic regions: Other areas of HR heterochromatin also do not participate in meiotic recombination. Similar to the pericentromeric areas, the degree of crossover suppression weakens as the heterochromatin becomes less densely repetitive along the chromosome. 

Crossover interference: the non-random placement of crossovers with respect to each other during meiosis. This is an apparent regulatory function to assure that crossovers on the same chromosome are distributed well apart from one another. Ergo the double-strand break(s) on a particular chromosome that occur first may help dictate what other regions along the chromosome might or might not be subsequent candidates for additional breaks and crossovers.

Genetic linkage: The nearer two genes and any associated, flanking exons are on a chromosome, the lower the chance of a double-strand break during crossover separating them, and the more likely they are to be inherited together. There are areas of some chromosomes that contain greater gene density than others; you can browse around the NIH Genome Data Viewer to get an idea visually of where protein coding genes reside.

Linkage disequilibrium: At first glance it seems it should be the opposite of genetic linkage, but it's really a sibling, and an important one for pile-ups. From Wikipedia: "Linkage disequilibrium (LD) is the non-random association of alleles at different loci in a given population. Loci are said to be in linkage disequilibrium when the frequency of association of their different alleles is higher or lower than what would be expected if the loci were independent and associated randomly. Linkage disequilibrium is influenced by many factors, including selection, the rate of genetic recombination, mutation rate, genetic drift, the system of mating, population structure, and genetic linkage. As a result, the pattern of linkage disequilibrium in a genome is a powerful signal of the population genetic processes that are structuring it. In spite of its name, linkage disequilibrium may exist between alleles at different loci without any genetic linkage between them and independently of whether or not allele frequencies are in equilibrium (not changing with time)."

Crossover hotspots: Our understanding here has been increasing and there are as many as 50,000 hotspots identified across different populations. These are areas of the genome that show empirically higher rates of double-strand breaks and crossovers than would occur from a baseline expectation. Some of these sites are "fragile," meaning that they have a greater tendency toward double-strand breaks and crossover, and are identifiable by certain trinucleotide repeats. It's worth another reminder here that all our genealogy testing companies continue to use an outdated version of genome reference assembly which doesn't incorporate newer understanding of crossover hotspots...and crossovers are what form the segments we use for atDNA and upon which we base the rather imprecise calculation of centiMorgans.

Imprecise understanding of recombination rates across the genome: Related to the above, the advent of fiscally accessible whole genome sequencing starting circa 2013 has shown that our understanding of recombination rates estimation is still somewhat rudimentary and is limited by insufficient amounts of informative genetic data and/or by high computational costs; this area of genomic information is still developing (Zhou, Browning, and Browning; 2020).

We need a stronger grasp of this, and then we need to factor it properly into centiMorgan calculation in an updated genome reference assembly or, perhaps, move to a pangenomic approach and do away with the notion of a single reference map. A distressing factoid (at least to me): of the 20 donor genomes the current reference was meant to draw information from, about 70% of the reference sequence was obtained from a single individual (Ballouz, et al. 2019).

We often hear quite a lot about DNA recombination being random. In truth, it's pretty far from being a purely random process.

The short message is that pile-up regions occur in a spectrum, ranging from genetic linkage preserving significant blocks of protein-coding and related exonic DNA across global populations, to pile-ups that are continental-level indicative of broad-scale population bottleneck events, to regional- and even familial-level haplotypic pile-ups that express as blocks of DNA carried forward for many generations.

by Edison Williams G2G6 Pilot (439k points)
selected by Valorie Zimmerman
Hi Edison,   I’m delighted to find your response.  The WikiTree community can be continually grateful for your depth of understanding and ability to share your knowledge in an understandable way.

Many thanks,  Douglas

Whenever I see a DNA related answer that is so long that I have to scroll down on a 4k screen to see the name of the person who posted it, I assume it was Edison. . . . and I mean that in the very best way!  I echo what Douglas said. yes

Thanks for the best answer star, Valorie. 

Douglas, I started out making that a comment rather than an answer because I knew it would be a 500-foot-view summary and not really an answer to your question. But--surprise!--it grew overlong so I copied it into an answer instead. It's a complex and shifting subject. Too, some purists don't like the term pile-up region, and some influencers in genetic genealogy ignore their existence, or at least never mention them.

One thing that I ignored was the last sentence of your question. In DNA Painter you can click on the chromosome number to expand the display, and then click on any of the pile-up regions indicated by the slanted gray bars. That bit of Chr 22 is the 16,051,881 to 25,095,451 range I mentioned that was one identified by Li, et al. in 2014. To my knowledge, all the pile-up regions Jonny Perl shows at DNA Painter come from that 2014 study.

DNA for genealogy is often treated as evidence that's either true or false...because it seems, well, all sciencey. But like other forms of evidence--even especially so in the case of genetics--DNA evidence exists on a scale from "can't take it to mean much at all" to "take it to the bank and cash it."

Pile-up regions, in my opinion, aren't anathema. Whether they're regions that may apply to several global populations as in the Hong Lee study, or they're a pile-up map you've made for yourself based on your own haplotype and areas where you show small matches to a gazillion people (see a good article from 2018 by Debbie Kennett about how this works and what it looks like), it's all about evaluating the evidence.

The matches have to be carefully and skeptically analyzed on a case-by-case basis. For example: Is it a very small segment which is, de facto, unreliable to begin with considering we still have yet to move to whole genome sequencing for genealogy? Is the number of SNPs used in the comparison adequate (I personally consider anything less than 700 unreliable and less than 900 suspect)? Are you working with phased segments on one parental side, both sides, with both the cousins who match? Is it a singleton match, meaning only a single segment shared between the two of you? Is the segment entirely or mostly encompassed by an identified pile-up, or does only a minority portion of it reside in the pile-up? If the latter, is the portion that's outside the pile-up of a probably reliable size and SNP density? If a small segment, to what extent is it comprised of protein coding genes? Do either the start or stop positions of the segment fall inside a pile-up region, a heterochromatic region, or a coding gene or exonic area? Even the same set of raw DNA results will display varying segment start and stop positions and centiMorgan calculations after uploading to different services: have you viewed the same match at multiple companies, confirmed they all show the match, and then used the highest value starting position and the lowest value ending position as the lowest common denominator defining the segment? Did you use a non-vendor tool like the Williams Lab at Cornell or the Matisse Lab at Rutgers to calculate the centiMorgan value for that defined segment?

I could go on, but the point is that the smaller the segment and the more distant the relationship between the test-takers, the more rigorously the evidence needs to be analyzed. AncestryDNA shows you have a 1700cM match to an aunt, good to go. A singleton reported match of 14cM to a 4th cousin once removed needs to be run through the gauntlet.

Leah Larkin had a good article with a handy illustration that gives an idea of why paying attention to pile-up regions is important: https://thednageek.com/the-small-segment-debate-is-over/. The whole thing is worth a read, but you can scroll down to the section titled "The Population Problem" for the bit I'm referring to.

We see references to the probable age of small segments and I think that often fails to resonate with us...in large part because, as genealogists, we typically go about the use of DNA in exactly the reverse way a scientist would: we start out with the intent of verifying an end result, of verifying that two sets of data correlate to a particular ancestor, to an already-stated conclusion. Which is, really, pretty much the definition of confirmation bias. Some small segments will be false, as in they weren't inherited as contiguous segments from either the mothers or the fathers involved. See for example the table of data compiled by John Walden and Tim Janzen at the "Identical by descent" ISOGG Wiki page. In analyzing over 9,000 shared segments, they found that 38% of the time 8cM segments couldn't be found in either parent, so were by definition a false-positive; that false-positive rate jumped up to 58% for 7cM segments, and 74% for 6cM segments. But there obviously is a subset of small segments that are validly identical by descent.

The trick then becomes determining whether a valid small segment came from where you think it did. If it's in a pile-up region--whether one that's been published or your own haplotypic pile-ups that you've mapped--there's a good chance it actually has more than one inheritance pathway and the one you're trying to use to verify that most recent common ancestor may not be the origin of the segment. That's why estimates of segment age can be important. Peter Ralph and Graham Coop did some work indicating that only segments longer than 10cM date from within the last 500 years for those of European descent (Ralph and Coop; 2013). Doug Speed and David Balding did an evaluation of using SNPs as a measure of relatedness and found that about 40% of matches on segments of 20 million base pairs (roughly 20cM) can date back beyond 10 generations (Speed and Balding; 2014).

To borrow Leah's handy illustration (click the image to go to her blog post) here's an example of why an old, ancestral segment--often the result of linkage disequilibrium at a regional or familial level--makes a difference and can confuse or invalidate an attempt at autosomal triangulation:

To quote the text Leah supplies: "Consider this scenario: Person A and Person B share the 'Blue' DNA segment (solid path), and both also have 'Purple' in their trees (dashed path). While it might seem reasonable to say the Blue segment is proof that A and B are both descended from 'Purple,' that would be wrong. The shared DNA came to both A and B through a much more distant ancestor via a population that had lots of Blue descendants."

She points out that, in this instance, three statements are true:

  1. A and B might both be descended from Purple.
  2. A and B share a "Blue" segment of DNA.
  3. The "Blue" segment is not evidence that A and B are descended from Purple.

Now consider complicating the illustration much more with 1,024 8g-grandparents and all their descendants.

I should add that Leah is one of the few blogging genealogists who have a PhD in the biological sciences...and she's a fellow Texan...even if her degree is from UT Austin. 

But that's the autosomal DNA trap in genealogy. If we're working with a smaller segment we have to analyze it and first try to determine if it's a valid, IBD segment at all. If it seems to be, then we need to be especially careful not to attribute causation via a specific ancestor if it's even somewhat possible that isn't the case; correlation doesn't equal causation.

Douglas (and Thom), thank you for saying that the WikiTree community can be grateful to me for sharing knowledge. What I've just proven--yet again--however, is that my greatest contribution to WikiTree is only thousands upon thousands of words that the search engines can index. I'm just a search-fodder content creator. The last time I went on a vacation where I didn't have a keyboard with me it almost drove me... Well, let's just say it was more stressful than not going on vacation at all. Sigh.

Related questions

+6 votes
4 answers
+8 votes
2 answers
1.1k views asked Oct 21, 2018 in The Tree House by Shirlea Smith G2G6 Pilot (284k points)
+11 votes
2 answers
4.4k views asked Mar 9, 2018 in The Tree House by Chris Colwell G2G6 Mach 2 (24.4k points)
+9 votes
1 answer
+10 votes
1 answer
+5 votes
1 answer
+5 votes
4 answers

WikiTree  ~  About  ~  Help Help  ~  Search Person Search  ~  Surname:

disclaimer - terms - copyright

...