A Brief History of the yDNA Haplotree

+27 votes
444 views

I just added this to the DNA-Newbie forum at Groups.io, and thought it might be useful for WikiTreers. It isn't my usual squirrel-chasing, way-too-wordy, deep dive down a DNA rabbit hole, and instead is intended to provide a snapshot of how we arrived at the expansive yDNA haplotree that we have today.

Even though yDNA testing for consumers started in 2001 at Family Tree DNA, a decade before the first autosomal tests, many of us have a greater familiarity with autosomal testing and centiMorgans than we do with yDNA SNPs and STRs. It's a little different mindset, one where "genetic distance" doesn't mean "number of generations separated." But one of the huge advantages with yDNA is that it doesn't undergo recombination at meiosis and that it changes only via mutation...which means it can reach many centuries into the past.

Those time periods are defined by specific SNPs, single nucleotide polymorphisms, and while we really can't match an exact date to a mutation event, what we do have is a large tree, a haplotree, of discrete and hierarchical branches, or subclades. A test-taker may be 20 branches deep under a given top-level haplogroup--for example: R, I, G, or E--and for each branch there will be some coalescence point in time, some point when a mutation occurred and one subbranch split into two or more additional branches.

We've seen questions about what seem to be two different naming conventions for haplogroups, and occasionally some general confusion that arises from having so many branches in the tree, with new branches being added almost weekly. By contrast, the mtDNA phylotree hasn't been altered since February 2016, and is only 7% the size of the yDNA haplotree.

This is just a quick chronological rundown of where we started and how we got to where we are today.

1 February 2002: The Y Chromosome Consortium, now defunct, published in Genome Research an article titled, "A Nomenclature System for the Tree of Human Y-Chromosomal Binary Haplogroups." This was the proposal that launched the way we began naming haplogroups. The proposal showed two types of haplogroup naming systems, one described as "by lineage," which in some circles later came to be called the "YCC long-form" and that the International Society of Genetic Genealogy (ISOGG) continued to use while it maintained its own haplotree through 2020. This took the structure of the top-level, or basal haplogroup letter followed by a series of numbers and letters to help distinguish the branches, or subclades. For example, R1b1a1b and I1a1a1a1a.

The second nomenclature, called "by mutation" in the paper, came to be what we would shift to when the explosive growth of the haplotree meant that unwieldy strings of characters were no longer practical in defining branches of the tree. R1b1a1b became R-M269 and I1a1a1a1a became I-M227. Much more practical.

When the YCC paper was published, there were a total of 243 Y-SNPs known (single nucleotide polymorphisms, or mutations among the four DNA "letters" at a single position on the chromosome), and the haplotree had a proposed 153 branches.

2008: The haplotree, over the course of six years, had grown to 311 cataloged branches and approximately 600 identified SNPs.

2010: The tree now had 440 branches and included 800 SNPs.

2014: Family Tree DNA and the Genographic Project jointly released their new, combined yDNA haplotree. This impressive collaboration, housed at the FTDNA website, included at its launch over 1,200 branches and more than 6,200 SNPs. This represented the start of the yDNA haplogroup knowledge explosion because it was in 2014 that FTDNA introduced its first version of the Big Y sequencing test.

September 2018: It took a while for the more expensive Big Y test to be well understood and see greater sales volume, but in the intervening four years the new data had grown the haplotree over 13-fold, to 16,361 branches.

January 2019: FTDNA introduced the Big Y-700 test which, on average, provided 50% more SNP coverage and tested up up to 838 STRs (short tandem repeats, the type of markers examined in the original Y-12, Y-25, Y-37, Y-67, and Y-111 tests; SNPs define haplogroups, not STRs, but sets of STR values can reliably predict a high-level haplogroup).

March 2020: A little over one year after the Big Y-700 was first available for sale, the yDNA haplotree had grown to 26,862 branches.

9 February 2022: In a bit less than two years, the haplotree had doubled in size again and now counted 52,394 branches.

22 February 2023: A year later and we had added over 12,000 branches to the tree, which now numbered 64,638 distinct branches.

18 February 2024: This morning, the haplotree at FTDNA currently describes 76,626 branches.

Since the advent of FTDNA's Big Y test, the hierarchical "yDNA Tree of Humankind" has grown to identify almost 64 times as many discrete, chronological branches as we knew about as recently as 2014, just 10 years ago. And the authors of the original Y Chromosome Consortium paper in 2002 (for whom Michael Hammer of the University of Arizona, a foundational scientist for Family Tree DNA, was the corresponding author) likely never quite imagined that the 153 branches known at that time--and in need of a consistent and formal taxonomy with which to describe them--would be nearing 80,000 branches just 22 years later.

Edited: A particularly astute reader broke down one of my sentences in diagrammatic fashion and noted that, as described in Bill Bryson's book, Bryson's Dictionary of Troublesome Words (2002, page 114), I had incorrectly used the word "than" in a comparative construction when I should have used "as" because, strictly speaking, it was not a comparison of different things (e.g., X is larger than Y), but instead used a qualifier ("64 times as many") as a distinguishing factor between two of the same things that were simply in a different quantitative state of equality. Ahem. But the grammatical correction is proof that I may have as many as 14 people who actually read the stuff I write!
laugh

in The Tree House by Edison Williams G2G6 Pilot (449k points)
edited by Edison Williams
I find this interesting and want to look at a visual of the variations of y-DNA. Thank you for the time line.

My husband, son (adopted), and myself decided to do the Genographic study when National Geographic first published about it. At the time, it was an either/or for them, so they decided to do the mtDNA for the first Eve study. When the project merged with FTDNA, I updated my information, but my family members did not and their info is gone to them (except whatever is saved personally). For myself, recently I needed to do a new test (2005-2023) for a more current in-depth mtDNA and so I’m assuming those who joined the study early on would have to do the same.

 I’ve talked to my husband about doing one of the levels of y-DNA testing.  I find DNA more fascinating than I anticipated.
Another fascinating exploration of DNA!

Sadly my digest email arrives between 12 and 1am so I delayed reading your post until this morning. If I had realised this was such a short post I would have read it earlier.

Would it be an imposition to ask if you could append LV (ie Limited verbosity) in the header for these more bite sized posts?

As pointed out by Ken, reading one of your posts is an absolute highlight of Wikitree.

Simon, thank you very much! But what is this "limited verbosity" of which you speak? I'm afraid I am unfamiliar with that phrase.
devil

Still and all, the "Question," sans the added editorial about grammar, was 44% of the way to the G2G maximum character limit. The fake-out was that I put the word "brief" in the title.

Melissa: You're correct about the Genographic Project. I also was one of the early adopters in phase one. I captured my reports when released but, since I was already testing with FTDNA for yDNA and mtDNA, I didn't follow the project closely. Sometime after it moved into phase two (circa 2012?) I tried logging back in only to find that my previous account was no longer valid.

Since then--the various limitations currently in effect at the major testing companies in wake of the 23andMe data incident notwithstanding--I made it a practice to always obtain what raw test data I'm able to, to carefully back those up in multiple places, and to capture report updates (e.g., match lists, "ethnicity" estimates, the Big Y Block Tree, and more) at least once per quarter.

If I get hit by a bus tomorrow, there's a cousin who has all that data, just so it isn't lost. Including all 97 gigabytes of my whole genome sequencing info.
laugh

5 Answers

+7 votes

Thanks for creating a reference list for me! laugh

We might want to include:

+10 votes
Thanks for posting this chronological history and progress of Y-DNA. It will be interesting to see where the progress leads in another year and then 2 years, etc. I wonder when the Big Y-700 will be obsolete and what will take its place!
by Virginia Fields G2G Astronaut (1.2m points)
+6 votes
Once again, I feel a whole lot smarter than I did when I woke up this morning. And it wasn't because I stayed in a Holiday Inn.

Thank you once again for enlightening those of us who practice a little genealogy on the side. I was aware of about 10% of what you wrote before and now I feel dangerously filled with knowledge (Probably due to a lack of coffee on my part).

As Virginia said, it will be very interesting to see where we're at in another few years ... perhaps a Big-Y 1000? Maybe we should get a pool going on what the date will be when we break the 100,000 branch mark! I think it will come sooner than we anticipate!
by Ken Parman G2G6 Pilot (122k points)

Ken, I feel dangerously filled with vacuous gaps of knowledge every day...despite being heavily caffeinated. Oops. I lost another brain cell just in the time it took to write that...
frown

Similarly, my record prognosticating the future of DNA testing for genealogy has proven to be almost always wrong. Heck, back on Groundhog Day 2019 I predicted that the preponderance of genetic genealogy testing would, by 2023, have shifted away from microarray tests to whole genome sequencing. Yeah; not even close on that one.

But I think the Big Y-700 results will stay viable for yDNA for a very long time, even when the Next Big Thing comes along to supersede it. To our understanding as of a couple of years ago, a large portion of the Y chromosome on its long arm was useless for genealogy and population genetics purposes. That's the reason the Big Y test examines only about 41% of the chromosome. Here's an illustration from FTDNA:

At each end of the chromosome are the pseudoautosomal regions (PAR). Their main function is to allow the Y chromosome to join with the X from the mother in order to form the chromosomal pair. The PAR actually is subject to recombination, unlike the rest of the Y, and because there is medical/clinical value in knowing about some of the SNPs contained in that region, our basic autosomal microarray tests do look in there. Because I happen to have the numbers handy, by way of example, the AncestryDNA v1 test looked at as many as 440 SNPs in the PAR, and AncestryDNA v2 as many as 525, depending upon the iteration of the individual test. There are 16 protein-coding genes in PAR #1 and 3 in PAR #2.

Both the X and the Y have pseudoautosomal regions. They are essentially useless for genealogy, but because some SNPs are examined in our inexpensive autosomal tests the PAR can cause confusion. For one thing, since the X chromosome contributed by the father to his daughter contains his pseudoautosomal regions, in the way some reports are generated it can make it look as if the daughter has some yDNA.

Second, under the CRCh37 genome reference we use for all our autosomal tests, PAR #1 starts at the beginning of the X chromosome and continues through position 2,699,520. PAR #2 begins at position 154,931,044 and continues to the end of the chromosome at 155,260,560. Any reported xDNA matches in those areas need to be completely excluded from genealogical evaluation.

Next is that large area labeled "inaccessible" in the diagram. This is a heterochromatic region, specifically one of constitutive heterochromatin. You can get more detail at Wikipedia. Essentially, though, this big region is filled with tightly packed, highly repetitive DNA. None of our conventional sequencing technologies can read it with any accuracy. This kind of heterochromatic DNA is in other places in our genomes, as well, and comprises around 6.5% of all our nuclear DNA.

It wasn't until the Telomere to Telomere Consortium (T2T) successfully used a hybridization of cutting-edge sequencing technologies in 2022 that we were able to sequence effectively 100% of the genome for the first time. See Nurk, et al. "The Complete Sequence of a Human Genome." Science 376:6588 (April 2022): 44–53. https://doi.org/10.1126/science.abj6987.

Oversimplifying, the reason that sizable chunk of our DNA went so long without being sequenced is because conventional Next Generation Sequencing has to break up DNA into relatively tiny bits in order to be read. Then there are multiple "read" passes performed, typically 30 (called 30X sequencing), up to 50 in the case of the Big Y, and you can buy sequencing that does 100 read passes. Then all those reads are computationally processed in order to accurately determine which chunks of DNA align with which other chunks.

In keeping with my trademark on World's Worst MetaphorsTM, pretend the task looks something like this...only multiplied by approximately 6.16 billion individual nucleotides:

M__y __d _ ____le l___
___y ___ _ l_____ _a__
____ h__ a ______ __m_
_a__ _a_ _ _i____ ___b
__r_ ___ _ __ttl_ l__b

Ah hah! "Mary had a little lamb." But if the sequences are really repetitive, then it becomes impossible to accurately reconstruct the correct nucleotide sequence from the small bits into which the chromosomes have been broken. We're familiar with the term Short Tandem Repeat (STR) from our yDNA testing. Most STRs consist of 6 or fewer DNA "letters," e.g., DYS448 which is identified by the repetition of AGAGAT multiple times. Most of these are also referred to microsatellites.

When a sequence of nucleotide repeats continue up to 60 times, they're called minisatellites. And then there are even macrosatellites whose repeats number into the hundreds. One of the larger known examples is RS447 on Chromosome 4 whose repeat pattern is 4,700 base pairs long and repeats from 20 to 103 times.

Mind you, this is based on conventional knowledge, not new information we yet may be able to learn from the 100% sequencing done by the T2T work. That big stretch of the Y chromosome labeled as "inaccessible" may prove to contain even more lengthy repeat patterns. And we simply don't know, as yet, if there's much in there that's valuable to genealogy or population genetics. But the chance is definitely not zero.

The T2T work required a hybrid of long-read and nanopore sequencing technologies. I won't dive down that rabbit hole either, but suffice to say that what was achieved was the accurate reading of continuous DNA bits that were factors of magnitude longer than what the conventional so-called shotgun sequencing works with. World's Worst MetaphorsTM: it's being able to read one complete Faulknerian paragraph at a time rather than a few letters from the sentence, "Mary had a little lamb."

So... If we see a meaningful next-evolution of Y chromosome sequencing, I think it will be in the direction of long-read technology. The Big Y-700 does just about as much as I believe we can get from the 41% of the chromosome it tests. There could be possible refinements to come, but if so I think they will be fairly small and incremental.

But over 50% of the Y chromosome is heterochromatic. We don't even know if the information in there will be useful for genealogy. But that's where I think the next horizon of testing will take us.

Fairly certain that cryptic section in the Worlds Worst Metaphor comes from the Silmarillion, can't wait to find out what the Valar have inscribed in my dna
Hm. I did note in that little editorial amendment about my failing grammar that I thought I was up to 14 readers. Fourteen. Valar. Coincidence?
+3 votes

Thanks for the latest in genealogy news 

https://www.wikitree.com/g2g/user/Williams-49144

Cheers and good day to you !!  coolyes

by William Maher G2G6 Pilot (624k points)

Thanks, William. And evidently a few people even peruse my WikiTree profile. I'm surprised at the number of views, and I may need to make some updates to it.

Especially the part about my genetic similarity to a banana. Given that two days ago in Nature it was announced that, for the first time, the regulatory go-ahead has been given to market a genetically engineered banana, called QCAV-4, I may no longer be as close a cousin. This strain is designed to be resistant to Panama Tropical Race 4, a fungal disease that's spread worldwide and that, at present, has no treatment or cure.

Just kidding. Not about the genetically engineered QCAV-4. But I'll be just as closely related genetically to the new designer banana as I am to a plain-old banana currently waiting patiently in your grocery store aisle. Will still share somewhere around 41% of my coding DNA.
wink

+4 votes
i always read everything you write and I save it in my WTDNA file.   I learn more from your posts than most of the dna books I bought.
by Laura Bozzay G2G6 Pilot (840k points)

Thank you, Laura! I already had you counted among the readership of 14. laugh

Just a small update, Debbie Kennett and I have been in touch and I'll be modifying and expanding this yDNA timeline notion a bit and placing a starter article for it on the ISOGG Wiki within the next few days. I'll link back here where the article is published.

Thanks for the heads up and for all the great information.

Related questions

+6 votes
2 answers
283 views asked Aug 5, 2022 in Genealogy Help by Dsw Sayne G2G1 (1.1k points)
+6 votes
4 answers
725 views asked Apr 10, 2020 in Genealogy Help by Andrew Ross G2G6 Mach 3 (37.0k points)
+6 votes
2 answers
565 views asked Oct 7, 2020 in Genealogy Help by Andrew Ross G2G6 Mach 3 (37.0k points)
+4 votes
1 answer
212 views asked Sep 2, 2022 in Genealogy Help by Robyn Adair G2G6 Mach 1 (19.0k points)
+3 votes
2 answers
285 views asked Nov 30, 2019 in Genealogy Help by Jaki Erdoes G2G6 Mach 6 (68.3k points)
+8 votes
2 answers

WikiTree  ~  About  ~  Help Help  ~  Search Person Search  ~  Surname:

disclaimer - terms - copyright

...