Has the Human Genome Been Completely Sequenced?

+26 votes
559 views

No near-term impact on genealogy, but I thought you'd all like to be informed:

On 16 February 2001, 20 years ago this year, as part of the Human Genome Project the announcement was made of the first publication of the human genome. This appeared simultaneously in the journals Science and Nature. That it's been only two decades since the announcement brings home to us just how rapid and recent have been our advancements in genetics.

But the genome was never fully sequenced, and hasn't been since. Until, possibly, a few weeks ago. Pending peer review--and with the caveat that the Y chromosome was not included in scope--a cooperative collection of labs known as the Telomere-to-Telomere Consortium, or T2T Consortium, has published a preprint in bioRxiv claiming to have achieved just that (DOI: https://doi.org/10.1101/2021.05.26.445798; you can go straight to the full PDF at https://www.biorxiv.org/content/10.1101/2021.05.26.445798v1.full.pdf). In the paper's abstract, the authors write:

"While these drafts [of the human genome] and the updates that followed effectively covered the euchromatic fraction of the genome, the heterochromatin and many other complex regions were left unfinished or erroneous. Addressing this remaining 8% of the genome, the Telomere-to-Telomere (T2T) Consortium has finished the first truly complete 3.055 billion base pair (bp) sequence of a human genome, representing the largest improvement to the human reference genome since its initial release."

The technologies used to achieve this are two competitors in the marketplace: types of long-read sequencing offered by Pacific Biosciences and Oxford Nanopore. Placed in perspective, our common microarray autosomal tests look only at about 0.02% of the genome.

In its introduction, the paper provides good insight for us into why this may be important to genealogy and population genetics:

"The latest major update to the human reference genome was released by the Genome Reference Consortium (GRC) in 2013 and most recently patched in 2019 (.p13) [our autosomal DNA testing companies all continue to use GRCh37, which was superseded December 2013]. This assembly traces its origin to the publicly funded Human Genome Project and has been continually improved over the past two decades. However, reliance on these technologies limited the assembly to only the euchromatic regions of the genome... As such, the current GRC assembly contains several unsolvable gaps, where a correct genomic reconstruction is impossible due to incompatible structural polymorphisms associated with segmental duplications on either side of the gap. As a result of these shortcomings, many repetitive and polymorphic regions of the genome have been left unfinished or incorrectly assembled for over 20 years."

To underscore that our most-used reporting element, the centiMorgan, is just an approximation, it is wholly based upon the genome reference map, which is used to estimate where chromosomes are most likely to cross over, or recombine, during meiosis. GRCh38 made multiple corrections to GRCh37. And many argue that we shouldn't be attempting to use a single reference genome at all, that we should be moving toward something called pangenomics: working with sets of genomes that, when taken together, more accurately describe the whole of a species.

In genealogy and population genetics, for example, we know that approximately 4 to 5 million SNPs will be significantly distinguishing among individuals who descend from a single, continental level population, e.g., East Asian, Western European. But there may be as many as 25 million SNPs that distinguish between individuals of different continental level populations. The current and recent human genome reference maps are derived from the sequencing of only 13 individuals. One result for genealogy is that centiMorgan calculations may be quite different when based upon different genome maps.

Though not one of the authors of the study, in speaking of it Paul Flicek summed up the state of genetics very nicely: "The entirety of genomics as a field is a constant cycle between pushing the technological envelope and using these technologies in new and exciting ways."

in The Tree House by Edison Williams G2G6 Pilot (443k points)
So does that mean that those people who have bought the full genome sequencing in the last year or two aren't really getting what they expected?

No; they are for all practical purposes getting what they expected. Especially considering the gross limitations of our microarray tests. For example, the FTDNA Big Y-700 includes only about 41% of the Y chromosome because the vast majority of the remainder is considered too highly repetitive to be of any genealogical value (that or it's part of the very small PAR; different story). The study cited didn't test the Y chromosome, but for a different reason. My 30X whole genome sequencing includes data within the area excluded by the Big Y. Because the data is there. It's just that, for now at least, it looks like it's uninformative and can't be called as accurately as can the euchromatic regions.

And to be clear, it isn't as if what was used in the lab-work the paper describes is commercially available to consumers today. Portions of it are, but that's a melding of competing technologies. My bet is that it will be another year or two before we see any direct-to-consumer application.

Are claims by current tests that they will "decode 100% of your DNA" incorrect? Yes, they are. But they do sample pretty close to that. It isn't that they don't get calls from within the roughly 7-8% of the genome that's either heterochromatic or highly repetitive. It's that those calls will typically have a lower quality threshold than elsewhere, and tests like my 30X WGS don't have sufficient coverage to accurately piece back together the data for a good picture. That's why regions like most of the p arms of the acrocentric autosomes (13, 14, 15, 21, 22) and the centromeric regions of all chromosomes aren't included in the reference genome.

The WGS test I took and these PacBio and Oxford Nanopore technologies are looking at the same thing. It isn't like two microarray tests examining completely different SNPs. But we don't know enough about some of those regions to accurately construct a reference map for them. This results in regions misassembled in the reference, ergo no way to use imputation to help determine which allele call is likely correct at a certain locus if there are competing results in the test data.

More than 100 gaps were filled or reduced when we went from GRCh37 to GRCh38...well, not we because genetic genealogy is still using the old 2013 model. But accurate telomere-to-telomere mapping holds the potential to make factor of magnitude changes to the reference model compared to those made between build 37 and 38.

If the current genotyping used by 23&Me, Ancestry, etc reports only 0.02% of the genome, I'm surprised that none of the companies are not more aggressively pushing to upgrade their tests--at least to exome sequencing (as previously suggested by Edison).  When Ann Wojcicki is asked in interviews whether her company plans to offer WGS, she has answered that the price is prohibitive for a direct-to-consumer company.  They seem content to continue to build their genotyping database to partner with drug companies.  Where is the vision?  More data good, but how about more and BETTER data?

Regardless of whether it's a health company or a genealogy company, having a DNA test "upgrade" to exome or whole genome level would be a way for that company to distinguish themselves from the competition.  Start building the next-gen pangenomic database now.  The existence of such a database would open up yet-unrealized new methodologies for both genealogy and healthcare.
Michael, my prediction is that 23andMe will move to WGS within 2022 (or latest 2023). Ann has clearly stated that she's betting on health (and thus genealogy takes an even further backseat) and the only way to get paid well by big pharma is by delivering whole genomes and not microarray data.

As Ed mentioned, the industry is evolving quickly and it's been clear now that microarrays have some serious limitations. They can only identify some small percentage of health risk (eg through BRCA genes) but INDEL's (insertions and deletions) have turned out to be very important as has the whole world of epigenetics (which is another story as Ed would say, haha).

Prices for WGS have come down a lot, 23andMe could use their 12 million customer base to push targeted emails to them and get a good pickup, especially with their marketing machine running at full speed (which is what 23andMe is really good at, just see their Covid predictor LOL).
I agree with Andreas and Michael. If I were to step back to a 500-foot view and look at consumer genetics as someplace I wanted to evaluate starting a business, or investing in one, the first thing that would catch my eye doing a PESTLE chart would be the "T," technological. Illumina doesn't want to lose any of its market share and at one point recently was trying to acquire PacBio. I attend a PacBio webinar at least once a month; there's a lot happening there. And Oxford Nanopore certainly hasn't been standing still.

I might look at outsourcing my DNA testing, as does AncestryDNA. But the big changes coming down the way certainly look to be in lab equipment, processes, and procedures. The secondary infrastructure needed for that is in information technology: the ability to store and manipulate massive amounts of data. There are ways with existing algorithms to make the 3 billion base pairs in a genome doable for an online environment in sort of an onion-like structure: the most frequently needed data is the outer layer, and each progressive layer can afford to be slightly slower on the access time...much like Amazon Web Services structures its S3 scalable storage.

If I really wanted to grab a sizable and profitable chunk of the market, I might position myself to have solid funding, maybe already be public on the NASDAQ, and then stay very quiet about specific plans until I decided to make the move and buy/build my own lab with truly state-of-the-art equipment so that I could rapidly, cheaply, and accurately do next-gen sequencing at levels competitors couldn't rival without big, expensive upgrades. Then leverage millions of existing customers and ad campaigns to take hold of and expand that market share. Alas, genealogy won't be a priority. But maybe it can go along for the ride!
And the current share price increase of ME has certainly set an expectation for some good things to happen with 23andMe in the future.

I also agree with you that the genealogy will be kept alongside as it's a good user base for doing upsells to their health product.

I mean for all the flak that 23andMe is getting, they still have the 2nd largest user database by far, the do offer unique features (that no one else offers like ethnicity information broken down by chromosome and start/end position and a DNA triangulation tool that every Tom, Dick and Joe can use), have a family tree tool (should be pushed more as we all know that family trees are the most important thing in genetic genealogy), a very good ethnicity prediction tool down to small regions and offer a high-level mtDNA and Y-DNA prediction as well.

That's a pretty good package IMO.

On the downside, their 1500 DNA matches limit is trying to push users into buying their subscription (which isn't really necessary, I have now over 3000 DNA matches) and their marketing could focus a lot more on genealogy to not give the impression that their users with that interest are something they don't care about.

But that's 23andMe, they were always showing questionable behavior when interacting with other parties, be it the FDA ("Do we really have to answer these letters? Nah"), be it third party developers like me ("Let's do a marketplace. Let's assign a summer intern to build it up", "Let's change our API again") or their customer base ("We concentrate on health").
Part of the problem I think is that as far as the technology has come both on the biotech side and the hardware/software side, the genetic genealogy market is just so much smaller economically than other (primarily health) users of this technology. There are also huge network effects. This really limits investment and disruption in the generic genealogy arena.

3 Answers

+7 votes
Fascinating information, thanks for sharing, Edison! It seems there are so many different ways in which we still do not understand details of our DNA, for example Epigenetics, but I thought the one thing we did have was a complete Genome!
by Shawn Ligocki G2G6 Mach 2 (29.2k points)
Thanks, Shawn. Yep; it's pretty staggering the pace that genetics has seen just in the last 20-25 years. We've learned an amazing amount, but there's still a lot more to learn, a lot more to discover. Not a week goes by without literally dozens of new articles published.

In many ways it's also a cautionary tale for genealogists. There are some strong beliefs that we fully understand genetic genealogy and that the methods we use are known to be valid and firmly grounded in science...when the truth is that some of those methods aren't all that stable, that the scientific ground is constantly shifting a bit underneath them.
+4 votes
I have a related question.  Beyond whole-genome sequencing, another technology that is evolving rapidly is high(er)-throughput karyotyping.  For example, the Saphyr machine sold by Bionano genomics dramatically cuts the time, cost, and labor required to obtain a karyotype.  This has obvious value in the medical realm; would knowing one's karyotype be useful info for genealogy?
by Michael Schell G2G6 Mach 4 (49.6k points)

Michael, I have very little experience with what, arguably, could be called next-gen karyotyping, or "optical genome mapping" (OGM). It's another area of genetics technology that has experienced massive leaps in just a few years.

Very briefly, because this probably won't be on many genealogists' radars, karyotyping is basically a structural analysis of the chromosomes, technically one of a few types of cytogenetic methods. It traditionally looks for defects or abnormalities on, loosely speaking, kind of the civil engineering front rather than the chemical side of genetics.

Structural chromosomal variations can result on a (relatively speaking) large scale like the existence of an extra Y chromosome or trisomy 21, or at smaller scales like chromosomal deletions, duplications, translocations, or inversions. Those "smaller" variations--again, traditionally--have been detectable when on the order of at least a few million base pairs.

I haven't researched it, but it's possible that Bionano and others are stepping into a new direction with a hybridization of array technologies combined with OGM. This ties into the longwinded post about what in the purported complete genome sequencing was still left out of the reference genome...because karyotyping (hybrid or not) is still structural in nature.

The core reason those highly repetitive, seemingly cluttered areas of the chromosomes weren't really accessible to us in WGS was the mechanism by which the typical lab methods operated. It conjures maybe the wrong impression, but the standard was/is "shotgun sequencing." The chromosomes would be broken up into tiny segments only a few hundred base pairs long. One pass, or read, would return data for (theoretically) all the nucleic acid values. Then it was done all over again; ergo the term "coverage" as in 15X, 30X, 60X: the number of passes performed. So what you ended up with was one extraordinarily complex jigsaw puzzle. The computational power came in after the fact to try to take all that data from those roughly 10 million segments of chromosomes sequenced at each pass and then reassemble it so that all the (hopefully) overlapping values from each of the, say, 30 passes could be reconciled and each base pair could be put in its proper place and order.

A lot of us are familiar with short tandem repeats (STRs) as the basic tests for the Y chromosome, and autosomal STRs are what's used in standard forensics. These are structural differences, not chemical ones. For example, DYS448 is defined by--clustered in exactly this sequence--the DNA letters AGAGAT. The nucleic acids don't change, but the number of times that same sequence is repeated on the chromosome can differ quite a bit; repetitions up to 20 times, in fact, for DYS448. So we would say that an instance of DYS448 has a value of 12 if the sequence repeats 12 times.

The SNPs (single nucleotide polymorphisms) of our autosomal tests are, by comparison, chemical in nature. They aren't about repetition, but about a single nucleic acid at a single, precise location on a chromosome mutating from, for example, an A to a G or a C to a T.

Okay. So. Everybody still with me?  laugh  The vast majority of that 7%-8% of the genome that the Telomere-to-Telomere Consortium says they've successfully sequenced hasn't been inaccessible to us. It's been there all along; we've known it was there; we could get allele values from it.

But... We couldn't accurately put the 10 million jigsaw puzzle pieces together in order to understand it or effectively map it. That's because of the tremendous amount of repetition in those heterochromatic, or palindromic, or centromeric regions.

Imagine for example that a sequence is AGAGAGAG and that it repeats itself 150 times, for a total of 1,200 sequential nucleic acids. And we have chunks of DNA to analyze that are 300 nucleic acids long. You see the problem. How can you possibly know which tiny chunk of the chromosome in your 30 randomly overlapping passes comes first, second, and third?

That's an extreme oversimplification, but that's been pretty much the dilemma.

The state-of-the-lab for STR testing has been variations on classical Sanger sequencing. A reason some have trouble with consumer cost difference between an autosomal microarray test and a Y chromosome STR test. The latter involves more manual human interaction and judgment. Fluoroscopy is used to view the structure of the repetitions, which will appear as banded lines.

I can't comment (despite the word salad above), but I imagine that next-gen optical genome mapping--or hybrids like Bionano's Saphyr System could get to the point where analysis to the level of individual STRs is possible. I don't think it's quite there yet, though.

I rather doubt that a karyotype would have any utility for genealogy. It's a picture viewable under a microscope, where special staining techniques can highlight certain features. You can detect large changes, such as a deletion or a section of one chromosome has been transposed to another chromosome.
I probably should have used the term "structural variants" rather than the old-school karyotype.  The "next-gen" machines mentioned by Ed do more than classic microscopic karyotyping.  I don't even think they require a microscope.
+5 votes

HEAD EXPLODES... AGAIN...

But seriously, after reading this and all of the responses, I realize that we have come a very long ways, but still have so much further to go. I am continually amazed that such a small percentage of our DNA makes me different from anyone else, and that we are only using a small percentage to determine that difference start with. I wonder where we will be in another 20 years?

... Now, to clean the gray matter off my computer monitor... or is it grey?

by Ken Parman G2G6 Pilot (121k points)

Related questions

+17 votes
5 answers
797 views asked Mar 31, 2022 in The Tree House by Ken Parman G2G6 Pilot (121k points)
+12 votes
2 answers
+3 votes
0 answers
103 views asked Feb 11, 2019 in Genealogy Help by Mardon Erbland G2G1 (1.8k points)
+4 votes
2 answers
425 views asked Nov 27, 2018 in Genealogy Help by Mardon Erbland G2G1 (1.8k points)
+9 votes
3 answers
+7 votes
8 answers

WikiTree  ~  About  ~  Help Help  ~  Search Person Search  ~  Surname:

disclaimer - terms - copyright

...