Part 2
Moving into the future, I believe we need to do two important things, the first one critical. That's expanding the number markers tested. With some of our current and historic tests looking at as few as 577,000 SNPs, and with some tests examining fewer than 20% of the same SNPs, our ability to compare one set of test results to another comes with a whole boatload of assumptions and guesstimates. Too, up to 18% of the SNPs in any given test are there because they provide some reference to our protein coding genes. Expending almost one-fifth of an already low marker count to test items of almost solely medical/pharmacological interest doesn't help us much for genealogy.
Illumina, the largest microarray manufacturer, for several years has had higher density chip options available. For example, there's the Infinium Omni5-4 that targets over 4.28 million SNPs selected from the International HapMap and 1000 Genomes Projects. Same microarray technology that uses the same Illumina iScan systems which many, if not most, of the testing companies already use.
The deterrent? Cost. The chips are more expensive; individual chips can handle fewer samples at one time so it decreases the throughput; and in terms of genealogy matching to all the existing kits that have been tested, it means a conundrum because it would be more like an apples to, well, pears comparison (at least in the Order Rosales, so not in a different Order like oranges in Sapindales).
And the new tests with 4.28 million SNPs would display far, far fewer total matches than the low-density tests. That's due to a simple reason: the greater the number of appropriate data points examined, the fewer the number of false positive matches. At GEDmatch--and the fact that they don't use imputation affects this--I've informally found that, using data derived from whole genome sequencing and using almost 2.1 million unique SNPs (about half of what the newer Illumina chips would provide) and comparing that to 11 different sets of results from individual major company microarray tests, at segments of greater than or equal to 20cM, around 14% of the reported matches are probable false positives. At greater than or equal to 10cM, that number jumps dramatically to around 210% probable false positives...meaning that about one in every two small segments shown at GEDmatch for a single microarray test's results is likely wrong.
The combined increase in consumer pricing plus a significant decrease in the number of reported matches would make a very tough sell. To those who don't understand the details, it would look like they're paying significantly more to get significantly less. Not a great marketing strategy from a revenue perspective.
That decrease in the number of matches, though, is precisely what the casual genetic genealogist needs. For segments at or above approximately 30cM, everything would be almost identical with the results we see today. The low-density microarrays are pretty solid with individual segments of that size. But an explosion of false positives can happen--depending upon how the testing/reporting company handles the data--below 20cM.
Back in 2012 that International HapMap Project (now discontinued) had cataloged approximately 10 million unique human SNPs. SNP, of course, isn't synonymous with base pair, nucleotide, or allele. To be a SNP (or SNV, for that matter) the polymorphism, the mutation, needs to be found in the global population, not just in a couple of individuals.
Today, there are 957.2 million human polymorphic variants (SNPs and SNVs) currently classified in the NIH's dbSNP database representing over 192,000 tested individuals. There are 107.2 million cataloged entries in total which includes microsatellites and small-scale insertions and deletions. We've cataloged almost 10 times the number of SNPs we knew about just a decade ago. Our current tests look at only about 0.02% of our genomes, and only about 0.07% of the currently cataloged SNPs.
What we really should do is move toward whole genome sequencing so we can eliminate all the estimations, assumptions, imputation, and inference about whether a given segment is really one continuous segment or is broken up into smaller segments by mismatching base pairs. That wouldn't eliminate all possible interpretation errors, specifically a common one resulting from what's called "haplotype switching," but it would minimize them and also give us accurate segment start and end points; today, those values as reported are, perforce, inaccurate. They seem quite precise when we see reports that offer exact numbers, but since our tests look at an average of only one nucleotide base pair out of every 4,700 the possibility for precision simply isn't there.
Moving to use of whole genome sequencing (WGS) would entail more expensive testing (though not nearly as expensive as it was just five years ago) plus an IT infrastructure that could handle such a massive amount of data. When you see a WGS test stating that it offers 30X coverage, a typical depth, it means that there is an attempt to scan each base pair up to an average of 30 times in order to more accurately piece together the actual chromosomal sequences. Each of these scans is called a "read," and they all are recorded. The data points alone, then, would number about 9.2x10^10, or 92 billion per genome sequenced.
However, there are ways to manage that--perhaps like a Burrows-Wheeler transform--where the initial comparison could be done with a much smaller dataset, and then the nucleotide-by-nucleotide comparison performed only on demand and only for a single, defined segment. We aren't there yet...but I digress.
The other major sea-change I believe we need is an adoption of a pangenomic approach to our genomic references. Some of the same team that accomplished the first full sequencing of a genome last year, the Telomere to Telomere Consortium, are also spearheading the call to move toward pangenomics. I won't bore your other sock off; it's easy to Google. But suffice to say that the vast majority of the results from the Human Genome Project used just one person's DNA when the project was declared completed in 2003. Our current GRCh38 reference genome is derived from just 19 people. Kind of shocking, really.
As we saw with the rapid increase in the number of cataloged SNPs, humans have more variability than we thought...and that hasn't yet extended to the heritable chromatin, epigenetic material that surrounds the DNA strands and exerts significant control over what is actuated and what is suppressed. We're coming to think of the DNA itself less as a blueprint, as did Watson and Crick and company, and more as an amalgam of structured raw materials. Some of it works just as it is, and some needs the detailed construction plans of epigenetics to shape and control how it works.
Many are beginning to feel that the concept of a single reference genome is approaching obsolescence. Genealogy should be impacted. For example, we already know that certain pile-up regions--fairly short chunks of DNA where far more people than should be expected statistically display as being identical--differ among continental-level populations. A pile-up region that displays in Western Europeans may not be there in Southeast Asians.
Likewise, our rough calculation of genetic distance, the centiMorgan, is based solely on that genomic reference map. It tries to estimate the likelihood, given two locations on the same chromosome, that a crossover, or recombination, will happen between those locations the next time meiosis creates gametes. But it's doing it over a single map that's trying to account for the variances among a population of 8 billion individuals.
We already know there are around 50,000 recombination "hotspots" that aren't taken into account in GRCh37, and that the cM calculation also doesn't do a stellar job of accounting for something during meiosis called "crossover interference." We also know that, in males, positions of likely crossover change as they age due to a process related to DNA methylation. Technically, centiMorgans should be calculated a bit differently for a male 22 years old than for the same male at 52. By the way, this phenomenon doesn't impact females because the oocytes, the forerunner of the egg cells, form and undergo most of the first of two stages of meiosis while the female herself is still a fetus. When she is born, all the egg cells she will ever produce have already undergone recombination, so the age at which she gives birth isn't affected the same way by DNA methylation.
I digressed way further.
Bottom line is that we genealogists often view and use the data and tools we have today as if we're working with solid, highly specific, long established and thoroughly tested science. We aren't. It's evolving rapidly. Very rapidly. The hybrid techniques that allowed for the first full genome sequencing in May 2021 were not available in 2020. Those specific techniques are not available for direct-to-consumer purchase yet, at any price.
To put it in terms of another technology, today we have home computers that would exceed the capabilities of what were mainframes not all that long ago. Comparatively, our use of genetics for genealogy is about at the point where home computers first came with a small, preinstalled hard drive and we no longer had to use a floppy disk to boot them. Our average microarray test totals around 650,000 points of data (complete coincidence, the IBM PC was introduced with 640K of RAM, random access memory). A 30X whole genome sequencing provides over 90 billion. We're using a single reference map of the human genome that was derived from only 19 individuals out of 8 billion living today.
Much is left to be done, and in the meantime we genealogists need to be critical of results that go too deep, that have us working beyond the cutting edge of the actual science.