Well, it was quick for me. All I did was paste in the links. This one won't be as quick...
(Note: I'll split the issue of maternal/paternal identification off into a second message. Am I the only one who has to watch out for the G2G per-post word limit? Ahem.)
Yep; still waiting on my WGS results. Eight months and 12 days since the testing company received my sample. But hey; who's counting?
The really short answer is that, to my knowledge, there are no commercial services providing what we would call comparison or matching services on whole genome results. Yet. And I believe it remains to be seen how those services will develop and function; I have a feeling there will be more than one methodology. I'm betting one method will be to use VCF (variant call format) files only--in other words, comparing only each of our differences from the then-current human genome reference model--and another may be similar to GEDmatch today, comparing actual SNPs but with a data library that would likely have to hold somewhere around 15 to 20 million relevant SNPs.
I don't know that I can envision an IT infrastructure, at least not in the near future, that could do an online comparison and data presentation of whole genomes. The BAM (binary alignment/map) file for a 30X coverage run of our 3.2 billion base pairs weighs in at about 80 to 100 gigabytes in size. (Andreas, did you get your BAM? If so, how large is it?) The human-readable text version of that is called a SAM (sequence alignment map) and it's even larger. Doing even a one-to-one comparison between two BAM/SAM files would take a lot of computing power and a lot of working storage space. Standalone, batched requests I could see; real-time comparisons like we're used to today...not so much.
I think it will be, from the standpoint of direct-to-consumer packaging, IT and data communications that will be the bottleneck, not what goes on with the sequencing itself. Technical improvements in the lab are advancing all the time, like nanopore and long-read sequencing, that continue to drive lab costs down and speed processing up while maintaining or even increasing the kind of accuracy we see today in 30X coverage tests.
I really don't know much about GoNL (Genome of the Netherlands) other than it is a very mature and well-funded study, and offers a possible glimpse of where we'll be in a couple of years with WGS for genealogy. A collaboration of multiple organizations including government entities and universities, GoNL kicked-off in 2009 and the baseline samples are from 250 trio-phased sets: two parents and a child, so 750 individuals. Those--I believe--have all been whole genome sequenced, plus there are a number of volunteer participants who were tested on genotyping microarrays.
To draw an imperfect analogy that's at work today, in a yDNA study it's a big deal to have multiple men take the BigY-700 full-sequence test because then, for other men who match them closely at 111 Y-STR markers, the men not full-sequenced can purchase individual SNP tests at a low cost and still arrive at good evidence they and the BigY testers align down to at least that individually tested SNP. Not everyone has to pay the big bucks for all to be able to receive some evidentiary benefit.
Told you it was a bad analogy. But the similarity is that--as we move forward with relatively inexpensive but high resolution direct-to-consumer whole genome sequencing--those who have been full-sequenced can provide detailed data down to individual loci that can support autosomal triangulation results for those who have taken coarser, less accurate microarray tests. So some good news there is that even if WGS supplants genotyping microarrays in the near future--which I think it will--the data from our microarray tests will still be valuable and tests by deceased family members will never be obsolete.
That's basically what the GoNL project has done. Trio-phased whole genome sequencing is informing the less data-rich microarray tests.
The issue we have with microarray testing is that only a (relatively) few markers, or SNPs, are examined. To put it into perspective, you're probably familiar with GEDmatch's previous default threshold that said a segment wasn't a real segment unless it contained 700 tested SNPs; which threshold has now been reduced to a floating figure that can go as low as 200 SNPs.
A centiMorgan is not a physical measurement, but is calculated based upon expected probability of crossover (recombination) at a given point on a chromosome, and the amount of cMs calculated for the same physical length--in number of base pairs--along a segment of a chromosome can differ significantly not only from chromosome to chromosome, but from different regions along the chromosome. That said, very roughly speaking one centiMorgan will equate to a stretch of a chromosome that's somewhere around one million base pairs in length. Ergo, a small 7cM segment will be, give or take, comprised of about 7 million base pairs.
Back to the SNP thresholds. If a 7cM segment can be considered valid if only, say, 400 of the same SNPs were examined in both tests being compared, that would mean we can positively match only one in 17,500 SNPs. So what we end up doing is assuming that those other 6,999,600 SNPs that were not tested will be identical.
Without diving into the exome and protein coding genes and the approximately 3 to 5 million SNPs that really differentiate your particular genome from someone else's, you see where I'm goin' with this. WGS can serve as something like a Rosetta stone. As I'm sure the GoNL study has done, if you get several microarray test results that match nicely with a set of trio-phased WGS results, you can do a bit better that simply guess at what lies within those empty stretches of 17,500 untested SNPs.
If this sounds like imputation, you're correct. In a nutshell, imputation infers the allele values of untested SNPs based on the linkage disequilibrium patterns derived from directly tested markers. When it comes to estimating the missing SNPs from a set of results, there are two different types of reference panels, or database libraries, used: haplotype panels and genotype panels. It's important to note that both require whole genome sequencing as a starting point in order to have that Rosetta stone. If you're trying to decipher a message from words in certain positions separated by thousands of unknown words, you need to have that (at least nearly) complete reference to compare against.
I believe that some mistakenly believe that AncestryDNA, for example, uses their huge database to keep adding newer and more refined information to feed into their Beagle imputation tool. They aren't. They can't, because within that huge database all tests (more or less) looked at exactly the same, few genetic markers: having 10 million copies of the same book that has only the same word in the same place every ten otherwise blank pages isn't going to help you figure out what the book says.
That's where we are with microarray testing today and why imputation is currently a less than stellar alternative for comparing results between chips that have very little overlap in markers tested. None of us have had whole genome sequencing performed by Ancestry.com or any of the other major DTC testing companies. They don't have haplotype reference panels gleaned from their 15 million tests on file; they have to rely on external sources for those data, and that can only start with the much broader genotype panels.
When we talk about genotype reference libraries, the 1000 Genomes Project comes first to mind, and samples are grouped into continental panels or "super-populations": African (AFR), Ad-mixed Americas (AMR), East Asian (EAS), European (EUR), and South Asian (SAS). Under those are more granular sub-population panels, but they can't really achieve what GoNL has done with its highly specific testing strategy. GoNL has created haplotype references for their defined population study against which microarray results can have their results accurately imputed.
Now, Ancestry or someone else can go to GoNL and ask for or acquire permission to use the data, and then for a select number of tests whose haplotypes align with those within GoNL, the predictive matching can be quite accurate. But without definitive haplotype WGS libraries like that, the big testing companies are working with population-level data.
Imputation against genotype reference panels isn't outlandishly inaccurate, but for genealogy we have to contend with trying to be precise with, say, 5th cousins--among the roughly 4,700 of whom only about 15% are likely to share any current-chip detectable DNA at all with us; without recent pedigree collapse we'll need to test 5.6 actual 5th cousins in order to find one that's a DNA match--which means working with small segments and very few tested SNPs in triangulation groups with multiple members. A very recent study of imputation in a five-way admixed AFR population had the best results yield a genome-wide error rate in overlap of the autosomes of 11.98%. We currently do perennially better with the EUR population panel, but is even a 5% or 8% error rate acceptable?
I think that's a fence we'll be able to climb once DTC WGS becomes popular.