Woo hoo! You can't see it--and you should be happy about that--but I'm jumping up and down having Ann at the party! I would have sent her a gilded invitation had I known she was interested. And I'm in complete agreement about how we may see WGS data actually used in genealogy.
(To Gaile: I was being facetious with the earlier start-a-business comment; I have about three new ideas for a business every year, and only one was ever successful enough to go public in a sizable way; so the track record ain't, uh, exactly what you might loosely term "good"...)
I'd seen some general, very basic (and probably dated) info from Strand Life Sciences regarding WGS BAM size and computational requirements dealing with FASTQ GZIPped files. At about 30X, the BAM should come in at around 80-90GB. They state, "...assuming whole genome samples are done at read lengths of 75 or above, the size of each whole genome sample [compressed FASTQ] can be rounded off to about 150 GB," that to accommodate up 40X. For a dedicated machine operating 16 cores at 2.7GHz with 32 GB RAM, they estimate generating aligned DNA reads from a FASTQ file takes about 6.5 hours each.
Per the SAM format specs, we should see each read per pass, plus a per-base-read quality string, read ID, flag, tags, and what not. On a bioinformatics board I read one person state that, at their institution, after alignment to GRCh38 the output SAM files from 30X tests were usually 250-350GB in size.
I've seen some amazing advances in computing and communications tech in my lifetime, but I'll never see a GEDmatch-style operation for complete WGS data. I'm also dog-paddling way over my pay-grade here. I'm eager to get my hands on the data and start learning, but on this subject I know not whereof I speak.
If I had to guess, though, I'd say Ann is spot-on. 'Cause she does know what she's talking about. What will interest genealogists are essentially the same data that interest population geneticists. They (we) pretty much don't care about exomic data (well, mostly); that's for medical researchers. But there are somewhere around 10+ million SNPs identified and cataloged. If we consider current (some customized) genotyping microarray testing, including Living DNA's new Thermo Fisher Scientific chip, we're looking at less than 10% of those, or about 900K SNPs.
I'll betcha some big-brained population geneticists have already prioritized many more of those 10 million SNPs. (In fact, I know one I think I'll ask about that.) In other words, Goldilocks theory: for genealogy, 10 million SNPs may be overkill, but 900K may actually be too few. With greater SNP density should come less imputation/inference about segment sizes: greater accuracy and less guessing. Maybe there's a sweet-spot in there of, say, 5 or 6 million SNPs that are both stable and ancestral/population indicative. Dunno; I'm clueless.
If we had 6x greater SNP coverage to compare than we have access to today, we may not need to do much more with the other 99.8% of the base pairs. Meaning real-time, GEDmatch-style compare-and-report. Extracted databases of those sizes would probably be manageable.
Novel and unique autosomal SNPs may have less genealogical relevance than do the ones we're seeing almost weekly in the Y-chromosome. But maybe they could be handled similarly to the way Alex Williamson does for The Big Tree...but automated and batch-compared. And I imagine that yDNA and mtDNA WGS data would always be split off into their own databases. Oh, and speaking of Alex, I learned that the Y-DNA Data Warehouse can already accept VCFs from Dante Labs, but of course that endeavor only includes yDNA in Haplogroup R.
And as Ann noted, I'll bet we start to see some patterns where we've been incorrectly inferring unbroken segments based on matching SNPs, probably in SNP-poor chromosomal regions (some centromeres, telomeres, etc.). Not real-time matching stuff, but researchers having access to volumes of WGS data could start to tell us far more about our accuracy in working with SNPs (and lead to big-time refinements of imputation and matching accuracy), as well as helping us do much more in positively identifying pile-up regions.
Regardless of what happens with the tech and our ability to use the data for genealogy, if we're seeing a $200 WGS become an actual thing it's going to be time to start figuring out how to store and preserve--and grant permission for research purposes to--those vast amounts of data. Just in the 15 years that I've been messing with DNA for genealogy I've seen test-takers pass away and their family members then have no access to--or interest in--the DNA tests or any communication about them. Blaine Bettinger has written about and discussed the situation, and maybe we need to make it a public-service planning priority for 2019.
When drafting a will, password information and a directive of what to do with a DNA test may cross almost no one's mind. But I'd be hard-pressed to think of anything as unique and irreplaceable as my DNA information. We're all a one-off. My beneficiaries can use or sell property, can scan and archive photos and documents. But my DNA is unique and, once they lose access to (or interest in) it or the management of it, it may be irretrievable. We might be able to sample artifacts like stamps or envelopes, but there will never be a way to get back entire WGS results.
Hm. Off-topic, but food for thought...