If you haven't tested with AncestryDNA, should you do it right now?

+18 votes

Becky started a topic about AncestryDNA's summer sale, currently in effect (US$59 through 26 August; the sale may already have ended in the UK), but I thought I'd caused enough topic drift there and decided to start a new question.

To answer my own question, if you've tested only on the Illumina GSA or the Thermo Fisher Scientific Affymetrix chips, then yes, you may want to get an AncestryDNA test soon...like, now, during this August sale. I may have been a bit premature with this blog post from last June (which should explain most of the reasons why you'd want an AncestryDNA test), but we now have additional corroboration that AncestryDNA will, in fact, soon be leaving behind the OmniExpress microarray chip.

Business Insider Prime reported this morning that "Genetic testing magnate Ancestry CEO Margo Georgiadis said that the startup plans to branch out from genealogy testing and expand into individualized medicine." They simply cannot do that with the OmniExpress chip, and I believe my prediction of a move to the GSA array no later than end-of-year is on point.

You can view an abridged, publicly-available version of the Business Insider article here. It goes on to say: "The company has largely refrained from stepping into the broader healthcare space...despite the success its rivals have achieved by leveraging genetic insights for pharmaceutical research and precision medicine. But now, Ancestry is building out a full health team, with open roles in marketing, engineering, communications, and senior management." 

I could argue with Business Insider terming AncestryDNA a "startup," though. wink Their first sales began November 2011, and they began to dominate the market in 2015. They've processed more direct-to-consumer DNA tests than any other company. As of January 2019 we believe Ancestry had sold roughly 14 million tests, maybe a bit more.

This is a big deal because the change in chipsets will mean that those 14 million people will have only 20% of the same SNPs tested as will the subsequent test takers. Mind you, for genealogy neither set of results is really better than the other...but the GSA chip is definitely better for health/medical analysis.

What's it mean for genealogists and adoptees? Short of waiting for whole genome sequencing to become routine, affordable, and to take over the market--which I think it will by 2022 or 2023--if you haven't tested on the OmniExpress chip, you may want to before it goes away. Take a look at this graph from MIT Technology Review:

All of the tests by Ancestry have been on the OmniExpress chip. July 2017 and earlier, 23andMe used the OmniExpress chip. March 2019 and earlier, both FTDNA and MyHeritage used the OmniExpress chip. Living DNA never used the chip. That can be your benchmark to help determine whether you or a relative tested on the OmniExpress chip. All the microarray chips in use can be customized--modified to include extra SNPs--and the companies did that. But the customization for the OmniExpress affects up to 30,000 SNPs out of a possible total of 714,000...the foundational ~680,000 SNPs are the same.

To date, my very rough estimate is that around 29 million people have purchased DTC DNA tests from the major companies, and of those somewhere around 19 or 20 million have been performed on the OmniExpress chip. And every company, including GEDmatch, have been struggling with how best to reconcile the fact that the OmniExpress and GSA chips examine only about 20% of the same genetic markers:

As a result, as the segment sizes grow smaller, I'm convinced we're seeing either--or both--more false-positive matches than ever, or valid segments not being evaluated as such due to too few in-common SNPs (the latter applies mostly to GEDmatch).

A fairly new Tier 1 tool at GEDmatch allows you to upload results from multiple testing companies and combine them into one "super kit." If results exist from both an OmniExpress chip and a GSA chip, that resulting "super kit" would contain somewhere around 1 to 1.2 million SNPs. It may or may not result in a greater number of matches showing up, but it will certainly improve the accuracy.

Perhaps most important, it keeps those 19 million sets of OmniExpress results pertinent. People have passed away, or may never take another DNA test. Once we transition to whole genome sequencing, this won't be an issue...providing someone builds us tools that allow us to do "historical test comparisons," which I'm sure will happen. But until then, I see this most affecting the oldest of our lines, and adoptees.

In September 2018, Louis Kessler wrote about "The Benefits of Combining Your DNA Raw Data." He took the results from his tests at all five of the major testing companies and manually merged them into a single "super kit" that resulted in just under 1.4 million unique SNPs.

After GEDmatch introduced their new tool, he followed up last April with "Combine Kits into One Superkit on GEDmatch Genesis." That process gave him a new kit number that contained 1.1 million SNPs.

Bottom line, if you think you might want or need results from the OmniExpress chip, you may have only a few months left to obtain them, and AncestryDNA is the only remaining source. At $59, it might be well worth another test kit or two during the current sale. If you'd like to help WikiTree a tiny bit at no additional cost to you, you can use this link to order.

in The Tree House by Edison Williams G2G6 Pilot (366k points)
So would a kit bought now but submitted after they make the transition be run on the new chip?  I assume it's not really about the test kit itself, it's about the chip (unless the chips need different test kits for some reason, but I don't see why they would).  So buying kits now and sitting on them isn't a good idea.
Nope; you're correct: the sample is the same; it's the actual microarray chip used--the microscope slide looking thingy--that makes the difference. Through the magic of digital-meets-biological, those super-tiny, ~680,000 points on that chip attract specific markers to attach to them when the prepped DNA solution is washed over it. Ergo, they can figure out which nucleotide is which.
Yeah, that's what I figured.  I'd hate for people to stockpile tests thinking it meant it would be on the old chip regardless of when it was submitted.

OK, so followup question... my mother, brother, paternal uncle and I have tested.  I'm hoping to get at least one or two of my father's other three siblings and my mother's first cousin to test.  Obviously the more the merrier in general, but now I'm wondering if I should wait and do them on the new chip to have better coverage of matches there as that dataset grows, or rush so I get them done on the old chip.  Correct answer, of course, is to do everyone on both chips and cover all my bases!  But if I had to prioritize which approach would make more sense?
To the inverse of what you’re suggesting: I asked FTDNA if they would retest a Family Finder on their new chip with a stored sample of a deceased individual. They said they would have to remove the old test from the system and replace it with the new one.  Seems kind of silly to me that they can’t have just two different logins and allow both tests, but that’s what they said.  Maybe I’ll end up using all of the sample on the Big Y-700 anyway.
Or run the new test and merge them to make a superkit like GEDMatch can do.

Barry, I agree: seems like a strange response with no hint of compromise. Out of curiosity, did you telephone FTDNA and speak to a human to whom you could fully explain the situation (and maybe get transferred to a supervisor), or was it via email or chat? In general I find FTDNA's support staff to be pretty good. They don't outsource to a boiler-room call center; the agents are right there in the same building.

If there's enough material in the stored sample, you might prefer the BigY-700 anyway. But otherwise I have to think there would be some way they could accommodate an additional Family Finder test for a deceased individual they already have on file. I mean, let the current information stand and, if it's lab procedure that's the issue, assign a new kit number and web account to the sample, pretend it's a new order, and just...do it. Shouldn't cause them any problems, and could make a big difference to your family's research.

"Or run the new test and merge them to make a superkit like GEDMatch can do."

Nah. Way too logical, Lisa. Thing is, we know FTDNA has to have a SNP catalog, or database, that includes all of their OmniExpress result markers, all of their new GSA chip markers and, since they accept 23andMe uploads, at least almost all of the additional markers 23andMe customized their v5 GSA chip with. So I can't imagine what technical limitation would keep them from doing just that...even make it a selling point because, other than GEDmatch, no one else is doing multiple result aggregation yet.

But I've heard no hint that something like that's in the works. Doesn't mean it isn't. Just that I haven't heard any chatter yet. Maybe Max or Bennett will read this and decide it's a good idea.  laugh

If You Would Like to Help with a Tiny Bit of Research...

I'm looking for a few sets of very recent AncestryDNA results. I don't need or want any DNA information, just which SNPs Ancestry tested. Posted results will be completely anonymous.

If you've tested in 2019--meaning the test was purchased in 2019, not just the results received--and if you're even a little bit comfortable using a spreadsheet program like Excel, please consider sending me a private message from my profile.

I can explain what's needed step-by-step in a reply email, but essentially you will have already downloaded your raw data file from Ancestry.com, and you'll be opening the file in a spreadsheet program, deleting all the information that pertains to you individually (chromosome numbers, marker positions, and allele information), saving the file (to a different filename so that you don't overwrite your actual results), ZIP compressing it, and emailing it to me. The objective will be to analyze which SNPs AncestryDNA is currently testing and compare those to the Illumina standard offerings in their OmniExpress v1.3 chip and newer GSA v2.0 chip.


A Follow-up

It seems nothing drives views of blog posts like controversy. Sigh. This one hit 6,000 views, and growing, this morning. Based on the communication with Dr. Turner, below, and conversations with a few others, I'm expanding this little SNP analysis to include all Ancestry v2 tests. I've received only a few sets of results as yet, but there are discrepancies among them that merit obtaining more samples.

So the timeframe now would be any AncestryDNA test purchased from June 2016 to present. I've gone public by posting the call for samples to the blog. You can read the revised description and contact information there. Would love to get a few more WikiTreers involved!

5 Answers

+5 votes

As a result, as the segment sizes grow smaller, I'm convinced we're seeing either--or both--more false-positive matches than ever, or valid segments not being evaluated as such due to too few in-common SNPs (the latter applies mostly to GEDmatch).

This is an issue, however I'm convinced that there is a viable workaround: imputation.

In short, imputation uses a combination of population genotype statistics and/or segment data from other customers and individuals to create a more complete composite picture for each individual's data set. 

The reliability is reasonable, however imperfect, but it should supply sufficient compatibility for much of our matching needs. Moreover, this is not something that we consumers need to perform; it's a back-end process performed by the company. 

For a very good run-down of the what imputation is about, I would recommend reading DNAeXplained's article on Imputaiton.

The real benefit of getting tested sooner than later is the opportunity to get in touch with others while they are still around. Lots of people really only check messages within the first month or so of getting tested. And others, often who are genealogists, have strong odds of passing away rather suddenly. 

Last September, I was going through my matches and had a rather poignant experience of looking up a match who, it turned out, had logged in to Ancestry DNA for the very last time exactly one year ago from today. I'm not sure if this audience would be amenable to the explanation that I offered to some online chums, but since it's the 1 year anniversary of her passing, I'll share it here in honour of Mrs. Unique Name ⁠— may she rest in peace.

It's always been a bit of a challenge with genealogy: Your best potential sources are going to die before you get a chance to ask. So, despite what advances tech changes may bring, it's better to test sooner than later.

by anonymous G2G6 Pilot (130k points)

Oh, I'm fairly familiar with imputation. In fact, I posted about the Roberta Estes blog you referenced here on G2G the day Roberta wrote it in 2017. However, none of the companies currently providing matches using imputation disclose what their criteria or algorithms are, and we can be certain that when Ancestry.com begins use of the GSA chip they won't tell us how they're running the comparisons, either. In reference to the Ancestry situation, I described imputation a bit yesterday...and the length--and divergence from the topic--of that post is what caused me to decide to start a new one.

We don't know what minimum thresholds are established by undisclosed imputation methods, either by directly observed alleles or through linkage disequilibrium with a genotyped variant. That pairwise correlation, referred to as r2, is usually considered acceptable if r2 >= 0.8. We have no way of knowing what companies are using, but an 80% threshold ain't nuthin' to write home about if you're dealing with what may or may not be a shared 10cM segment.

And in truth, what we're doing with genealogy is a sort of double-up imputation. Strictly speaking, with imputation you're taking a limited set of identified variants and extrapolating an inferred, genome-wide representation based upon whatever database you're using. But for genealogical matching, that has to happen twice, once for each set of results, then the two different sets of genomically imputed results are compared. More margin for error.

I have minimal confidence in our current states of imputation when dealing with smaller segments. I'm still awaiting my own whole genome sequencing data, but I participate in a couple of groups with knowledgeable folks who already began working with theirs months ago. Nothing scientific, but anecdotally the information seems to be that--when we have complete BAM or VCF files to look at--we're seeing far more false-positive small segment matches from the testing companies--even Ancestry with their Timber algorithm--than we might expect.

To compound this, the GSA chip is more specifically medical/health focused than the OmniExpress chip. The GSA chip devotes about 18% of its standard coverage to clinically-relevant markers, and these exomic data will almost always underperform in imputation...we are, after all, all of us about 99.8% identical, and the exomic regions, the protein expressing genes, are where we're most likely to be the same.

For the time being at least, GEDmatch has chosen not to perform any type of imputation. So if you want to use their tools and have taken a 23andMe v5 test, and I have an Ancestry v2 test, we may be using as few as 90,000 markers for a comparison (I have one real-world example to a known second cousin where only slightly over 89,000 could be compared). In chromosomal areas of low SNP density, having only that sort of number to compare can be like driving from New York to L.A. without ever seeing a mile marker or freeway sign.

But my point was simply that we still have a window of time where, if we choose, we can get a set of data from an OmniExpress chip test.

Edited to add: Originally I said that the "GSA chip devotes about 10% of its standard coverage to clinically-relevant markers." With more research while writing today's blog post, Illumina's numbers show that 18% of the chip's standard markers are clinical/pharma focused.

+3 votes
It's certainly possible that AncestryDNA will move to a new chip at some point in the future, but the current chip is a custom design and has more SNPs annotated in SNPedia than the GSA chip.

Edison said in a different post that Illumina is going to stop manufacturing the chip Ancestry is using now.  Do you disagree with that?
I have no insider knowledge, but Ancestry has its own custom design for a chip.  Ancestry is almost a mass market all by itself, and Illumina would probably be willing to continue to manufacture it.


They well be thinking about a new chip design after three years, but I rather doubt they would move to the core GSA chip. It would have a very low SNP overlap with its existing database and fewer SNPs annotated by SNPedia.

Time will tell.

Ann beat me back here. Yesterday was non-stop, and I got up at 4:00 this morning to write this lengthy follow-up that I felt was necessary, given recent events, to my June 24 blog post. Today's is "The Pending Reinvention of Ancestry.com." It may be rambling, but hopefully it pulls together more information than we've really been able to detail here. (If you read it, be gentle: I've tabled a final proofread until I can walk away from it for a while.)

Ancestry's chip is not unique. The "custom" part comes in because it's a customized OmniExpress chip...v1.3 of which, the latest, can sample a total of 714,238 markers which includes up to 30,000 user-defined, custom markers. The GSA v2.0 chip handles a total of 665,608 markers, which includes up to 50,000 user-defined ones.

I can't comment about SNPedia's evaluation, but Illumina certainly feels the GSA chip is far, far more relevant for clinical research than is/was the OmniExpress. In fact, by their own summary fully 18% of the base markers were chosen specifically because they address a clinical/pharma focus. And GlaxoSmithKline also thought so when they partnered up last year with 23andMe with $300 million.

Illumina writes that "The clinical research content of the Infinium GSA-24 v2.0 BeadChip was designed through collaboration with medical genomics experts using multiple annotation databases to create an informative, cost-effective panel for clinical research applications.... Variants included on the array consist of markers with known disease association based on ClinVar, the Pharmacogenomics Knowledgebase (PharmGKB), and the National Human Genome Research Institute (NHGRI)-EBI database."

Ancestry's purchases from Illumina really can't carry enough financial clout to merit a completely custom chip, or IMHO to continue manufacturing the OmniExpress. Sales growth of DTC DNA tests have been shrinking across the board during the past year or so, and in order to have Illumina continue to make the OmniExpress Ancestry would have to pay so much per chip (because they'd be the only ones using it) that it would prohibit them from being price-competitive in a field where they're the Southwest Airlines: the ones who drove down baseline pricing to begin with.

In fact, about two hours after I posted that blog entry, a guy from Orlando emailed me and said he believed Illumina had already ceased general support for the OmniExpress chip. He pointed me to Illumina's support website, and I just now included that information at the bottom of my blog post. Sure enough, Illumina's submenu shows support options for 10 different human genotyping microarray chips, and the OmniExpress is no longer among them.

I'll bet ya a Starbucks' latte and a donut that we see AncestryDNA move to the GSA chip this year. laugh

You raise lots of good points. The almost continuous specials could even be a sign that Ancestry is reducing its inventory.

But the current version is more customized than just adding a few tens of thousands of SNPs to the core selection of the old Illumina chips. This is evident in the SNP overlap table at the ISOGG Wiki


Note the slight differences between the companies using the GSA chip each with their own custom SNP selection. Compare that to Ancestry v1 and v2, where the SNP overlap is only 426,760. V2 dropped a large percentage of SNPs and replaced them with others, so I think it's fair to call it a fully customized chip.

May very well be the case, Ann. A Twitter mini-war is in agreement with you about a completely customized chip. I'm never afraid to have a hypothesis disproven.  laugh  That's why I only bet you a coffee and donut.

I don't have any insight as to how Louis, if he's still the source, is running those comparisons. But we know not even all Ancestry v2 chips are identical. They made a big announcement May 2016 with the first v2--the announcement was only two months before the press release that a multi-year deal had been inked with Quest Diagnostics to do the lab work--but we know they tweaked with the chip a little in at least two other iterations, April 2018 and December 2018.

These low-cost microarray chips are only a small source of revenue for Illumina, and an even smaller source of relative profit. They booked $3.33 billion in revenue in 2018, and $826 million in net attributable profit. It's hard for me to get my curmudgeonly and calcified business brain around how it would be economically feasible for either Illumina or Ancestry to manufacture a completely unique microarray chip for Ancestry and everything stay affordable enough for a $59 retail test. I know customization certainly can be done with Illumina designing and synthesizing a special oligonucleotide pool based on a customer's specifications, and then tuning manufacture of the chips, but I gotta think this would have significant up-front fees plus continuing custom-manufacturing fees each time a product run is made.

Add to that a completely custom chip may cause variance in the workflow at Quest Diagnostics that might cost Ancestry additional money. I'd assume everything would still run on Illumina iScan or HiScan equipment--no problem for Quest--but a completely custom chip might mean different assay chemistry, maybe at hybridization.

I honestly don't know. But it just seems logical that Ancestry would see, with a truly custom chip, cost mark-ups at both manufacture and lab testing that its competitors wouldn't face. Production mark-ups like that just don't seem like they'd be tenable. But we'll see in the next several months.

(BTW, I exposed the blog post on Twitter so that an Ancestry insider might want to speak up set the record straight. But they always hold everything close to the vest, and I doubt that will happen.)

However this pans out, I can't dispute your bottom line advice. With the current sale prices, now is a good time to take an Ancestry test.

Ann, I just did a completely inconclusive--but perhaps interesting--comparison. The only Ancestry raw data I have at my disposal is my own, from a test December 2016 but which still shows in the header to be "AncestryDNA array version: V1.0," AncestryDNA converter version: V1.0."

I obtained from Illumina the listing of all standard rs IDs in OmniExpress-24 v1.2 and v1.3. There were a total of 6,876 SNPs in OmniExpress v1.2 that were removed in v1.3, and a total of 7,515 different ones added. I also have listings of the removed and added rs IDs from v1.1 to v1.2, but the actual rs ID listing for v1.1 seems to no longer be available.

Since we can't with certainty correlate the date/version of a given Ancestry test with any possibly-correspondent OmniExpress version, I chose to use the v1.3 listing and merge in the retired SNPs from v1.2. Then I made certain there were no duplicates in that merged data (there were two).

Next, I took all rs IDs shown in my Ancestry raw data--excluding all chromosome, position, and allele information; so all rs IDs were included regardless of possible no-calls: a total of 701,479 SNPs.

Finally, I ran a comparison to determine whether a unique rs ID appeared in both lists of data (a returned value of "1"), or whether an rs ID appeared in the Ancestry raw data but not in the default list of OmniExpress SNPs (a returned value of "0").

The result was not what I was expecting to see. Exactly 6,093 rs IDs were unique in the Ancestry data; all others had a correspondent in the OmniExpress default data. That means 99.13% of the SNPs Ancestry tested are in the OmniExpress v1.2/1.3 data sets. I was fully expecting at least a 20K gap even if the chip had been customized as allowed in the standard-production chip, and perhaps well over 50K or more if it was a completely customized chip, as suspected by the May 2016 introduction of Ancestry v2.

Since my data header shows v1.0, without more information I have to assume that they processed my kit, ordered approximately five months after the introduction of the first iteration of v2, on existing stock of the v1 chip. Seems a very long time to still be holding old stock, but it does look that way.

Still, only 6,093 rs IDs differentiating my Ancestry data from the OmniExpress default data set (albeit a set combining both v1.2 and v1.3) was...surprising, to say the least. What I'd like to do is get my hands on some test results processed by Ancestry after January 1 this year. We still can't be certain those were tested against what SNPedia calls "v2d," updated last December, but they would almost certainly have been tested on nothing earlier than "v2c," released April 2018. If Ancestry is using a completely custom chip we should be able to tell. Plus I'd like to run the same comparison against the GSA v2 ship so we can see the actual potential overlap. I'll put out a "call for assistance" here and on my blog to see if I can get a few sets of results...rs ID column only, of course; no actual information about the test results.

BTW, if you'd like to have a look at the result of this little comparison, you can get the .CSV file here (79K) that contains all the Ancestry rs IDs that did not also appear in the OmniExpress v1.2/v1.3 data. I didn't bother to run them against GRCh37; seems like extra effort that wouldn't tell us anything further about whether or not Ancestry's chip is truly custom. But if can get a solid, consistent set of results from a few recent Ancestry tests, it might be interesting to at least see which differences are autosomal, which X/Y/PAR, and which are mtDNA.

Edited to Add: Changing one incorrect snippet to read: (a returned value of "0"). Having only two possible values and they both be "1" ain't very helpful...

Here's a link to the template files I provided Louis Kessler for his cross-tabulation. Many are my own and the remainder are single examples from files I solicited, so they could well be from different minor revisions. LivingDNA v2 is the exception: they do not report no-calls, so I merged several examples to get a somewhat better sampling of SNPs.


Ouch; thought I'd already replied to this because I've already crunched numbers in some of the files. So, belatedly...Thanks, Ann!

This is turning into a full-blown, mini-project, time-sink.  wink  I have only a couple of sets of results back yet--have responded to a half-dozen interested parties just since the notice went up on the blog a few hours ago--but already see total rsID number discrepancies in the 15,000 range on recent v2 tests. That just doesn't seem explicable from, say, a few damaged probes on a given chip, or failures during fluorescence or imaging. We'll see what we get. Ann, I'll make sure you have access to detailed data once done.

+3 votes
I plan to order several test kits.  When I click on the link, I see no indication that WikiTree will get anything.  The link, to my surprise, took me to my own Ancestry account.  Is there any way I can be sure WikiTree is getting a benefit?
by Living Kelts G2G6 Pilot (517k points)

Hey, Julie. I snagged that link from here: https://www.wikitree.com/wiki/Help:DNA_Tests#Types_of_Tests.

So I honestly can't comment about its use. Maybe someone from the WikiTree DNA Project can? For reference, the affiliate link on that Help page is https://prf.hn/click/camref:1011l4xx5/creativeref:1011l28282.

And, yep, just like you, because Ancestry.com remembers my log-in since I'm there frequently, when I try the link it takes me straight to my own AncestryDNA page, too. Dunno.

+2 votes
Yes do it. Well worth the 59 bucks! I had all ready done the Y test with FTDNA & their family search so I was reluctant to do yet another test. Let me just say: ancestry's data base is huge! Plus their ethnicity estimates are more thorough & accurate. Good features too,like true lines & a common ancestor feature with applicable matches.better research options for trees than my heritage also.
by Jesse Elliott G2G6 (6.3k points)
It comes out to more than $59 once you add the shipping.  And guess what Ancestry did just after I ordered my tests?  They sent me an e-mail telling me I could buy tests for $74!
If you have Amazon Prime you can get free shipping. Today it's showing the $59 price (or $69 if you include traits).
+2 votes
I'm hoping one day in the future genetic DNA will be further refined so that a person can go back not just five generations but  perhaps 10 or 12 generations. That would open up a lot of new doors for me and  many others who wish to do deeper genealogy.  With our present knowledge it doesn't seem possible but who knows what the future will hold with advancements we can't foresee when looking back where we were 10 or 15 years ago.
by James Stratman G2G6 Mach 8 (89.2k points)
We can go back more then 10 generations ago. Check out the “Genome of the Netherlands” work, they went back to 2200 BCE which is what you get when you analyze IBD data in the 1-2 cM range (as per their presentation before I get flamed for “its all IBS”.
Andreas, interesting.  I always thought anything below 7 cM could not be trusted as it was considered 'noise' and therefore unreliable. Why isn't the 1 or 2 cM range used or talked about?   I would like to investigate this further. Thank you for your feedback.

"It's all IBS."

Glad I didn't have a mouthful of coffee when I read that. smiley

For quick reference: http://www.nlgenome.nl/https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3895638/; and https://www.nature.com/articles/ng.3021.

I'm not sure I'd call those quick, Ed, especially the first. smiley Would you please clarify:  They used whole-genome sequencing for their analysis, didn't they?  And while I've heard of a few people who've had their whole genomes analyzed (I think you said somewhere you were waiting on your own results), is there any commercially available service that will do the comparisons?

P.S. (added later)  For those of us who are ignorant, would you please also clarify:  Does whole-genome sequencing distinguish between paternal and maternal DNA?  

Well, it was quick for me. All I did was paste in the links.  laugh  This one won't be as quick...

(Note: I'll split the issue of maternal/paternal identification off into a second message. Am I the only one who has to watch out for the G2G per-post word limit? Ahem.)

Yep; still waiting on my WGS results. Eight months and 12 days since the testing company received my sample. But hey; who's counting?

The really short answer is that, to my knowledge, there are no commercial services providing what we would call comparison or matching services on whole genome results. Yet. And I believe it remains to be seen how those services will develop and function; I have a feeling there will be more than one methodology. I'm betting one method will be to use VCF (variant call format) files only--in other words, comparing only each of our differences from the then-current human genome reference model--and another may be similar to GEDmatch today, comparing actual SNPs but with a data library that would likely have to hold somewhere around 15 to 20 million relevant SNPs.

I don't know that I can envision an IT infrastructure, at least not in the near future, that could do an online comparison and data presentation of whole genomes. The BAM (binary alignment/map) file for a 30X coverage run of our 3.2 billion base pairs weighs in at about 80 to 100 gigabytes in size. (Andreas, did you get your BAM? If so, how large is it?) The human-readable text version of that is called a SAM (sequence alignment map) and it's even larger. Doing even a one-to-one comparison between two BAM/SAM files would take a lot of computing power and a lot of working storage space. Standalone, batched requests I could see; real-time comparisons like we're used to today...not so much.

I think it will be, from the standpoint of direct-to-consumer packaging, IT and data communications that will be the bottleneck, not what goes on with the sequencing itself. Technical improvements in the lab are advancing all the time, like nanopore and long-read sequencing, that continue to drive lab costs down and speed processing up while maintaining or even increasing the kind of accuracy we see today in 30X coverage tests.

I really don't know much about GoNL (Genome of the Netherlands) other than it is a very mature and well-funded study, and offers a possible glimpse of where we'll be in a couple of years with WGS for genealogy. A collaboration of multiple organizations including government entities and universities, GoNL kicked-off in 2009 and the baseline samples are from 250 trio-phased sets: two parents and a child, so 750 individuals. Those--I believe--have all been whole genome sequenced, plus there are a number of volunteer participants who were tested on genotyping microarrays.

To draw an imperfect analogy that's at work today, in a yDNA study it's a big deal to have multiple men take the BigY-700 full-sequence test because then, for other men who match them closely at 111 Y-STR markers, the men not full-sequenced can purchase individual SNP tests at a low cost and still arrive at good evidence they and the BigY testers align down to at least that individually tested SNP. Not everyone has to pay the big bucks for all to be able to receive some evidentiary benefit.

Told you it was a bad analogy. But the similarity is that--as we move forward with relatively inexpensive but high resolution direct-to-consumer whole genome sequencing--those who have been full-sequenced can provide detailed data down to individual loci that can support autosomal triangulation results for those who have taken coarser, less accurate microarray tests. So some good news there is that even if WGS supplants genotyping microarrays in the near future--which I think it will--the data from our microarray tests will still be valuable and tests by deceased family members will never be obsolete.

That's basically what the GoNL project has done. Trio-phased whole genome sequencing is informing the less data-rich microarray tests.

The issue we have with microarray testing is that only a (relatively) few markers, or SNPs, are examined. To put it into perspective, you're probably familiar with GEDmatch's previous default threshold that said a segment wasn't a real segment unless it contained 700 tested SNPs; which threshold has now been reduced to a floating figure that can go as low as 200 SNPs.

A centiMorgan is not a physical measurement, but is calculated based upon expected probability of crossover (recombination) at a given point on a chromosome, and the amount of cMs calculated for the same physical length--in number of base pairs--along a segment of a chromosome can differ significantly not only from chromosome to chromosome, but from different regions along the chromosome. That said, very roughly speaking one centiMorgan will equate to a stretch of a chromosome that's somewhere around one million base pairs in length. Ergo, a small 7cM segment will be, give or take, comprised of about 7 million base pairs. 

Back to the SNP thresholds. If a 7cM segment can be considered valid if only, say, 400 of the same SNPs were examined in both tests being compared, that would mean we can positively match only one in 17,500 SNPs. So what we end up doing is assuming that those other 6,999,600 SNPs that were not tested will be identical.

Without diving into the exome and protein coding genes and the approximately 3 to 5 million SNPs that really differentiate your particular genome from someone else's, you see where I'm goin' with this. WGS can serve as something like a Rosetta stone. As I'm sure the GoNL study has done, if you get several microarray test results that match nicely with a set of trio-phased WGS results, you can do a bit better that simply guess at what lies within those empty stretches of 17,500 untested SNPs.

If this sounds like imputation, you're correct. In a nutshell, imputation infers the allele values of untested SNPs based on the linkage disequilibrium patterns derived from directly tested markers. When it comes to estimating the missing SNPs from a set of results, there are two different types of reference panels, or database libraries, used: haplotype panels and genotype panels. It's important to note that both require whole genome sequencing as a starting point in order to have that Rosetta stone. If you're trying to decipher a message from words in certain positions separated by thousands of unknown words, you need to have that (at least nearly) complete reference to compare against.

I believe that some mistakenly believe that AncestryDNA, for example, uses their huge database to keep adding newer and more refined information to feed into their Beagle imputation tool. They aren't. They can't, because within that huge database all tests (more or less) looked at exactly the same, few genetic markers: having 10 million copies of the same book that has only the same word in the same place every ten otherwise blank pages isn't going to help you figure out what the book says.

That's where we are with microarray testing today and why imputation is currently a less than stellar alternative for comparing results between chips that have very little overlap in markers tested. None of us have had whole genome sequencing performed by Ancestry.com or any of the other major DTC testing companies. They don't have haplotype reference panels gleaned from their 15 million tests on file; they have to rely on external sources for those data, and that can only start with the much broader genotype panels.

When we talk about genotype reference libraries, the 1000 Genomes Project comes first to mind, and samples are grouped into continental panels or "super-populations": African (AFR), Ad-mixed Americas (AMR), East Asian (EAS), European (EUR), and South Asian (SAS). Under those are more granular sub-population panels, but they can't really achieve what GoNL has done with its highly specific testing strategy. GoNL has created haplotype references for their defined population study against which microarray results can have their results accurately imputed.

Now, Ancestry or someone else can go to GoNL and ask for or acquire permission to use the data, and then for a select number of tests whose haplotypes align with those within GoNL, the predictive matching can be quite accurate. But without definitive haplotype WGS libraries like that, the big testing companies are working with population-level data.

Imputation against genotype reference panels isn't outlandishly inaccurate, but for genealogy we have to contend with trying to be precise with, say, 5th cousins--among the roughly 4,700 of whom only about 15% are likely to share any current-chip detectable DNA at all with us; without recent pedigree collapse we'll need to test 5.6 actual 5th cousins in order to find one that's a DNA match--which means working with small segments and very few tested SNPs in triangulation groups with multiple members. A very recent study of imputation in a five-way admixed AFR population had the best results yield a genome-wide error rate in overlap of the autosomes of 11.98%. We currently do perennially better with the EUR population panel, but is even a 5% or 8% error rate acceptable?

I think that's a fence we'll be able to climb once DTC WGS becomes popular.

Related questions

+10 votes
0 answers
+7 votes
0 answers
+10 votes
2 answers
526 views asked Dec 21, 2017 in The Tree House by Janice Anderson G2G6 (8.5k points)
+7 votes
1 answer
299 views asked Jul 15, 2018 in The Tree House by Pip Sheppard G2G Astronaut (2.5m points)

WikiTree  ~  About  ~  Help Help  ~  Search Person Search  ~  Surname:

disclaimer - terms - copyright