If you haven't tested with AncestryDNA, should you do it right now?

Question

If you haven't tested with AncestryDNA, should you do it right now?

6.5k views

Becky started a topic about AncestryDNA's summer sale, currently in effect (US$59 through 26 August; the sale may already have ended in the UK), but I thought I'd caused enough topic drift there and decided to start a new question.

To answer my own question, if you've tested only on the Illumina GSA or the Thermo Fisher Scientific Affymetrix chips, then yes, you may want to get an AncestryDNA test soon...like, now, during this August sale. I may have been a bit premature with this blog post from last June (which should explain most of the reasons why you'd want an AncestryDNA test), but we now have additional corroboration that AncestryDNA will, in fact, soon be leaving behind the OmniExpress microarray chip.

Business Insider Prime reported this morning that "Genetic testing magnate Ancestry CEO Margo Georgiadis said that the startup plans to branch out from genealogy testing and expand into individualized medicine." They simply cannot do that with the OmniExpress chip, and I believe my prediction of a move to the GSA array no later than end-of-year is on point.

You can view an abridged, publicly-available version of the Business Insider article here. It goes on to say: "The company has largely refrained from stepping into the broader healthcare space...despite the success its rivals have achieved by leveraging genetic insights for pharmaceutical research and precision medicine. But now, Ancestry is building out a full health team, with open roles in marketing, engineering, communications, and senior management."

I could argue with Business Insider terming AncestryDNA a "startup," though. Their first sales began November 2011, and they began to dominate the market in 2015. They've processed more direct-to-consumer DNA tests than any other company. As of January 2019 we believe Ancestry had sold roughly 14 million tests, maybe a bit more.

This is a big deal because the change in chipsets will mean that those 14 million people will have only 20% of the same SNPs tested as will the subsequent test takers. Mind you, for genealogy neither set of results is really better than the other...but the GSA chip is definitely better for health/medical analysis.

What's it mean for genealogists and adoptees? Short of waiting for whole genome sequencing to become routine, affordable, and to take over the market--which I think it will by 2022 or 2023--if you haven't tested on the OmniExpress chip, you may want to before it goes away. Take a look at this graph from MIT Technology Review:

All of the tests by Ancestry have been on the OmniExpress chip. July 2017 and earlier, 23andMe used the OmniExpress chip. March 2019 and earlier, both FTDNA and MyHeritage used the OmniExpress chip. Living DNA never used the chip. That can be your benchmark to help determine whether you or a relative tested on the OmniExpress chip. All the microarray chips in use can be customized--modified to include extra SNPs--and the companies did that. But the customization for the OmniExpress affects up to 30,000 SNPs out of a possible total of 714,000...the foundational ~680,000 SNPs are the same.

To date, my very rough estimate is that around 29 million people have purchased DTC DNA tests from the major companies, and of those somewhere around 19 or 20 million have been performed on the OmniExpress chip. And every company, including GEDmatch, have been struggling with how best to reconcile the fact that the OmniExpress and GSA chips examine only about 20% of the same genetic markers:

As a result, as the segment sizes grow smaller, I'm convinced we're seeing either--or both--more false-positive matches than ever, or valid segments not being evaluated as such due to too few in-common SNPs (the latter applies mostly to GEDmatch).

A fairly new Tier 1 tool at GEDmatch allows you to upload results from multiple testing companies and combine them into one "super kit." If results exist from both an OmniExpress chip and a GSA chip, that resulting "super kit" would contain somewhere around 1 to 1.2 million SNPs. It may or may not result in a greater number of matches showing up, but it will certainly improve the accuracy.

Perhaps most important, it keeps those 19 million sets of OmniExpress results pertinent. People have passed away, or may never take another DNA test. Once we transition to whole genome sequencing, this won't be an issue...providing someone builds us tools that allow us to do "historical test comparisons," which I'm sure will happen. But until then, I see this most affecting the oldest of our lines, and adoptees.

In September 2018, Louis Kessler wrote about "The Benefits of Combining Your DNA Raw Data." He took the results from his tests at all five of the major testing companies and manually merged them into a single "super kit" that resulted in just under 1.4 million unique SNPs.

After GEDmatch introduced their new tool, he followed up last April with "Combine Kits into One Superkit on GEDmatch Genesis." That process gave him a new kit number that contained 1.1 million SNPs.

Bottom line, if you think you might want or need results from the OmniExpress chip, you may have only a few months left to obtain them, and AncestryDNA is the only remaining source. At $59, it might be well worth another test kit or two during the current sale. If you'd like to help WikiTree a tiny bit at no additional cost to you, you can use this link to order.

asked Aug 15, 2019 in The Tree House by Edison Williams G2G6 Pilot (439k points)

Yeah, that's what I figured. I'd hate for people to stockpile tests thinking it meant it would be on the old chip regardless of when it was submitted.

OK, so followup question... my mother, brother, paternal uncle and I have tested. I'm hoping to get at least one or two of my father's other three siblings and my mother's first cousin to test. Obviously the more the merrier in general, but now I'm wondering if I should wait and do them on the new chip to have better coverage of matches there as that dataset grows, or rush so I get them done on the old chip. Correct answer, of course, is to do everyone on both chips and cover all my bases! But if I had to prioritize which approach would make more sense?

commented Aug 16, 2019 by Lisa Hazard G2G6 Pilot (264k points)

Barry, I agree: seems like a strange response with no hint of compromise. Out of curiosity, did you telephone FTDNA and speak to a human to whom you could fully explain the situation (and maybe get transferred to a supervisor), or was it via email or chat? In general I find FTDNA's support staff to be pretty good. They don't outsource to a boiler-room call center; the agents are right there in the same building.

If there's enough material in the stored sample, you might prefer the BigY-700 anyway. But otherwise I have to think there would be some way they could accommodate an additional Family Finder test for a deceased individual they already have on file. I mean, let the current information stand and, if it's lab procedure that's the issue, assign a new kit number and web account to the sample, pretend it's a new order, and just...do it. Shouldn't cause them any problems, and could make a big difference to your family's research.

commented Aug 16, 2019 by Edison Williams G2G6 Pilot (439k points)

"Or run the new test and merge them to make a superkit like GEDMatch can do."

Nah. Way too logical, Lisa. Thing is, we know FTDNA has to have a SNP catalog, or database, that includes all of their OmniExpress result markers, all of their new GSA chip markers and, since they accept 23andMe uploads, at least almost all of the additional markers 23andMe customized their v5 GSA chip with. So I can't imagine what technical limitation would keep them from doing just that...even make it a selling point because, other than GEDmatch, no one else is doing multiple result aggregation yet.

But I've heard no hint that something like that's in the works. Doesn't mean it isn't. Just that I haven't heard any chatter yet. Maybe Max or Bennett will read this and decide it's a good idea.

commented Aug 16, 2019 by Edison Williams G2G6 Pilot (439k points)

If You Would Like to Help with a Tiny Bit of Research...

I'm looking for a few sets of very recent AncestryDNA results. I don't need or want any DNA information, just which SNPs Ancestry tested. Posted results will be completely anonymous.

If you've tested in 2019--meaning the test was purchased in 2019, not just the results received--and if you're even a little bit comfortable using a spreadsheet program like Excel, please consider sending me a private message from my profile.

I can explain what's needed step-by-step in a reply email, but essentially you will have already downloaded your raw data file from Ancestry.com, and you'll be opening the file in a spreadsheet program, deleting all the information that pertains to you individually (chromosome numbers, marker positions, and allele information), saving the file (to a different filename so that you don't overwrite your actual results), ZIP compressing it, and emailing it to me. The objective will be to analyze which SNPs AncestryDNA is currently testing and compare those to the Illumina standard offerings in their OmniExpress v1.3 chip and newer GSA v2.0 chip.

Thanks!

commented Aug 19, 2019 by Edison Williams G2G6 Pilot (439k points)

5 Answers

Answer 1 · 2019-08-15T20:58:37+0000

As a result, as the segment sizes grow smaller, I'm convinced we're seeing either--or both--more false-positive matches than ever, or valid segments not being evaluated as such due to too few in-common SNPs (the latter applies mostly to GEDmatch).

This is an issue, however I'm convinced that there is a viable workaround: imputation.

In short, imputation uses a combination of population genotype statistics and/or segment data from other customers and individuals to create a more complete composite picture for each individual's data set.

The reliability is reasonable, however imperfect, but it should supply sufficient compatibility for much of our matching needs. Moreover, this is not something that we consumers need to perform; it's a back-end process performed by the company.

For a very good run-down of the what imputation is about, I would recommend reading DNAeXplained's article on Imputaiton.

The real benefit of getting tested sooner than later is the opportunity to get in touch with others while they are still around. Lots of people really only check messages within the first month or so of getting tested. And others, often who are genealogists, have strong odds of passing away rather suddenly.

Last September, I was going through my matches and had a rather poignant experience of looking up a match who, it turned out, had logged in to Ancestry DNA for the very last time exactly one year ago from today. I'm not sure if this audience would be amenable to the explanation that I offered to some online chums, but since it's the 1 year anniversary of her passing, I'll share it here in honour of Mrs. Unique Name ⁠— may she rest in peace.

It's always been a bit of a challenge with genealogy: Your best potential sources are going to die before you get a chance to ask. So, despite what advances tech changes may bring, it's better to test sooner than later.

Ann Turner · Answer 2 · 2019-08-17T13:11:33+0000

Ann beat me back here. Yesterday was non-stop, and I got up at 4:00 this morning to write this lengthy follow-up that I felt was necessary, given recent events, to my June 24 blog post. Today's is "The Pending Reinvention of Ancestry.com." It may be rambling, but hopefully it pulls together more information than we've really been able to detail here. (If you read it, be gentle: I've tabled a final proofread until I can walk away from it for a while.)

Ancestry's chip is not unique. The "custom" part comes in because it's a customized OmniExpress chip...v1.3 of which, the latest, can sample a total of 714,238 markers which includes up to 30,000 user-defined, custom markers. The GSA v2.0 chip handles a total of 665,608 markers, which includes up to 50,000 user-defined ones.

I can't comment about SNPedia's evaluation, but Illumina certainly feels the GSA chip is far, far more relevant for clinical research than is/was the OmniExpress. In fact, by their own summary fully 18% of the base markers were chosen specifically because they address a clinical/pharma focus. And GlaxoSmithKline also thought so when they partnered up last year with 23andMe with $300 million.

Illumina writes that "The clinical research content of the Infinium GSA-24 v2.0 BeadChip was designed through collaboration with medical genomics experts using multiple annotation databases to create an informative, cost-effective panel for clinical research applications.... Variants included on the array consist of markers with known disease association based on ClinVar, the Pharmacogenomics Knowledgebase (PharmGKB), and the National Human Genome Research Institute (NHGRI)-EBI database."

Ancestry's purchases from Illumina really can't carry enough financial clout to merit a completely custom chip, or IMHO to continue manufacturing the OmniExpress. Sales growth of DTC DNA tests have been shrinking across the board during the past year or so, and in order to have Illumina continue to make the OmniExpress Ancestry would have to pay so much per chip (because they'd be the only ones using it) that it would prohibit them from being price-competitive in a field where they're the Southwest Airlines: the ones who drove down baseline pricing to begin with.

In fact, about two hours after I posted that blog entry, a guy from Orlando emailed me and said he believed Illumina had already ceased general support for the OmniExpress chip. He pointed me to Illumina's support website, and I just now included that information at the bottom of my blog post. Sure enough, Illumina's submenu shows support options for 10 different human genotyping microarray chips, and the OmniExpress is no longer among them.

I'll bet ya a Starbucks' latte and a donut that we see AncestryDNA move to the GSA chip this year.

commented Aug 18, 2019 by Edison Williams G2G6 Pilot (439k points)

May very well be the case, Ann. A Twitter mini-war is in agreement with you about a completely customized chip. I'm never afraid to have a hypothesis disproven. That's why I only bet you a coffee and donut.

I don't have any insight as to how Louis, if he's still the source, is running those comparisons. But we know not even all Ancestry v2 chips are identical. They made a big announcement May 2016 with the first v2--the announcement was only two months before the press release that a multi-year deal had been inked with Quest Diagnostics to do the lab work--but we know they tweaked with the chip a little in at least two other iterations, April 2018 and December 2018.

These low-cost microarray chips are only a small source of revenue for Illumina, and an even smaller source of relative profit. They booked $3.33 billion in revenue in 2018, and $826 million in net attributable profit. It's hard for me to get my curmudgeonly and calcified business brain around how it would be economically feasible for either Illumina or Ancestry to manufacture a completely unique microarray chip for Ancestry and everything stay affordable enough for a $59 retail test. I know customization certainly can be done with Illumina designing and synthesizing a special oligonucleotide pool based on a customer's specifications, and then tuning manufacture of the chips, but I gotta think this would have significant up-front fees plus continuing custom-manufacturing fees each time a product run is made.

Add to that a completely custom chip may cause variance in the workflow at Quest Diagnostics that might cost Ancestry additional money. I'd assume everything would still run on Illumina iScan or HiScan equipment--no problem for Quest--but a completely custom chip might mean different assay chemistry, maybe at hybridization.

I honestly don't know. But it just seems logical that Ancestry would see, with a truly custom chip, cost mark-ups at both manufacture and lab testing that its competitors wouldn't face. Production mark-ups like that just don't seem like they'd be tenable. But we'll see in the next several months.

(BTW, I exposed the blog post on Twitter so that an Ancestry insider might want to speak up set the record straight. But they always hold everything close to the vest, and I doubt that will happen.)

commented Aug 18, 2019 by Edison Williams G2G6 Pilot (439k points)

Ann, I just did a completely inconclusive--but perhaps interesting--comparison. The only Ancestry raw data I have at my disposal is my own, from a test December 2016 but which still shows in the header to be "AncestryDNA array version: V1.0," AncestryDNA converter version: V1.0."

I obtained from Illumina the listing of all standard rs IDs in OmniExpress-24 v1.2 and v1.3. There were a total of 6,876 SNPs in OmniExpress v1.2 that were removed in v1.3, and a total of 7,515 different ones added. I also have listings of the removed and added rs IDs from v1.1 to v1.2, but the actual rs ID listing for v1.1 seems to no longer be available.

Since we can't with certainty correlate the date/version of a given Ancestry test with any possibly-correspondent OmniExpress version, I chose to use the v1.3 listing and merge in the retired SNPs from v1.2. Then I made certain there were no duplicates in that merged data (there were two).

Next, I took all rs IDs shown in my Ancestry raw data--excluding all chromosome, position, and allele information; so all rs IDs were included regardless of possible no-calls: a total of 701,479 SNPs.

Finally, I ran a comparison to determine whether a unique rs ID appeared in both lists of data (a returned value of "1"), or whether an rs ID appeared in the Ancestry raw data but not in the default list of OmniExpress SNPs (a returned value of "0").

The result was not what I was expecting to see. Exactly 6,093 rs IDs were unique in the Ancestry data; all others had a correspondent in the OmniExpress default data. That means 99.13% of the SNPs Ancestry tested are in the OmniExpress v1.2/1.3 data sets. I was fully expecting at least a 20K gap even if the chip had been customized as allowed in the standard-production chip, and perhaps well over 50K or more if it was a completely customized chip, as suspected by the May 2016 introduction of Ancestry v2.

Since my data header shows v1.0, without more information I have to assume that they processed my kit, ordered approximately five months after the introduction of the first iteration of v2, on existing stock of the v1 chip. Seems a very long time to still be holding old stock, but it does look that way.

Still, only 6,093 rs IDs differentiating my Ancestry data from the OmniExpress default data set (albeit a set combining both v1.2 and v1.3) was...surprising, to say the least. What I'd like to do is get my hands on some test results processed by Ancestry after January 1 this year. We still can't be certain those were tested against what SNPedia calls "v2d," updated last December, but they would almost certainly have been tested on nothing earlier than "v2c," released April 2018. If Ancestry is using a completely custom chip we should be able to tell. Plus I'd like to run the same comparison against the GSA v2 ship so we can see the actual potential overlap. I'll put out a "call for assistance" here and on my blog to see if I can get a few sets of results...rs ID column only, of course; no actual information about the test results.

BTW, if you'd like to have a look at the result of this little comparison, you can get the .CSV file here (79K) that contains all the Ancestry rs IDs that did not also appear in the OmniExpress v1.2/v1.3 data. I didn't bother to run them against GRCh37; seems like extra effort that wouldn't tell us anything further about whether or not Ancestry's chip is truly custom. But if can get a solid, consistent set of results from a few recent Ancestry tests, it might be interesting to at least see which differences are autosomal, which X/Y/PAR, and which are mtDNA.

Edited to Add: Changing one incorrect snippet to read: (a returned value of "0"). Having only two possible values and they both be "1" ain't very helpful...

commented Aug 19, 2019 by Edison Williams G2G6 Pilot (439k points)
edited Aug 20, 2019 by Edison Williams

Ann Turner · Answer 3 · 2019-08-18T16:51:13+0000

Yes do it. Well worth the 59 bucks! I had all ready done the Y test with FTDNA & their family search so I was reluctant to do yet another test. Let me just say: ancestry's data base is huge! Plus their ethnicity estimates are more thorough & accurate. Good features too,like true lines & a common ancestor feature with applicable matches.better research options for trees than my heritage also.

Andreas West · Answer 4 · 2019-08-18T18:06:10+0000

Well, it was quick for me. All I did was paste in the links. This one won't be as quick...

(Note: I'll split the issue of maternal/paternal identification off into a second message. Am I the only one who has to watch out for the G2G per-post word limit? Ahem.)

Yep; still waiting on my WGS results. Eight months and 12 days since the testing company received my sample. But hey; who's counting?

The really short answer is that, to my knowledge, there are no commercial services providing what we would call comparison or matching services on whole genome results. Yet. And I believe it remains to be seen how those services will develop and function; I have a feeling there will be more than one methodology. I'm betting one method will be to use VCF (variant call format) files only--in other words, comparing only each of our differences from the then-current human genome reference model--and another may be similar to GEDmatch today, comparing actual SNPs but with a data library that would likely have to hold somewhere around 15 to 20 million relevant SNPs.

I don't know that I can envision an IT infrastructure, at least not in the near future, that could do an online comparison and data presentation of whole genomes. The BAM (binary alignment/map) file for a 30X coverage run of our 3.2 billion base pairs weighs in at about 80 to 100 gigabytes in size. (Andreas, did you get your BAM? If so, how large is it?) The human-readable text version of that is called a SAM (sequence alignment map) and it's even larger. Doing even a one-to-one comparison between two BAM/SAM files would take a lot of computing power and a lot of working storage space. Standalone, batched requests I could see; real-time comparisons like we're used to today...not so much.

I think it will be, from the standpoint of direct-to-consumer packaging, IT and data communications that will be the bottleneck, not what goes on with the sequencing itself. Technical improvements in the lab are advancing all the time, like nanopore and long-read sequencing, that continue to drive lab costs down and speed processing up while maintaining or even increasing the kind of accuracy we see today in 30X coverage tests.

I really don't know much about GoNL (Genome of the Netherlands) other than it is a very mature and well-funded study, and offers a possible glimpse of where we'll be in a couple of years with WGS for genealogy. A collaboration of multiple organizations including government entities and universities, GoNL kicked-off in 2009 and the baseline samples are from 250 trio-phased sets: two parents and a child, so 750 individuals. Those--I believe--have all been whole genome sequenced, plus there are a number of volunteer participants who were tested on genotyping microarrays.

To draw an imperfect analogy that's at work today, in a yDNA study it's a big deal to have multiple men take the BigY-700 full-sequence test because then, for other men who match them closely at 111 Y-STR markers, the men not full-sequenced can purchase individual SNP tests at a low cost and still arrive at good evidence they and the BigY testers align down to at least that individually tested SNP. Not everyone has to pay the big bucks for all to be able to receive some evidentiary benefit.

Told you it was a bad analogy. But the similarity is that--as we move forward with relatively inexpensive but high resolution direct-to-consumer whole genome sequencing--those who have been full-sequenced can provide detailed data down to individual loci that can support autosomal triangulation results for those who have taken coarser, less accurate microarray tests. So some good news there is that even if WGS supplants genotyping microarrays in the near future--which I think it will--the data from our microarray tests will still be valuable and tests by deceased family members will never be obsolete.

That's basically what the GoNL project has done. Trio-phased whole genome sequencing is informing the less data-rich microarray tests.

The issue we have with microarray testing is that only a (relatively) few markers, or SNPs, are examined. To put it into perspective, you're probably familiar with GEDmatch's previous default threshold that said a segment wasn't a real segment unless it contained 700 tested SNPs; which threshold has now been reduced to a floating figure that can go as low as 200 SNPs.

A centiMorgan is not a physical measurement, but is calculated based upon expected probability of crossover (recombination) at a given point on a chromosome, and the amount of cMs calculated for the same physical length--in number of base pairs--along a segment of a chromosome can differ significantly not only from chromosome to chromosome, but from different regions along the chromosome. That said, very roughly speaking one centiMorgan will equate to a stretch of a chromosome that's somewhere around one million base pairs in length. Ergo, a small 7cM segment will be, give or take, comprised of about 7 million base pairs.

Back to the SNP thresholds. If a 7cM segment can be considered valid if only, say, 400 of the same SNPs were examined in both tests being compared, that would mean we can positively match only one in 17,500 SNPs. So what we end up doing is assuming that those other 6,999,600 SNPs that were not tested will be identical.

Without diving into the exome and protein coding genes and the approximately 3 to 5 million SNPs that really differentiate your particular genome from someone else's, you see where I'm goin' with this. WGS can serve as something like a Rosetta stone. As I'm sure the GoNL study has done, if you get several microarray test results that match nicely with a set of trio-phased WGS results, you can do a bit better that simply guess at what lies within those empty stretches of 17,500 untested SNPs.

If this sounds like imputation, you're correct. In a nutshell, imputation infers the allele values of untested SNPs based on the linkage disequilibrium patterns derived from directly tested markers. When it comes to estimating the missing SNPs from a set of results, there are two different types of reference panels, or database libraries, used: haplotype panels and genotype panels. It's important to note that both require whole genome sequencing as a starting point in order to have that Rosetta stone. If you're trying to decipher a message from words in certain positions separated by thousands of unknown words, you need to have that (at least nearly) complete reference to compare against.

I believe that some mistakenly believe that AncestryDNA, for example, uses their huge database to keep adding newer and more refined information to feed into their Beagle imputation tool. They aren't. They can't, because within that huge database all tests (more or less) looked at exactly the same, few genetic markers: having 10 million copies of the same book that has only the same word in the same place every ten otherwise blank pages isn't going to help you figure out what the book says.

That's where we are with microarray testing today and why imputation is currently a less than stellar alternative for comparing results between chips that have very little overlap in markers tested. None of us have had whole genome sequencing performed by Ancestry.com or any of the other major DTC testing companies. They don't have haplotype reference panels gleaned from their 15 million tests on file; they have to rely on external sources for those data, and that can only start with the much broader genotype panels.

When we talk about genotype reference libraries, the 1000 Genomes Project comes first to mind, and samples are grouped into continental panels or "super-populations": African (AFR), Ad-mixed Americas (AMR), East Asian (EAS), European (EUR), and South Asian (SAS). Under those are more granular sub-population panels, but they can't really achieve what GoNL has done with its highly specific testing strategy. GoNL has created haplotype references for their defined population study against which microarray results can have their results accurately imputed.

Now, Ancestry or someone else can go to GoNL and ask for or acquire permission to use the data, and then for a select number of tests whose haplotypes align with those within GoNL, the predictive matching can be quite accurate. But without definitive haplotype WGS libraries like that, the big testing companies are working with population-level data.

Imputation against genotype reference panels isn't outlandishly inaccurate, but for genealogy we have to contend with trying to be precise with, say, 5th cousins--among the roughly 4,700 of whom only about 15% are likely to share any current-chip detectable DNA at all with us; without recent pedigree collapse we'll need to test 5.6 actual 5th cousins in order to find one that's a DNA match--which means working with small segments and very few tested SNPs in triangulation groups with multiple members. A very recent study of imputation in a five-way admixed AFR population had the best results yield a genome-wide error rate in overlap of the autosomes of 11.98%. We currently do perennially better with the EUR population panel, but is even a 5% or 8% error rate acceptable?

I think that's a fence we'll be able to climb once DTC WGS becomes popular.

commented Aug 22, 2019 by Edison Williams G2G6 Pilot (439k points)

Categories

If you haven't tested with AncestryDNA, should you do it right now?

Please log in or register to add a comment.

Please log in or register to answer this question.

5 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Related questions