New Research Paper Looks at Common Genealogy DNA Tests and Chip Versions

+7 votes

There's some jargon in this relatively short paper, but overall it's an accessible read. Of particular interest might be its findings about SNP overlap after analyzing 1,292 raw data files, which shows the lowest in-common SNPs between any two tests to be 17%.

An excerpt from the "Discussion" section:

Consumer genome sequencing products are developed not only by technological advancement, but also under the economics of market pressure to be financially beneficial to the company. We have observed companies developing different genotyping arrays for different purposes (Table 1), and the differences create a significant barrier to cross-provider analyses....

Most genetic analysis requires a cohort of data to establish variant statistics. So what one consumer can do with the raw genome data is very limited. The power of consumer genomes comes when there is a large amount of data, which is currently accessible as a homogeneous dataset by only the companies that initially conducted the sequencing. Increasingly however, as these companies move to monetise their access to this resource, savvy consumers who control their own data will look outside the original provider for ways to share and use their genome.

Many thanks to WikiTreer and genetic genealogist Debbie Kennett for drawing my attention to this Open Access article. Full citation and link:

Lu, Chang, Bastian Greshake Tzovaras, and Julian Gough. "A Survey of Direct-to-Consumer Genotype Data, and Quality Control Tool (GenomePrep) for Research." Computational and Structural Biotechnology Journal, Vol 19 (July 2021): 3747–54.

in The Tree House by Edison Williams G2G6 Pilot (366k points)
This has been a concern to me ever since the companies all ended up with  new test versions, and imputation became a practice. About the only real way to possibly counteract this that is around is to create a “superkit” at either gedmatch or borland genetics that combines test results from tests at different companies, making imputation less necessary, at least in theory. Is there any evidence that this actually helps?

Hey, Deborah! I have personal experiential information--but no experimental evidence--that more SNPs equals better accuracy, at least at GEDmatch. From my 30X whole genome sequencing I extracted into GEDmatch-acceptable format a "superkit" containing almost 2.081 million SNPs identified as existing in various versions of the microarray tests. Uploaded to GEDmatch, they decided 1.377 million of the original count were usable. And, nope, I have no answer as to why ~600,000 SNPs were deemed unworthy since the catalog of SNPs used came from other microarray test templates.

To condense the story, you can use the GEDmatch kit diagnostic tool to get information about any given kit, including total matches in the database. Now, these aren't total people matches; far from it. But they do provide a like-to-like comparison of the number of aggregate segments matching in the database.

With a little over twice the number of usable SNPs in my superkit compared to any single microarray test, I'd expect the superkit to show fewer segment other words, the additional SNPs should cause segment "breaks" in places where the microarray data would otherwise skip over unknown mismatches and assume a contiguous segment. The results (last checked March 2021):

  • WGS superkit; 1,377,182 usable SNPs; total matches in database: 56,802
  • 23andMe v5; 496,179 usable SNPs; total matches in database: 161,680
  • AncestryDNA v1; 636,709 usable SNPs; total matches in database: 65,845
  • FTDNA v3; 487,649 usable SNPs; total matches in database: 109,359
  • MyHeritage v2; 494,251 usable SNPs; total matches in database: 117,136

Seems like a correlation between number of usable SNPs and total number of matches. My Ancestry v1 test wasn't as far off the mark as the others, but the 23andMe v5 was kinda staggering: only 35% as many matches with the superkit, indicating the false positive rate for very small segments on the microarray tests might be quite high.

Of course, GEDmatch doesn't use imputation, per se. I would think that more data in would equal better data out from genotype imputation, but there are a lot of variables, from the quality and size and diversity of the genotyping reference cohorts in use to the specific algorithms employed to "guess" the missing puzzle pieces.

I also have to comment from my own highly informal research that I was surprised the authors of the paper identified as few "clusters" of microarray datasets as they did. I found about that much variety in iterations of AncestryDNA tests all purported to be "v2."

Please log in or register to answer this question.

Related questions

+20 votes
7 answers
+4 votes
2 answers
+3 votes
3 answers
+4 votes
3 answers
284 views asked Dec 31, 2018 in Genealogy Help by Billy Dunn G2G6 (8.2k points)
+5 votes
2 answers
+9 votes
2 answers
842 views asked Feb 18, 2017 in The Tree House by Steve Stobaugh G2G6 Mach 1 (18.7k points)
+10 votes
0 answers

WikiTree  ~  About  ~  Help Help  ~  Search Person Search  ~  Surname:

disclaimer - terms - copyright