Hey, Deborah! I have personal experiential information--but no experimental evidence--that more SNPs equals better accuracy, at least at GEDmatch. From my 30X whole genome sequencing I extracted into GEDmatch-acceptable format a "superkit" containing almost 2.081 million SNPs identified as existing in various versions of the microarray tests. Uploaded to GEDmatch, they decided 1.377 million of the original count were usable. And, nope, I have no answer as to why ~600,000 SNPs were deemed unworthy since the catalog of SNPs used came from other microarray test templates.
To condense the story, you can use the GEDmatch kit diagnostic tool to get information about any given kit, including total matches in the database. Now, these aren't total people matches; far from it. But they do provide a like-to-like comparison of the number of aggregate segments matching in the database.
With a little over twice the number of usable SNPs in my superkit compared to any single microarray test, I'd expect the superkit to show fewer segment matches...in other words, the additional SNPs should cause segment "breaks" in places where the microarray data would otherwise skip over unknown mismatches and assume a contiguous segment. The results (last checked March 2021):
- WGS superkit; 1,377,182 usable SNPs; total matches in database: 56,802
- 23andMe v5; 496,179 usable SNPs; total matches in database: 161,680
- AncestryDNA v1; 636,709 usable SNPs; total matches in database: 65,845
- FTDNA v3; 487,649 usable SNPs; total matches in database: 109,359
- MyHeritage v2; 494,251 usable SNPs; total matches in database: 117,136
Seems like a correlation between number of usable SNPs and total number of matches. My Ancestry v1 test wasn't as far off the mark as the others, but the 23andMe v5 was kinda staggering: only 35% as many matches with the superkit, indicating the false positive rate for very small segments on the microarray tests might be quite high.
Of course, GEDmatch doesn't use imputation, per se. I would think that more data in would equal better data out from genotype imputation, but there are a lot of variables, from the quality and size and diversity of the genotyping reference cohorts in use to the specific algorithms employed to "guess" the missing puzzle pieces.
I also have to comment from my own highly informal research that I was surprised the authors of the paper identified as few "clusters" of microarray datasets as they did. I found about that much variety in iterations of AncestryDNA tests all purported to be "v2."