Seeking help/advice on tracing Y-DNA from Francis Smith (c1614 ENG – 1679 MA)

+6 votes

In 2009 I tested with FTDNA and found about 10 other Smiths who are close Y-DNA matches. Five of them had good paper trails back to Francis Smith. As of now, there are 27 pretty close matches and still only five paper trails. (I’m not sure that any of them are on Wikitree.)

My thought was that trying to trace the mutations in the Y-DNA, and grouping together the branches that seem most closely related, might help all of us direct our efforts where they may be most promising. Here’s the chart I’ve put together: (PDF). The raw data is available via the Smiths Worldwide project ( and the SmithConnections Northeastern project (, Group 3).

I started from the assumption that all 27 kits are genetic matches, i.e., they have a common ancestor within the last several hundred years. I also assumed that five lines are descended from Francis, as indicated by the paper trails, and furthermore that two other lines that were in England until the 1800s – kits #149599 (that’s me) and #***001 – are collateral to Francis. I then grouped the other lines to put the closest matches together and minimize (as far as possible) the number of distinct mutations for each marker.

Some of my tentative conclusions: Values for Francis Smith’s set of 67 markers can be inferred, since we have test results from descendants of his two sons and from collateral lines. Twelve lines out of the 27 seem likely to be descended from Francis, while the other 15 seem more likely to be collateral.

So, my plea for help is:

  1. Does anyone with a better understanding of this stuff want to check my work? I realize that that’s more work than anyone may want to take on, especially if you’re not related.
  2. So, alternatively, can anyone point me to expert discussion of how to do this sort of reconstruction? Or to examples of other people doing this within a surname project and explaining how?

I have lots of little questions, such as:

  1. Is minimizing the number of mutations in the reconstruction the correct theoretical approach?
  2. In the scenario I’ve charted, is the total number of mutations for the group significantly lower than expected? (That is my subjective impression.) If so, does that suggest that other scenarios are more likely?
  3. What is the effect of different mutation rates for different markers? How much more likely is a scenario with two mutations at CDY (a fast-moving marker) and one at DYS576 versus one at CDY and two at DYS576? Should I present both as possibilities?
  4. For lines without a paper trail to Francis Smith, but with genetic distance of 0/37, 0/67 or 1/67 from a line that does have a paper trail, how confident are we that they descend from Francis? How confident that they all descend from Francis’ son John? I’ve tried to make some calculations based on FTDNA’s “TiP report,” but I’m not sure I’m doing it correctly.

Thanks in advance,

Peter Newbatt Smith

WikiTree profile: Francis Smith
in Genealogy Help by Peter Smith G2G Crew (950 points)
I've done a bit more Googling. Apparently, the name of the thing I'm trying to create is a "mutation history tree".

Or a "cladogram".

A worthy effort!  With emphasis on the 'effort' part, obviously going to be time-consuming!  And you may never be finished!

My one suggestion is to make every effort to encourage further testing by everyone involved.  I was happy to see that Edison, amid all the useful advice, included an implied recommendation for more and deeper testing, especially SNP testing.  More data is always more important than more guesses, and a 37 marker test is a pretty coarse tool, 37 knowns and 74 unknowns when compared with an 111 marker test.  That's twice as many unknowns as knowns.  (Because 111 is 37 times 3, I'm curious why the 67 marker test wasn't a 74 marker test?!)  In just a couple of weeks or less, we may see the next round of sales and discounts on DNA tests and upgrades, a good time to encourage your fellow testers to consider upgrades.  It's expensive, but the goal for each should be a full genome test like the Big Y, but more markers will be helpful too.  One advantage of a test like the Big Y, quite a bit of phylogenetic tree building has been and will be done for you (although deeper than your immediate interest), but it could provide a foundation to attach the branches you detect in your project.

1 Answer

+3 votes
Best answer

Peter, I know this has languished here with no responses, and I'm sorry for that. I think the reason is that the source data is the cornerstone for everything, and those data seems to be in two or three different places...and the genealogy in a fourth. Makes it extremely difficult for anyone not intimately involved with the effort to understand what's what and chime in. That said, here are a few random--and quite possibly useless; but that's never stopped me before--thoughts.  :-)

First, there are a whole boatload of free utilities, with varying degrees of detail and user-friendliness, available to generate the type of phylogenetic tree I believe you're looking at. You may have already chosen and employed one or more to use; if so, my apology for bringing it up. The good ones will almost always be far superior to working with a manual rendering simply because they're generated programatically, so that new data--and there will always be new data--can be included and the output updated without extreme pain and suffering.

One place to look at some of the better known apps is on this page at the ISOGG Wiki. ISOGG titles the page "Phylogeny Programs" and includes a few conversion or data preparation apps that work with what can be pulled from FTDNA projects. While you're there, you may want to check the section "External Links." Some are quite dated, but some of the phylotree apps have been around awhile, too. Might be worth exploring.

Speaking of dated, something that's developed into a vital consideration in just the past few years is the ability to incorporate Y-SNPs from NextGen testing (a la FTDNA's BigY). For the purposes of a deep dive toward an MRCA, STR testing has its foibles. The data are constantly (well, semi-regularly) adjusting our understanding of STR mutation rates, and for any given per-STR, per-generation mutation rate estimation rate I've seen, the accuracy range is only about +/- 15-20%. Not terrible, but still a one-in-five chance that our understanding of the actual frequencies is wrong. Too, I've seen no published mutation rates for about 60 commonly tested Y-STRs (published is key, because I'm certain companies like FTDNA track their own databases to help refine their matching estimates), and of those about a dozen are trinucleotide markers which can be very slow to mutate in some haplogroups, much faster in estimates usually aren't published for those because of the disparities.

Of the Y-STR mutation rates I've seen, two of the slowest are DYS455 and DYS454 at a 0.00016 chance per generation (DYS454 also has the distinction of being the most stable: 96% of the records in Ybase show 11 repeats). The quickest mutator is DYS724, more commonly known as CDY, a palindromic marker that changes at the comparatively astronomical rate of 0.03531 per generation. The result is that you can have one STR marker mutating at a rate over 220 times than another marker...and the whole estimate of genetic distance from STRs is as much art as science. In fact. it's only been about 24 months since FTDNA switched to the "infinite allele" model, significantly liberalizing GD estimates of palindromic, multi-copy markers.

A roundabout way of saying that I think NextGen testing is the future of phylotrees for DNA projects. The data is still coming in at a frenetic pace as more and more men take these tests, but SNP testing has the decided benefits of A) slower mutation rates, and B) a (relatively consistently) hierarchical structure; e.g., if you're R-BY22194 positive, then you'll also be positive for the older BY3332 and the still older ZZ12. This can make hypotheses of MRCA branching more accurate, and can even allow estimates to reach back prior to what we normally consider the genealogy timeframe into clan phylogeography to the period just before surname adoption. Working with the Williams surname ain't the farthest thing removed from Smith, so I can understand how valuable a method might be of being able to sort back to, say, the Battle of Hastings.  ;-)

Last up, I remembered that Dr. Maurice Gleeson was on the schedule at the FTDNA annual conference last I had made reservations to attend, but at the last minute could not. But, yep: his presentation is on YouTube! Might be worth the 52 minutes to give it a look: (might be worth looking at his YouTube channel, as well, for other stuff; he has a number of vids available). The audio is somewhat off in spots, so you may need to crank the volume a bit.

Good luck with the Smiths!

by Edison Williams G2G6 Pilot (273k points)
selected by Peter Smith
Many thanks Edison!

I did realize that this was a big ask, for anyone to review the data in any detail. But what you've written here, and the info that Dr. Gleeson has put on the Web, are very helpful. Before I posted my question, I did search the Web looking for stuff like that, but I didn't know the specific terms -- "Y-DNA" and "surname study" didn't pull up what I was looking for.

At some point I may try the Fluxus software, or something similar. For now, I think my "hand-drawn" cladogram is good enough, especially since the paper records create some constraints that limit the possibilities.

I will probably recommend that members of this Smith group consider upgrading to 111 markers and the "Big Y". That would be especially useful for several lines that are believed to descend from Francis Smith but that appear genetically identical, especially at the 37- or 67-marker level.

Best wishes!

Thanks for the best answer star, Peter...undeserved, but thanks!

Turns out that Maurice Gleeson's name didn't randomly pop into my head after all. See this G2G post about his upcoming webinar next Tuesday. I'd seen his name on the schedule from the Guild of One-Name Studies earlier this year, but hadn't remembered the subject and the date until reminded this morning. I very much doubt he will be getting into advanced topics like phylogenetic and mutation history charting in this presentation, but thought it still might interest you.

Related questions

+5 votes
4 answers
+6 votes
2 answers
+3 votes
2 answers
0 votes
3 answers

WikiTree  ~  About  ~  Help Help  ~  Search Person Search  ~  Surname:

disclaimer - terms - copyright