# WikiTree and Network Theory

499 views

3 years ago, I asked a question here about the mathematical structure of the global tree. Over the last several years, I've dug into data dumps and taught myself some Network Theory in order to try and figure out what we can say about this mathematical structure of WikiTree. Inspired by the 100 Circles project and private conversations with Bernard Vatant, I've decided to try writing up some of my discoveries on a blog. The first post is now available in which I describe the basic idea of representing the WikiTree dataset as a mathematical network: https://www.sligocki.com/2021/06/23/wikitree-and-network-theory.html.

If you find this interesting, let me know in the comments! What questions would you like answered by analyzing the WikiTree Network?

WikiTree profile:

Since I got back to my ancestors at "the record horizon" and started working forwards from some early ancestors living in the same area, I have often been thinking of it as "my kinship network", rather than as "my tree", because of the frequent intermarriages between branches, making a tangle. Now, I'm not a mathematician, so my use of the network word will be a distant cousin (perhaps several times removed) to the mathematical concept.

Since I greatly enjoy the WikiTree connection finder, and since I'm located in Sweden, I'm very familiar with the phenomenon that the connection to the featured profiles of the week, from most profiles I manage, will go through a comparatively small number of "bridges". Mostly emigrants, but sometimes nobles.

I get a strong impression that the structure of the Wiki-Tree is "lumpy" with clusters that are internally very interconnected, but more thinly connected to each other. So what I'm wondering is if there is a way to describe this in more formal terms.

A very long time ago I learned lacemaking, but I haven't practiced it for ages. I don't have the tools.

by Eva Ekeblad G2G6 Pilot (502k points)
selected

Hi Eva,

Though you say you aren’t a mathematician, you used the mathematical word for the concept you describe: a bridge between two different people in a network is a link which, if deleted, would cause those on the “other side” of the bridge to be unconnected from the main tree. It would be a useful analysis to identify the bridges on Wikitree, to find profiles which are in danger of being disconnected if research indicates that the bridge is based on incorrect information.

I have a selfish interest in this of course: my own connection to the main tree is through a single bridge of around 6 people.

A bridge in graph theory sounds more strict than what I have come to think of as bridges in the Wiki-Tree.

There are, of course, clusters that have one single path out - a single profile or a sequence of profiles, where the disconnection of one mistaken parent, or similar, will disconnect the whole cluster.

What I have come to think of as "bridges" is less vulnerable - there will usually be more than one path out from the clusters I'm working with. But between Swedish profile's I'm working with and the international profiles featured in a week I have come to recognize a few profiles that turn up over and over. Why I recognize them may also be because they are the last Swedish profile before the path goes through America or another English-speaking country. I guess I'm not as attentive to profiles recurring frequently in the paths beyond that point :-)
Your question sounds like this to me.

How many profiles exist with only one parent, but with descendants?
Yes Eva! I'm very interested in this type of question. I don't have any big answers yet, but I have found some tools which might help us to identify and map out this sort of "lumpiness" eventually.
Eva/Shawn/Chase/Anonymous Jones - Identification of the bridges, and then a pseudo-standardized qualitative assessment of the sources (akin to the Grindle scale here: https://www.wikitree.com/g2g/1220648/do-we-need-a-classification-of-sources) would show the weaker points in the tree that could use focus to shore up.

That may only be fruitful if they are within the timeframes and types of profiles that lend themselves to documentation, as Chase mentioned above, but it would indicate a fruitful way to spend our time.

@Eva I can also see this network lumpiness around my Breton focus, of course. My hunch is that those lumps aggregate in bigger clusters in a fractal way, with scale invariance.

Shawn had shared a while ago (in the circlers private conversation) some wonderful normalized distance distributions showing amazing properties which might or not support this idea. I suppose he will publish them on his new blog.

What are some questions that can't be answered by analyzing the WikiTree Network?
by Tommy Buch G2G Astronaut (1.2m points)
Well, you can't answer questions like: is a profile well sourced? (We are ignoring the bios when building a network.) However, I agree that many, many things I care about are network-y questions :)
I find this interesting, I am looking forward to reading your blog! I find how removing a person degrades the network a fascinating idea.
by Kylie Haese G2G6 Mach 7 (79.0k points)

I don't have a question, but I'm looking forward to reading what you have to say.

by Ian Beacall G2G6 Pilot (189k points)
Very good introduction and teaser, Shawn!

I suppose you will share some of the fascinating invariants we have begun to conjecture under current distribution of distances, WikiTree vs the "Real Human Network" or whatever we call it.

Looking forward to further developments!
by Bernard Vatant G2G6 Pilot (113k points)

Which profile has the most brick walls?

edited for clarification of what I meant by brick wall.

I meant which direct ancestor is missing a parent or both.

I am not using brick wall in the sense that one cannot determine who the parent or parents may be of a direct ancestor.

by Tommy Buch G2G Astronaut (1.2m points)
edited
What is the mathematical definition of a brick wall?
Tommy, do you mean which profile that is a brick wall has the most descendants on wikitree?
I am rewording my above question as follows:

1) Which profile has the most direct line ancestors?

2) Which profile has the longest direct line ancestors?

3) Which profile has the most direct line leaves?
Chase, that wasn't my question, but that is one I would like to know too?
@Bernard - From a genealogical standpoint, a brick wall is an ancestor whose parentage has not been determined despite much effort. However, from a wikitree standpoint, a profile is a brick wall if it does not have other profiles connected to it as its parents. A profile could be deemed a brick wall if it is missing both parents or just missing one parent.
@Tommy -  think understand now. So you are looking for the profile with the most direct ancestors whose profiles are missing one or more parent connections. (I think you would want to count an ancestor missing one parent connection as a brick wall, but it's your question, so I defer to you.)
Bernard: a brick wall would be a person whose parents are unknown. This would require a different analysis to raw network analysis, which (I suspect) doesn’t distinguish between parental links, spousal links, or sibling links.
Chase, I am using brick wall from the WikiTree standpoint.
I can't help but think that the profile with the largest number of known unique ancestors also must be the profile with the largest number of brick walls.
Leif - I think that's true, at least if you count an ancestor with neither parent connected as two brick walls and an ancestor with one parent connected as a single brick wall.
I can't help but think that the profile with the longest line of direct ancestors is the most likely to have bogus ancestry attached. (yes, I'm the Disproven Existence Project leader).
Yes. Unfortunately, many of these "extreme" questions (who has the most ...) have turned out to be very challenging to analyze because they expose so many mistakes in the network! For example, a while ago I read about John Tyler still having two living grandsons (175 years after his presidency) and wondered what was the longest time between a grandparent's birth and a grandchild's death. Sadly if you sort the data this way you get many, many obviously wrong or completely unsourced profiles. The most extreme ones have people living 300 years or having children born a century after they die. In the end I gave up trying to verify or discard enough entries to find the real maximum. But perhaps, if there was widespread interest in these sorts of questions, we could crowd-source cleanup around these sorts of questions :)
Logically, every line on WT would have to end in a brick wall, since none of them go on forever.  Thus, Leif is exactly right.

Maybe someone could make a list of those so Isabelle could shorten their lines.

And yes, Shawn, I think it would be fun to see the answers to such questions.  A lot of those you mention would have generated suggestions, I think, so maybe are already getting Data Doctors' attention.

Edit:  Anyone living 300 years or having children after they are dead surely has outstanding suggestions.
Well I have a grandparent born in 1873 and all eight of his grandchildren are still living.
Shawn - I think a fundamental, and distinguishing, feature of the Wikitree network is that, while it is based on a biological/genetic tree where every person has two parents, Wikitree network connections are dependent on (1) someone on wikitree botthering to establish the connection (which generally means a direct ancestor or someone closely related to a direct ancestor) and (2) the existence of available documentary evidence to establish those connections. Re (1) - The network thus reflects and is limited by the interests/ancestry of the people active on wikitree and thus is, to some extent, a composite portrait of the active wikitree members. Re (2) - Going back in time, the loss of expected connectivity should relate to loss of documentary evidence to establish connections (e.g., lack of birth/death/marriage records) and thus reflects the history of vital record recordkeeping and preservation. At some point in the past, the network becomes just a small slice of the biological/genetic network and only reflects connections with and between nobility/gentry, since those are the one people for whom records were kept/survive. I think exploring the holes in the wikitree network and areas of missing connectivity and comparing it with areas of relatively complete connectivity might be interesting and revealing.
by Chase Ashley G2G6 Pilot (257k points)
edited
So, a question that comes to mind is how much more connectivity can be done on WikiTree via DNA where no physical records exist?

@Tommy - I think there are real limits on how much additional wikitree connectivity DNA can provide. In the absence of documentary evidence, all DNA is going to be able to show is that there was some common ancestor about x hundred or thousand years ago, but you can't (or at least, I think, shouldn't) create a profile and connections for that kind of theoretical ancestor. DNA ancestry/connectiveness make more sense for a phylogenic tree, such as The Big Tree, which don't try to show individual connecting ancestors.

Yes Chase. I'm very interested in trying to understand in what ways WikiTree is a subset of the "Real Human Network" and why. As you note, there are sort of two reasons that WikiTree is incomplete: records have been lost to time or nobody has put in the effort to find records yet. It would be very interesting to me too be able to model those phenomenon in some way. Perhaps understanding this better would help us in the actual process of making a more complete network!
That sounds like a driver for policy changes maybe? You're missing a scenario, which is that records don't exist, because they aren't tracked like they are in the Western world. (oral histories, etc.). If we can identify density gaps in the tree due to that reason, we could create policies defining when that is commonly considered to be acceptable and how to document it on wikitree? Would have to include definitions of when it is to be considered "authoritative" versus "family legend" and what to do if the documents that do exist conflict with the alternative sources.
Tommy and Chase - I can imagine a time when sophisticated chromosome mapping would extend our connections back significantly farther than we've done in many cases through traditional genealogy.  Of course some records must exist if we are to put names to people and not just "unnamed 15th great grandfather.," etc. (Impediments to that happening are the relatively limited number of people who've been DNA-tested, limitations of current testing methods and incompatibilities between them, etc.  I'm not a scientist.  That's just my somewhat uninformed opinion.)

Shawn, I think from the beginning of the project we realized that one limit to how well the WT shared tree could represent the real human tree was that eventually we'd run out of records.  But we have a long way to go before we reach that end.

@Chase. When looking at WikiTree as a network, at least in a "circles" perspective, or from the Connection Finder perspective if you prefer, all four relation types : parent, child, sibling and spouse are considered without distinction.

If you take two random post-1800 profiles (they represent certainly more than half of WikiTree profiles), in the Main Tree (reaching 23 million soon), and look at the shortest path between them, it generally does not go much before 1700, rather takes shortcuts through siblings and spouses. If they have some common ancestor before 1400, say, the back-and-forth path through there would take at least 15 generations upwards and 15 downwards, which make a path of length 30 or more, whereas a traversal path could take typically between 20 and 25.

Even in my "kinship network" (as Eva calls it) of Brittany, which looks more like a tangled knot of yarns than any kind of tree I know of, the shortest path to my removed cousins most of the time does not go through a common ancestor, but through multiple interbreeding.

My conjecture is that what I see there locally, and Eva has the same at home in Sweden, is just a prefiguration of the structure of the global network, that we have hard time figuring how it scales, and the graph analysis tools are difficult to put in action because of the sheer size of the graph. So the few we have, as the Connection Finder, are precious.

[edited] I just looked at my shortest path to one of the profiles of the week, Betty Haig : it seems a typical illustration of the traversals I am speaking about above : 24 steps, going through Cozima Liszt, Eva Wagner, three generations of Chamberlain, the oldest date of birth being 1767.

@Julie Just an aside that, discounting uniparental DNA and assuming no known pedigree collapse (a big assumption), an individual's number of genetic ancestors begins to be, at a point somewhere around 8 or 9 generations, exponentially outpaced by the number of "paper trail" ancestors. At the level of 7th great-grandparents, about 31% of them will have contributed no autosomal DNA. At 8g-grandparents that no-contribution number climbs to 53%, and at 9g-grandparents, 71%. That would also assume we can do much more precise and granular matching via next-generation sequencing than we can today using our microarray tests.

I don't know where the theoretical limit would be, i.e., the "all Europeans are descended from Charlemagne" trope. It's logical to assume those limitations would be constrained by the continental-level populations involved (some have had more severe genetic bottlenecks than others), the population admixtures involved, and of course the degree of endogamy in one or more of the populations. But on average we'll see around 75% of the ancestors 11 or 12 generations back contributing no autosomal DNA. Using a 26-year median generational interval, that places us at 286 years before "individual zero," our DNA test-taker. For practical purposes, my absolutely unscientific guess would be that 15 generations will prove to be a reasonable threshold beyond which autosomal DNA can seldom be genealogically useful...so call that 390 ybp (years before present signifying January 1950), or around the early 17th century. Even at that I may be being overly optimistic.
Edison, I do understand that there will be ancestors from whom I get no DNA.  For that matter, having made myself a map of my own chromosomes, I can see that on certain chromosomes where I had only one or zero crossovers, I have already lost a grandparent or two (on that chromosome).

What I assume, though, is that for that ninth great grandparent from whom my chance of getting his DNA is not good, that someone else did get some.  I can see (as I think while I write) that a person who didn't reproduce would have had his DNA wiped out, or if his children had no children, it would be gone in one generation.  But what are the chances that a person who has a ninth great grandchild would only have one?  In some branches of my family, I'm guessing I'd more likely be one in a million.

Julie, you kinda lost me. Sorry.

Meiosis becomes, in aggregate, a numbers game with more randomization than not. Even if we could get to the point where we'd be accurate enough to evaluate to a 1cM level (not a given at all, if for no other reason than the assumptions and estimations involved in calculating centiMorgans), you would have ancestors with multiple living descendants, none of whom will carry any detectable DNA from that ancestor. A person can have 20 children and never pass down her entire genome; there is genetic truncation, deprecation at every generation. And without exhuming the ancestor, we would still have to rely on comparing people who have actually tested.

If you flip the numbers and say, hypothetically, that of the 9g-grandparent only 25% of her living descendants still carry any of her DNA, then we're at one-in-four to start. With the intervening 22 birth events you're actually down to a 0.002% chance you and any given 10th cousin carry any matching DNA. But in order to determine that two 10th cousins align back to a specific 9g-grandmother as an MRCA, they both have to share a segment of the same DNA that can be traced back to the ancestor.

To arrive at that theoretical number, take the expected sharing of any one descendant from the ancestor, and then divide it by the number of ancestors at that generation (described in a limited fashion here). For example, we have (again without pedigree collapse) 8 great-grandparents. Any given descendant of one of the great-grandparents would be expected to have around 12.5% of her DNA. Second cousins whose MRCA couple is one set of great-grandparents would have an expected total DNA sharing of about 3.125%. So to get an expected amount of DNA the two cousins would share that came from the same great-grandparent, divide 12.5% by 8 = 1.56%.

Now let's put the numbers on steroids. With no pedigree collapse you would have 2,048 9g-grandparents. Theoretical, even-distribution sharing would mean you would have about 0.048825% of her DNA. If we assume that an entire genome comprises 7,000cM (which splits the difference between the traditional FTDNA 6,800cM model and 23andMe's ~7,200cM), that's roughly 3.4cM.

You and any given 10th cousin descended from the 9g-grandmother would be expected to share about 0.00005% of your DNA...or 0.0035cM. And if we use the formula to see how much the two of you might share from the same 9g-grandmother: 0.048825% ÷ 2,048 = 0.00002384%, which would be equivalent to 0.001669cM. Call it 1,600 total base pairs. With gene linkage in meiosis and linkage disequilibrium, there are very few places on the genome where that might be potentially genealogically meaningful, even with 100X coverage whole genome sequencing.

And, of course, I said my guess was 15 generations as an eventual autosomal threshold. Now I'm thinking I was too optimistic.

Edison, I will try to briefly explain what I meant, and I admit I have only a foggy conception of how it would all work.

In traditional genealogy, we don't jump in and, first thing, announce we are descended from a particular tenth great grandfather.  We build the chain of evidence step by step, generation by generation.

I had imagined a similar process for building a genetic one world tree.  That is, start with people who have tested.  Next, reconstruct their parents' chromosome maps.  I believe GEDmatch is already attempting something like that.  Then, do the next generation, etc.

I understand what you're saying about missing information.  I don't know if that would make the whole idea impossible.

Yeah; and I didn't intend to thoroughly derail the topic, either. Sorry 'bout that, Shawn and moderators. I'll wrap it up and quit interfering.

I'm still hoping for computational biology simulation runs from Cornell on the matter, but my comment about the top-down gaps goes along with my continued and oft-stated belief that the mechanisms of genetics disagree with the notion that autosomal triangulation is possible beyond a few generations. I would be absolutely floored if, for example, a group of 4th cousins and other biological relatives could reconstruct via their own tests a set of chromosome maps that accurately represent the shared 3g-grandmother's genome. Even with allele by allele comparisons in next-gen sequencing if all the people involved took that level of testing. And my personal belief is that, ultimately, we'll find that autosomal triangulation is always unreliable and inherently inaccurate beyond 5th cousins.

So top-down, oldest generation to youngest, we lose genetic data and the genetic pool is far shallower than the genealogical one. And going bottom-up can, I believe, only be accurate for a few generations at most. In truth, the start and end loci as reported with our segment details are only approximations to begin with because we're testing only about 0.02% of the genome. You'll get different values when comparing microarray tests from different companies and different test versions. So even reconstructing a genome one generation back by chromosome mapping is unlikely to be fully accurate.

But you get my point. Genealogists have been rapidly adopting and propounding some presumed-valid practices that the science can't quite substantiate or underpin. In fact, see my G2G post from just a few minutes ago. The first complete sequencing of the human genome may have been accomplished just a few weeks ago.

My question is:  What does it look like?  (Bernard will need a glass of wine when he reads that.)

I think visual representations can add a lot to a discussion, including making things easier to grasp for us non-mathematicians.  Even if you can't represent the entire network visually, I know you've produced some interesting graphs of some of its aspects.  It would be interesting to see those in your blog.
by Living Kelts G2G6 Pilot (516k points)
For the record, my usual answer to Julie on this is : graph visualizations do not scale. They are OK up to a few hundred. We are speaking about tens of millions, and counting.
And my usual answer to Bernard is:  It's still fun to think about.
Yes Julie, I am super-interested in ideas around visualization! As Bernard says, directly viewing a 27 million node network using simply graph visualizations leads to a jumbled mess that is completely inscrutable. However, I think there are many opportunities for creative visualization. Some ideas I have are: Local neighborhoods visualizations; Cluster-level visualizations (a la Eva's comment about Swedish clusters, etc.); Visualizations that allow seeing the "outer rim" and rough edges of the network; Visualizations of smaller networks that are alike our giant network in some way. I am currently working on a new blog post which will have some local visualizations to start us off in this direction.

Bernard's idea of clusters grouping in a fractal way is intriguing, as fractal display of information is possible in a repetitive way that may work to explore without having to use all data points within the visible graph. See onezoom.org or more specifically zoompast.com (same tech, different use).

also, fun stuff to think about:

https://www.cs.umd.edu/~ben//papers/Shneiderman2008Extreme.pdf

Phylogeny is not genealogy, but still interesting to think about. Might have to identify the clusters somehow so you can group as you go up, and store that somehow.

Jonathan: Wow, onezoom.org is super-cool! Thanks for sharing. These are awesome visualizations!
Wow. Looking at your links, Jonathan, I figure I've missed a lot of things lately. Won't write "graph visualizations don't scale" any more.
uh-oh, I broke him. I think that's still a valid statement Bernard, you have to cheat somewhat to make it work. And mentally visualizing 27 million anything is going to be akin to the "hairball" scenario where it's just meaningless unless you have a useful way to dissect and group things, or browse among the data, which would need to be less dynamic and more of a static calculation that just moves the frame around.
For me, visualizing a hairball or a dense inscrutable mess is more satisfying than telling myself I can't possibly imagine what the global tree looks like.  I think it's a big lumpy ball, thicker in some places than others, with some hairs or string or tails hanging off.

Edited for brevity.

349 views
621 views
427 views
615 views
765 views