Why aren't genealogical collections using contemporary data formats?

Question

Why aren't genealogical collections using contemporary data formats?

492 views

The GEDCOM format looks like a databasing format from the 1980s. Why hasn't a more updated format emerged?

The problems with the GEDCOM formats and attempts at its extension are myriad.

The GEDCOM format assumes all people will have a surname or a Europeanized name.

The GEDCOM format assumes that the general structure of a family is linear and a tree when family trees are generally actually family graphs with a degree of intermarriage, bastardy, and interbreeding.

The GEDCOM format is difficult to verify and lacks simple consistency checking mechanisms and as such allows the formation and propagation of computationally ambiguous loops like grandfather paradoxes.

The GEDCOM does not distinguish between conflicting degrees of information such as sourced entries vs non-sourced enteries. The format does not lend itself to amendment of properties or the extension of data by the linking of further sources.

The GEDCOM location data does not capture the changes of the name of locations throughout history; place names are regularly given in strictly modern terms that don't make sense within the context of the historic areas that people originate. Example: ancestors being listed as living in Connecticut or Vermont, USA in 1750. GPS coordinates are generally better for the purposes of dereferencing ahistorical location data to digital maps. GEDCOM then does not properly correlate location data of sourced events to graphical areas on maps.

Frequently webpages and sites dealing with large family graphs do not deal well with the importation of 5000+ entries and the fragmentation of the import data is difficult because the linear and singule file nature of the GEDCOM format. The unit document of genealogical data is the collectively documented properties of a person through out their life; uploading needs to be constructed from the premise of importing and updating members of a family graph rather than in terms of uploading entire family trees and verifying each and every entry by hand. In otherwords, the better format is a distributed source versioning repository.

All of these problems are solved in the contemporary programming and developing domain of package management and source code development software.

Open document containment and management formats that would be appropriate for incorporation to construct a new genealogy format or what I call a family graphing format include zip, 7z, or tar.gz.

Github is functional for a distributed source version repository, and the format of GitHub can be extended to a genealogy specific distributed development environment. Say GenGraph.

Semantic web technologies would be really useful for automatically finding and retrieving source documents; I don't see technologically why pages on a modern wiki can't be partially constructed from existing documented and semantically marked up sources like the Wikimedia commons or DBPedia, so biographies might be imported at least for notable individuals. These same technologies would be useful for linking data to systems like Geohack, so events could be displayed in a map frame on their page. And RDF-equivalents would be ideal for linking together collections of genealogical documents, resolving their dependencies, and interfacing them with automated reasoners and compilers to provide data validation and consistency checking systems.

asked Apr 18, 2016 in WikiTree Tech by Ian Mclean G2G6 Mach 1 (13.6k points)
edited Apr 19, 2016 by Ian Mclean

Show 14 previous comments

Good for you Dale. Please do move on to bigger and better things.

Salgo, the internet generally speaking is a precision instrument by comparison. I can reliably communicate across the entire global internet more than 90% of the time. It is a lot better in some areas where people are more reasonable about what systems and networks they setup. In this part of the thread, I am illustrating the size and scope of the problem which is genealogy based on simplistic models and modified by specific points of contention like the actual number of humans who have ever and the maximum potential number of human beings who could have ever lived. My point is that the genealogy project is progressing at a snails pace and REALLY doesn't have to. We could be applying deep learning techniques to this problem so that the majority of the tedious record finding and comparison work is done for us by machine verification and reasoners doing routine document processing.

The practical problem of genealogy is less complex than a generally uncomputable problem. Provided we operate consistently and efficiently. If we are "greedy" about the complexity by neglecting to setup appropriate data structures to archive, search, sort, prune, merge, and retrieve relevant data then the problem is at least exponentially complex. Different by 19 orders of magnitude in fact. If we're totally inconsistent about it then we can end up like Sisyphus and his bolder. It doesn't have to be. The basic datastructures of genealogy are finite fields of information; genetics are the benchmark for progress here. Our genetic sequencing technologies are beating Moore's law in terms of performance rate, accuracy, and precision. That is what we can expect at best based on state of the art bio-informatics in anthropology.

At the far end from here is Watson and the Google Deep learning algorithms. We could have a Genealogy@Home project going on worldwide. We could supercompute this problem, and we could generate the Global Family Graph back a couple thousand years in a couple of decades tops. In a couple generations, we could trace back a couple million years and start to link up with the tree of biological evolution. Couple generations after that we would have the global family graph back a couple billion years to the earliest known organisms.

First we need to get our busy-ness in order. Use sensible archival systems and take advantage of the state of the art library and information retrieval science. Wikitree has a whole bunch of data, but I haven't seen much analysis. I want statistics; I want white papers. I want open source science and publications. I want to know what percentage of the global graph is complete and checked to high precision consistency.

commented Apr 19, 2016 by Ian Mclean G2G6 Mach 1 (13.6k points)

Salgo, people are allowed to disagree. We're doing that in this thread quite a bit. I am reporting my experience; I am not speaking prohibitions. I am simply stating propositions then inferring consequences from them. People who like the 40 year old software are more than welcome to keep using it. I am not saying the GEDCOM has to be entirely abandoned or anything like that. I am saying that no archival format presently exists outside of proprietary data packaging standards specifically for genealogy data, model, and theory. I am reporting contradictions that I have discovered in the models which indicate inconsistencies produced within the theoretical system.

And it was said in the thread that Wikitree is a volunteer force. My point here is that there exists a huge volunteer force which is dedicated to file management, data validation, databasing, query software, deep learning and data mining, and high quality archival technology. GEDCOM X is a step in the right direction, but the fundamental problem of grouping the data into a model for communication and distribution has many known solutions in the package management world of *nix operating systems. Free open source software, the creative commons, and the GNU communities have the technologies that can be bundled together to define a specification of what amounts to a fancy zip format. A direct extension of that technology actually. Supplemental to the GEDCOM and GEDCOM X semantic variants. A way to more effectively share our genealogy models that allows us to use the command line tools and scripts of the *nix environment to bulk process the global family graph using distributed processing network like that of the BOINC.

It isn't asking much, and it only needs a couple of interested individuals with the technical know how. Chances are the Free/libre open source community is are already working on parts of it. Wikitree can be an active part and supporter of that distributed project.

commented Apr 19, 2016 by Ian Mclean G2G6 Mach 1 (13.6k points)

Very yes. That is definitely what I am thinking about. So I could use the format to do something like

$gengraph install picasso-1

And download the entire collection of works recorded in the public domain by Picasso or about Picasso.

And best yet, I could do the same thing with somebody "non-notable" to get a better idea of who they were, and where they went, and what they accomplished. I've already discovered some really neat facts about my family; I've learned stories that I have never been told by my living relatives. Marriage licenses and census records can tell interesting stories; I learned that five generations ago my family had estates with servants. I learned that my family can be found in Aberdeenshire, Scotland. My grandpa once told me that our family had a castle, and from what I've learned from reading about the Mcleans, we actually have several. I probably have some title. It would be neat to be able to build libraries of my ancestor's works from my genealogy.

commented Apr 19, 2016 by Ian Mclean G2G6 Mach 1 (13.6k points)

I'll note that GRAMPS does some of the databasing improvements, but it is unweildy and is designed for legacy support of genealogy data standards based on GEDCOM and the historical authorial method rather than scientific-anthropological method.

One of the things that occurs to me is that whatever the data standard is that emerges, it needs to be design with a unit testing experimental model in mind. Basically, people each create their own version of a genealogy with wildly varying spelling and choice of location names, so the further back you go the more divergent variations of the same genealogies you have. I would like the ability to take the census data and generate trees based off of it and compare those to available genealogies; there needs to be a method of eliminating genealogies based on counter evidence. The problem is already huge, and we need methods of reducing it down to less incorrect models.

commented Apr 20, 2016 by Ian Mclean G2G6 Mach 1 (13.6k points)

1 Answer

Answer 1 · 2016-04-18T10:48:11+0000

First yes GEDCOM standards are using a database format from the 1980's,but if you deviate too much then most programs will not work with the new standards.

Second I believe that the transferring of sources is a problem because the receiving program wants that information their own way and that is not compatible with the way other programs display things.

Every program and web site has their own conventions about name display and location naming so until there is uniformity among ALL programs and Web sites the name and location problem will exist.

I have been working on genealogy programs since the 1980's and have used a very large number of them and I can say that WikiTree is the most unique in the formatting of data including names and location names, that is not bad, but because of that if a GEDCOM is going to function for most programs there will always be problems with a few.

I have found that it is actually less work and takes less time to create profiles on WikiTree manually than it is to upload a GEDCOM and do the resulting clean up later. The real problem is that most do not see that until they upload a GEDCOM and then a lot of those people just do not bother to fix their uploads.

answered Apr 18, 2016 by Dale Byers G2G Astronaut (1.7m points)

So transfer the whole source as it is. Define formats for reinterpreting or mapping the original source to a digital representation appropriate to application specific domains. Parsers, Interpreters, and transcription technologies are OLD at this point. Much older than 1980s tech. No excuse other than trying to prop up antiquated technologies which should be a legacy concern for contemporary software systems.

The way you get uniformity is not by avoiding defining and developing specific technologies but by making something so useful that everyone uses it. GEDCOM is mostly the defacto standard because most people are conforming to the early tech of the LDS church. After having reviewed the LDS church's Familysearch and Roots software, I can say with confidence that there are far better ways to do all of this.Cheaper. More accessible. More usable. Friendlier.

Any solution which is "manually enter the data by hand" in this domain is stupid. Supremely stupid. For me alone, the maximum number of ancestors is 2^n; at 100 generations (about the span of agrarian human history) there are more entries to be parsed, transcribed, translated, annotated, linked, verified, and compared than there is storage in all the world's servers presently. When you take into account the 7 billion people on the planet then the sum of them is Sigma(2^n) max entries. Even with relatively high consanguinity (IE redundancy in the family graph), we're taking about a number of entries (multiplied by the number of sources and citations) approaching the Sigma(2^n) maximum. You have to understand that kind of complexity is already not effectively computable.

We can do better, and we should, I think. Wikitree isn't the end of genealogy; the data we enter here is not definitive and conclusive, and the tree we construct is a exceedingly small fragment of the whole. At some point, the data will be migrated to a new system, a new Wikitree version, and when that happens, you are effectively advocating that we simply re-enter the majority of the data by hand.

commented Apr 18, 2016 by Ian Mclean G2G6 Mach 1 (13.6k points)

Wikitree is by no means comprehensive. It presently skews very heavily towards US genealogies and genealogies completed in English. You have to understand that the number of humans living today are far less than the number of all humans who have ever lived. If Wikitree is to be truly global in its extent then it will eventually have to accommodate the genealogies of Chinese, Indian, and Native American people. We're presently talking about 7 billion profiles for only the living people right now; Wikitree is barely a ten million profiles of both living and dead people. (7 billion * 2^n). If we go 2 generations out (only the grandparents of everyone presently living) then it is (7 billion * 2^2) or a maximum of 28 billion profiles. Even supposing that there are half as many people due to redundancies, we're still talking about 14 billion people which is still a thousand times the size of the entire Wikitree project from 2008 to 2016. If it takes 8 years to produce 10 million profiles and the rate is constant then it will take 8000 years to get almost two generations of today's population.

That is both a Sisyphean task and an infeasible way to process all the information available. And to be clear, I am not talking about the number of connections in the family graph. I am only talking about the number of profiles; the number of links between the profiles will generally vastly exceed the number of total profiles. It is simple combinatoric arguments.

commented Apr 18, 2016 by Ian Mclean G2G6 Mach 1 (13.6k points)
edited Apr 18, 2016 by Ian Mclean

Ian, You yourself said that the required number of profiles would exceed the current server space in the world. Yes WikiTree is skewed toward the US, that is because it is based in the US and most of the members are from the US, but we do not limit ourselves to only profiles of US people, in fact I have created a profile, manually, for a Japanese citizen and even added his name in their language. Is it perfect? No. But it is evolving. The root of the problem is something that you and all of your math are ignoring, you want to change something that is used by hundreds of programs and has been in use for almost 40 years, and those changes are way beyond the current abilities because the GEDCOM output form every program is slightly different and you would need to account for all of those variables to have it work smoothly. There have been a large number of improvements to the way GEDCOMS are imported and how profiles are created form them but you need to understand that WikiTree does not have a large staff, it is mostly volunteers to enable them to keep the cost to the users FREE.

commented Apr 18, 2016 by Dale Byers G2G Astronaut (1.7m points)

You don't understand the basic mathematics of the genealogical problem.

When I say links, I mean between individuals. Child to parent. spouse to spouse. Child to child. What is called graphing and is governed by the mathematical theory of graphs. Also ruled by network theory.

So my assertions about links between profiles is without taking into consideration links between documents and profiles. That is another term in the equation which generally increases the value that I was asserting.

I have one ancestor who has eighteen children. That means 19 profiles and 18 links between child and one parent and C(18, 2) links from child to child. 36 links between children and two parents. I am only concerned with necessary links in a family graph not with optional links like Friend of a Friend.

commented Apr 18, 2016 by Ian Mclean G2G6 Mach 1 (13.6k points)

Your theoretical maximum of 2^n at 100 generations or roughly (rounded down) 1.26x10^{^30} is entirely flawed. The actual estimates of the total number of human beings that have ever lived on the planet is closer to 1.08x10^¹¹ - a quick division of these two numbers shows you have overestimated the size of 'the problem' by a factor of about 1.17x10^^19.; The 28 billion profiles (7 billion x 2^2) that you suggest are just 2 generations actually represents about 1/4 of 'humanity'. Then there is of course the fact that we do not have records going back to the "start of agrarian human society", nor for a vast number of those people who existed - with enough samples we may be able to use DNA to 'rebuild' the people who must have existed that connect us, but there are many who left no descendants and no DNA trail for us to follow so we will never be making profiles for the entirety of agrarian human society.

When you state the problem in remotely realistic terms, you might find people give it more consideration; and possibly using "conversational" language rather than jargon might help people better understand what you are suggesting - one of the most important things in communicating is considering your audience.

commented Apr 19, 2016 by Rob Ton G2G6 Pilot (290k points)

The maximum is the maximum. I know the actual number is significantly lower than the maximum because humanity is not a family tree but a family graph or as I like to refer to it, a family shrub. The consanguinity (inbreeding) is in general relatively high especially among royals. But just theoretically, I have two parents. They have two parents each. Their parents have two parents each. And so on. That progression is 2^n. That's the maximum number of people for a bi-sexed reproductive species based on a progression of generations.

The actual number will be 2^n - c where c is the consanguinity factor based on how many ancestors appear multiple times in a given family graph. However, We don't need to worry about going back a hundred generations or even about the difference being theoretically 19 orders of magnitude smaller. 10^11 is still a huge number compared to 10^6. At the rate Wikitree has been filling out family histories, we're still looking at several thousand years worth of data entry by hand to get as much of the human family graph filled out as is possible using historical records, anthropological artifacts, and biophysical evidence like genetics. I'll actually leave aside the whole theoretical reality that data in our universe can't actually be erased (1st and 2nd thermodynamic laws), so we can eventually recover at least almost all lost data using forensic physics.

For my purposes, it suffices to show that if we want to recreate the family graph back to say 30-50 generations, the problem is still prohibitively expensive in terms of time if we're going to enter everything by hand and migrate the data from software revision to software revision by hand each and every time. We spend all our time updating a past that gets further and further from us each, and every time we have to migrate the data to the new updated system. And if human generations keep growing in size then we will never actually get any of the near generations done because their size becomes comparable to the size of all previous human generations.

commented Apr 19, 2016 by Ian Mclean G2G6 Mach 1 (13.6k points)

In simple terms for others to understand Ian wants to make a change because he can not recreate every person who ever lived based on flawed math, but using assumptions again based on the premise that even with the changes we will never be able to store this information. So he wants everyone to change the way we do things to enable something that can not be done even with the changes.

He is also expecting that WikiTree and all other software platforms will not evolve but rather be completely replaced and the history of these programs proves that this assumption is not always the case. By his own statements at least one other platform has been around for over 30 years and shows no sign of going away and that is familysearch.org.

commented Apr 19, 2016 by Dale Byers G2G Astronaut (1.7m points)
edited Apr 19, 2016 by Dale Byers

The LDS ceased effectively using (depreciated) a database of 40 million individuals when they switched to the GEDCOM format. They've been mining that archive, but they simply moved to a completely new data model rather than attempt to simply and exclusively transform the old data by extension.

And it has created a lot of cruft. I've been spending the past few days going through and consolidating the sources and profiles in my family tree. I've found more than a few instances where there was in excess of three duplicate profiles to be merged not including the merger of duplicate spousal families and duplicate children. I've already verified that Charlemagne (First Emperor of Rome) has on the order of hundreds of duplicate profiles and hundreds of duplicate trees, not all the trees and duplicates agree on the specific properties of Charlemagne or the members of his family. Wikipedia and Dbpedia is not a standardly indexed source. I fully expect that the tendency towards redundancy increases with the age of the generation. Data becomes almost totally unreliable within 10 generations. Data beyond about 10 generations from me ceases to meet my criterion for retention; the amount of speculation and unsourced entries is staggering. It wasn't a surprise to me to find that one of the Viscounts of the Franks (Thouars) was mistaken for his descendant's descendant and for an ancestor.

A sample of 13 generations showed that a pull of more than 15 generations was not viable on a common computer with a fiber optic connection and the percentage of irreconcilable database errors grew to an impractical level. The highest number of profiles successfully extracted from the archive was 10,000-20,000 profiles with an increasing number of duplicates.

From the perspective of operating system archival retrieval and standard business databasing benchmarks this is catastrophic. Windows 3.x works slightly better than GEDCOM systems. We can do better, and even the people who insist on hand entry will find great benefits from overhauling the data standards for genealogy.

commented Apr 19, 2016 by Ian Mclean G2G6 Mach 1 (13.6k points)

Categories

Why aren't genealogical collections using contemporary data formats?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Related questions