Why aren't genealogical collections using contemporary data formats?

Question

Why aren't genealogical collections using contemporary data formats?

493 views

The GEDCOM format looks like a databasing format from the 1980s. Why hasn't a more updated format emerged?

The problems with the GEDCOM formats and attempts at its extension are myriad.

The GEDCOM format assumes all people will have a surname or a Europeanized name.

The GEDCOM format assumes that the general structure of a family is linear and a tree when family trees are generally actually family graphs with a degree of intermarriage, bastardy, and interbreeding.

The GEDCOM format is difficult to verify and lacks simple consistency checking mechanisms and as such allows the formation and propagation of computationally ambiguous loops like grandfather paradoxes.

The GEDCOM does not distinguish between conflicting degrees of information such as sourced entries vs non-sourced enteries. The format does not lend itself to amendment of properties or the extension of data by the linking of further sources.

The GEDCOM location data does not capture the changes of the name of locations throughout history; place names are regularly given in strictly modern terms that don't make sense within the context of the historic areas that people originate. Example: ancestors being listed as living in Connecticut or Vermont, USA in 1750. GPS coordinates are generally better for the purposes of dereferencing ahistorical location data to digital maps. GEDCOM then does not properly correlate location data of sourced events to graphical areas on maps.

Frequently webpages and sites dealing with large family graphs do not deal well with the importation of 5000+ entries and the fragmentation of the import data is difficult because the linear and singule file nature of the GEDCOM format. The unit document of genealogical data is the collectively documented properties of a person through out their life; uploading needs to be constructed from the premise of importing and updating members of a family graph rather than in terms of uploading entire family trees and verifying each and every entry by hand. In otherwords, the better format is a distributed source versioning repository.

All of these problems are solved in the contemporary programming and developing domain of package management and source code development software.

Open document containment and management formats that would be appropriate for incorporation to construct a new genealogy format or what I call a family graphing format include zip, 7z, or tar.gz.

Github is functional for a distributed source version repository, and the format of GitHub can be extended to a genealogy specific distributed development environment. Say GenGraph.

Semantic web technologies would be really useful for automatically finding and retrieving source documents; I don't see technologically why pages on a modern wiki can't be partially constructed from existing documented and semantically marked up sources like the Wikimedia commons or DBPedia, so biographies might be imported at least for notable individuals. These same technologies would be useful for linking data to systems like Geohack, so events could be displayed in a map frame on their page. And RDF-equivalents would be ideal for linking together collections of genealogical documents, resolving their dependencies, and interfacing them with automated reasoners and compilers to provide data validation and consistency checking systems.

asked Apr 18, 2016 in WikiTree Tech by Ian Mclean G2G6 Mach 1 (13.6k points)
edited Apr 19, 2016 by Ian Mclean

Some of the things I have been looking at which are relevant to this. I think they miss the obvious steps of using packaging and source versioning solutions in the new standard of genealogy tech. As the above linked thread shows, there is a tremendous need and growing demand for a technology which bundles or links genetic data together with the family graph.

Semantic web technologies are a part of the solution, but they are good for describing the document model for machine readers; this is analogous to the file structure and version history of a zip and of a git repository. You still need to encapsulate the documents into a logical structure which can be reduced to a directly human readable document IE a web page or a searchable database of documents. And based on what I have seen and done in modding communities for video games, I think that the obvious solution is to make packages which can be decomposed into smaller groups of members. Sometimes, I want to see only the members of my genealogy out to 6 generations. Somtimes, I want to browse through every member back to the furthest documented. Bundling the documents of a member together and then bundling the members of a family graph together makes sense from a modularity perspective as compared to the unifile format that is barely computable.

http://jay.askren.net/Projects/SemWeb/

http://www.zandhuis.nl/sw/genealogy/

https://github.com/blokhin/genealogical-trees

https://blog.tilde.pro/semantic-web-technologies-on-an-example-of-family-trees-7518f3f835a9#.2b09rho78

http://genealogy.stackexchange.com/questions/4087/application-to-create-a-google-map-of-ancestors

commented Apr 18, 2016 by Ian Mclean G2G6 Mach 1 (13.6k points)

Thanks for the links

In the Wikiapps Google group yesterday Robert Warthen who works with gedmatch.com (video Finding Sue) said....

"I've been interested in working on open standards for communicating DNA based information for a bit, so your info may fit into that area also. "

Which is also an area where genealogy needs to rethink and open up to do some progress faster...

DNA research would benefit from open data:

Family trees - Wikitree
In Common With FTDNA
Ancestry DNA Circles
Time period and location of people in the family tree (maybe Wikitree if we start timelines using templates)
Sharing segment match information Gedmatch?

Example of the possibilities using just FTDNA ICW (video) and maps done at DNAgen.net (swe)

commented Apr 18, 2016 by Living Sälgö G2G6 Pilot (297k points)
edited Apr 18, 2016 by Living Sälgö

Getting Started

The GEDCOM X project is developed and maintained at Github, allowing us to leverage the powerful collaboration features that Github provides. Github has accumulated a lot of great resources on how to get started at github:help.

If you haven't worked with Git yet, now's the time to start! (We promise, you'll never go back.) Github has accumulated a lot of great resources on how to get started at github:help. Get familiar with the Github M.O., and take particular note of how to fork a repo, how to send a pull request, and how to be social.

To get started with Git, we'd recommend Pro Git.

From - http://www.gedcomx.org/Community.html

Maybe you could collaborate with them through their community.

commented Apr 18, 2016 by Kathleen Heath G2G6 Mach 2 (22.0k points)

FamilySearch acts as a proprietary repository for GEDCOM X data. Keeping in mind that GEDCOM X is only a semantic model describing a genealogical collection and doesn't constitute a proper genealogical archive without document management software which isn't included explicitly in the GEDCOM X standard. The repository is not open and free, so it can not be generally cloned or forked without filtering through their proprietary software or approved third party alternatives.

The FamilySearch familytree is not portable, and it imposes a modern western model of nuclear family relationships on the family graph which are not supported by the messiness of real world family graphs and relations. This makes it difficult to query the database and get sensible results. I have spent weeks combing through simple records the system is unable to correlate to my family relations. And ultimately, the FamilySearch site is a centralized repository controlled exclusively by the LDS church; Git's major claim to fame is being a free, open, and distributed alternative to traditional SVRs/SVNs.

So, yes and no. GEDCOM X and the Familysearch systems have strengths, but they're strongly biased towards the codified Mormon normative worldview.

commented Apr 18, 2016 by Ian Mclean G2G6 Mach 1 (13.6k points)

Well, you don't have to sign up for a FamilySearch account and feed data into the Mormon Church's PR/Marketing department for one thing. You can download git archives from a public interface without downloading fee-for-service or shareware/adware.

Then there's the aspect that the FamilySearch system doesn't actually perform consistency checks on the over 4 billion profiles they have, so they have something like at least a couple hundred "Emperor Charlemagne" profiles. I have found one "grandfather paradox" in my family tree already that is where an ancestor was also listed as a descendant of a descendant such that a never ending loop was created between the segment of ancestors and descendants; a computer attempting to download that segment of generations would never halt and would cause the growth of the GEDCOM without bound depending only on processing time. A couple of my ancestors have several sets of parents listed.

When searching their database, you have few of the standard options for a functional search engine such as concatenation of Logical-OR terms or Logical-AND terms. Search and sort by regular expressions. When searching through records, the system makes a false dichotomy between "other people" and "spouses" and "Children" and "Parents"; the system doesn't recognize cultures where they don't have surnames or middle names. It is not easy to search through records where you know the result is not male but do not necessarily know if the record will be female or unknown.

The system lacks a lot for merging data from sources into filled attributes on profiles like when a birth date is listed in a profile as an estimate but a source that has been linked has an exact date. Regenerating profiles based on increased sources is difficult to impossible and not a standard feature of either FamilySearch or of Wikitree, and Wikitree has noted in several places policies based around the static structure which makes duplicates a huge pain to match and merge. Pruning data is apparently not a standard method among genealogists and isn't built into their tools.

commented Apr 18, 2016 by Ian Mclean G2G6 Mach 1 (13.6k points)

I am conversant with the match and merge mechanism of FamilySearch and of WIkitree. The way in which both sites create and manage profiles however does not lend itself in the long term to the notion that much if not most genealogical data is actually incorrect or redundant.

For example, here at Wikitree the profiles are created with a URI based off the best-guess-last-name-at-birth; when a profile is created with an name--say Unknown--then merged into a another profile--say Test--the original Unknown-X profile is redirected to the newly created profile Test-Y; Unknown-X as a URI is forever used up by that redirection. This is fine if it happens occasionally, but long chains of redirections of this kind are computationally costly a risk unpredictable errors in the long term maintanence of the system. Keep in mind that the problem of genealogy at its maximum size is effectively non-computable (See power sets, computational complexity, and non-computability for details), so computational costs to our resources are a real risk to the ability of the system to actually get things done in the long run.

The point being that the system isn't designed to have a dynamic fundamentally uncertain structure; the system is designed around a certain static structure. Where each and every profile is ideally created once and refers exactly and only to profile representing the person with nth last name of Z where n is a integer and Z is some string representing a surname. A system designed to be pruned and merged would not have the URI be generated from the last name of the entries in the URI; it does not lend itself to frequent modification of what amounts to a tentative fact because if Unkown-X becomes Test-Y and Test-Y turns out to be actually named Mistake-ZED which turns out to be only one of nine different equivalent spellings of the person then we will have created either three profiles and three links or around 12 profiles and 12 redirecting URIs.

That assumes that we don't have hundreds of people editing the profile and disagreeing on what the last name of the person in question is; if the hundred people each think that the person has one and only one true last name at birth when the person actually has nine then they could end up rotating edits generating an unending number of profiles and redirects. And if you think that no person is that stupid then you don't understand what computers can end up doing to databases in the long run.

Doesn't particularly matter what you and the arborists or FamilySearch users think or do. The flaw--along with flaws that I haven't explicitly described--exists in the design and programming of the system itself. To me, the existence of the arborists or groups of people going along and trying to cleanup duplicates doesn't matter because my concern is first with whether or not a computer can be made to automate the process. Under the current design, a computer can't be automated or can't easily be made to do the bulk of the work. The system has flaws which would cause gross system-wide corruption, and we are currently depending on people to not maliciously attack the system by exploiting these vulnerabilities and design flaws.

And my survey of what genealogists actually use in terms of tools is not limited to Wikitree or FamilySearch. I am analyzing and learning to use GRAMPS, Ancestor Quest, Roots Magic, and various semantic web tools; pruning tools are optional or afterthoughts in the design rather than a coretool. AncestorQuest for example almost totally lacks a pruning function; the best method I have read so far has been to split the trees using the import/export functionality. Which amounts to a hack rather than a properly designed feature.

Wikitree is more amenable to pruning in the form of simply deleting URI links between profiles, but it is not amenable to pruning in the sense of obliterating bad information; the current paradigm amounts to trying to not let bad information contaminate the global family graph in the first place by imposing human gatekeepers to physically review and reject the data at the level of GEDCOM imports. This is presently somewhat functional, but it adds a sizable overhead to the entry of data and the linking of Wikitree with the data produced by all the previous generations; in the information tech field, this is referred to as "reinventing the wheel", and it is a red flag for a bad design.

commented Apr 19, 2016 by Ian Mclean G2G6 Mach 1 (13.6k points)

Good for you Dale. Please do move on to bigger and better things.

Salgo, the internet generally speaking is a precision instrument by comparison. I can reliably communicate across the entire global internet more than 90% of the time. It is a lot better in some areas where people are more reasonable about what systems and networks they setup. In this part of the thread, I am illustrating the size and scope of the problem which is genealogy based on simplistic models and modified by specific points of contention like the actual number of humans who have ever and the maximum potential number of human beings who could have ever lived. My point is that the genealogy project is progressing at a snails pace and REALLY doesn't have to. We could be applying deep learning techniques to this problem so that the majority of the tedious record finding and comparison work is done for us by machine verification and reasoners doing routine document processing.

The practical problem of genealogy is less complex than a generally uncomputable problem. Provided we operate consistently and efficiently. If we are "greedy" about the complexity by neglecting to setup appropriate data structures to archive, search, sort, prune, merge, and retrieve relevant data then the problem is at least exponentially complex. Different by 19 orders of magnitude in fact. If we're totally inconsistent about it then we can end up like Sisyphus and his bolder. It doesn't have to be. The basic datastructures of genealogy are finite fields of information; genetics are the benchmark for progress here. Our genetic sequencing technologies are beating Moore's law in terms of performance rate, accuracy, and precision. That is what we can expect at best based on state of the art bio-informatics in anthropology.

At the far end from here is Watson and the Google Deep learning algorithms. We could have a Genealogy@Home project going on worldwide. We could supercompute this problem, and we could generate the Global Family Graph back a couple thousand years in a couple of decades tops. In a couple generations, we could trace back a couple million years and start to link up with the tree of biological evolution. Couple generations after that we would have the global family graph back a couple billion years to the earliest known organisms.

First we need to get our busy-ness in order. Use sensible archival systems and take advantage of the state of the art library and information retrieval science. Wikitree has a whole bunch of data, but I haven't seen much analysis. I want statistics; I want white papers. I want open source science and publications. I want to know what percentage of the global graph is complete and checked to high precision consistency.

commented Apr 19, 2016 by Ian Mclean G2G6 Mach 1 (13.6k points)

Salgo, people are allowed to disagree. We're doing that in this thread quite a bit. I am reporting my experience; I am not speaking prohibitions. I am simply stating propositions then inferring consequences from them. People who like the 40 year old software are more than welcome to keep using it. I am not saying the GEDCOM has to be entirely abandoned or anything like that. I am saying that no archival format presently exists outside of proprietary data packaging standards specifically for genealogy data, model, and theory. I am reporting contradictions that I have discovered in the models which indicate inconsistencies produced within the theoretical system.

And it was said in the thread that Wikitree is a volunteer force. My point here is that there exists a huge volunteer force which is dedicated to file management, data validation, databasing, query software, deep learning and data mining, and high quality archival technology. GEDCOM X is a step in the right direction, but the fundamental problem of grouping the data into a model for communication and distribution has many known solutions in the package management world of *nix operating systems. Free open source software, the creative commons, and the GNU communities have the technologies that can be bundled together to define a specification of what amounts to a fancy zip format. A direct extension of that technology actually. Supplemental to the GEDCOM and GEDCOM X semantic variants. A way to more effectively share our genealogy models that allows us to use the command line tools and scripts of the *nix environment to bulk process the global family graph using distributed processing network like that of the BOINC.

It isn't asking much, and it only needs a couple of interested individuals with the technical know how. Chances are the Free/libre open source community is are already working on parts of it. Wikitree can be an active part and supporter of that distributed project.

commented Apr 19, 2016 by Ian Mclean G2G6 Mach 1 (13.6k points)

Very yes. That is definitely what I am thinking about. So I could use the format to do something like

$gengraph install picasso-1

And download the entire collection of works recorded in the public domain by Picasso or about Picasso.

And best yet, I could do the same thing with somebody "non-notable" to get a better idea of who they were, and where they went, and what they accomplished. I've already discovered some really neat facts about my family; I've learned stories that I have never been told by my living relatives. Marriage licenses and census records can tell interesting stories; I learned that five generations ago my family had estates with servants. I learned that my family can be found in Aberdeenshire, Scotland. My grandpa once told me that our family had a castle, and from what I've learned from reading about the Mcleans, we actually have several. I probably have some title. It would be neat to be able to build libraries of my ancestor's works from my genealogy.

commented Apr 19, 2016 by Ian Mclean G2G6 Mach 1 (13.6k points)

I'll note that GRAMPS does some of the databasing improvements, but it is unweildy and is designed for legacy support of genealogy data standards based on GEDCOM and the historical authorial method rather than scientific-anthropological method.

One of the things that occurs to me is that whatever the data standard is that emerges, it needs to be design with a unit testing experimental model in mind. Basically, people each create their own version of a genealogy with wildly varying spelling and choice of location names, so the further back you go the more divergent variations of the same genealogies you have. I would like the ability to take the census data and generate trees based off of it and compare those to available genealogies; there needs to be a method of eliminating genealogies based on counter evidence. The problem is already huge, and we need methods of reducing it down to less incorrect models.

commented Apr 20, 2016 by Ian Mclean G2G6 Mach 1 (13.6k points)

1 Answer

Answer 1 · 2016-04-18T10:48:11+0000

First yes GEDCOM standards are using a database format from the 1980's,but if you deviate too much then most programs will not work with the new standards.

Second I believe that the transferring of sources is a problem because the receiving program wants that information their own way and that is not compatible with the way other programs display things.

Every program and web site has their own conventions about name display and location naming so until there is uniformity among ALL programs and Web sites the name and location problem will exist.

I have been working on genealogy programs since the 1980's and have used a very large number of them and I can say that WikiTree is the most unique in the formatting of data including names and location names, that is not bad, but because of that if a GEDCOM is going to function for most programs there will always be problems with a few.

I have found that it is actually less work and takes less time to create profiles on WikiTree manually than it is to upload a GEDCOM and do the resulting clean up later. The real problem is that most do not see that until they upload a GEDCOM and then a lot of those people just do not bother to fix their uploads.

answered Apr 18, 2016 by Dale Byers G2G Astronaut (1.7m points)

So transfer the whole source as it is. Define formats for reinterpreting or mapping the original source to a digital representation appropriate to application specific domains. Parsers, Interpreters, and transcription technologies are OLD at this point. Much older than 1980s tech. No excuse other than trying to prop up antiquated technologies which should be a legacy concern for contemporary software systems.

The way you get uniformity is not by avoiding defining and developing specific technologies but by making something so useful that everyone uses it. GEDCOM is mostly the defacto standard because most people are conforming to the early tech of the LDS church. After having reviewed the LDS church's Familysearch and Roots software, I can say with confidence that there are far better ways to do all of this.Cheaper. More accessible. More usable. Friendlier.

Any solution which is "manually enter the data by hand" in this domain is stupid. Supremely stupid. For me alone, the maximum number of ancestors is 2^n; at 100 generations (about the span of agrarian human history) there are more entries to be parsed, transcribed, translated, annotated, linked, verified, and compared than there is storage in all the world's servers presently. When you take into account the 7 billion people on the planet then the sum of them is Sigma(2^n) max entries. Even with relatively high consanguinity (IE redundancy in the family graph), we're taking about a number of entries (multiplied by the number of sources and citations) approaching the Sigma(2^n) maximum. You have to understand that kind of complexity is already not effectively computable.

We can do better, and we should, I think. Wikitree isn't the end of genealogy; the data we enter here is not definitive and conclusive, and the tree we construct is a exceedingly small fragment of the whole. At some point, the data will be migrated to a new system, a new Wikitree version, and when that happens, you are effectively advocating that we simply re-enter the majority of the data by hand.

commented Apr 18, 2016 by Ian Mclean G2G6 Mach 1 (13.6k points)

Wikitree is by no means comprehensive. It presently skews very heavily towards US genealogies and genealogies completed in English. You have to understand that the number of humans living today are far less than the number of all humans who have ever lived. If Wikitree is to be truly global in its extent then it will eventually have to accommodate the genealogies of Chinese, Indian, and Native American people. We're presently talking about 7 billion profiles for only the living people right now; Wikitree is barely a ten million profiles of both living and dead people. (7 billion * 2^n). If we go 2 generations out (only the grandparents of everyone presently living) then it is (7 billion * 2^2) or a maximum of 28 billion profiles. Even supposing that there are half as many people due to redundancies, we're still talking about 14 billion people which is still a thousand times the size of the entire Wikitree project from 2008 to 2016. If it takes 8 years to produce 10 million profiles and the rate is constant then it will take 8000 years to get almost two generations of today's population.

That is both a Sisyphean task and an infeasible way to process all the information available. And to be clear, I am not talking about the number of connections in the family graph. I am only talking about the number of profiles; the number of links between the profiles will generally vastly exceed the number of total profiles. It is simple combinatoric arguments.

commented Apr 18, 2016 by Ian Mclean G2G6 Mach 1 (13.6k points)
edited Apr 18, 2016 by Ian Mclean

Ian, You yourself said that the required number of profiles would exceed the current server space in the world. Yes WikiTree is skewed toward the US, that is because it is based in the US and most of the members are from the US, but we do not limit ourselves to only profiles of US people, in fact I have created a profile, manually, for a Japanese citizen and even added his name in their language. Is it perfect? No. But it is evolving. The root of the problem is something that you and all of your math are ignoring, you want to change something that is used by hundreds of programs and has been in use for almost 40 years, and those changes are way beyond the current abilities because the GEDCOM output form every program is slightly different and you would need to account for all of those variables to have it work smoothly. There have been a large number of improvements to the way GEDCOMS are imported and how profiles are created form them but you need to understand that WikiTree does not have a large staff, it is mostly volunteers to enable them to keep the cost to the users FREE.

commented Apr 18, 2016 by Dale Byers G2G Astronaut (1.7m points)

You don't understand the basic mathematics of the genealogical problem.

When I say links, I mean between individuals. Child to parent. spouse to spouse. Child to child. What is called graphing and is governed by the mathematical theory of graphs. Also ruled by network theory.

So my assertions about links between profiles is without taking into consideration links between documents and profiles. That is another term in the equation which generally increases the value that I was asserting.

I have one ancestor who has eighteen children. That means 19 profiles and 18 links between child and one parent and C(18, 2) links from child to child. 36 links between children and two parents. I am only concerned with necessary links in a family graph not with optional links like Friend of a Friend.

commented Apr 18, 2016 by Ian Mclean G2G6 Mach 1 (13.6k points)

Your theoretical maximum of 2^n at 100 generations or roughly (rounded down) 1.26x10^{^30} is entirely flawed. The actual estimates of the total number of human beings that have ever lived on the planet is closer to 1.08x10^¹¹ - a quick division of these two numbers shows you have overestimated the size of 'the problem' by a factor of about 1.17x10^^19.; The 28 billion profiles (7 billion x 2^2) that you suggest are just 2 generations actually represents about 1/4 of 'humanity'. Then there is of course the fact that we do not have records going back to the "start of agrarian human society", nor for a vast number of those people who existed - with enough samples we may be able to use DNA to 'rebuild' the people who must have existed that connect us, but there are many who left no descendants and no DNA trail for us to follow so we will never be making profiles for the entirety of agrarian human society.

When you state the problem in remotely realistic terms, you might find people give it more consideration; and possibly using "conversational" language rather than jargon might help people better understand what you are suggesting - one of the most important things in communicating is considering your audience.

commented Apr 19, 2016 by Rob Ton G2G6 Pilot (291k points)

The maximum is the maximum. I know the actual number is significantly lower than the maximum because humanity is not a family tree but a family graph or as I like to refer to it, a family shrub. The consanguinity (inbreeding) is in general relatively high especially among royals. But just theoretically, I have two parents. They have two parents each. Their parents have two parents each. And so on. That progression is 2^n. That's the maximum number of people for a bi-sexed reproductive species based on a progression of generations.

The actual number will be 2^n - c where c is the consanguinity factor based on how many ancestors appear multiple times in a given family graph. However, We don't need to worry about going back a hundred generations or even about the difference being theoretically 19 orders of magnitude smaller. 10^11 is still a huge number compared to 10^6. At the rate Wikitree has been filling out family histories, we're still looking at several thousand years worth of data entry by hand to get as much of the human family graph filled out as is possible using historical records, anthropological artifacts, and biophysical evidence like genetics. I'll actually leave aside the whole theoretical reality that data in our universe can't actually be erased (1st and 2nd thermodynamic laws), so we can eventually recover at least almost all lost data using forensic physics.

For my purposes, it suffices to show that if we want to recreate the family graph back to say 30-50 generations, the problem is still prohibitively expensive in terms of time if we're going to enter everything by hand and migrate the data from software revision to software revision by hand each and every time. We spend all our time updating a past that gets further and further from us each, and every time we have to migrate the data to the new updated system. And if human generations keep growing in size then we will never actually get any of the near generations done because their size becomes comparable to the size of all previous human generations.

commented Apr 19, 2016 by Ian Mclean G2G6 Mach 1 (13.6k points)

In simple terms for others to understand Ian wants to make a change because he can not recreate every person who ever lived based on flawed math, but using assumptions again based on the premise that even with the changes we will never be able to store this information. So he wants everyone to change the way we do things to enable something that can not be done even with the changes.

He is also expecting that WikiTree and all other software platforms will not evolve but rather be completely replaced and the history of these programs proves that this assumption is not always the case. By his own statements at least one other platform has been around for over 30 years and shows no sign of going away and that is familysearch.org.

commented Apr 19, 2016 by Dale Byers G2G Astronaut (1.7m points)
edited Apr 19, 2016 by Dale Byers

The LDS ceased effectively using (depreciated) a database of 40 million individuals when they switched to the GEDCOM format. They've been mining that archive, but they simply moved to a completely new data model rather than attempt to simply and exclusively transform the old data by extension.

And it has created a lot of cruft. I've been spending the past few days going through and consolidating the sources and profiles in my family tree. I've found more than a few instances where there was in excess of three duplicate profiles to be merged not including the merger of duplicate spousal families and duplicate children. I've already verified that Charlemagne (First Emperor of Rome) has on the order of hundreds of duplicate profiles and hundreds of duplicate trees, not all the trees and duplicates agree on the specific properties of Charlemagne or the members of his family. Wikipedia and Dbpedia is not a standardly indexed source. I fully expect that the tendency towards redundancy increases with the age of the generation. Data becomes almost totally unreliable within 10 generations. Data beyond about 10 generations from me ceases to meet my criterion for retention; the amount of speculation and unsourced entries is staggering. It wasn't a surprise to me to find that one of the Viscounts of the Franks (Thouars) was mistaken for his descendant's descendant and for an ancestor.

A sample of 13 generations showed that a pull of more than 15 generations was not viable on a common computer with a fiber optic connection and the percentage of irreconcilable database errors grew to an impractical level. The highest number of profiles successfully extracted from the archive was 10,000-20,000 profiles with an increasing number of duplicates.

From the perspective of operating system archival retrieval and standard business databasing benchmarks this is catastrophic. Windows 3.x works slightly better than GEDCOM systems. We can do better, and even the people who insist on hand entry will find great benefits from overhauling the data standards for genealogy.

commented Apr 19, 2016 by Ian Mclean G2G6 Mach 1 (13.6k points)

Categories

Why aren't genealogical collections using contemporary data formats?

Getting Started

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Related questions