What Would It Take to Get Genealogists to Use Version Controls?

Question

What Would It Take to Get Genealogists to Use Version Controls?

3 Answers

Best answer

Hi Ian,

Welcome to WikiTree. Most people here are not versed in computer science stuff. I am, and so is Magnus Sälgö, and Aleš Trtnik too. There may be a few others.

The first thing to do, is to state the problem that you're trying to solve. Your post comes across like you're trying to teach graphing theory to genealogists, without any reason to do so. You say that merging has a problem, but you don't state exactly what the "merging problem" is, or why it is a problem here on WikiTree. Only after that happens can you hope to propose a solution that might work here on WikiTree. Realize that it might not work either.

As to the overall issue of "version control" within genealogy, this is why I like WikiTree. Because it is based on an underlying wiki, there already is version control. You can go to the Changes tab of any profile, see exactly what changed, and roll back to any previous version. The only exception to this is with merging. However, in practicality, there haven't been too many issues with merging, and not having enough version control, here on WikiTree.

There are other, bigger, problems to solve. And most of them usually center around teaching people about sourcing profiles, how to do it, and getting people to learn how to collaborate in the first place.

If you want to solve computer science problems around genealogical data structures, you're probably better off to experiment with your own software and post it on GitHub to see if it solves whatever the problem might be.

answered Jul 31, 2017 by Eric Weddington G2G6 Pilot (520k points)
selected Aug 1, 2017 by Julie Campbell

I work in the Open Source software field.

Talk is just that, talk. GitHub is nothing more than a hosting site for repositories of source code. I'm not sure what a "GitHub for genealogy" would actually mean.

One of the fastest ways to improve the state of the art, any art, is to build off of what exists today. Don't start something completely new, unless there is really no other choice.

If there are opportunities where others can help contribute to the underlying technology of WikiTree, that would be great. It would have to start first with a list of bugs and feature requests available to the public. It's always better to point out actual, tangible problems that need solving, rather than problems that are largely academic or theoretical.

commented Aug 2, 2017 by Eric Weddington G2G6 Pilot (520k points)

@Eric agree...

I think the choice of using the wikimedia engine as base for WikiTree was great.... a wiki sharing knowledge is what also genealogy has in common....

I am not sure if the WIkiTree community is willing to do it more advanced.... but I feel Aleš has proved that WikiTree can be developed in away that benefits everyone....

Lesson learned is that genealogy on the web is getting crazy and less and less "old fashion" genealogy and much more "click and forget" found today a person with 33 siblings and no one question that.... nearly no sources added see Profitt-104

My feeling
Wikitree should move in direction structured data. See and learn from the Wikipedia community..... that is nearly the same and has the same problem as Wikitree with trust....

The problem right now with Wikitree is that all the structured data and knowledge about Wikitree is in Aleš software and nothing we can use/update/query in an efficient way....

The roadmap of WIkipedia

5 years ago they added better support for structured data = Wikidata
Today they feel there is a problem with citation so next step is building something called WikiCite to support that better references

= the same problem WikiTree and genealogy has

commented Aug 2, 2017 by Living Sälgö G2G6 Pilot (297k points)
edited Aug 2, 2017 by Living Sälgö

LoL. I have been on WikiTree for about a year now. #451 as of 05-08-2017 on G2G forums according to the statistics. I have talked about the Merging Problem, and I am not the only one. It has been part of the digital genealogy discussion for probably a few decades; on here, it is a persistent subject of discussion going back to the origin of the forums and WikiTree.

The basic merging problem is that merges are irreversible & do not take into account matches/mismatches between family structures as well as matches/mismatches between singular profiles in comparison.
Discussed at least in these threads:

https://www.wikitree.com/g2g/383565/proposals-concerning-some-bigger-issues
https://www.wikitree.com/g2g/251692/merges-often-between-entire-branches-family-trees-comparisons

The problems of merging are not merely academic or theoretical. Like the problems of the consistency of the data in genealogies both on WikiTree and on other digital genealogies. Issues which I discussed at length and which Aleš has turned into the Data Doctors and DB_errors project because they had practical consequences. Notably in both cases database corruption and data loss.

WikiTree as an entire format for a digital genealogy is not lossless. It is not archival quality, and in time, it will break down. That isn't theoretical or academic. That is a function of its current state and typical operation. Like a JPEG being copied and pasted a million times. Artifacts develop & multiply.

Controls are being put into place, and the situation has improved immensely. But the way merges are handled is a major root to the problem. Over a long enough period of time, the merges will cause significant drift resulting in significant corruption of the WikiTree program. The issues with data loss already impacts the community in an almost invisible but significant way in that it drives people interested in archival quality genealogy away. Several major family genealogists that I have sought to collaborate with on WikiTree have turned me down because they either knew someone who had lost a significant amount of work on WikiTree or who had themselves lost a significant amount of work on WikiTree.

The wiki format is a partial versioning system, but it does not operate as a versioning system for all the WikiTree graphs. You get my meaning? There is versioning control for individual profiles which is where WikiTree is strongest. Then there is the graph of the family in which a profile resides; there is an implicit versioning system built into that which extends from the versioning of the profiles and requires a human per profile review of the versioning history with comparison to all other profiles in the graph, but it does not preserve data about combinations or permutations of the family graph as a whole. Then there is the collection of all graphs considered together; there is only recently started to be the means for pulling the WikiTree global graph(s) as a whole for analysis, and there is no method for pushing, branching, or merging versions of WikiTree as a whole.

This is a "all eggs in one basket" situation. Where if WikiTree as a monolith fails in some unexpected way due to uncontrolled & unanticipated programmatic conditions then we lose the whole thing. There is also the problem of what happens if a significant contingent of editors ever become hostile or hostile invasions occur due to political, economic, or corporate motivations. The library of Alexandria represents a significant loss of many versions of human history.

So on a whole, there is not version controls for WikiTree. There is a singular version of WikiTree which is mutating lossly every day with an as yet uncertain degree of accuracy & precision. And it is influenced pretty much at all times by what is going on out in the wilderness beyond the WikiTree domain proper; the fact that almost no digital genealogy sites use proper versioning controls or data verification and validation presents an ongoing issue or threat to the WikiTree project.

From the perspective of user convenience & collaboration, there is the issue that the problems with GEDCOMs & versioning was controlled by bottlenecking the GEDCOMs. This sidestepped a real problem in that every addition to the WikiTree global graphs is provisional. Provisional, but WikiTree doesn't have a means of rejecting or refuting or really handling wholly incorrect, contradictory, or fictional data with little to no way to safely expunge such things. As a result, they often get "merged away" which then becomes tangled webs of concentrated error, fiction, & eventually some amount of real and valid family data in some graph we're going to actually want to extract someday.

It wasn't until we started really getting into the meat and bones of error classification that we even recognized that there were graphs of Arthur Pendragon and other problematic profiles and family graphs. It wasn't until earlier this year that we had quantification on the magnitude of errors, consistency, and the fact that most errors are traceable to a few prolific editors. This can be neatly solved by implementing at least a two layer system. One in which data is disseminated as a hypothetical version of relations and events (proliferation), and the other is in which data is locked into a theoretical framework by available source and evidence (validation, formal genealogical proof, and refutation).

Basically an alpha, beta, final type structure for individual digital genealogies which are operated on concurrently with a place for shear proliferation of speculative genealogies which can be honed down into more reliable versions supported by passing quality assurance tests and being certified in quantifiable terms.

commented Aug 5, 2017 by Ian Mclean G2G6 Mach 1 (13.6k points)
edited Aug 5, 2017 by Ian Mclean

Julie, that is blatantly not true. A versioning system can be setup that keeps track of changes and dependencies in the forms of graphs such that a family graph can be zipped/composed and unzipped/decomposed without loss of data and with continuity of history. Merges can be reversible both for individual profiles and for entire family graphs. WikiTree merges are not totally reversible as various kinds of information are lost practically irretrievably in merges. Every merge like every GEDCOM upload is a risky operation.

If you read the links I provided at the top of my previous comment then you would see that I have spoken about GEDCOMs and the general quality of digital genealogies. The fact that GEDCOMs are junk and have view to no quality assurance properties is closely linked to why many digital genealogies including WikiTree are plagued by bad data. It is in the design of the format and the absence of usage of modern data standards.

The downside of GEDCOMs is their huge problematicness. But we're not going to build a relatively complete digital genealogy completely by hand spanning from today to 1CE. And eventually all the data on WikiTree will need to be migrated to whatever the next systems are; it is a mistake to believe that the current version of WikiTree is the be all, end all of digital genealogy. This is a prototype and proof of concept for what comes next.

Magnus Sälgö and I are on the same page when it comes to where this is all headed. Eventually, the digital genealogies are going to be underwritten by what are called ontologies. Languages describing the semantic markup of open linked data.

It is not at all useful to anyone to respond to "what is the future of the data standards for our field, and how do we implement them?" with "it is useless to do anything about it and we should just keep doing what we've always done."

My background is in theory of computation, programming, and physics. I recognize the distinction between data and program, and I also have a background in information security and archival methods. Data is not actually at the level of hardware distinctly different from the hardware and the configurations of state that is represented as software or programs on hardware. WikiTree experiences demonstrable underflow and overflow errors; there are a variety of places in which problems have been patch fixed, and the system as a whole has not been code reviewed or proven code complete; this is to say that WikiTree is not provably error-free at the level of its software or hardware operation. It isn't held to the best engineering standards and it is cobbled as needed and practical by a small dedicated volunteer force with limited time and interest in extensive testing (thank you everyone who holds the ship together with baling twine and duct tape!).

As such, the data is not always handled in strictly safe ways by the underlying system. This results in erratic behaviors of profiles and graphs from time to time which are usually resolved by patch fixes and administrative intervention. This is actually really typical of any fundamentally open generally recursive functional system and is related to the independence of some propositions or data and the general undecidability of Turing machine equivalent systems.

Over a long enough time line the survival of all things tends to 0. Right now, WikiTree is sparse with respect to the total available world historic global graph. There are 7 billion people on Earth almost 8 billion now. Between 1 AD and now there is a comparable number of people. Say there has been 20 billion people in total and my profile is repesentative of the size of each profile, my profile is 1.5MB in size so a 20 billion profile database will be a minimum of around 30 petabytes. I don't actually believe for a moment that my profile is representative or that 30 petabytes will be sufficient for the historical global graph. But it serves as a model for discussion purposes. 30 petabytes has 2^(30E15) possible states which is a number too big for Google's common calculator; this is to say that the space in which errors can occur in the system over the lifetime of WikiTree is huge. In the long term, a small scrappy team of dedicated devs and volunteer testers will never root out all the critical errors by bruteforce, trial-n-error methods.

By similar argument, the database currently would be around 2E12 Bytes or 2TBs of information. Roughly speaking, WikiTree grew about 33% in a year in terms of total profiles. The content data itself has a consistency error rate of about 1% to 1.5%; I am not sure I've ever seen tracking on the error rate of the WikiTree server or software themselves, but I can tell from G2G posts that it is significantly higher than 0%.

This is all to say that 1) we can't assure that our data is in general safe for all time and 2) we have a limited window within which to backup that data to multiply redundant sources for future iterations of the global graph(s) and 3) we can't assure that correct information and graphs will continue to be correct as WikiTree grows and mutates.

WikiTree does an okay job of preserving consistency of the graph for practical operation in the immediately forseeable future. But it is notably limited, and it is notably limited by its operational data formats & standards. But fundamentally this isn't just about WikiTree, this is about the proliferation and preservation of the data and hard work that WikiTree and other genealogies represent. Eventually, the global graphs are going to merge whether the corporations controlling them want it to happen or not. That merger can be totally unsafe and ad hoc resulting a lot of pain, data loss, and unnecessary obstacles or we can work on the merging problem to prepare our data for the future and posterity.

commented Aug 6, 2017 by Ian Mclean G2G6 Mach 1 (13.6k points)

That was all just the relatively low-level stuff.

Here's a question to think about: how many generations would you attest to being 100% accurate in a court of law in your direct ancestor graph?

Like what percentage of your direct ancestor graph are you almost absolutely certain has all the correct relations?

I've seen several ancestors in different versions of my direct ancestor graphs where they are listed with several parents. I've corrected more than a few such profiles after refuting a couple of possibilities. It is a relatively common problem for analog and digital genealogies to contain multiple possible versions of relations and events. Right now in the majority of cases and systems, those multiples are collided into a model which assumes singularity of relations and events. With FamilySearch data, this results in the most aweful headache of a tangle. The software is in many cases unable to merge thousands of copies of ancestors (my go to is Charlemagne I) because from the perspective of the software there should only be 1 copy of that ancestor and anyone with different family graphs is a different person. 1) people none the less have created many many many copies of the ancestor and versions of their family graph 2) you have to merge duplicates from the latest generation (roughly 20-30 generations) back to all the Charlemagne I before you can start merging away the Charlemagne I duplicates.

How confident are you that you're going to get all those merges correct and end up with exactly and only the correct version of his genealogy?

Now how confident are you that everyone else is going to manage the same or similar feats with the genealogies of interest to them?

How confident are you that the other users will find all their ancestors that already exist in the data and properly link them into newly generated graphs to form exactly and only correct genealogical graphs?

How confident are you that they will not fail to find all such ancestors and will not create duplications or will find all such ancestors and not create different versions of their genealogical graph?

What do you suppose the rate of agreement and disagreement is among genealogists past, present, and future?

From this view of multiple genealogies in singular graphs like FamilySearch or WikiTree and the view that there are multiple "global graphs" like Geni, Ancestry.com, WikiTree, MyHeritage, FamilySearch, Wikipedia/Wikidata, and the others, how many versions of the global graph do you suppose there are right now? How many for a single distant ancestor like Charlemagne I?

commented Aug 6, 2017 by Ian Mclean G2G6 Mach 1 (13.6k points)
edited Aug 6, 2017 by Ian Mclean

Answer 1 · 2017-07-31T22:00:24+0000

One other thing which isn't used as often as it should be is using ~~~~ to provide a date time stamp. I use in on research notes, but I'm sure other uses are valuable. The only problem is that it's a one time use. That is once you add it it replaces the 4tildes with their date-time equilivent and if you want to change it you need to erase or overwrite the date-time... Oh, it also provides your WikiTree ID along with the date-time.

Categories

What Would It Take to Get Genealogists to Use Version Controls?

Please log in or register to add a comment.

Please log in or register to answer this question.

3 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Related questions