What Would It Take to Get Genealogists to Use Version Controls?

+2 votes
325 views
The basics of a common method of version controls can be found in the following link: http://ericsink.com/vcbe/html/directed_acyclic_graphs.html

A directed acyclic graph (DAG) is a data structure useful for producing multiple versions of a history that is otherwise actually linear in structure. GitHub and ScienceHub use DAGs.

I will note one of the most vexing problems of genealogy is neatly solved in DAG version controls: the problem of merging.

I will also note that family ancestry graphs are generally going to be directed acyclic graphs rather than trees.
in WikiTree Tech by Ian Mclean G2G6 Mach 1 (12.4k points)

3 Answers

+18 votes
 
Best answer

Hi Ian,

Welcome to WikiTree. Most people here are not versed in computer science stuff. I am, and so is Magnus Sälgö, and Aleš Trtnik too. There may be a few others.

The first thing to do, is to state the problem that you're trying to solve. Your post comes across like you're trying to teach graphing theory to genealogists, without any reason to do so. You say that merging has a problem, but you don't state exactly what the "merging problem" is, or why it is a problem here on WikiTree. Only after that happens can you hope to propose a solution that might work here on WikiTree. Realize that it might not work either.

As to the overall issue of "version control" within genealogy, this is why I like WikiTree. Because it is based on an underlying wiki, there already is version control. You can go to the Changes tab of any profile, see exactly what changed, and roll back to any previous version. The only exception to this is with merging. However, in practicality, there haven't been too many issues with merging, and not having enough version control, here on WikiTree. 

There are other, bigger, problems to solve. And most of them usually center around teaching people about sourcing profiles, how to do it, and getting people to learn how to collaborate in the first place.

If you want to solve computer science problems around genealogical data structures, you're probably better off to experiment with your own software and post it on GitHub to see if it solves whatever the problem might be.

by Eric Weddington G2G6 Pilot (229k points)
selected by Julie Campbell
Now THAT is an answer I can understand!!  LOL  Thanks Eric!!
I am a software developer, too, I've used all sorts of version control software. I setup an intranet MediaWiki instance for our engineering group about 10 years ago. The beauty of MediaWiki (from which WikiTree seems to be based as well) is the built in version control. When I started my engineering wiki, there was a lot of belly-aching about "someone is gonna change my stuff and I won't be able to get it back". It hasn't happened, and the then nay sayers have been pretty happy with it. If WikiTree gets a malicious user, their changes can be reverted without too much effort.
Thanks for the comment, Julie! Good to see others here with a SW engineering background.
Really great answer, Eric. Like many of your answers.

Matt Misbach, who works at FamilySearch, has been trying to get independent developers to sign on to the idea of creating a GitHub for genealogy. My response was great, go for it. It might be better than WikiTree. But WikiTree is where I'm focused.
I work in the Open Source software field.

Talk is just that, talk. GitHub is nothing more than a hosting site for repositories of source code. I'm not sure what a "GitHub for genealogy" would actually mean.

One of the fastest ways to improve the state of the art, any art, is to build off of what exists today. Don't start something completely new, unless there is really no other choice.

If there are opportunities where others can help contribute to the underlying technology of WikiTree, that would be great. It would have to start first with a list of bugs and feature requests available to the public. It's always better to point out actual, tangible problems that need solving, rather than problems that are largely academic or theoretical.

@Eric agree...

I think the choice of using the wikimedia engine as base for WikiTree was great.... a wiki sharing knowledge is what also genealogy has in common.... 

I am not sure if the WIkiTree community is willing to do it more advanced.... but I feel Aleš has proved that WikiTree can be developed in away that benefits everyone.... 

Lesson learned is that genealogy on the web is getting crazy and less and less "old fashion" genealogy and much more "click and forget" found today a person with 33 siblings and no one question that.... nearly no sources added see Profitt-104 

My feeling
Wikitree should move in direction structured data. See and learn from the Wikipedia community..... that is nearly the same and has the same problem as Wikitree with trust....

The problem right now with Wikitree is that all the structured data and knowledge about Wikitree is in Aleš software and nothing we can use/update/query in an efficient way....  

The roadmap of WIkipedia

  1. 5 years ago they added better support for structured data = Wikidata
  2. Today they feel there is a problem with citation so next step is building something called WikiCite to support that better references

    = the same problem WikiTree and genealogy has

LoL. I have been on WikiTree for about a year now. #451 as of 05-08-2017 on G2G forums according to the statistics. I have talked about the Merging Problem, and I am not the only one. It has been part of the digital genealogy discussion for probably a few decades; on here, it is a persistent subject of discussion going back to the origin of the forums and WikiTree.

The basic merging problem is that merges are irreversible & do not take into account matches/mismatches between family structures as well as matches/mismatches between singular profiles in comparison.
Discussed at least in these threads:

https://www.wikitree.com/g2g/383565/proposals-concerning-some-bigger-issues
https://www.wikitree.com/g2g/251692/merges-often-between-entire-branches-family-trees-comparisons

The problems of merging are not merely academic or theoretical. Like the problems of the consistency of the data in genealogies both on WikiTree and on other digital genealogies. Issues which I discussed at length and which Aleš has turned into the Data Doctors and DB_errors project because they had practical consequences. Notably in both cases database corruption and data loss.

WikiTree as an entire format for a digital genealogy is not lossless. It is not archival quality, and in time, it will break down. That isn't theoretical or academic. That is a function of its current state and typical operation. Like a JPEG being copied and pasted a million times. Artifacts develop & multiply.

Controls are being put into place, and the situation has improved immensely. But the way merges are handled is a major root to the problem. Over a long enough period of time, the merges will cause significant drift resulting in significant corruption of the WikiTree program. The issues with data loss already impacts the community in an almost invisible but significant way in that it drives people interested in archival quality genealogy away. Several major family genealogists that I have sought to collaborate with on WikiTree have turned me down because they either knew someone who had lost a significant amount of work on WikiTree or who had themselves lost a significant amount of work on WikiTree.

The wiki format is a partial versioning system, but it does not operate as a versioning system for all the WikiTree graphs. You get my meaning? There is versioning control for individual profiles which is where WikiTree is strongest. Then there is the graph of the family in which a profile resides; there is an implicit versioning system built into that which extends from the versioning of the profiles and requires a human per profile review of the versioning history with comparison to all other profiles in the graph, but it does not preserve data about combinations or permutations of the family graph as a whole. Then there is the collection of all graphs considered together; there is only recently started to be the means for pulling the WikiTree global graph(s) as a whole for analysis, and there is no method for pushing, branching, or merging versions of WikiTree as a whole.

This is a "all eggs in one basket" situation. Where if WikiTree as a monolith fails in some unexpected way due to uncontrolled & unanticipated programmatic conditions then we lose the whole thing. There is also the problem of what happens if a significant contingent of editors ever become hostile or hostile invasions occur due to political, economic, or corporate motivations. The library of Alexandria represents a significant loss of many versions of human history.

So on a whole, there is not version controls for WikiTree. There is a singular version of WikiTree which is mutating lossly every day with an as yet uncertain degree of accuracy & precision. And it is influenced pretty much at all times by what is going on out in the wilderness beyond the WikiTree domain proper; the fact that almost no digital genealogy sites use proper versioning controls or data verification and validation presents an ongoing issue or threat to the WikiTree project.

From the perspective of user convenience & collaboration, there is the issue that the problems with GEDCOMs & versioning was controlled by bottlenecking the GEDCOMs. This sidestepped a real problem in that every addition to the WikiTree global graphs is provisional. Provisional, but WikiTree doesn't have a means of rejecting or refuting or really handling wholly incorrect, contradictory, or fictional data with little to no way to safely expunge such things. As a result, they often get "merged away" which then becomes tangled webs of concentrated error, fiction, & eventually some amount of real and valid family data in some graph we're going to actually want to extract someday.

It wasn't until we started really getting into the meat and bones of error classification that we even recognized that there were graphs of Arthur Pendragon and other problematic profiles and family graphs. It wasn't until earlier this year that we had quantification on the magnitude of errors, consistency, and the fact that most errors are traceable to a few prolific editors. This can be neatly solved by implementing at least a two layer system. One in which data is disseminated as a hypothetical version of relations and events (proliferation), and the other is in which data is locked into a theoretical framework by available source and evidence (validation, formal genealogical proof, and refutation).

Basically an alpha, beta, final type structure for individual digital genealogies which are operated on concurrently with a place for shear proliferation of speculative genealogies which can be honed down into more reliable versions supported by passing quality assurance tests and being certified in quantifiable terms.

The same issues as far as merging are completely possible with all version control systems, and would still happen under the tightest version control systems. I think GEDCOM importation, particularly with the "quality" (the complete lack thereof) of some of the trees floating out there in internet land is a much greater issue. Garbage in, garbage out.

The data is not going to corrupt the program. The data could be corrupt, but it isn't going to break WikiTree itself.

I think you have in the Wikitree database the merges done and its just a bad user interface that don't display the whole history....

This is not rocket science it's just handling simple text 

Julie, that is blatantly not true. A versioning system can be setup that keeps track of changes and dependencies in the forms of graphs such that a family graph can be zipped/composed and unzipped/decomposed without loss of data and with continuity of history. Merges can be reversible both for individual profiles and for entire family graphs. WikiTree merges are not totally reversible as various kinds of information are lost practically irretrievably in merges. Every merge like every GEDCOM upload is a risky operation.

If you read the links I provided at the top of my previous comment then you would see that I have spoken about GEDCOMs and the general quality of digital genealogies. The fact that GEDCOMs are junk and have view to no quality assurance properties is closely linked to why many digital genealogies including WikiTree are plagued by bad data. It is in the design of the format and the absence of usage of modern data standards.

The downside of GEDCOMs is their huge problematicness. But we're not going to build a relatively complete digital genealogy completely by hand spanning from today to 1CE. And eventually all the data on WikiTree will need to be migrated to whatever the next systems are; it is a mistake to believe that the current version of WikiTree is the be all, end all of digital genealogy. This is a prototype and proof of concept for what comes next.

Magnus Sälgö and I are on the same page when it comes to where this is all headed. Eventually, the digital genealogies are going to be underwritten by what are called ontologies. Languages describing the semantic markup of open linked data.

It is not at all useful to anyone to respond to "what is the future of the data standards for our field, and how do we implement them?" with "it is useless to do anything about it and we should just keep doing what we've always done."

My background is in theory of computation, programming, and physics. I recognize the distinction between data and program, and I also have a background in information security and archival methods. Data is not actually at the level of hardware distinctly different from the hardware and the configurations of state that is represented as software or programs on hardware. WikiTree experiences demonstrable underflow and overflow errors; there are a variety of places in which problems have been patch fixed, and the system as a whole has not been code reviewed or proven code complete; this is to say that WikiTree is not provably error-free at the level of its software or hardware operation. It isn't held to the best engineering standards and it is cobbled as needed and practical by a small dedicated volunteer force with limited time and interest in extensive testing (thank you everyone who holds the ship together with baling twine and duct tape!).

As such, the data is not always handled in strictly safe ways by the underlying system. This results in erratic behaviors of profiles and graphs from time to time which are usually resolved by patch fixes and administrative intervention. This is actually really typical of any fundamentally open generally recursive functional system and is related to the independence of some propositions or data and the general undecidability of Turing machine equivalent systems.

Over a long enough time line the survival of all things tends to 0. Right now, WikiTree is sparse with respect to the total available world historic global graph. There are 7 billion people on Earth almost 8 billion now. Between 1 AD and now there is a comparable number of people. Say there has been 20 billion people in total and my profile is repesentative of the size of each profile, my profile is 1.5MB in size so a 20 billion profile database will be a minimum of around 30 petabytes. I don't actually believe for a moment that my profile is representative or that 30 petabytes will be sufficient for the historical global graph.  But it serves as a model for discussion purposes. 30 petabytes has 2^(30E15) possible states which is a number too big for Google's common calculator; this is to say that the space in which errors can occur in the system over the lifetime of WikiTree is huge. In the long term, a small scrappy team of dedicated devs and volunteer testers will never root out all the critical errors by bruteforce, trial-n-error methods.

By similar argument, the database currently would be around 2E12 Bytes or 2TBs of information. Roughly speaking, WikiTree grew about 33% in a year in terms of total profiles. The content data itself has a consistency error rate of about 1% to 1.5%; I am not sure I've ever seen tracking on the error rate of the WikiTree server or software themselves, but I can tell from G2G posts that it is significantly higher than 0%.

This is all to say that 1) we can't assure that our data is in general safe for all time and 2) we have a limited window within which to backup that data to multiply redundant sources for future iterations of the global graph(s) and 3) we can't assure that correct information and graphs will continue to be correct as WikiTree grows and mutates.

WikiTree does an okay job of preserving consistency of the graph for practical operation in the immediately forseeable future. But it is notably limited, and it is notably limited by its operational data formats & standards. But fundamentally this isn't just about WikiTree, this is about the proliferation and preservation of the data and hard work that WikiTree and other genealogies represent. Eventually, the global graphs are going to merge whether the corporations controlling them want it to happen or not. That merger can be totally unsafe and ad hoc resulting a lot of pain, data loss, and unnecessary obstacles or we can work on the merging problem to prepare our data for the future and posterity.

@Ian you are walking down the wrong path....

The problem with genealogy on WIkiTree is not disk head misalignment 

The problem is that everyone with a family tree doesn't have a Master in genealogy ==> you cant trust what you see....

My background is a Master in applied physics....

That was all just the relatively low-level stuff.

Here's a question to think about: how many generations would you attest to being 100% accurate in a court of law in your direct ancestor graph?

Like what percentage of your direct ancestor graph are you almost absolutely certain has all the correct relations?

I've seen several ancestors in different versions of my direct ancestor graphs where they are listed with several parents. I've corrected more than a few such profiles after refuting a couple of possibilities. It is a relatively common problem for analog and digital genealogies to contain multiple possible versions of relations and events. Right now in the majority of cases and systems, those multiples are collided into a model which assumes singularity of relations and events. With FamilySearch data, this results in the most aweful headache of a tangle. The software is in many cases unable to merge thousands of copies of ancestors (my go to is Charlemagne I) because from the perspective of the software there should only be 1 copy of that ancestor and anyone with different family graphs is a different person. 1) people none the less have created many many many copies of the ancestor and versions of their family graph 2) you have to merge duplicates from the latest generation (roughly 20-30 generations) back to all the Charlemagne I before you can start merging away the Charlemagne I duplicates.

How confident are you that you're going to get all those merges correct and end up with exactly and only the correct version of his genealogy?

Now how confident are you that everyone else is going to manage the same or similar feats with the genealogies of interest to them?

How confident are you that the other users will find all their ancestors that already exist in the data and properly link them into newly generated graphs to form exactly and only correct genealogical graphs?

How confident are you that they will not fail to find all such ancestors and will not create duplications or will find all such ancestors and not create different versions of their genealogical graph?

What do you suppose the rate of agreement and disagreement is among genealogists past, present, and future?

From this view of multiple genealogies in singular graphs like FamilySearch or WikiTree and the view that there are multiple "global graphs" like Geni, Ancestry.com, WikiTree, MyHeritage, FamilySearch, Wikipedia/Wikidata, and the others, how many versions of the global graph do you suppose there are right now? How many for a single distant ancestor like Charlemagne I?

**** in **** out

>> How confident are you that you're going to get all those merges correct and end up with exactly and only the correct version of his genealogy? 

This is a loosely coupled system and you need to select what you trust.... I guess you dont trust everything on internet.....

Good lecture by Tim Berners-Lee on linked data and loosely coupled systems...

 

  • As internet took some time also creating a semantic web will take some time see the 5 star model
  • My trust is 
    • maybe research by professionals e.g. Swedish SBL
    • I dont trust community research like 
      • WIkipedia
      • WIkidata
      • Wikitree
      • Find A Grave
Right now I connect SBL and Wikidata to add more trust to Wikidata
+4 votes
aarrgghh  too much maths - not me LOL
by Robynne Lozier G2G6 Pilot (887k points)
+5 votes
One other thing which isn't used as often as it should be is using ~~~~ to provide a date time stamp.  I use in on research notes, but I'm sure other uses are valuable.  The only problem is that it's a one time use.  That is once you add it it replaces the 4tildes with their date-time equilivent and if you want to change it you need to erase or overwrite the date-time... Oh, it also provides your WikiTree ID along with the date-time.
by Dave Dardinger G2G6 Pilot (407k points)

Good suggestion

OBS Its a bug that ~~~~ doesnt work inside ref tags ==> below my work around when creating a Find A Grave

  1. I first create {{FindAGrave|123456|~~~~|John Lawrence Westlund}}}
  2. Do a save of the profile
  3. Then add <ref> tags
    1. {{FindAGrave|123456|Sälgö-1 12:36, 30 July 2017 (EDT)|John Lawrence Westlund}}}</ref>

 

==> I get  (from profile Westlund-121)


1.   John Lawrence Westlund Find A Grave Memorial #30011165. Retrieved Sälgö-1 12:36, 30 July 2017 (EDT).

Related questions

+3 votes
0 answers
+2 votes
1 answer
84 views asked Feb 7, 2018 in Genealogy Help by Deborah Dunn G2G6 Mach 2 (21.7k points)
+7 votes
3 answers
165 views asked Jul 28, 2019 in Policy and Style by Susan Smith G2G6 Pilot (451k points)
+3 votes
3 answers
223 views asked Jun 16, 2019 in Policy and Style by Anonymous Leep G2G5 (5.6k points)
+10 votes
3 answers
+14 votes
0 answers
97 views asked Oct 10, 2015 in The Tree House by Julie Ricketts G2G6 Pilot (380k points)

WikiTree  ~  About  ~  Help Help  ~  Search Person Search  ~  Surname:

disclaimer - terms - copyright

...