Merges can often be between entire branches of family trees; why isn't there a way to do three-way or n-way comparisons?

+12 votes
386 views

Merges between two profiles of the same person can be rather complex.

In the simplest case, all the locations, dates, relations, and names are identical or effectively equivalent. When the merge is unambiguous like this, it should be almost entirely automatic.

The more common cases, there is data missing between merges but matches on other points; it is ambiguous as to whether the profiles match in a strict sense as in the simplest case, but if the profiles aren't contradictory to each other and they're complementary with intersections then it should be almost entirely automatic again though with more human supervision in the decision making process. These are nice cases because they allow a profile to be constructed from incomplete pieces.

Also somewhat common are imperfect matches. There's complementary information and equivalent information, but you have some entries which are contradictory to each other. One profile says born 1800 and another says born 1778, but they otherwise agree on everything else or almost everything else. These cases probably should be merged, but the contradicting information of both profiles should probably be purged; contradicting information can be positively identified as being something in need of further research. If deleting the contradicting information from both profiles renders the merge ambiguous then we enter into an unhappy case where merging should not proceed, see below. However, most times this case can result in a sound merger provided that the majority of information is in agreement.

The unhappy cases are the ones in which it is ambiguous as to what the case is. Like merges between profiles which are totally complementary; what I mean by complementary is that they are each missing data that the other has; complementary with intersections in the above case means they're missing parts but they have parts (plural!) which are equivalent, so they are semi or quasi definite. But purely complementary profiles are wholly ambiguous, and even if one or two properties are shared it is probably not enough to disambiguate them. Consider Sarah Smith (Smith-1) and Sarah Smith (Smith-2) where Smith-1 has parents but no spouse or children and Smith-2 has a spouse and children but no parents; it is possible that they're the same profile, but it is more likely that they are not the same Sarah Smith eitherway more information is needed. The profiles should not be merged in this condition.

Then there are the dread case of the unconnected, anonymous, unknown last name at birth profiles. These are by themselves more trouble than they are worth. Sometimes they can be matched up in some systems with named and related profiles to fill in details, but the less information defined, the less likely that is going to be the case. A floating date of birth with nothing else is pretty much wasted resources.

Okay, all that is rough enough, but we're only talking about comparing two profiles to be merged. What happens when you have four or more profiles which have parts suggesting they are all the same profile but their merger depends on the merger of their relations. Example and case study: Motheral-2, Motheral-17, Motheral-25, Motheral-42, and Motheral-45.

At current, there isn't a way to setup what is functionally equivalent to a system of equations to check to see whether or not a branching merger will result in a consistent, inconsistent, or undetermined genealogy. There isn't a system in place to warn users that they may be merging from ambiguous conditions or that they might be creating ambiguous conditions by the merger. And eyeballing it takes some effort when you're comparing up to n-profiles and their relations; the commenting system would be how I would keep track of the complexities of such mergers, but I am finding that the spam protect locks me out (don't remove this) pretty quickly.

There's got to be a better way to deal with snarls like this. Especially if we are to get a handle on the accuracy and precision of WikiTree's profiles. Done right, the merger system could be a powerful tool for determining what profiles have incorrect information or need further research to further complete the tree. At current, I would hazard to guess that the merger system is similar to the GEDCOM import system in that it creates more liabilities than it resolves.

in WikiTree Tech by Ian Mclean G2G6 Mach 1 (12k points)
retagged by Dorothy Barry
Also, I think there should be a distinction between profiles. In physics and mathematics, you usually work with some kind of systematic invariant which you use to break ties and resolve contradictions in the data. The laws of thermodynamics are an example. In social sciences, we have control groups which are compared to test groups.

On WikiTree not all profiles are constructed equally. Some profiles are under sourced and constructed of what amounts to hearsay; some of the hearsay can be substantiated with proper investigation and sourcing. A lot of it is cruft that makes it harder to find proper matches than things need to be. Red herrings. Some profiles are oversourced; oversourcing is where too much information is included and you likely have multiple people slammed into one profile. Some profiles are properly or strictly sourced; they have exactly enough sources to substantiate their unique identity in the family graph and differentiate their life from birth to death.

I think properly sourced profiles should be treated somewhat differently than other profiles. They should be recognized as complete at some level, and they should be treated as fixed points; the protected profiles status is analogous to what I am thinking, but at current, the protected profiles status is arbitrarily applied on a need basis. Profiles which are not in danger of being merged into an oversourced state or not subject to the attritious threat of being editted away from a relatively complete or finished state can be used for powerful error correcting and statistical techniques.

When you have fixed points of reference, you can compare new or uncertain data to the fixed points and flush out the uncertainty algorithmically. Even when you can't determine a system directly by such comparisons, you can usually figure out what you need to do in order to have a determined system. You can count the gaps and map the disagreements. This can be very useful for determining where research efforts need to go.

2 Answers

+8 votes
 
Best answer
Hi Ian,

Sounds like you have an engineering background.

You're going to lose a lot of people in your long description, with a lot of technical terminology. While I can understand it, because I have a software engineering background, there's a lot of people here who don't.

It's easy to propose that something *should* work a certain way. And I agree with you in that merging capabilities could definitely be improved. But it's also a long way to get there from here.

I work a lot in Open Source software. It's my day job. It's all about collaboration. And it's also about "scratching your own itch". Which means that if you really want something done, then you're going to need to volunteer your time and effort to see that it gets done. How important is this feature to you?

I'm sure it's not like this hasn't been thought of before. I even requested a way to merge whole or partial trees myself.

But it would probably go a long way if you're willing to jump in to help develop the tools that you would like to see here.
by Eric Weddington G2G6 Pilot (166k points)
selected by Rosemary Jones
My background is actually in physical theory, computer theory, and the philosophy of scientific method. Unfortunately, I am not a particularly savvy coder when it comes to using and writing other people's programs; I never got that far in my practical college courses; I learned some Java and some C++ and fragments of HLA. My specialty is abstract and rough numerical analysis and research.

Anyway, I don't expect giant leaps and bounds. But the low hanging fruits are pretty significant here. First step is identifying, classifying, and categorizing the issues. My contribution is in outlining that for programmers. For those savvy at retrieving and processing the data through web programming, there's a bunch of things that can be done using read only access as AleŇ° has shown.

Everything that the error checking project is doing right now can be used to also check to see what class or category a given potential merge falls into. The project already actually identifies some merges between unconnected profiles and the big family graph. I was about to go and merge some of those when I discovered that there was a snarled mess of merges and rejected matches that probably should be matches. So I know this isn't just my problem, and I know that as hard as it is for me to do that it is harder still for others who don't have my background.

The least I can do is make the problem and some of the potential solutions visible and make some recommendations for people better qualified to handle.

The first problem is in your assumption that this is somehow "low-hanging fruit".

Conceptually, it's low-hanging fruit, right? But coding it may be a different matter. This is from my own experience. 

Your best approach is to first ask where is the TODO list for the WikiTree back-end. Maybe it's already there on the list. I don't really know. 

Then the next step would be to ask if someone else has already thought of this.

Then the next step would be to ask what the issues are in getting it implemented. Is it a resource issue? Are there hidden complications in proposed implementations?

My point is that you'll get farther by asking questions. Don't assume that you're the only one with answers.

As the quote says: "In theory, theory and practice are the same, but in practice, they're different."

+4 votes
The PROFILE DEFINING FACT suggestion I made would be a good start (IMO) toward this, and simpler to implement than programmatic analysis of whole profiles and their interrelationships.  My concept barely required any programming, just a way to set a "lock" on one fact for each profile.  We basically have birth, marriage and death, as well as parent links.  Let the PM (but not gedcom importer) choose the one fact that he considers the definitive fact for the profile.  In the case where all the data is correct, birth record (if a specific date and place) should be the default choice.

Once you have the profile locked/anchored to a fact that can't be edited out or merged away, it simplifies the process of seeing if 2 profiles are a good match.  If both are locked to different birth dates for example, then they can't be merged.

Just having one fact may not sound like it would be enough, but consider that now we are getting by with really nothing nailed down.  I'm trying to do merges with no real sources, hardly any good data, and unresponsive (or even hostile) profile managers.

So, maybe take one aspect like that to start with, get it working first, and use that to build on later.  And for my part, the biggest problem we have to deal with is how poorly done so many of the profiles and links between them are.
by Mikey Anonymous G2G6 Mach 4 (45.1k points)
That in general isn't going to create precision profiles. It is a form of bad scientific method. I mean I get the idea, but it relies on the controller of the profile to be competent in the judgement of fact vs fiction, and I was reading about people complaining that there isn't a Jesus Son of God profile on WikiTree.

I trust aggregation of data more than I trust individual facts. Systems can be checked and cross checked against each other to decide whether or not they are consistent wrt to first order predicate logic. I don't necessarily trust the links between profiles, but I do trust a well sourced profile. I have a couple of profiles in the early portion of my family tree which have government records from birth to death and burial; it is rare for all the records to agree on every detail, but I trust some records more than I trust others like US Census records compared to FindAGrave.

If I were to create an ancestor and lock their birth date to say the 21st of May 1XXX but a record shows that same ancestor with a birth date the 1st of May 1XXX then there's a high probability that two profiles representing the same person would get created and locked from merging. If there was a third record which showed that same ancestor with a birth date, the 2nd of May 1XXX then we're likely to get a third profile locked from merging. That's not an ideal condition for the system I think.

Related questions

+9 votes
1 answer
+22 votes
8 answers
+9 votes
1 answer
+10 votes
2 answers
+13 votes
3 answers
+14 votes
3 answers
+3 votes
1 answer

WikiTree  ~  About  ~  Help Help  ~  Search Person Search  ~  Surname:

disclaimer - terms - copyright

...