Why can't you improve the matching functions for the GED import?

+7 votes
131 views
The matching function for the GED import suggests way too many names.

I am attempting to import 1500 names. I am not sure how may suggestions I have had. For example - I just finished reviewing suggested matches for Elizabeth Marler.  I had several suggestions.  The only thing I could see in common about most of them was they were named Elziabeth. This is just one example out of 1500 names.  Could you limit the suggestions better? Couldn't you match last names as well as first names?  What about any other data - birth date; place of birth, etc. At least don't suggest unless there is more than a first name (or last name) that match!  I submit that as the tree grows this could become an even worse problem and that many people would either give up or simply reject matches as rapidly as possible (which wouldn't be good).

Jim Vincent
in WikiTree Tech by Jim Vincent G2G4 (4.9k points)
edited by Jim Vincent
Good luck, Jim. You've stumbled upon the one area of Wikitree where you really do 'get what you pay for'. Our treatment of GEDCOMs is so very poor that many members have given up completely and, ignoring their vital importance, want to see them banned from the site. You will find innumerable excuses for our failure and very little in the way of solutions.

2 Answers

+5 votes
Jim my friend, you are not the first to voice such a complaint about the search function in GEDCOMpare....nor shall you be the last.

Unfortunately, search functions work best on specific data sets.  WikiTree does not force user to enter data in a specific manner.  And many a G2G Forum question about that has been raised.  WikiTree is international and there are different norms in how first and last names are used and therefore entered.  Also the convention on dates are different from country to country. DD-MM-YYYY or MM-DD-YYYY or YYYY-DD-MM.  Now the system tries to keep a standard, but some slip through.  Actually the system uses YYYY-MM-DD for it's own internal function from what I have observed and shows DD-MMM-YYYY when a date is viewed.

Names seem to be searched using the SOUNDEX or METAPHONE systems which use phonetic algorithms to search names by their sound.  So this can produce some, shall we say, odd results.  And numerous ones at that too.

As to places of birth or death, if I had a nickle for every Profile I have seen that has specific dates of birth or death or both and no location, I would have a new computer on my desk.  LOL  I mean come on, if you know they were born on a specific date, you can't tell me you don't have some idea of where they born....even the country!!!!  I mean geez Louise.

So the search engine has to be very open to allow for those Profiles that may not have been entered properly or differently due to different norms and still be on the look out for possible duplicates which are the bane to a system trying to build a single Global Family Tree.

One thing I have learned after loading over 25 gedcom files to WikiTree, keep the file size under 400 names. I try to average about 300.  It makes it easier to manage and helps to me to keep what little sanity I have left.

Not the answer I know you were looking for, but that's about it in a nutshell.  And one other thing, this is a site basically operated by volunteers.  Not a lot of money there to make A.I. type of search engines. It may not be perfect, but little in this world is.  Well, except for my homemade brownies....
by LJ Russell G2G6 Pilot (178k points)
edited by LJ Russell
LJ -

Thanks for your comments.  Allow me to address some of your points:

There is a GEDCOM format. Data is read  into the proper field in the gedcom match process, so I fail to see how that is a problem.  If the data is read and put into the proper fields, can't it be matched? I don't envy the programmers trying to deal with the international issues - but if they can't deal with them, then maybe WikiTree shouldn't be international.  

As to dates - again - the dates in my gedcom appear to be formatted OK.  Further, if dates are formatted differently, can't a simple parse format them to the WikiTree standard.  If no has done that, I would be willing to try!  

As to names, I understand that in some cases phonetic variations need to be considered.and names can vary, no doubt about it.  (My name could be Jim, Jimmy, Jimmie, James, James Raymond) But there should be more than a first name (or last name) match before a suggestion is made!  And I think there is - I just think it needs to be improved!

You say that if a birth date is known, how can we not know the birth place.  Birth dates are often caluclated based on census data, which almost never tells more than the state of birth, if at all.  But is that to the point? If we have John Doe born 1/1/1930 is there any reason to suggest a match to Jane Doe born 1/1/1935?  (I am making up this example - but the point is, should a mtach be suggested if there isn't some correlation between the name and other data present in the GEDCOM file?)

Your suggestion about limiting the file size of a GEDCOM import is good, but of course it doesn't solve the problems.  Nevertheless, could a warning be posted on the GEDCOM import page to keep a GEDCOM file size small until the matching process has been improved?  (If there is one, I certainly didn't see it before importing my file.)  Family Tree Maker has often cited trees with many thousands of individuals..How could these people ever manage to import their data into WikiTree?  

Jim Vincent

Hi Jim,

Sorry, I didn't mean to infer your GEDCOM file data was incorrectly entered, Rule #1 was broken: Don't try to answer a Post when you are tired.  LOL   I meant the data that is already in the WikiTree system that is being compared to your data.   As this is an Open System with no true or few controls on how data is entered manually, there is a plethora of improperly created Profiles out there, but the system must still take them into consideration in the search engine.  Also, in previous years GEDCOM files were almost a load and dump operation.  While there was a simplified method for comparing each name in a file to those in WikiTree, it was easy to bypass and caused many duplicate files to be created and sadly, a lot of poorly created Profiles to be created as well.

When I mentioned birth/death dates with no location, I was talking about a specific date including day, month and year not just year.  If someone is calculating a birth date as 13 Feb 1902 from 3 or four Census records where only the calculated year is present, we have a real problem there.  As I said, even the country would help.  If you have the family's Census data from before a person is born and the next after they are born and they lived in the same state, it is not hard to infer they were born in that state.  You can always mark it as Uncertain to let future viewers know know you don't have a specific record to say where they were born.  And really, the location is just for when you are trying to Match or Reject a Suggested Match,  I was just using the location field as one that is improperly filled or left unfilled too many times.  I do not believe the location field is used at all in the search engine due to the variations encountered.  Just an example.
Ellen's answer regarding data fields is spot on.  Nice, neat and precise.  Thank you Ellen.

And no Jim, you didn't miss anything about file size before uploading your file.  I wish there was.  Not for your reason on improvement of the search engine, but just because for the first time user it can be daunting to see pages of Suggested Matches. Around 300 to 400 names per file, I usually have only 2 pages to my GEDCOMpare Report List.

I personally feel using GEDCOMpare should require the same certification process as being able to work on Pre-1500 Profiles.  Members should have been on WikiTree for awhile to get used to it's nuances such as adding proper sources by creating Profiles manually.  Then they should load a small GEDCOM file of around 50 names to get used to the GEDCOMpare process.  Then it's Katy bar the door because here I come on file size..  LOL

I had over 12,000 names across various large trees. Even then I quit adding to those trees because I decided to use WikiTree as my archive for all of my work. And I loaded my first file and saw the complexity of a large file.  I now create smaller trees strictly to load to WikiTree and add any new ancestors to those..  A little time consuming, yes. But it also gives me the chance to review some files I may not have worked on for over a year and permits me to update each person with new data or fix those I may have been less judicious in adding data to in the first place.

I have loaded over 5,000 names to WikiTree.  I hit the 5,000 name limit on my Watch List and am presently going through my list to Orphan those who are related to me by distant marriage only  I may not be the Profile Manager anymore, but that does not break the connection to me.  I did just load a 389 name file to WikiTree to break the monotony of reviewing my old Profiles and it has taken me over a month to add those folks to WikiTree. Profile clean up after a an import can take time depending on the style and format you like to present your Profiles in.

Like the tortoise and the hare, slow but sure wins the race.  Except in this case, there appears to be no finish line as I keep growing my trees.  LOL

GEDCOM File Usage Primer  is a good place to review the GEDCOMpare process and a few tips included as well.  Though it doesn't answer your base question, just a good page to peruse.

I have recently made the mistake of importing a GEDCOM with over 1500 people in it. BIG mistake, but I'm now stuck with it after already spending 3 days looking at suggestions and rejecting 90% of them. 

I do think the matching algorithm could be improved, maybe by giving the user some control over how restrictive it is. For example don't suggest persons where the place of birth or death (if specified) is on a different continent! Require further fields to match if they exist in both the GEDCOM and the profile (eg data for parents and siblings)

Possibly a weighted points scoring system for each positive (and negative) match with a user set threshold for acceptance.

Failing this, the presentation could be improved. Correct me if I'm wrong, but my understanding is that the GEDCOM data won't move on to being converted to profiles until EVERY suggestion has been processed. Even if one suggestion has been matched for a person, the rest have to be eliminated in order to pick up potential duplicates. Therefore, could there at least be a button that shows only those people who have outstanding suggestions or other reasons preventing them from becoming profiles. Presumably "no match found" is not such a reason - it's a new person.

I realise it's a fine line between missing a positive match and swamping the user with excessive human-obvious mismatches and WikiTree does not have unlimited funds for software development, but just, maybe, my last suggestion could be considered for implementation?

+5 votes

From my observations, the matching algorithm in Gedcompare looks at all name fields, birth dates, and death dates and it only returns results that are possible matches in all available data fields.

  • For last names, it looks at possible variant spellings. Apparently somebody has identified the common name Carter as a possible variant of Marler, which may give you extra matches for Elizabeth Marler.
  • All last-name fields in the Gedcom are compared with all last-name fields in WikiTree profiles. This means that one woman's married name can be matched to a different woman's maiden name. (Matching the different name fields is not unreasonable, considering that it's fairly common for last names to end up in the wrong data fields, in both Gedcoms and WikiTree.)
  • If the LNAB is "Unknown," that often greatly increases the number of matches that are returned.
  • For first names, I think all first-name and middle-name fields in the Gedcom are compared with all first- and middle-name fields in Wikitree, so a woman named Elizabeth could be matched to a woman named Barbara Elizabeth.
  • Dates are treated as possible matches if they are within +/-2 years.
  • If one date field is filled in and the other one is empty, the empty date field will be treated as a possible match to all date values.

Note that hitting the Refresh button at the bottom of the Gedcompare screen will increase the number of suggested matches needing to be reviewed. It is not necessary to hit that button.

by Ellen Smith G2G Astronaut (1.2m points)
Thanks to LJ for his response, and to Ellen for spelling out exactl9y how matches are done.

Related questions

+3 votes
0 answers
+3 votes
0 answers
61 views asked Jan 17, 2018 in WikiTree Tech by Bjørnar Tuftin G2G6 Mach 1 (11.8k points)
+8 votes
1 answer
161 views asked Mar 28, 2016 in WikiTree Tech by Peter Laponder G2G Crew (400 points)
+5 votes
0 answers
114 views asked May 8, 2018 in WikiTree Tech by Dirk Laurie G2G6 Mach 3 (35.7k points)
+4 votes
1 answer
+2 votes
1 answer
+7 votes
1 answer

WikiTree  ~  About  ~  Help Help  ~  Search Person Search  ~  Surname:

disclaimer - terms - copyright

...