Did you see that we changed our GEDCOM import system?

+37 votes
495 views

Hi WikiTreers,

We just released a round of improvements to our GEDCOM import system.

This has been requested by many members and on the to-do list forever. To be honest, I find working with GEDCOMs depressing so there was a little procrastination involved. :-)

To rationalize my procrastination a bit: I kept thinking that the old GEDCOM standard was on its way out. I thought we should focus our very limited resources on the standards and systems of the future instead. But I was wrong. GEDCOMs aren't going away any time soon. They remain important for genealogy, and important for WikiTree. Not important for all members, but for many members, especially new members.

Anyway.

Profiles created through GEDCOM imports will never be beautiful. We need to balance a lot of different considerations. But they can and should be better than they've been in the past.

In the past, we operated with these principles:

  1. We never want to lose any information that's in a GEDCOM.
  2. We never want to misinterpret or mislabel information.
  3. More information is always better than less.

I'm sure many of you will agree that these principles sound correct. :-) But in practice they've created horrible, junky profiles that need extensive editing.

Now we are skipping a lot of information.

For example, many GEDCOMs contain ID numbers for each individual. They're unique identifiers for the exporting system. Often they're unique to the one user of the system and they're meaningless to everyone else. And 99% of the time they're meaningless to the one user too, because they're only used by their software in the background. But, according to our old thinking, they could theoretically be useful in some cases that could be helpful even in a collaborative environment like ours. And they could. But that's likely to be such a small fraction of the cases that they're not worth what it costs the community to include them.

Now we're just skipping these ID numbers. And a whole bunch of other stuff. Here's the complete list: http://www.wikitree.com/wiki/Skipped_Tags_in_GEDCOMs

We're also now leaving a lot of information unlabeled. Rather than putting "Address:", "City:", and "Country:" in an address, for example, we just print the address.

See http://www.wikitree.com/wiki/Skipped_Tags_in_GEDCOMs#Translated_Tags for the exact details. (Not that it'll be easy to understand from that, because what the translations mean are complicated by a lot of other things in the code.)

We've also changed how we format the information we do print in the text. We no longer make lots of subheaders. We do something closer to a plain-language paragraph structure.

We did a variety of other little changes, but most aren't worth mentioning. One that's important to our Dutch community: we better preserve the capitalization in a Last Name at Birth like van der Beek. And we also do a better job of guessing at proper capitalization in a name like McClellan when it appears in a GEDCOM as MCCLELLAN.

Here's an example of a profile created today under the new system: http://www.wikitree.com/wiki/Syme-151

Feel free to post here with additional suggestions, comments, questions, etc.

I have to warn you, though, that a suggestion that would make one GEDCOM import better might at the same time make other GEDCOMs worse. In fact, that's almost a guarantee. We have to balance the good that a change does for some imports with the harm it does to others. Judging this is incredibly difficult and time-consuming. So, now that I've made that excuse for why suggestions might not be implemented, feel free to fire away. :-)

Merry Christmas and happy holidays everybody!

Chris

P.S. I don't want my complicated explanations and excuses above to discourage posting suggestions. We do plan on continuing to make improvements. It's already on the to-do list for early next year to completely rewrite our code so that we have a cleaner foundation.

asked in The Tree House by Chris Whitten G2G Astronaut (1.1m points)
edited by Chris Whitten
More improvements.  Thanks.  Yay!
Chris, Thanks for these improvements. Is there a feed that shows Open profiles that have been created with a GEDCOM import?
Thanks, Vincent and Rick. Sorry, no feed of new GEDCOM-created profiles.

One thing you could do: tomorrow or any day in the future, look at the top 25 surnames at the bottom of the home page. They're often dominated by GEDCOM imports.
Looks nice!

One minor suggestion - I try to rearrange standard headings in 'chronological order' - birth, occupation, residence, marriage, children, death, burial. Makes more sense when reading a biography from top to bottom. At least we should put the will/death/burial sections last. Do others do the same thing during cleanup?

Also - children display - could you put each on separate lines with '#' in front, instead of all on a single line/paragraph? This provides a birth order and nicer display.

When cleaning up imported profiles I've typically converted Birth, Marriage, Death tags to headers, but I like the bolded display better.

Should we update http://www.wikitree.com/wiki/GEDCOM-created_biographies to be consistent with the new import format?
SO much better!! Thanks, Chris, et al for all the hard work!
That does look much better.
Beautiful job! Thanks
Wonderful!! I think I can finally stop bit@&ing about gedcom imports! And Eowyn can retire her protective armor! Awesome, Chris! You're sure giving us a lot of great Christmas presents this year!
Very impressed. Thanks so much, Chris.
Dear Chris W and Gedcom people,

Are baptisms imported or not please? The BAPL tag is in both lists - ignored and imported!

Chris L, a newbie.
Thanks for pointing that out, Chris! It's been corrected.

All the LDS-specific tags are skipped, except the one indicating a person is LDS, if that's used.

To specifically answer your question: BAPL (LDS baptism) is skipped, but BAPM (baptism) is kept.
Chris,

I am trying to wrap my brain around what type of entity GEDCOM is and why they are giving us profiles?

Are they like Family Search?

Thanks

Taylor

Hi, Taylor!

I can answer that for you. :)

A GEDCOM is a standard file format that is used to move family trees from one piece of software (or website) to another. 

For example: You may have built your family tree on Ancestry.com. If you wanted to bring your information from Ancestry to Wikitree, you would export all of your information (or a subset) from Ancestry in the form of a GEDCOM file. 

Wikitree knows how to read GEDCOM files, so you would bring your file to Wikitree and import it here. Profiles can then be automatically created from the information in the GEDCOM.

You can read more about them on this help page: GEDCOMs.

Julie,

Thank you very much.  I now understand the process.

Taylor

8 Answers

+6 votes
Chris, weren't you once on a committee to look at updating GEDCOM's?  Is there a place/site we can read about this?  What are the stumbling blocks to progress?
answered by Michael Stills G2G6 Pilot (361k points)
Having a committee is a major stumbling block to progress on anything. :-)

If you're interested, try searching for "better GEDCOM" and "genealogy standards" on Google. There have been valiant attempts. Honestly, I think FamilySearch would have to be behind any real changes. They put their energy into something called GEDCOMX.
+6 votes

I imported a gedcom this morning (40 people). My profiles all look excellent. I wondered what had happened to the ID # gobblygook and I love the three apostrophe headings. Here's one. It still needs some help, it looks a little run on. But it is a major improvement

Thank you.

answered by Anne B G2G6 Pilot (990k points)
That looks very good compared to what it was.  :)
I can hardly wait to try a gedcom! The example one looks too good to be true. I am so glad those non-essential tags will not be imported. I can not imagine the work fine-tuning this aspect so thank you for working on it.
What Maggie said.
+4 votes
Chris, I started to clean up my gedcom import from yesterday and noticed an anomaly. www.wikitree.com/wiki/Crofutt-8 and www.wikitree.com/wiki/Hinkley-184. Wife and husband.

Crofutt-8 lists footnotes/references for S201, S352, S256, S45, S8, S19, S18 S263. but only S201, S256, S45 and S263 are in the sources list. A Source S24 is in the sources but has no footnote. It is associated with the whole person, but not one fact, so I guess that's actually correct.

Hinkley-184 has footnotes/references for S201 and S352, but only a source listing for S352.
answered by Anne B G2G6 Pilot (990k points)
I see what you mean Anne. S201 is only used for the marriage. It was included for one spouse in the marriage, but not both. Looks like the same thing happened where Crofutt-8 is the child in another family record. We'll look into it. Thanks!
+3 votes

Thanks, Chris!

Getting rid of the extraneous labels is great. The references make much more sense now.

Eliminating the ID numbers is excellent - also the part about relationships being "natural" on the lists of children. Would it be possible to leave names (without links, of course) in the profiles, or would that be impossible because of the names having been skipped?

I'm one of the people who actually liked the old headings, but this looks good, too. 

Is it on the radar to combine repeated references to the same source? That would make editing much more straightforward, but I'm assuming that it's not an easy fix.

Overall, a vast improvement!

Carole

answered by Carole Partridge G2G6 Mach 6 (69.3k points)
+4 votes
A suggestion for GEDCOM imports if not too complicated: Would it be possible to place a pointer on the profiles of connected people if a person is skipped due to an already existing profile? Every now and then I come across broken off twigs of one parent and children that might have gotten lost when the other parent got skipped on import. Some way of flagging the existence of missing links might make it easier to find and connect these profiles.
answered by Helmut Jungschaffer G2G6 Pilot (441k points)
Hi Helmut.

This is something we'd like to look at when we do the complete rewrite of the GEDCOM import code that we're planning.

Do you have any specific ideas for how this could be done? If a person is skipped, obviously, they're not imported, so we have to put these indicators somewhere else. We could report them to the user who did the import but that wouldn't enable other WikiTreers to help rebuild the links.

I'm not sure how we could do this in as simple a way as possible. Ideas?

Chris
Could the pointer to a parent/child/spouse in the original GEDCOM be easily replaced with an entry in the biography section? Something along the lines of "Child/parent/ Spouse omitted due to match with xyz"?
Would you put it in all nuclear relatives? For example, if a father is skipped, would you put the note in the profiles of his parents, siblings, spouse(s), and children? Probably not siblings.

One thing I'd worry about is clutter.

It also seems like we should be able to do better. Theoretically we could actually make the relationship connection to the existing profile, but that would depend on editing rights and might be really complicated to implement.
It seems to me that the parent-child connection would be the most useful information, the spouse would probably be connected to any children (unless they were childless), siblings would not need the pointer since under normal circumstances they would only be connected through a parent.

Instead of the direct connection how about an automatic notice to the PM of the existing profile (if there is one) that so-and-so wants to connect additional family to that profile?
it seems to me that the direct connection is very important to make when a profile is skipped.  Ideally, the actual connection should be made if permissions permit and pointers added ONLY when direct connections cannot be made.  In these cases, the pointers need to be added to ALL profiles that would otherwise show the connection.

I realize that this adds another order of magnitude to the complexity, but my opinion is that it is very important.

I once built a tool that could take a spreadsheet file and convert it to a gedcom that would import cleanly to WikiTree.  I managed to get a fairly clean import but became discouraged when WikiTree's import superimposed headings and boilerplate on the gedcom that was output from my tool.  As a result, I gave up on the next step I had planned, which was to have my tool accept import of gedcoms from non-WikiTree sources and clean them up before import to WikiTree.  I also planned to give my tool to WikiTree when it was complete.  I repeated that offer several times but it was never even so much as acknowledged and the answers I needed about the WikiTree import processing that I never received also made the task of optimizing the output of my tool much more difficult that it otherwise would have been.  I don't discourage very easily, however, and am still ready and willing to VOLUNTEER to help with the subroutine logic for anything you want to do when making major changes to the import process, however I will no longer be surprised if this offer is not only not accepted, but even unacknowledged.

I agree with Gaile here. I have opened a G2G-question about the need for names and dates - not all GEDCOM data comes from Ancestry.com. The GEDCOM-sources of this profile is as meaningless as they come. I understand that as Chris says above that this is still a work in progress, but these three principles that was operated by in the past:

  1. We never want to lose any information that's in a GEDCOM.
  2. We never want to misinterpret or mislabel information.
  3. More information is always better than less.

are even more being compromised now. In order to uphold principle 1 and 2 it is not true that more information is at all times better than less. What is important is that the correct information (or data) is relevantly and appropriately cited and sourced in a way that specificity and pertinence of information facilitates the validation of info. So this will mean sometimes less info and sometimes more info. The example used does not provide validation - translating it to South African sources it will (I mean say those fields for example 

  1. Source: #S332
  2. Source: #S67 Birth date: abt 1854 Birth place: Markinch, Fifeshire Residence date: 1861 Residence place: Auchterderran, Fife, Scotland Link: http://search.ancestry.com/cgi-bin/sse.dll?db=1861scotland&h=1683585&ti=0&indiv=try&gss=pt
  3. Source: #S525
  4. Source: #S524
  5. Source: #S69 Birth date: abt 1854 Birth place: Residence date: Residence place: New South Wales, Australia Arrival date: 15 Apr 1875 Arrival place: Link: http://search.Ancestry.com.au/cgi-bin/sse.dll?db=nswassisted&h=200835&ti=5544&indiv=try&gss=pt

to South African sources also supplied by Ancestry.com) not only make the existing profiles difficult to integrate, it will merely be more dated secondary sources cited in a haphazard fashion that conflate even more, not less (quite a lot of the SA secondary sources, though authoratitve, still contain many mistakes, some of which have been recently rectified but not in a way that it will necessarily be included in any GEDCOM).

The amount of GEDCOM and the cut-off dates for GEDCOM, needs addressing. This has always been the major obstacle - I still come across GEDCOM where there are even clear duplicates - still too many and unnecessary. And with a new format the existing profiles they are being merged into, creates even more stumbleblocks.

As soon as I have the time I will integrate the two examples and see what I can provide as a solution if any.

To add to the comment made by Philip

My personal view:

I am not a member of Ancestry

Sorry to be a wet blanket, but The information as listed below has about as much meaning to me as the quotation: “ If it is in Ancestry it must be true”  .... 

  1.  Source: #S69 Birth date: abt 1854 Birth place: Residence date: Residence place: New South Wales, Australia Arrival date: 15 Apr 1875 Arrival place: Link:http://search.Ancestry.com.au/cgi-bin/sse.dll?db=nswassisted&h=200835&ti=5544&indiv=try&gss=pt

 

Correct me if I am wrong

To me a source link should take me to a place where I can see that the person who entered the data had done some research and looked at primary sources, making the information trustworthy.

(The following is not relevant to the discussion but I feel I should still mention it)

My concern and the implication of links that leads nowhere:

 WikiTree is a free and open site: Information can be gathered by anyone with internet.   

There is no way to verify the information (for non member of Ancestry)  and should it be incorrect, will be used by the younger generations as is. Repeating the same incorrect data over and over.

Regards

Ronel Olivier

Welcome back in Holland Philip , hope you had a good flight ?

First it of course is assumed people know of understand and are following all guidelines as requested and don't upload /import any duplicates and are skipping them all as requested, so the ideal WT world ;) (so seen from that point we would not have to merge any of the new imported ones)

But ....there of course always will be some members that although this is requested , and of course assumed unintentionally (because these Pre-1700 have a lot of variations of the last names and patronymics, or perhaps just because they don't understand things, because they are not understanding English very well ) , so in spite of all explaining info and requests about not creating ,uploading / importing duplicate profiles, some new members almost immediately will take the Pre-1700 self certification test and are uploading/importing and without skipping any all these same old Pre-1700 duplicate profiles/branches or parts of them , over and over and over again. And indeed with the new style gedcom, it sometimes (if there's a lot of info) is very hard and a lot of work to integrate all that's in it and how it is in it now .

And I was wondering about how many and how fast people are allowed to import/upload the new small gedcoms, because 10 or more small gedcoms within a week of course will be just the same as a large one. (if about 2000 profiles are added in a few days, almost all duplicates from different Pre-1700 branches you can imagine the amount of work it takes to get things merged and corrected integrated and sourced again. !)   

Also Helmut has a point ,  the members that do understand the why and how, and who are skipping duplicates (ideal) , so importing only the not already present profiles, are having a hard time/trouble, sometimes children/siblings indeed all will be ''disconnected'' and floating around because the duplicate parents were skipped like requested . Maybe it's possible to add name tags of the duplicate skipped parents to these children , because many times same names were repeated  for many generations , so if these parents are not very clear mentioned in the Bio and if there are no sources to verify, it's very hard to find the parents they belong to. which means many research before you know what child belongs to what parent(s) and before the whole family is reunited again.

Not complaining here , but just some points that maybe need some attention and could somehow be prevented or improved. Other than that love the way you guys listen to all members and keep trying to improve things.

Like the gedcoms where the Dutch names were changed , they all are imported fine now , so a big thank you for that !

Indeed Ronel, thanks (and Bea for the additional comments and welcome too).

I did the integration of the two biographies (example 1 and example 2) and have a few observations to make, the most important one of all that it took me at least twice as long than before to integrate the bio, [also] because of the following reasons:

  1. I also wanted to include the name of the GEDCOM and person and date ...
  2. The text(s) were punctuated were there is no need for punctuation - I had to do a lot more editing than would be usual for a GENI-import
  3. The actual sources (links to primary documents in GENI) had not been migrated. Knowing a bit how GENI-GEDCOM usually works (in the case of South African profiles) I missed many links, even though some of the links are often duplicates ...
  4. I have an idea of the actual secondary sources (which may also have mistakes in) but no surety. One also has to have some knowledge of the specific De Villiers / Pama  genealogical numbering system (see also this link on the project page) to edit this bio properly.

Though there is nearly always some new information to be gleaned from a GEDCOM, in both these instances (and as Bea points out in most other cases of duplicate GEDCOM), the amount of energy going into cleaning up after a duplicate GEDCOM is way more than a simple collaborative adding in scholarly fashion by WikiTreers (also keeping in mind that the previous ''boilerplate'' will have to be integrated as well in the case of a duplicated profile, from === Birth === for example to '''Birth:'''). If the duplicates aren't checked in time and the cut-off dates not changed (1800) this issue will not go away, with all consequences of re-directs and data-load etc.

There is one solution [in the case of the South-African profiles within this project period from around 1652-1806] - to create in time a separate database ''within WikiTree'' with all the name variations where people need to go and check first, before being allowed to create any file, in GEDCOM-form or just plain manually. Example of such a surname index ...

I think the name index is a great idea Philip , don't know if it's possible but for sure could prevent a lot of duplicates if all  last names patronymics and variatons were to find in indexes like this :)
Am again rolling like crazy now: One more example of a recent unchecked GEDCOM'd duplicate with new Boilerplate but absolutely no sources: http://www.wikitree.com/wiki/Janse_van_Rensburg-1074 ...
+5 votes

I have a potential addition to your GEDCOM Import To Do List. :-)

What do you think about adding a section to the resulting profiles for Research Notes / To Do List to encourage people to add research notes and To Do items?

Example:

=== Research Notes / To Do List ===

  • Place notes about your research and tasks to be completed for this profile here. This comment can be deleted after import.

As it is right now, I don't think many new members realize that they can add sections and information like this to a profile unless they happen to run across a profile where someone else is doing so. It could serve as a good teaching tool.

 

 

answered by Julie Ricketts G2G6 Pilot (258k points)
+3 votes
Thanks Chris, this does look much better although I haven't tried it out in practice yet with all the recent activities!

One improvement I have mentioned before that I would like is for the profiles 'skipped' because they are the same person, if there are relationships being created in the new gedcom, somehow be captured as a message to the existing profile manager and the new importer similar to the current form for a merge, that says 'do you want to attach these relationships' to 'WIKI ID XXX'?  This report would probably have to generate after the new gedcom was imported, to be able to capture the new WIKI ID's.  Not sure if that would work from the systems point of view? Once approved by both they would get linked up.  Perhaps they could also form part of an aborists feed?
answered by Veronica Williams G2G6 Pilot (108k points)
+3 votes

"We're also now leaving a lot of information unlabeled. Rather than putting "Address:", "City:", and "Country:" in an address, for example, we just print the address."

Could I ask for a measure of consistency with the Tag headings, please? Some have the : before the ''' while others omit it. I would prefer to have the : (colon) every time. i.e. '''Tag:''' I guess this would be an easy change?

BTW, I much prefer the new results 

answered by John Slee G2G2 (2.3k points)

Related questions

+63 votes
17 answers
1k views asked Jan 17, 2018 in The Tree House by Chris Whitten G2G Astronaut (1.1m points)
+4 votes
0 answers
45 views asked Jun 17, 2017 in The Tree House by M Bale G2G6 (6k points)
+3 votes
2 answers
68 views asked Jan 12, 2017 in WikiTree Tech by PJ Lombardi G2G1 (1.5k points)
+5 votes
1 answer
155 views asked Oct 11, 2016 in Genealogy Help by Karen Raichle G2G6 Mach 5 (57.4k points)
+6 votes
5 answers
147 views asked Jul 7, 2015 in Policy and Style by Susan Tye G2G6 Mach 1 (19.9k points)
+32 votes
5 answers
510 views asked May 25, 2018 in The Tree House by Chris Whitten G2G Astronaut (1.1m points)

WikiTree  ~  About  ~  Help Help  ~  Search Person Search  ~  Surname:

disclaimer - terms - copyright

...