If you had a tool that could scan all the profiles you managed for areas needing attention, what would it look for?

+9 votes
396 views
Being a programmer, I wanted to play around with the gedcom data downloaded from WikiTree. To get things started, I wrote a Python module to parse gedcom files and a tool inspired by the lint utility used by C programers using that module that looks for issues in the profiles.

Currently, the wt_lint tool finds profile with an unknown gender or with the word 'todo' in the bio or messages. What else would you like to find as potential problems in your profiles? Ideas I plan to add are repeated sections, such as multiple "Sources" section.

For those comfortable with running python scripts at the command line and using mercurial to download software, feel free to try out the tools and give me your feedback. http://www.wikitree.com/wiki/Project:WikiTree_Tools
in Genealogy Help by Roland Arsenault G2G6 Mach 5 (54.1k points)
recategorized by Chris Whitten
Yay, Roland!
Just added a wt_progess scripts that lists how many ancestors per generation you have. Here's what the output looks like for me:

./wt_progress.py ARSENAULT-64.ged "Roland Arsenault"
generation 0 - 1 of 1 ( 100.0 %)
generation 1 - 2 of 2 ( 100.0 %)
generation 2 - 4 of 4 ( 100.0 %)
generation 3 - 8 of 8 ( 100.0 %)
generation 4 - 16 of 16 ( 100.0 %)
generation 5 - 30 of 32 ( 93.75 %)
generation 6 - 42 of 64 ( 65.625 %)
generation 7 - 63 of 128 ( 49.21875 %)
generation 8 - 86 of 256 ( 33.59375 %)
generation 9 - 103 of 512 ( 20.1171875 %)
generation 10 - 98 of 1024 ( 9.5703125 %)
generation 11 - 68 of 2048 ( 3.3203125 %)
generation 12 - 30 of 4096 ( 0.732421875 %)
generation 13 - 12 of 8192 ( 0.146484375 %)
Great work, Roland... ! Can't wait to see what you come up with...

Scripts have finally been updated to reflect changes in the GEDCOM export implemented a couple of months ago.

I also added a wikitree specific version of the script that creates a graph of a gedcom. It uses the privacy level information to create a tree that can be publicly shared. Here's my tree for example: http://wikitree-tools.googlecode.com/hg/example_tree.png

That's awesome, Roland. Nice work!
What it is missing is what I find most frustrating.  No birth or death dates what I'd like to see.  Maybe some kind of scoring system.  We all have missing data but for a large tree with no dates I'd like to know that before I even hit the compare button.

More improvements to the tools!

  • missing birth and/or death date checks to wt_lint
  • max age check to wt_lint
  • Graphical User Interface (wikitree-tools.py) added for wt_lint
  • optinal html output for wt_lint (with links to flagged profiles)
I found a prog called GenSmarts. It takes a GEDCOM and runs through pointing you to areas needing work but is very intelligent and helps tie things together. E.g. You haven't got someone's birth details. But GenSmarts picks up brother born in X, parents married in X so probably missing details will be found in X. It then provides prefilled links to numerous sites to hunt down the missing info.

If this tech could be stuck into wiki it would be amazing :-)

If nothing else, links to say Ancestry.com prefilled with the wiki persons data would be a great start :-)
I agree David. I love Gensmarts & use  quite often. If something similiar could be implemented into WikiTree it would be awesome.
Ok wow awesome!  I have been doing exactly that by hand!
can you get it to print out the names of each ancestor of the nth generation or a link to each?  That would be a great tool.

5 Answers

0 votes
 
Best answer
Hi - I'm out of my league here (I think my eyes glazed over about halfway through the answers!), but thought I'd offer some suggestions without solutions :)  Anyway, I keep checking to make sure that I've

* done something with the death date (either entered what I have, check living, or check the definitely not living) and

* done something with the generated "firsthand knowledge" line (equivalent to the gedcom needs editing line).

I also

* look at the bottom of the profile and re-check unmerged matches and

* periodically go through my watch list to check that none of the profiles I manage have ( - ) for the (birth - death) data.  I adopt a lot of profiles, looking for a couple of my missing links, and I've found that having a ballpark date (either birth or death) makes looking for matches a lot less painful.

Cheers,
Liz
by Liz Shifflett G2G6 Pilot (347k points)
selected by Gloria Lange
I am also out of my league on this subject but i vaguely see some of the solutions to some of the problems I have been having with wikitree. I loved your answer because it contains the steps I have been taking to deal ... I so hate having to change each and every profile from firsthand knowledge to compiled data ... wish i knew how to do it autogenerated or something... i am so lame... anyways thanks
Thank you! Cheers, Liz
+6 votes

Hi Roland, This sounds like it could be a cool utility.

1) Have you considered scanning for date anomalies (even with the greatest care, shared or individually managed profiles can end up with logic errors):

Marriage date after death; Marriage date before birth; Birth date after death; Child's birth less than X years after own birth; Child's birth after own death;

2) Have you considered scanning for duplicate relations - often profile merges are done and some relations or offspring don't get merged or resolved at the same (they might be managed by other or unresponsive managers):

Two spouses with the same surname and/or forenames (this might be bona fide); Two children with the same forename and birthdates within X years (this might be bona fide); Two siblings should be covered by the above but may be needed if parents are not on your watchlist;

3) Have you considered scanning for profiles that have remained unchanged for a long time - often you need to revisit profiles after a while just to reassess whether more research tools or records are available now than when the profile was last edited:

Profile last modified more than X years ago (from current date/time).

4) Have you considered scanning for biographies that have remained untouched after an import:

Profiles where the biography contains the text... "This biography is a rough draft. It was auto-generated by a GEDCOM import and needs to be edited."

Profiles where the biography contains gedcom object references that could indicate that the profile manager has associated 'media' (pictures, etc) that could be uploaded to enhance the biography...  an AT symbol followed by a string of numbers and another AT symbol (without spaces) E.G. @O63@

Just some ideas for you Roland, Cheers

by Wombat Allen G2G6 Mach 2 (22.1k points)
Thanks! All great suggestions. 1,2 and 4 should be doable. 3 is missing a "last modified" type of field in the gedcom, so might be more difficult to implement...
Nice, Wombat!

Note that #2 is part of what http://www.wikitree.com/wiki/Special:FindMatches is intended to do. You can use it to search WikiTree-wide, or just within your own Watchlist. Not that it couldn't be much better.

You can do #3 with your Watchlist! Sort by reverse edit date.
Good point Chris. I'm not trying to reinvent the wheel, so I'm not going to add checks that aren't trivial if they can already be done on the website.
+4 votes
After importing the gedcom export from WikiTree into a number of other Genealogy programs, I found that the Wiki tags (_BIO) from WikiTree were recognized by no other program, so I wrote a quick fixup to change it into general notes for the person (profile). Also, most other programs don't support first+middle name, so the same fixup appended middle name to given name, plus cleaning up some of the template 'noise' in the biographies. It's in Java though, probably not easily accessible to most users.

The main reason I wanted to import into other programs is to take advantage of the data validations, though eventually I'd like to publish the same information to multiple online sites. Gramps and FamilyTree Builder were particularly good at this. I use these to check improbable ages and relationships, then fix the data in WikiTree (as system of record) before the next export. All of the improvements you mentioned are already available in other free Genealogy programs, I wouldn't consider duplicating that same logic when you can simply import (plus navigation is much easier), but adding that logic to WikiTree itself would be extremely helpful.

If there's interest, I could post the program source/runtime in the Tools section, though I have to clean up some of the hard-coded settings first.
by Bob Fields G2G6 (9k points)
This is great stuff, Bob.

Have you considered adding yourself to this project?
http://www.wikitree.com/wiki/Project:WikiTree_Tools

I know you've suggested some fixes in the past and they're on the to-do list. Honestly. Things on the to-do list do get done, it can just take a long time.

Maybe the WikiTree Tools Project could evolve into a way to make our development process much quicker.
Good point Bob, Gramps and other tools can already do much of that. To be honest, my goal isn't to write this tool. This is just a step I'm taking in learning more about the gedcom format so I can eventually mess around with more advanced data visualization which I haven't seen yet in any software or platform. Of course, I have already put my wt_lint script to good use finding all the todo I had scattered in my tree and forgotten!

Chris, I agree that the WikiTree Tools Project could help in development. I see it as a place where people comfortable with programming can easily prototype ideas and see if it's worth implementing in the site itself.
The two WikiTree tags that aren't recognized by other programs are _BIO (Biography - the Wiki text section) and _MIDN (middle name). there were also some issues with living people being marked as dead in the export (adding an empty DEAT tag if nothing was marked in the profile, rather than estimating if dead based on birth date). See http://en.wikipedia.org/wiki/Gedcom + references for gedcom details.

Security levels are based on the birth date plus if/how long a person has been dead, so the appropriate privacy level can be derived from those values. Yes they are not part of the export, but the derived values work fine for me as long as birth and death are correct, I pretty much work exclusively with 200+ year old profiles anyway.
Having a privacy field in the exported gedcom would avoid having to parse the dates and derive it (falling into my not reinventing the wheel category), but more importantly it would allow me to respect the privacy levels for profiles which have been set to a level other than the default.
Putting the privacy level in gedcom exports would be easy enough.

Is this starting to beg the question of whether gedcom export isn't the right medium? We could be doing some other sort of data dump, or maybe some sort of API. That would add more complication, of course. Pros and cons?
So far, I'm not feeling limited by the gedcom standard itself, so I'm not a fan of switching to something else unless it solves a real problem. ( http://xkcd.com/927/ )

Now, if you want my vision of something that would definitely need quite some work to implement, but I think would be a cool long term goal is the ability to have a tool such as Gramps be able to "sync" with WikiTree. That would allow using some of the tools in Gramps which might not be practical or feasible to implement within WikiTree's web interface without needing to maintain parallel trees. (I'm picking Gramps here since it's an existing open source tool, which can be modified without needing to go through a vendor)

To make something like this work, I envision something much like source code version control where your local application wouldn't be allowed to upload back to WikiTree local changes to profiles which have been modified since the last "sync" unless those changes are first merged in locally. (This might not make much sense unless you have used version control tools such as subversion...)
Great cartoon, Roland.

OK, we will add Privacy Level to GEDCOM exports. I made a quick note on http://www.wikitree.com/wiki/Bugs -- want to add to that? You can put other items here too.

We need a way to track larger suggestions and new features. Bob mentioned this. Currently we use an internal bug tracking system. I'm not sure the best/easiest way to keep ideas like this one on sync'ing organized. Just with G2G? With the Projects page? With a new wiki page?
I love the idea of a tool that changes things in the GEDCOM to make it better to import to other programs. I recently started using RootsMagic, mostly so I could use some of the fancy tools and related software like Family Atlas, but it only has a given name field, so I ended up losing all the middle names on import.

I love your syncing idea, Roland! I've realised that I'll never get into the habit of updating two separate trees, so all I can do right now is repeatedly start over on RootsMagic by getting a fresh GEDCOM off WikiTree. Even if there was something that just read changes from WikiTree and updated a desktop app, that would be amazing.
Lianne, I just checked out that software and it looks like Family Atlas does something similar to what I want to do! What I want to do is create an animation of the migrations of my ancestors. Step 1, which I've done by writing these python scripts, is to make sure I can read my tree in a programming environment.
Next will be to figure out how to encode times and places in profiles, probably adding to timelines. Once I have that solved, I'll see if I can create a movie showing migrations.
Synching in one direction (your computer to internet) is easy. FamilyTreeBuilder does this today. Synching in two directions (your computer <-> collaborative intenet app such as Wikitree) is much more difficult but not impossible. You would have to know the last changed date for each record on both sides, plus the time of last synch. If one side has been changed since the last synch, but not the other, then we synch in that direction. If both have been changed, then we will have to manually resolve the conflicting data, down to the specific field. It's less like source code control than like dealing with multiple database (Record of Origin / System of Record) issues.

Right now I just repeatedly start over in RootsMagic and Gramps and FamilyTreeBuilder, making note of the errors found and making the updates back on WikiTree. Manual export from WikiTree, automated gedcom fix, manual import into the local genealogy program.

If you are looking for an Atlas of locations, check out WebTrees. It had embedded Google Maps components within the page to provide markers (assuming the location can be recognized as valid by google).

Chris - wouldn't it also be easy to fix the GivenName problem that both Lianne and I mentioned (and I wrote a program to automatically fix)? The first thing I noticed with the basic import was that all the middle names were gone.
Roland: Yeah, I was so excited when I discovered Family Atlas! I haven't had a chance to play around with it much yet, but my main issue so far is that (while I think it's supposed to be able to deal with historical places) I can't seem to get it to recognise Acadia. lol, big issue for me. But I'll keep looking into that.

Bob: I'd be happy with just syncing one direction, but from WikiTree to desktop. It doesn't seem like it would be that hard. Any time a change is made to one of your WikiTree profiles, make it on the desktop. So you'd essentially need something watching your watchlist activity feed. And since it's just going from WikiTree to my own personal desktop, there won't be merge errors, because I only make changes on WikiTree.

In terms of fixing these problems with the GEDCOM exports, well, some people might want the middle names separate, if they're uploading somewhere else that also supports this. I guess I would like it best if there were options when exporting your GEDCOM. For example, select whether to have given names combined, or separate. Perhaps also select which people on your watchlist to export. (I didn't mind this until I started working on the royal family. I don't need them in my Family Atlas and whatnot! But also I invite a lot of people, so there are usually a few of them on my watchlist until I go through and remove them.)
Synching from html to html program is relatively easy, we could always parse the web page results into a usable object with the right attributes, by iterating through the watchlist changes between specific dates, but putting the info into a local program would be more difficult because they are all closed systems with no public interface other than through GEDCOM import, except for Gramps or WebTrees (open source) where something might be possible. I'd hate to try to decode the internal storage format, unless they used a local database. It's easiest now to simply overwrite each time through import.

I of course would love to have the middle name and married name and certainty available within other genealogy programs, but I am simply working within the limitations of the software programs that I use (and of the gedcom export itself). My program doesn't remove unused fields such as these, but they are ignored on import to other programs.
+3 votes
Scanning for possible living persons who should have a more secure privacy level.
by Debby Black G2G6 Mach 8 (80.6k points)
Or related: Scanning for profiles by death date so you can decide if you want to make profiles more open that may not necessarily be open by default. (I've been trying to anyone dead for more than 100 years open, because it is unlikely that their direct descendants are still living, but WikiTree only makes people born 200 years ago open.)
Also realated, the security level of a profile doesn't seem to be part of the gedcom. Chris, would it be possible to somehow add that to the exported gedcom? It could help detect the above mentioned issues but it could also help with the other script I wrote to create a complete graph of a tree. Currently, the graph I produce contains data that would violate the privacy settings so I can't share the results. I would like to be able to add the logic in my script to respect the privacy levels so that the resulting graph would hide the info which isn't meant to be public.
Debby and Erin:

I assume you're both familiar with the bulk privacy changes tool:
http://www.wikitree.com/wiki/Special:PrivacyChanges

What you can't do with this is sort by birth date. That would be handy. I think it would be doable.

In the meantime, you can sort your Watchlist by birthdate.

Chris
Thanks, Chris. I just sorted by birth using the Watchlist and found a few profiles that should have been marked with greater privacy. There's so much to learn at the site when one first becomes active. All the help from you, the WikiTeam in general, and other WikiTreers is very encouraging.

With all of our wishes and wants, I hope you and the WikiTree team don't become overwhelmed. This site is wonderful as is, and we can do without a lot of our wants and wishes to keep the site free and to keep all of you from exhausting yourselves. Thanks for offering such a great place where we can share and merge and find cousins and all without having to pay to do it!
Thank you, Debby. I very much appreciate the kind words.
+3 votes
I just thought of something that would be handy, if maybe a little harder to implement.

FamilySearch recently changed their URL structure, and it's no longer backwards compatible, so suddenly a tonne of my source links are broken. :( I'd love a way to check for those.

Actually, that might not be too hard... If we could figure out the structure of the URLs, we wouldn't need to actually check them. The broken ones all seem to end with /p2 or something like that. Some regular expressions could probably weed them out. :)

~Lianne
by Lianne Lavoie G2G6 Pilot (419k points)
I've noticed that many FamilySearch links end in # space #. I've replaced the space with a . (period, dot, etc) and they are no longer the same, but not broken.
Hmm, I haven't noticed that. As I'm replacing the broken links, it seems the new ones are all formatted like this: https://familysearch.org/pal:/MM9.1.1/F2JC-2SW

The broken ones I've seen so far all end with either /p2 or /p4.
Good idea. I think a generic URL checker is the way to go so the tool would be more broadly useful and not depend on knowing specific changes made on a specific host.
Good point. That way it'll still be useful if FamilySearch changes their URLs again. :P
I just committed a wt_url_check.py script to the repo. It can be used to list all urls found in bios without checking them, or checking them and only list the ones with problems.

On my data, I am getting a few false positives from some wikipedia urls. I'm getting a "forbbiden" answer from my script but they work when followed from the wikitree page so I'm assuming the server is treating the script as a robot. Not a big deal since this script is meant to help us, not do all the work for us!
fyi: I'm finding some familysearch links which are getting flagged by the script, but are actually still valid, so more false positives to be careful about!

Related questions

+6 votes
1 answer
83 views asked Oct 20, 2015 in Genealogy Help by Kristina Adams G2G6 Pilot (170k points)
+5 votes
0 answers
61 views asked Jan 20, 2015 in Policy and Style by Andrea Powell G2G6 Mach 4 (41.3k points)
+6 votes
1 answer
+5 votes
2 answers
+8 votes
1 answer
177 views asked Apr 29, 2014 in WikiTree Tech by Tim Ard G2G1 (1k points)

WikiTree  ~  About  ~  Help Help  ~  Search Person Search  ~  Surname:

disclaimer - terms - copyright

...