Database Management Error Correction

+9 votes
539 views
I do not understand why more of the errors are not being corrected the more efficient way which is by behind-the-scenes database management.  If this was discussed, i guess I missed it.

We are loading pages one at a time when a database manager could do a global replace operation on every corresponding value in an entire field in one step in the same amount of time it takes us to do one or two.  I don't know who other than  Chris has this kind of access but recruiting help should be a possibility if needed.

Would it help if we identify data values that would be most appropriate for global replace?  Do you need help writing SQL query commands?
in WikiTree Tech by M Anonymous G2G6 Mach 4 (47.1k points)
retagged by Dorothy Barry
I completely agree!  Especially with things like capitalization and Y death location.   I'm mentioned it before in some other threads, but hopefully this will jump start something.

I agree 200%

It takes 10 minutes to upload 2000 unsourced profiles and 300 man hours to correct it => 

1800 times more work correcting and be a good WikiTree citizen compared to be on the dark unsourced side ;-)

I suggest we change that and start correcting 1800 times faster than the people upload ;-) The Database Errors project has made clear WikiTree is not 100% correct and if we add structured information a computer can find out that human makes mistakes...

I think WikiTree should help the correcting force and make the evil unsourced force life tougher

Using a more efficient way of correcting is to be bold and if we look on what some people think how to react on the Database Error Project errors some say always to be polite but most people say Bold for most errors

what about issuing a warning that in 6 months, unless an objection is recorded, profiles will be "corrected" automatically?
I am in favor of that model.
many of us DO want to make the corrections ourselves but we do not know how to do it  as so many of your questions have shown. you know how because you have been working on it ^ we do appreciate all you do, so just help us to know how to go about it.

thanks,

2 Answers

+7 votes
 
Best answer
Mikey, Automating the correction of the errors in the database could be not only dangerous but could add more errors than it fixes. Just this morning someone proposed a merge of two siblings on my watchlist that had very similar names, the exact same birth date along with the same parents but it took looking further to find that they had differing death dates that were not sourced at the time of the proposal. If someone was not paying attention and the merge had been completed then we would have had to create another profile for the one that was merged away. Giving the power to make global changes quickly to someone could result in changes being made that could have just as much potential for error as the situation I described above but on a much larger scale. I hope that we keep on the slow pace the database error project is moving at so that we do not make a large number of unresearched changes and unwittingly do even greater damage to the database.
by Dale Byers G2G Astronaut (1.3m points)
selected by Bea Wijma
True. The computer automation can quickly flag potential errors. Then its best for human eyes to review and correct.
But surely there is an in between?  Are there any legit reasons the gedcom import shouldn't sanitize all capital names, or Y/Yes death locations?  

There are a lot of errors on there that need reasoning to correct, but some of them are pure housekeeping.

I think there are a lot of low hanging fruits

The discussion at WikiTree must start getting precise. Just saying No is not an answer ;-)

Some examples which I feel are a non issue and just a waste of our time. In the long run I think WikiTree is getting bad reputation that we don't correct obvious things

  1. Empty profiles (size=0) not marked unsourced?
  2. Having Y in the location field
  3. Having Unknown in the location field
  4. Having just ????? in the name fields
  5. Automatic mark links to Ancestry that needs log in with a link template ==> subscription needed)
  6. Automatic mark links to Ancestry that are empty  a link template ==> non working link warning)
  7. Automatic mark links not working with a template ==> non working link warning) see dead link template
  8. .....

I tag this G2G entry bots bot_update_discussion

Example of a bot change I found that corrected characters see change log -  log entry and a long G2G discussion

Feels the profile Westphal-71 still has issue as when you do edit there is a lot of text not displayed Hm

  • The issue is that some historical GED imports from 2010 (possibly other years) seem to include a special character that cannot be read. As such, the whole profile becomes unreadable and it appears to the regular user that "Nobody has entered anything".
     
Dale hits the nail on the head.

And there is another reason: limited resources. We have very few internal resources at WikiTree. I am very cautious about letting WikiTree's expenses grow too quickly. We're committed to always being free. This means that the team is very small. We strive to empower volunteers wherever possible. This is why Aleš is such a hero.

This goes back to what I was saying about genealogical data standards. It is true that there are some errors which are complex fixes and require human supervision; this is true and undeniable as far as I am concerned.

I have been using both the WikiTree merging tools and the various desktop merging tools, and I find them to be very crude and ineffective; they produce way too much cruft and lead to more errors than they fix. I have also talked about the issues of merges from simple unambiguous two profile merges to complex ambiguous multifamily merges; we can distinguish between these cases and classify them with high precision; there is no reason that the kinds of merges that Dale is concerned with should happen by a machine without human supervision.

Corrections shouldn't be made until we know the full consequences of the correction by application of proposed rules, and in a professional program engineering environment, we'd test these things on non-live branches of the WikiTree database thoroughly before we ever even attempted the fixes on the live database. The golden standard of debugging is formal proof of code completeness; you prove that the program has no errors or results in no errors before you run it on live and risky systems; this standard was developed for the NASA Apollo program for their space capsules because errors cost lives.

Beyond that there are also simple errors or trivial errors which can be corrected enmass which would not cost much in terms of computing time or resources and would risk very little--nothing more than is risked with every read and write operation to the active database. For instance, making changes to the birth date field by removing non-date-format information would not damage the database or split profiles or alter relationships between profile entries; it would be relatively simple to identify what kinds of errors exist in the fields by doing machine-read-only surveys to develop samples and devise a parse-and-replace algorithm. As Magnus said, there are a lot of low hanging fruits that we can deal with long before we have to confront the complex issues of mergers.

And again, I will emphasize that this comes back to the long standing and problematic data standards of the genealogical community; the public portion of WikiTree should exist as a Git-like repository independent of the WikiTree project, so we can develop and test software solutions to the major genealogical problems.

"Giving the power to make global changes quickly to someone could result in changes being made that could have just as much potential for error as the situation I described above but on a much larger scale."

This is a problem of centralization vs distribution. The database error correction project should be distributed not centralized as Dale assumes in this assertion.

Finally, what Dale fails to acknowledge in this:

"I hope that we keep on the slow pace the database error project is moving at so that we do not make a large number of unresearched changes and unwittingly do even greater damage to the database."
is there are a lot of errors and unresearched changes being made to the WikiTree database by typical users some of which are likely creating unpredictable damage in the database or damaging the community in subtle ways. To me it seems like Dale is setting up a false dilemma.

"I hope that we keep on the slow pace the database error project is moving at so that we do not make a large number of unresearched changes and unwittingly do even greater damage to the database." 

We have already a faster way of creating damage called Gedcom import of unsourced profiles  it takes 6 minutes to upload 500 profiles without sources..... as we don't use source templates we have no understanding how unhealthy the WikiTree family tree is ...

Of course there has to be done something about allowing people to immediately upload and import their family tree, it is indeed the fastes way of causing a lot of damage and many hours of work for others, especially if the Gedcom has (which often is the case) a lot of all the same duplicate families/lineages again and the member ignores all advice to skip these profiles and decides to uncheck all skip profile boxes again and just import them all... adding profiles manually and based on sources/ evidence and facts of course is much faster and probably the best way to prevent  errors.

Re Bea
Stupid question. In Sweden we have nearly no duplicates and not so many users. I guess in the States WikiTree users get much much more duplicates.....

Isn't it time to stop with gedcom imports to the Wikitree in areas with a lot of profiles... or restrict to 3 generations...

I of course know and understand that Magnus, I am talking about allowing new members to immediately take the Pre-1700 test which now allows them to upload and import their gedcom without skipping any of the profiles and with all Pre-1700 profiles/lineages included. Maybe it's better if people first explore and learn how WikiTree works a bit, before allowing them to import Gedcoms. And maybe it's better if there would be some communication about what families they are planning to import so we could prevent the import of duplicate lineages/families.

So I don't know if it's possible to only allow gedcom imports of a specific Country, I do know the uploaded Gedcoms are very hard to check for duplicates, the Gedcompare isn't catching them all anymore (because many names were corrected, modern names people never used, now are corrected to patronymics for example, so even if it catches more variations now, a patronymic in no way will look like the modern name or a variation of that name) .

re reading your answer @ Magnus: Of course it would be great if it would be possible to somehow: stop with gedcom imports to the Wikitree in areas with a lot of profiles... or restrict to 3 generations... :) 

Ian and Mangus, Chris Whitten agrees with my answer. I will not defend it further.
+1 vote

Some new challenges is error 107 and 108

107 Full name in UPPERCASE 2878 1807   38 53 369 594 17  
108 Full name in lowercase 3114 2841 1   6 41 218 7
 

 

To change Last Name at Birth I feel you need to be a profile manager or on the trusted list

A candidate for SQL scripting: sql-capitalize-first-letter-only or InitCap

by C S G2G6 Pilot (273k points)
Again, his could create more problems.  There are many last names which quite correctly have internal capitalization.  Two in my family are FitzWarin (EuroAristo naming convention) and deLisle (modern, appears in all the official records that way)

Let's face it.  Some of these are cosmetic- they affect readability and consistency, but they do not affect the database relationship between the profiles. With limited resources, we should be focusing our resources on errors like duplicate siblings and spouses, and inconsistent dates, not capitalization.  Yes, flag ALL CAPS  it as an error.  But don't try to automate it.

As Emerson said: "A foolish consistency is the hobgoblin of little minds"

Re Janet or think outside the box

  • Maybe you get as a profile manager a list and check the one you approve.....
  • Or the ones you have marked false errors will not be changed...
The problem I feel is people who are not active see Jancsics-3 not changed since 2009 dec 22 and Profile manager not active the last 5 years 

I assume that all active profile managers has 0 errors in Watchlist and family tree...

That would be an unwarranted assumption.  There are plenty of active profile managers that are not even aware of the db_errors existence.  And there are plenty of errors that, even once the manager is aware, take time and effort to fix.

@ Magnus Thinking outside the box, maybe there's a way or maybe it's an idea if early profiles (Pre-1800) automatically could/would be added to projects they maybe could or would fall under ? This way no one is ''owning'' a profile and all profiles can be worked on by project/WikiTree members interested in the profile/family,  and not like your example have to wait for an unresponsive member to respond..

I would suggest that gedcoms be imported with all dead people set to Open and orphaned.  A Gedcom Integrators Project would then go through them, fixing and merging as necessary.  The line would be that the site doesn't host your tree, you donate your tree to the site.

Like donating old stuff to a charity, which then has to chuck out the stuff it can't use.  Because it certainly can't afford to warehouse it.  In a single tree, there's an overhead cost in carrying all the junk.  It interferes with other people's use of the site.
Or maybe if there just was a WikiTree Profile created that could be automatically added as manager when profiles are imported. So working similar like the project profiles but more general and less specific ? All Wikitree members if they would like could than be invited to join the Wikitree Profile google group and be able to keep track of these profiles or maybe only members of the Gedcom integrators project as you suggested , could take care of this and than sort them all out a bit ?

If profiles are orphaned they are free for adoption and if they are adopted by someone who decides to maybe change the privacy settings again, it (orphaning them all) didn't really help of course.
You're right of course, the big snag with orphaning profiles is that they don't stay orphaned.
Comment is in the wrong place should have been under my answer, Sorry

 

Ian and Mangus, Chris Whitten agrees with my answer. I will not defend it further.

Dale you are man of short sentences.... the problem is you need to understand the non logical approach WikiTree has 

  1. We don't want to do any SQL cleaning of profiles 
    1. instead we let people spend 1000 of hours take away Y in death locations or change CAPS in names...
       
  2. We have a honour code that profiles should be sourced but its ok to upload non sourced profile and do nothing more.... 
    1. and we don't follow up if they are a good member or not
       
  3. The only people we allow to send in a script to do changes are "New users" they can upload GEDCOM with 2500 profiles to add 2500 unsourced profiles and do that more times....

Files uploaded the last 2 hours today - Huston Huston we got a problem

==> We let users 

  1. upload in 2 hours 3565 profiles 
  2. with 1442 sources
    1. 1148 of the sources are in one upload
    2. I guess 90% of those profiles added are unsourced.... and we don't mark them with the template unsourced

 

 

Mangus , I have nothing to do with GEDCOM uploads and in fact I discourage them. If you have a problem with them take it up with Chris Whitten.

Related questions

+9 votes
4 answers
+12 votes
4 answers
+12 votes
14 answers
+6 votes
1 answer
95 views asked Jun 21, 2016 in WikiTree Tech by Lance Martin G2G6 Mach 9 (90.9k points)
+24 votes
1 answer
171 views asked Oct 19, 2017 in The Tree House by Aleš Trtnik G2G6 Pilot (485k points)
+9 votes
3 answers
117 views asked May 4, 2015 in WikiTree Tech by Anonymous Knight G2G6 Mach 3 (35k points)
+10 votes
2 answers
157 views asked Apr 30 in WikiTree Tech by Barry Smith G2G6 Pilot (135k points)
+28 votes
2 answers
219 views asked Feb 13, 2018 in The Tree House by Aleš Trtnik G2G6 Pilot (485k points)

WikiTree  ~  About  ~  Help Help  ~  Search Person Search  ~  Surname:

disclaimer - terms - copyright

...