DB_Error are WikiTree moving in the right direction

+9 votes
359 views

New year and maybe time to reflect on the magic work of Aleš that changed the focus of Wikitree and also give visibility to quality concerns see 
News on database errors project january 15th 2017

In analyse nr 38 and we have 1 551 285 errors in total. Analyse number 1 in 2016 04 01 had 206 558 errors....

Are we moving in the right direction?

  1. Is GEDCOM import killing the quality of the family tree?
  2. Is there need for more education?
  3. More validations?
  4. Start doing bots that correct things that we know are wrong like death in the future....
  5. Or is everything ok or is Database Error project negative for WikiTree as more people are  "touching my tree"


Some data

in The Tree House by C S G2G6 Pilot (273k points)
retagged by Maggie N.

Sure. The word is: tålamodsprövande

But I run with what there is.

Yes, I like the warnings, even though sometimes it takes a while to find it.  I did like yesterday when I was warned that I had a death before birth which had happened when I typed 180 instead of 1808.  I'd hate to have to report one of those "need someone with a pre-1500 badge to help me."
Are all the earlier db errors still needing to be worked on or are they all included in the latest db errors list?
The last list has all the errors. I left older ones just for comparison.
That's what I thought. Thank you.
To make a meaningful comparison between list nr 1 and list nr 38 when need to know the total number of profiles on the database at each time.

Although the total number of errors may be up are we making progress on the number of errors per profile?

@Daryl Use http://wikitree.sdms.si/

Then you get 

  • number of profiles processed
  • percent change per error
  • change in errors for protected and not protected profiles
  • changes in error for profiles from different centuries...

see video how to do it

Date          Number of 
              profiles
              processed

2017-01-15    12897886       

2017-01-08    12842637     
2017-01-01    12778835     
2016-12-25    12729069     
2016-12-18    12689683     
2016-12-11    12640458     
2016-12-04    12593640     
2016-11-27    12546691     
2016-11-20    12499941     
2016-11-13    12455377     
2016-11-06    12410374     
2016-10-30    12361572     
2016-10-23    12308004     
2016-10-16    12257126     
2016-10-09    12215312     
2016-10-02    12170087     
2016-09-25    12126687     
2016-09-18    12079370     
2016-09-11    12021239     
2016-09-04    11978356     
2016-08-28    11940849     
2016-08-21    11894436     
2016-08-14    11850728     
2016-08-07    11806668     
2016-07-31    11758550     
2016-07-24    11711152     
2016-07-17    11660685     
2016-07-10    11614317     
2016-07-03    11563204     
2016-06-26    11514480     
2016-06-19    11467203     
2016-06-12    11420927     
2016-06-05    11361166     
2016-05-29    11311734     
2016-05-22    11262275     
2016-05-15    11206469     
2016-05-11    11184648     
2016-05-01    11107636     

 

Thanks Magnus! That video helps me out a lot!
Thanks Aleš he has done all the magic.... It takes some time to understand all the charts he produce....
Thank you for sharing the video. I had no idea it was possible for us (the individual WikiTree'rs) to look at the data this way.

Absolutely rocks my world! Hat's off to geniuses that work so diligently behind the scenes!!!

4 Answers

+10 votes
Magnus, WikiTree is not getting worse. For the last several weeks now I have been removing profiles from my watchlist and only doing very minor edits to the remaining profiles. I also try to fix the errors or remove myself from the profiles that show up on my error report with very poor results because most weeks the list of errors reported grows. The problem is that the project keeps adding things to fix faster than we can correct them and most of the thing they find for profiles I manage are not errors in the first place so they artificially inflate the number of errors. As an example the only "error" that was on any profile I manage this week was caused by WikiData and the profile not being connected to the big tree, There is no error in the profile and because it is in the Notables project as well as being Open privacy, and was both in the project as well as Open prior to me adopting it, I feel like it will not be fixed soon. I tried to connect it last week but just could not find any connection.
by Dale Byers G2G Astronaut (1.3m points)
@Dale I think with so much data available about WikiTree and its genealogy quality maybe we can see what could be changed to make WikiTree better... what works and what doesnt.

Does it matter that WikiTree has identified 195 000 uncleaned profiles? Should it be done with a bot? In 3 months 10 000 profiles has been cleaned and I guess that is by active members... what about maybe 150 000 uncleaned profiles by inactive members...

How many people has found the option to run the report is it 1% of the users or 50%

I feel before Aleš started doing this project by himself no one thought about doing what he has done... I guess now with more data and knowledge maybe there are new possibilities...
I said nothing bad about the Database Error project, but as others have noted the number of types of errors has grown since it started so you stating that the total number of errors is growing larger is itself a false statement and would only be valid if you took only the same errors that were on the first report and compared the total of only those errors both then and now. I do not want to even try to do that but I have a feeling it would show that the number of "errors" is going down and not up.

I did that see I feel "one problem" is protected profiles by unactive profile managers... that will make some errors always be part of the equations

All profiles by inactive profile managers and non cleaned GEDCOM import is that a problem? Should a bot be designed that make it in an hour... or hide those profiles or isn't it a problem....

I see you have done a comparison of some of the errors from the first report and the last in a comment above and in every case you cited the numbers decreased so it is moving in the right direction. I know you want everything fixed right now but that is not possible if you want to do accurate research.
Trying to change the subject to avoid seeing that your original "question" was based on false data is not the answer. I will no longer respond to your comments on this thread.

I am more interested in what other think and feel its odd that we dont look if something should be changed to WikiTree quality better.......

We still have uncleaned GEDCOM profile that I guess Aleš dont address Google who normally has a low estimate say +737 000 WikiTree profiles with CONT a year ago they said 259 000 profiles  

@Dale please dont go low and attack me I have a Master in Applied physics and understand statistics better than most people you have met see my profile at LinkedIn

The mistake I did in my first topic was that I never thought someone should compare the raw error numbers and not realize Aleš is making magic all the time. I tried to get some creativity from the community but it ends as always you attacking me ;-) What would people like you and me do without internet ;-) 

Well google lies. Those search results are quite bad estimate.

CONT appears in 323054 profiles. I didn't add it as an error, since it can be corrected by a BOT, and will be done at some point. Also some other errors, that go into 100000 are on hold for same reason.

Most of the Errors we have, are not easily corrected by the bot and need human touch.

@Aleš I remember you said that but couldnt find your post....

My feeling is that Chris care about that we merge correct i.e. to lowest number because if we dont it will hit WikiTree ranking on Google  (SEO

I wild guess is that "dirty" GEDCOM profiles also hurt the ranking as Google think pages are not unique/the same,... maybe another argument in the pile of no more GEDCOM import votes

GEDCOM import is important. If there were no GEDCOM import, I wouldn't be on wikitree and I think many other members first imported part of their family tree.

Even if it is without sources. For me sourced GEDCOM is not an option, since there are almost no online sources available in Slovenia. Also many sources in GEDCOMs refer to other family trees, which is not really a source. So sourcing gedcoms is not really good option.

@Aleš 

>> My error calculations can't be integrated into wikitree

I thought when you do save in Wikitree you check in a table that says Sälgö-1 or an relation with Sälgö-1 has an error ==> the user who saves it get a warning please also check Error report....

Today the "normal" user don't get any information that the profile he/she looks at has project database Error...

>> I guess gedcom import dont have the same check as manual entered profiles... 

Isnt that a problem?!?!

>>  If there were no GEDCOM import

You have a point but unsourced profiles from Sweden is just someone that hasn't done the homework.... and maybe never will do....

I checked last million of profiles. There are 60000 of them have errors. And 10-20% of them are from Gedcom. So it really isn't that much.
Yes, I have all the errors on my error list fixed (out to 8 generations or so) except for a group which are pretty close relatives (descendents of my 2ggrandfather) which are blocked by privacy put on by a non-responsive PM.  One of these days I'll get busy and try getting this changed, but I have too much else to do to make it a high priority.
I have no problem with the finding and listing of new errors, I am simply pointing out that as long as new errors are added every week we can expect the total number to go up. As for the one that still shows on my current weeks report, that one was requested by the Connectors Project, again nothing wrong there but I think that should be in a different report because it is not really an error. As Aleš points out the use of Bots is and should be limited and GEDCOM files are a necessary evil. I try to discourage others from uploading a GEDCOM without entering at least a few profiles manually first, but some must learn the hard way.
+8 votes
I have been helping with 901 errors and find MOST of the ones I look at are false errors caused by privacy settings for people that are likely living.  

I think we can reduce the number of errors by better defining criteria of what is an error.  Example:  901 looks for blank profiles.  If someone is set up as private it is assumed they are living because why would we mark profiles private for any other reason?  So maybe we have a toggle that looks at the date a private entry is set up.  The system then generates an email to the Profile Manager maybe every 10 years (or less we can discuss timing) to ask s this still a living person?  If not please change the privacy setting.  I see so many where the husband or children are marked as private so the profile looks blank.  

Magnus I sent you a private email about the 901 a few days ago.  

So I think there are likely a lot of false errors in that count.

The unique name error.  I had several of those.  And yes the names are correct and they are unique so more false positives

But it does catch typos and the error report if used to make sure you have not done something like connect the wrong child to the wrong parents (I just did that last week because the names in that family are so repetitive and similar that I accidentally grabbed the wrong parents)

 I do think basic spelling errors for well known locations like Michigna for Michigan could be an automatic fix.  

Gedcoms are often giving a lot of errors because the rules are not as stringent as WikiTree has for its style pages so unless you redo your gedcom to work with WikiTree you are going to generate errors.

Perhaps a mapping tool prior to upload would help with some upload errors?
by Laura Bozzay G2G6 Pilot (630k points)
@Laura, I like your idea of mapping gedcoms prior to upload. Or maybe maybe run pending gedcoms through an error report an only allow error free gedcom profiles in. This would eliminate some of the redundant errors that we see in the 901's for sure.

I had also wondered about the use of an automatic spell checker to correct these errors or at least most of the errors.

And you are correct, there are a lot of false errors in the 901's, most could be cleared by just opening the privacy on anyone who has been deceased over 50-100 years. Is there really a need for those profiles to be private?

Thanks for sharing your thoughts.
+9 votes

I see that Magnus is trying to facilitate an open new year review of the error checking project which is to be applauded, but Magnus please could you try to tone down some of your comments as sometimes your posts and responses come across as overly negative and that’s turning people off the db_error project and causing them to be worried to post to your threads. Comparisons of the numbers for the db project doesn’t really work, it should be done as a % red, orange, white/project group errors of the overall profiles (but needs doing by profile not by error as well) as that would stop the flux in numbers overly influencing things. But in answer to the suggested topics you raised in my opinion –

1. No gedcom isn’t killing the tree in my opinion, if the tree didn’t have gedcom it would turn a lot of people off even looking at bringing their records here. Gedcoms imports have advanced greatly in 5 years so many of the bad imports seen are historic. Personally I would rather see the basic details with a contact profile manager for a person I’m interested in than nothing at all. It understandable that when someone see's too many profiles below gold star standard they need to let off steam about it and start to feel it’s all profiles, but that’s exactly the time to work on something different, plus the negativity just turns off the new people and breeds more negativity.

2. There is always a need for more education but it’s hard to educate people who think there doing things correctly or who wish to use wikitree differently to how it’s been developed. Many people just want the basic details here so people know who to contact, they don’t have an interest in writing biography’s or sources. An idea here though would wikitree be able to send a generic e-mail to all the profile managers who have a profile with a red error? Maybe something along the lines of are you aware of wikitree’s new functionality that has highlighted this ... The resources that are available for training maybe need more g2g promotion as well as new person promotion as its easy to forget about them once your established here.

 

3. The create a new profile box could do with having the drop down places so that people don’t make typo’s at that stage. The marriage input window could do with having a leave a comment box.  The birth place box could do with being next to the birth date box on the input form.

 

4. Maybe on a case by case basis, but I disagree with the death in the future being edited by a bot, unless the bot can both take the date out and put a note and the date it took out into the body of the biography. I don’t believe a bot can do this as its dynamic which would mean it removes a date that could be partially right and usable for someone trying to correct the field. The errors I personally feel should be a bot are the 200 year but not open profile obvious errors (where the birth date is more than 200 years old), this should be a monthly run to open them. And the empty biography one where a bot should auto put the unsourced template onto the profiles

5. The error project is a massive advantageous add on to wikitree and to genealogy as a whole and should be written about by genealogical journalists. Every experienced historian knows there are errors in family tree’s which have happened for a number of reasons, but Ales and wikitree are the first family tree to officially acknowledge this and put something in place to try to fix it. The editing other pm’s profiles issue I feel will continue until wikitree becomes the most widely used family tree in the world as other sites that contain family trees build on the ethos of “mine”. Through quality research and collaboration we can start to reduce this issue though but have to go gentle with it so not to lose genealogists and create poor feelings towards wikitree.

And regarding what Dale has posted, I also think adding more and more errors is frustrating to the error correctors and that Ales should consider capping the number of errors around 1.5m by either swapping white errors out for red ones or holding a waiting list of errors checks that are coming soon. Furthermore I think the full error report table should be reformatted to show red errors, orange, white / project group requests errors, so that people may find it easier to target the worst errors and better understand the white errors are not really error they are wiki volunteer project work to further improve the database.

by

1-a) >>  but Magnus please could you try to tone down some of your comments as sometimes your posts and responses come across as overly negative and that’s turning people 

OK I feel it's just Dale chasing me... I would wish he was more constructive...

1.b) Gedcoms imports have advanced greatly in 5 years so many of the bad imports seen are historic. 

Do you have numbers on that!??! I did a check and it was one out 200 uploaded GEDCOMs that had sources see blue dot upper left below. Even if a family tree has no Project Database errors it could be just of no value as sources are missing.... 

I understand Aleš point of view that he has no sources in his country but then its not genealogy in my cubic world... I don't understand either why getting 12 generations family tree with nearly no sources is more important than to create good profiles about your grandparents....


see G2G uploading 1500 profiles and 100 sources is not serious...

1.c) my personal feeling is that too many people upload GEDCOMS and do nothing more which I feel is sad, but I have no numbers on that and have seen no polls why people don't continue with WikiTree... I feel people arguing on G2G is Dale and me and a few more people but not everyone so why people leave would be interesting to understand....

1-d) my intention with this post is just to see how all the magic Aleš has done can be used even better and maybe more efficient....   

A) The Normal WikiTree user how often do they use Project Database Error?

I guess the tool is stealth for many.... is that good or bad? Maybe some people get upset to see that they have an error in the family tree...

2) better training..... maybe use some polls and ask people what they miss see example poll - bold or polite take 10 minutes to create a poll using Google Forms....and you often get answers pointing on things you never thought of...

3) change the user interface - I like the way locations has been implemented, even if the Family Search database could be better its 1000 times better than just a line and you get the country for free.. ....  maybe more things like that

4-1) Bots: My bot candidate is things like reformat GEDCOM garbage that has never been touched... delete all CONT

4-2) Empy profiles with a template I guess is just a Chris database SQL query if size < 50 char then {{Unsourced}} time to make 1400 millisec 

5) The Database Error project: I think Aleš should sell it to Ancestry, Family Search but I guess at least Ancestry has no interest they have a business idea focus on subscribing users and not genealogy quality... or?

6) More errors: Don't agree what is the problem with more errors?!?!? The problem is not finding errors?!?! When I start looking at a profile and then realizing that the whole family tree is unsourced and of no value one hour later you are not happy. 

Much better check with Database Error and understand this family tree you cant trust... 

901 db_e - most of the errors that I m working on were generated in 2009-2012, before this week most of the profiles I saw had a private person connected to them but this week I would say about 50% of the errors I have encountered are completely empty. Some are even false info, which is disturbing but easily taken care of.

I have put my research on hold for the month of January and am concentrating on the 901"s. Last week it was 34,040 errors, this week it is 33,514. I would say we are moving in the right direction!

Thank you to everyone who is working on the 901's, I certainly couldn't do it by myself. Thank you again!
Sorry I am not Dale, he still speaks to you directly, perhaps looking back at the number of people who used to reply to your messages compared to now could help.

Regarding your chart, some of those records will have sources they just wont have the <ref> </ref> needed for automated process to count them.

Regarding your Bot - sometimes there is useful information in the gedcom "garbage" such as a reference number to a profile on geni that gives all the info needed to fix the profile.

More Errors - I didn't say to stop errors just to be aware that the data doctors are a certain group of people who get disheartened when the number of new errors increase at a greater rate than they can manage, as also mentioned by 2 others on this thread.

Regarding Ales saying Slovak records can't be sourced, oral tradition can still be used as a source it just needs doing carefully to say who and when you were told about that person or whether you knew them and then hopefully someone else will have also known them or been told that information independently.

Regarding your chart, some of those records will have sources they just wont have the <ref> </ref> needed for automated process to count them.

  1. No this is imported GEDCOMS and in GEDCOM a source is called a source it has nothing to do with ref tags in WikiTree
  2. Sources can also be found in the Notes of a GEDCOM if the generated GEDCOM is wrong created but the raw GEDCOM I checked didn't have anything useful

My conclusion from this and also speaking with other people members of my genealogy society in Sweden is that WikiTree is a new concept and doesn't attract people doing good genealogy based on sources. 

If you have spent the last 30 years researching and have sources added you hesitate to upload to a site where a million profiles has no sources... my guess is better quality focus inside WikiTree will attract people more interested in genealogy. The number of unsourced Ancestry profiles indicates we miss that target group...  

GEDCOM and bot What we are speaking about is garbage like tag CONT. CONT means continue and is added by GEDCOM to say new line ==> 

Regarding your chartCONT some of those records willCONT have sources they justCONT wont have the

Example profile Bubb-48 it has no genealogy value is just bad import..... and makes profiles unreadable.... you always have the old version in WIkiTree if you would like to see how it looked before 

data doctors are a certain group of people who get disheartened when the number of new errors increase at a greater rate than they can manage, as also mentioned by 2 others on this thread.

Agree we have different focus for me this fixing badly researched uploaded GEDCOMS feels more like a Sisyphus work and if the logic in the dates are not ok then I guess also the rest is wrong and wrong connected ==> why spend time on it. What we see on old unsourced WikiTree profiles from Sweden uploaded using GEDCOM is that it's just junk and better to be deleted.... there is some basic problem in the WikiTree concept that an unskilled person can upload 300 wrong profiles in 10 minutes and it will be there forever... what we can do is mark it {{Unsourced}}

I see the potential with WikiTree and get happy when I read the research e.g. done by Eva Ekblad she is now adding taxation records from before 1700 for a small village in Sweden Solmyra in Västmanland and adds value and is interesting to read. In this case the church records starts 1700 so using the Taxation records is one possibility...

My vision for Wikitree is to gather people doing good genealogy to learn from each other. The best we can do with all profiles with no sources and errors is warn other people don't trust this as this profile/research has indication of bad quality..... 

oral tradition as a source

Agree and its important that we write it down so its not getting lost. WikiTree is an excellent platform for that

  • Free
  • Easy to access
  • Based on the Wikimedia platform that scale good and used by many
  • Plus also have some quality checking tools like Project database error to avoid some mistakes...
+8 votes
When the Error Reports first started, Aleš would give us an estimate of the number of corrections made during the previous week.  As I remember, the number was usually in the 15,000 – 20,000 error corrections per week.  I do not think we have slowed down as I still see a consistent drop in the number of errors in certain categories from week to week.

I believe we have made over 500,000 error corrections in the last 36 weeks.  I wish I had a more accurate estimate – I don’t know if there is a way for Aleš to make this calculation accurately.

Aleš adds errors faster than we can correct them, so it may be difficult to see the improvement in the numbers.  But WikiTree is a vastly better website than it was 36 weeks ago, in my opinion.

 

OK I looked up some of the early numbers:

11 May 2016 – 9591 errors corrected since May 1.  First time we were given an error reduction estimate.

15 May 2016 – 6153 errors corrected.

22 May 2016 – 9988 errors corrected.

29 May 2016 – 14781 errors corrected.

5 June 2016 – 18209 errors corrected.

12 June 2016 – 20525 errors corrected.

19 June 2016 – 14926 errors corrected.

26 June 2016 – 12902 errors corrected.

3 July 2016 - 17192 errors corrected. Last week we were given an error reduction estimate.

So, that was an estimated 124,267 errors corrected in May and June.  It is obvious from the ramp up in numbers that the project was just catching on.  It is also well before the many additional error checks that Aleš has added.  I expect the weekly numbers from July 2016 to January 2017 continued in the 15,000-20,000 range.

Magnus, I think the answer is: yes, we are moving in the right direction.
by Joe Cochoit G2G6 Pilot (204k points)
I stopped making estimate on error number, since error checking on save changed new error creation rate.Also after correcting most errors for specific error, assumption on number of new errors is no longer correct.

Instead we have last column in summary. It is exact number of new errors each week.Summary for last week is 8235 new errors.

And the total number of errors was reduced for 3274

That gives the total of 11500 errors corrected.

After the maximum in june, correction rate is between 10-15K per week.

Related questions

+9 votes
2 answers
538 views asked Jun 12, 2016 in WikiTree Tech by M Anonymous G2G6 Mach 4 (47.1k points)
+12 votes
4 answers
+12 votes
14 answers
+6 votes
1 answer
95 views asked Jun 21, 2016 in WikiTree Tech by Lance Martin G2G6 Mach 9 (90.9k points)
+10 votes
2 answers
157 views asked Apr 30 in WikiTree Tech by Barry Smith G2G6 Pilot (134k points)
+28 votes
2 answers
219 views asked Feb 13, 2018 in The Tree House by Aleš Trtnik G2G6 Pilot (484k points)
+8 votes
3 answers

WikiTree  ~  About  ~  Help Help  ~  Search Person Search  ~  Surname:

disclaimer - terms - copyright

...