db errors: widely varying number of results using same location query

+6 votes
294 views

Members of the Dutch Roots project try to fix errors by province. Querying the province noord-holland had  almost 2000 errors last night, 531 errors this morning, 1146 now. Looks like reporting error #511 is switched on and off (and on). Same symptoms were reported when querying netherlands.

Please explain.

in WikiTree Tech by Living Terink G2G6 Pilot (293k points)
retagged by Maggie N.
I added MaxErrors to the query. New default is 1000, but you can put any number in and wait. If you enter 0 it will return all recordt.

I checked Netherlands and it has Total: 15242 errors. Today I added also error 512.
Question:
Witch names should be checked?
1. Just the given names or also the family names?
2. Just the profiles on the left side of the error-list or also on the right side?
3. The names of the locations?

@Aleš

Here is an example of someone with 3 not too rare first names that gets error #511:

Arend Willem Maurits

Could you please explain?

Dhr is the problem here. Is that something like Mr? Only 6 names start with Dhr.

In dump there is only joined first name. I do not know if there is a prefix or not.
@Pierre

1, 2, If you are talking about 511 errors,  only left profile is present. Only first name is checked.

3, 600-630 is Birth, 630-660 is death and 660-690 is Marriage Location.
Yes, in this profile "Dhr" is recorded in prefix field. And yes, "Dhr." is like "Mr". I think prefix is only filled in a small number of Dutch profiles, so no need to adjust algorithm.

It's good to know about the joined first name, it makes us understand more of the error reporting.

Thanks!
From 15,242 error records, there are over 10,000 with error #511.
I have looked for false errors, agree that from most ones, I understand why the algorithm produces an error. Many of them need an edit.

Some profilenames that give false error that I do not understand:

http://www.wikitree.com/wiki/Florack-20
http://www.wikitree.com/wiki/Draaisma-194 , but should give #512.
http://www.wikitree.com/wiki/Draaisma-564 , #511 ( and correct #512 present)

Maybe a frequenty of 3 is already enough to eliminate spelling errors enough.

I found a bug in calculation of error 511. It is 60K less errors.

First two are no longer present. Third one has KLAZES in first name, that doesn't appear as a single name at all.

Also new data arrived, so i will start calculations from beginning.

What else did you change?
The number of records for location Netherlands has changed from more then 15,000 to less than 8,000.
Should I be happy? :)

I removed case sensitivity from names, so number decreased significantly (120K errors). Now 511 lists 346225 errors

1 Answer

+6 votes
 
Best answer

Error 511 was had too many results (1.2M) so today I made different calculation and now it is 534383 errors. I think it is much less false positives. This can happen again in future, as I add new errors and correct algorithms.

511 now takes all names, that appears only once and if name has multiple words (names), it checks frequency of each name and they are all >10 it is not spelling error. Just unique combination of names. You have to give me feedback if some error has too many false positives. I can often adjust conditions to have better result.

by Aleš Trtnik G2G6 Pilot (804k points)
selected by Living Terink
For this moment, I only can tell that with location "Netherlands" yesterday, >24 hours ago, I had about 3.200 records. Tonight, about 20 hours ago, I had 9.500 records, today I had about 4.500 records and now I have 10.000 exactly. (is that a MAX perhaps?)
Thanks for explaining the algorithm!
10K is limit I set, you cannot check 10K errors in one week. Next week some of them will be corrected.

I am also preparing timeframe for country names if that is a case for Dutch community.

http://www.wikitree.com/wiki/Space:Database_Errors_Definition

You can add entries for Netherlands if you want.
Hi - Note that I have changed the date for United States formation to "1776 - 07 - 04" as our national day is JULY 4th, not JUNE 4th (-06-04 = previous notation).

Some historians date the United States from 1774 or 1775 when the first battles of the American Revolution were fought (in Massachusetts mainly) but I believe most agree that July 4, 1776, is the most-widely-used date.

I have a question - will an "error" still come up if "USA" or "United States" is put into PARENTHESES = (USA) after a colony designation for dates before 1776 ??  Say a colonial Virginia profile that is marked:  "Jamestown, Virginia colony (USA)"  ??   I would request that the (  ) be allowed to help others realize that this person lived in a territory that would become this State.  Can you do that ?

Chet Snow - A WikiTree Leader
This discussion should take place in these thread, so there is the answer. http://www.wikitree.com/g2g/249287/location-errors

Related questions

+5 votes
2 answers
+3 votes
3 answers
+12 votes
1 answer
+5 votes
1 answer
+6 votes
5 answers
+3 votes
1 answer
+4 votes
1 answer
152 views asked Dec 26, 2017 in WikiTree Tech by Living Breece G2G6 Mach 4 (45.6k points)

WikiTree  ~  About  ~  Help Help  ~  Search Person Search  ~  Surname:

disclaimer - terms - copyright

...