db errors: widely varying number of results using same location query

+6 votes
299 views

Members of the Dutch Roots project try to fix errors by province. Querying the province noord-holland had  almost 2000 errors last night, 531 errors this morning, 1146 now. Looks like reporting error #511 is switched on and off (and on). Same symptoms were reported when querying netherlands.

Please explain.

in WikiTree Tech by Living Terink G2G6 Pilot (298k points)
retagged by Maggie N.

I noticed you have limited error checking to netherlands. I could make it possible to view only one error at a time for netherlands and all different notations

Country name Country Occurrences
Netherlands Netherlands 121498
Netherlands Nederland 27998
Netherlands The Netherlands 7826
Netherlands NL 5177
Netherlands Amsterdam 3793
Netherlands Gouda 3560
Netherlands Holland 3431
Netherlands NLD 2495
Netherlands Groningen 2207
Netherlands Dutch Republic 773

 

I suppose you mean that then entering netherlands as location in the query would actually result in a query something like

where lower(country) in ("netherlands","nederland",...,"dutch republic")

If that's what you mean, plaese do,

with the following exceptions: Amsterdam, Gouda, and Groningen are city names, so they could be reported as location errors: missing names of province and country

 

No, I ment to browse errors by any country with more then 10K profiles. Error pages could be precalculated By error ID.

Marking Gouda as an error (incomplete location) is something else but already in my mind to do
OK, I get the idea. Browsing one by one could significantly reduce page load times. Also layout of this one error record could then be vertical, so one does not have to scroll to the end of the line. Maybe implement for any size result set, so not limited to result sets > 10K? Maybe choice item on search page: browse set or browse one by one?

No. I meant one error type, not one error. So you would have links 

Netherlands_103
Netherlands_104
Netherlands_511
...

Thanks for being so patient with me, Aleš!

I can only judge the Dutch situation, where we try a country wide attempt to fix the errors. But currently we do this per province, not per error type. So I don't think it is really needed for us. Maybe other projects judge different...

Thanks for the offer!

The list for location Netherlands with error #511, witch now is longer than 10,000 records, will automatically shorten when error are solved.
Looking at the list Aleš made, I think there are 12,000 records in the list (not 121,000.) 
But we will miss errors with birth dates at the start or the end of the list.
I don't suppose that that is a problem for now, But when many errors are added, we may have to rethink about this. 

Maybe we should start checking names in this list, and I expect that most of them will get a "False Error"

Really Why. I think those are mainly typos. If a name doesn't appear 5x then it is very rare name. If there are a lot of false errors, maybe I can change the condition. Can you estimate % of false errors and give me a few examples of errors.

The table is for number of profiles, not errors. Errors is much less.
Well, with error #511 I get for location Netherlands 10,000 error records. 10.000 is a limit, so the list is longer, but I do not know how long.
The last time error #511 was not in the list, there were about 4,500 records.
So the sum of all other errors together with this location is 4,500, And for #511 alone it is more then 5.500. Well, that strikes me. I cannot simply believe there are that many errors in the names. But you are right, it should be checked before I can complain. And I will.
I will let you know the result.
I added MaxErrors to the query. New default is 1000, but you can put any number in and wait. If you enter 0 it will return all recordt.

I checked Netherlands and it has Total: 15242 errors. Today I added also error 512.
Question:
Witch names should be checked?
1. Just the given names or also the family names?
2. Just the profiles on the left side of the error-list or also on the right side?
3. The names of the locations?

@Aleš

Here is an example of someone with 3 not too rare first names that gets error #511:

Arend Willem Maurits

Could you please explain?

Dhr is the problem here. Is that something like Mr? Only 6 names start with Dhr.

In dump there is only joined first name. I do not know if there is a prefix or not.
@Pierre

1, 2, If you are talking about 511 errors,  only left profile is present. Only first name is checked.

3, 600-630 is Birth, 630-660 is death and 660-690 is Marriage Location.
Yes, in this profile "Dhr" is recorded in prefix field. And yes, "Dhr." is like "Mr". I think prefix is only filled in a small number of Dutch profiles, so no need to adjust algorithm.

It's good to know about the joined first name, it makes us understand more of the error reporting.

Thanks!
From 15,242 error records, there are over 10,000 with error #511.
I have looked for false errors, agree that from most ones, I understand why the algorithm produces an error. Many of them need an edit.

Some profilenames that give false error that I do not understand:

http://www.wikitree.com/wiki/Florack-20
http://www.wikitree.com/wiki/Draaisma-194 , but should give #512.
http://www.wikitree.com/wiki/Draaisma-564 , #511 ( and correct #512 present)

Maybe a frequenty of 3 is already enough to eliminate spelling errors enough.

I found a bug in calculation of error 511. It is 60K less errors.

First two are no longer present. Third one has KLAZES in first name, that doesn't appear as a single name at all.

Also new data arrived, so i will start calculations from beginning.

What else did you change?
The number of records for location Netherlands has changed from more then 15,000 to less than 8,000.
Should I be happy? :)

I removed case sensitivity from names, so number decreased significantly (120K errors). Now 511 lists 346225 errors

1 Answer

+7 votes
 
Best answer

Error 511 was had too many results (1.2M) so today I made different calculation and now it is 534383 errors. I think it is much less false positives. This can happen again in future, as I add new errors and correct algorithms.

511 now takes all names, that appears only once and if name has multiple words (names), it checks frequency of each name and they are all >10 it is not spelling error. Just unique combination of names. You have to give me feedback if some error has too many false positives. I can often adjust conditions to have better result.

by Aleš Trtnik G2G6 Pilot (808k points)
selected by Living Terink
For this moment, I only can tell that with location "Netherlands" yesterday, >24 hours ago, I had about 3.200 records. Tonight, about 20 hours ago, I had 9.500 records, today I had about 4.500 records and now I have 10.000 exactly. (is that a MAX perhaps?)
Thanks for explaining the algorithm!
10K is limit I set, you cannot check 10K errors in one week. Next week some of them will be corrected.

I am also preparing timeframe for country names if that is a case for Dutch community.

http://www.wikitree.com/wiki/Space:Database_Errors_Definition

You can add entries for Netherlands if you want.
Hi - Note that I have changed the date for United States formation to "1776 - 07 - 04" as our national day is JULY 4th, not JUNE 4th (-06-04 = previous notation).

Some historians date the United States from 1774 or 1775 when the first battles of the American Revolution were fought (in Massachusetts mainly) but I believe most agree that July 4, 1776, is the most-widely-used date.

I have a question - will an "error" still come up if "USA" or "United States" is put into PARENTHESES = (USA) after a colony designation for dates before 1776 ??  Say a colonial Virginia profile that is marked:  "Jamestown, Virginia colony (USA)"  ??   I would request that the (  ) be allowed to help others realize that this person lived in a territory that would become this State.  Can you do that ?

Chet Snow - A WikiTree Leader
This discussion should take place in these thread, so there is the answer. http://www.wikitree.com/g2g/249287/location-errors

Related questions

+5 votes
2 answers
+3 votes
3 answers
+12 votes
1 answer
+5 votes
1 answer
+6 votes
5 answers
+3 votes
1 answer
+4 votes
1 answer
152 views asked Dec 26, 2017 in WikiTree Tech by Living Breece G2G6 Mach 4 (45.7k points)

WikiTree  ~  About  ~  Help Help  ~  Search Person Search  ~  Surname:

disclaimer - terms - copyright

...