News on Data Doctors Report (March 22nd 2020) [closed]

+16 votes
485 views

News

  • I refined the 860 - Reference suggestions. Now span and email tags are excluded from checking resulting in a bit less false suggestions.

Previous News

closed with the note: Outdated
in The Tree House by Aleš Trtnik G2G6 Pilot (813k points)
closed by Aleš Trtnik
Sorry for veering out of the Dutch conversation, but I want to emphasize that I'm not interested in being told that the names are unique. I already know it. Clerks used the most fancy spellings and each combination has a chance to be unique. There is no helping that. We have to integrate every possible interpretation of what we read (sometimes in hardly legible scripts) to reduce the possibility of duplicates being created.

Also, many of the bizarre names are the earliest-known-female ancestress of a line. There will never be a second profile by the same name because their only known relatives do not have the same name at birth. And there are other examples. Nobility who used different apanages as the name they were referred to. Very often these are not repeated. Etc, etc.

I checked a bit further and quite a few have only one instance also in the records.

https://api.openarch.nl/1.0/records/search.json?name=Bakerin&lang=en&number_show=5

In such case I would still consider it a typo. 

I would need to remember the results. I can't make 100000 requests each week. I will explore it further.

I second what Isabelle and Mindy said but I extend it to all of these Unique Last Name errors.
And you are all so sure in your typing skills that it is not worth a second look at what you typed or even double checking source transcription?

This borders on nightmare. Since I use all of the documented spellings from early records, so than anyone searching for the profile based on that spelling can find it, I had 28 errors. 2 were other than "false." 

Why is that a nightmare? Because I also am trying to clear the suggestions on the Scotland Project. Gaelic words in records had no standard spelling whatsoever. There are 1,862 errors only in the unique spellings. How is one to verify each one without going back to the original documents?

And suggesting that spending precious person-hours to show 1,862 errors as "false" is a massive waste of time, and very frustrating.

Question Aleš: when you do something like this, can you find a way to carry over the "false suggestion" from the previous version to the new one? Meaning, the 511 suggestions that had been marked before, couldn't they be shown as false when you do the new categories? We're doing the same thing, repeatedly, and to my mind, to no good purpose. As you can tell, I'm not real excited about this one.

"And you are all so sure in your typing skills that it is not worth a second look at what you typed or even double checking source transcription?" Pretty sure on mine, (I had two of these in my error list, but they were false) but there are over 60,000 profiles in the unique LNAB ones...many of those have limited sourcing. How is anyone to check this?

As you can see in the hidden column, all suggestion statuses were replicated to the new suggestions. Are you missing some false suggestions? I can look into it, if you have an example.

The big difference is that now I also check the Last Names. 511 was checking only first names.

I anticipated initial complaints due to a lot of new suggestions on profiles, but after initial complaints they calm down as there is only a small increase each week. You must understand, that with each new suggestion, you get a backlog of 10 years. So in your case 180/year or 3-4 new suggestions each week, which is not much to double check for typos.

I am not "that sure of my typing" but out of 200+ new suggestions on my watchlist alone (I dare not look at the whole of France) there was one real error. I know a loss of time when I see it. I used to check my suggestions every week, but with this new onslaught on top the 220 Wikidata suggestions I can't fix (Unconnected people with a Wikidata page and "Clue for parent" suggestions), I will no longer bother.

Aleš,

Yes, apparently all 260 previously marked as false errors are probably still there, leaving 1862 to be worked. Looking at the list of LNABs of the first approximately 200, I see 2 actual clear errors.

It seems to me our project time will be better spent on more important errors and this will wait.  

Bobbie

4 Answers

+2 votes
Aleš - The Weekly tracker appears to have reset to only count some cleared suggestions
by Graeme Olney G2G6 Pilot (144k points)
This might be connected to new suggestions. I had to copy 511 statuses to 7x7 suggestions. I will have a look.
Corrected.
+3 votes

Thanks so much for

I have extended the search capabilities on WikiTree+. Now you can define by which field you are searching for a word.

Maybe unintended side effect:

Looks like old searches like

Netherlands Unsourced

do not return any profiles now. Instead Country=Netherlands CategoryFull=Unsourced_Profiles

must be specified now

by Living Terink G2G6 Pilot (300k points)
No, That is just a temp failure. I am rebuilding indexes and standard search should funcion shortly.
Search is now fully operational.
Fantastic, Aleš!

Had been hoping for extensions like this for some time already.
+6 votes

After seeing Isabelle's comment about the Unique Last Names errors, I looked at my Suggestions Report and sighed with dismay when I saw these new errors. It looks like I am being punished for my diligence in deliberately  recording unusual variant spellings found in records -- something that is very important to do so that the profiles will turn up in searches, and we will not get more duplicate profiles for those spellings. I wish that WikiTree+ had a bulk status-update feature, so I could mark False Suggestion on 50 profiles at one time.

Other wishes related to these new errors:

  1. DB 797 is called Unique name in Last Name Other. Could this please be corrected to 797 Unique name in Other Last Names?
  2. The current treatment of name particles is weird. Particles should be kept together with the rest of the name, but in this report they have been separated in a manner that can makes it difficult to tell what names are recorded.  For example, in WikiTree the Other Last Names field for Thijsz-3 has a comma-separated list of 7 names: Tys, Thijsen, Theijsen, Tyssen, van Heide, van der Heyden, Vanderheyden. This report, however, has converted that list to an out-of-sequence list that separates the name particles and scrambles the name van der Heyden:

der
Heide
Heyden
Theijsen
Thijsen
Tys
Tyssen
van
Vanderheyden

Apparently, WikiTree+ ignores the comma separators in the Other Last Names field, treats the name particles as distinct names, and alphabetizes the list. If this particular error is going to be retained, can it be revised to recognize the commas as separators between names?

by Ellen Smith G2G Astronaut (1.5m points)

I do checking based on each word. For performance reasons it is alphabetised and stripped of duplicates. I will see if I can list only the problem word.

I renamed all 79x suggestions to Other Last Names

I corrected the content of the Info column. Now it lists only the unique words.

Thanks. Those changes make these 797 suggestions a lot easier to review. However, I am puzzled to see that several of my 797 suggestions now have no content in the Info column (that is, no names are listed as problematic) -- and there are no recent edits to these profiles.

Note that WikiTree name search treats names like van der Heyden as equivalent to the concatenated form vanderheyden. Could WikiTree+ be trained to do the same thing?

Ellen, you found a bug. The problem was in different case of word van. I corrected the problem and I will update the suggestions.

I could, but then there is immediately a lot of variations. I checked your list and you are often using both cases in LNO with and without spaces (van der Something and vanderSomething). I guess asking which one is correct is pointless.
+8 votes
After further consideration, I must say that I do not like the idea of extending Data Doctors unique-name checking to Last Names, and particularly not to Other Last Names (suggestion 797).

Indeed, although I believe I am the member who first made the suggestion to check for misspelled first names, I propose that it is time to eliminate the Unique Names checking feature, except for newly created profiles.

Years ago when I proposed name checking, I was seeing profiles (mostly GEDCOM-created, back when GEDCOMs were imported in batch mode) with odd first-name spellings that I was rather sure were common names that were misspelled due to typing mistakes (consider a spelling like Behjamin or Benjamni). The Unique Names db error has been effective in eliminating a great many errors like those.  Where such spellings are seen now, it probably is because the profile creator says the spelling is correct. Now that GEDCOM profiles need to be individually reviewed, I think the probability of these types of errors in new profiles is lower than it was in the old days, but I do see benefit in retaining this kind of checking for profiles created within the previous year.

Last names are a different situation. Weird undetected typing mistakes are less likely in last names because people generally pay more attention to the spellings of their ancestors' last names than they do to their first names, and because last names typically get replicated to other family members, there is less chance for one-time typos. Furthermore, because the Other Last Names field is not auto-populated by Gedcoms and is mostly used by experienced WikiTreers, there is a relatively low likelihood of simple typing errors and a very high likelihood that any unusual spellings were entered deliberately after checking and double-checking the source(s). Therefore, DB-797 can be expected to detected very, very few invalid spellings.  It can, however, be expected to cause experienced members (such as Isabelle, Bobbie, and me) to take time away from work that most of us consider far more important for the quality of WikiTree.
by Ellen Smith G2G Astronaut (1.5m points)
In addition to putting an extra burden on members who are meticulous about recording name variants, this quantitative approach puts an extra burden on members working with parts of the world that are less densely populated in WikiTree.

If a typo is common enough, it can still get through undetected.
Thank you, Ellen and Eva.
I agree wholeheartedly with Ellen and Eva.
I agree with these comments, especially for Other Last Names - until WikiTree has many, many more Non-English or Non-American profiles (which I hope happens but will take time!), anyone working with such profiles is likely to waste a lot of time on these 797 and other similar "suggestions".

As a hobby I dabble in Hawaiian genealogy and believe me ALL those names are unique since they were descriptive and not patronymic or generalized.   I think many Native American names follow that pattern too.  I consider marking them all as False a big waste of time and hardly "genealogy".

Related questions

+6 votes
2 answers
242 views asked Mar 31, 2020 in The Tree House by Aleš Trtnik G2G6 Pilot (813k points)
+7 votes
0 answers
106 views asked Mar 17, 2020 in The Tree House by Aleš Trtnik G2G6 Pilot (813k points)
+10 votes
0 answers
149 views asked Mar 9, 2020 in The Tree House by Aleš Trtnik G2G6 Pilot (813k points)
+8 votes
1 answer
169 views asked Mar 4, 2020 in The Tree House by Aleš Trtnik G2G6 Pilot (813k points)
+8 votes
0 answers
102 views asked Dec 24, 2019 in The Tree House by Aleš Trtnik G2G6 Pilot (813k points)
+12 votes
0 answers
178 views asked Sep 24, 2019 in The Tree House by Aleš Trtnik G2G6 Pilot (813k points)
+11 votes
2 answers
145 views asked Jul 24, 2018 in The Tree House by Aleš Trtnik G2G6 Pilot (813k points)
+17 votes
6 answers
358 views asked Jul 27, 2020 in The Tree House by Aleš Trtnik G2G6 Pilot (813k points)
+19 votes
4 answers
337 views asked Jul 21, 2020 in The Tree House by Aleš Trtnik G2G6 Pilot (813k points)
+13 votes
6 answers
420 views asked Jul 13, 2020 in The Tree House by Aleš Trtnik G2G6 Pilot (813k points)

WikiTree  ~  About  ~  Help Help  ~  Search Person Search  ~  Surname:

disclaimer - terms - copyright

...