Could a filter be implemented to automatically remove the "Unknown" junk

+4 votes
297 views

"A total of 4,888,457 person profiles remain unconnected. "

I don't think so.  A million or so are obviously junk in the style "unknown first name, unknown last name, unknown everything..."

Surely the logic to decide "everything = unknown / then delete" could be implemented without losing any real identities?

in The Tree House by Patrick Chadwick G2G6 Mach 1 (13.8k points)
retagged by Ellen Smith

The comments show that I did not make my argument clear enough. It concerns unmanaged, unconnected profiles with zero information, probably generated by faulty imports. These are in no way deficient - but real - profiles of real people, but a kind of spam that enormously inflates the list of genuine profiles which we want to sort out.

For anyone who wishes to understand what is happening, please do the following:

Go to “Unconnected People”

Set “View All”

Set “Limit to unmanaged profiles = Yes

The setting of “Limit to open privacy level” does not appear to matter.

CORRECTION: I forgot the last step!
Set "Most connections on top".

You will see an apparently endless list of profiles with no information. No names, no spouses, no parents, no children, no sources. In short: no information.

But marked as having more than 100 connections.

This looks like software-generated spam that has flooded the list with zero-info pseudo-profiles

This junk could be removed by a filter on the lines of:

IF
First name = Unknown
AND Last name = Unknown
AND Father = Unknown
AND Mother = Unknown        [etc. etc. etc.]
AND Sources = missing
AND No profile manager
AND More than 100 connections
THEN
Delete profile

- because it does not refer to a real person!

Sorry, Patrick, I don't understand. I went to what you describe and looked at 500 profiles:

https://www.wikitree.com/index.php?title=Special:Unconnected&limit=500&order=&viewAll=1&orphans=1

While there are a few with first name Unknown or last name Unknown, I didn't see any "Unknown Unknown" with both first and last name Unknown.

Sorry Jim, I forgot the last step:

Set "Most connections on top".

Then the junk appears - all these pseudo-profiles are marked as having more than 100 connections - although they contain no information.

Ok, thanks Patrick. So a direct link with 500 profiles is

https://www.wikitree.com/index.php?title=Special:Unconnected&limit=500&order=connectionsdn&viewAll=1&orphans=1

I do see about 20 "Unknown Unknowns" or similar near the top of this list. But the remaining 480 or so have names with Chinese characters, not entirely unknown. Most of them seem to be GEDCOM imports from a decade back. Although there is little detail, they probably represent real people.

Your criterion "if First name = Unknown" would exclude them from consideration by the software you are proposing.

It seems that the problem is not as bad as I first thought. I gave up after 1200 Unknown Unkowns. 

Thanks to everbody who took the trouble to check this out!

There still remains my suspicion, that a genealogy going back 10 or more generations without any sources is largely imaginary!

2 Answers

+3 votes
If you don't know something, leave it blank. Please!

(And also sanity check for things like census records and marriages after death, or births in the US and baptisms in the UK and so on.)
by P J Evans G2G6 Mach 1 (15.5k points)
Please see my additional explanation above.
+3 votes

I thought it might be useful to check how many of these "junk" profiles there actually are. According to the latest data in WikiTree+ there are currently 8,742 profiles with unknown first name and unknown last name (https://plus.wikitree.com/default.htm?report=srch1&Query=FirstName%3DUnknown+LastNameatBirth%3DUnknown+CurrentLastName%3DUnknown&MaxProfiles=50000&Format=&PageSize=100)

In one of your comments you suggested that the ones that could be considered for deletion were the ones that also have no mother, no father and no profile manager. That reduces the number to 1,804 (https://plus.wikitree.com/default.htm?report=srch1&Query=FirstName%3DUnknown+LastNameatBirth%3DUnknown+CurrentLastName%3DUnknown+NoFather+NoMother+Orphan&MaxProfiles=50000&Format=&PageSize=100)

Analysing those results in Bio Check to see if any have sources, of the 1,804 profiles it says that 476 are marked as unsourced and another 1,009 are possibly unsourced but not marked.

That suggests that there are actually only about 1,500 profiles that would meet the criteria for deletion. 

Edited to add: If in addition the profile should not have a spouse, that reduces the number to just 400 (https://plus.wikitree.com/default.htm?report=srch1&Query=FirstName%3DUnknown+LastNameatBirth%3DUnknown+CurrentLastName%3DUnknown+NoFather+NoMother+NoSpouses+Orphan&MaxProfiles=50000&Format=&PageSize=100)

by Paul Masini G2G6 Pilot (398k points)
edited by Paul Masini

In the above list there are profiles with children, ex. Unknown-576352, they should not meet the criteria for deletion.

Adding no children to the criteria leaves just 142 profiles (https://plus.wikitree.com/default.htm?report=srch1&Query=FirstName%3DUnknown+LastNameatBirth%3DUnknown+CurrentLastName%3DUnknown+NoFather+NoMother+NoSpouses+NoChildren+Orphan&MaxProfiles=50000&Format=&PageSize=100)

Edited to add: Actually it looks like there are a few profiles in that list that do have a spouse/child but that spouse/child has an unlisted profile so isn't in WikiTree+. Since no parents, no spouse and no children is the same as unlinked, the actual number of profiles is 112 (https://plus.wikitree.com/default.htm?report=srch1&Query=FirstName%3DUnknown+LastNameatBirth%3DUnknown+CurrentLastName%3DUnknown+Unlinked+Orphan&MaxProfiles=50000&Format=&PageSize=100)

Related questions

+6 votes
2 answers
+15 votes
11 answers
+2 votes
2 answers
+5 votes
2 answers
+10 votes
4 answers
+4 votes
3 answers
+6 votes
1 answer

WikiTree  ~  About  ~  Help Help  ~  Search Person Search  ~  Surname:

disclaimer - terms - copyright

...