Should spaces in last names be ignored for most purposes on WikiTree?

+16 votes
381 views

Hi WikiTreers,

We are in the middle of a change that would improve how our search functions handle last names with spaces and special characters.

To explain it briefly: Right now our systems interpret, for example, St. Laurent and St Laurent as different names. Soon they will be interpreted as the same name.

Ellen Smith has suggested that StLaurent, with no space, also be interpreted as the same name.

In her words: "please teach the search function to overlook the underscore character _ in last names like Van_Dyke and Du_Bois and van_der_Walt when searching for possible duplicates. Currently, if I'm looking to see whether there's an existing profile for an Abraham Van Arnhem, I can enter Abra* and Vanarnhem in the search boxes, and I'll see results for Abram or Abraham with last names of Vanarnhem, Vanaernum, Vanarnum, VanAernam, VanArnem, and more. However, if I want to find variant spellings for the two-word version of the name (and this was properly a two-word name), I have to search separately for Abra* with each possible variant spelling (Van_Arnhem, Van_Aernum, van_Aernam, van_Arnheim, etc.). How hard would it be to teach the search function to treat Van_Arnhem as equivalent to VanArnhem, for the purpose of finding possible duplicates?"

This would be relatively easy to do and now would be the time to do it, before we complete the change we already have in progress.

I'm inclined to think that Ellen is right (she usually is). For search purposes, Van Arnhem and VanArnhem, and O'Reilly and OReilly, etc., should be treated the same.

Do you agree? Can you think of cases where the spaces are very significant for matching?

I'd like to say that Van Arnhem and VanArnhem could be considered as similar to each other but not exactly the same, e.g. so that a search for Abraham Van Arnhem would weight another Abraham Van Arnhem more highly than an Abraham VanArnhem. It should be this way. But I'm sorry to say that we have to be fairly black and white here. Either Van Arnhem equals VanArnhem or it doesn't.

Also, if we do this, Van Arnhems and VanArnhems would be lumped together in other contexts, not just search and matching. The change would affect almost all contexts where we group people by surname.

For example, there would be one surname index page: the one without the spaces. All the O'Reillys, O' Reillys, O Reillys, and Oreillys would be on https://www.wikitree.com/genealogy/OREILLY

As another example, the van Der Walts would no longer be on https://www.wikitree.com/genealogy/VAN_DER_WALT they would be https://www.wikitree.com/genealogy/VANDERWALT even though nobody on WikiTree has the name Vanderwalt.

To find orphaned and unconnected profiles, van der Walts would need to look at Vanderwalt. And van der Walts should start following and using the G2G tag VANDERWALT instead of VAN_DER_WALT.

Some van der Walts might not appreciate all this.

Some of these things can be mitigated, but not easily, and it wouldn't be a priority. Van der Walts would have to get used to seeing and using Vanderwalt a lot.

I don't mean that Philip van der Walt's profile would say Philip Vanderwalt and that's how he'd appear in search results or family trees, etc. Individuals' names would be respected where they are displayed as individuals. It's just that in group contexts, they'd be Vanderwalts.

I don't know how much this would upset the van der Walts, et al, or how heavily this should be weighted in the decision.

Do you have any thoughts?

Thanks!

Chris

in The Tree House by Chris Whitten G2G Astronaut (1.5m points)
retagged by Ellen Smith

In some cases there are an advantage i.e. van de Venter which became van Deventer would now end up in the same list. Would take some getting used to but it's no biggy if weighed against the advantages.

It would be a great improvement for Belgian genealogies, because they have no fixed way of dealing with Van or De names - for example van der Straeten can become Vander Straeten, or vanderstraeten, and then change again with the next generation. So ignoring spaces would definitely make those familes much easier to deal with.

depending on where the name is displayed without spaces and only initial cap, I think the benefit is not worth it.

Is the change of searching for space/no space something that could be implemented separately from looking for St. or St & surnames with more than one word? If it were separate, could it be easily implemented & reversed? In other words, would it be feasible to implement it temporarily?

With the past two changes in search parameters, there was a massive influx of MatchBot proposals (the first was surname variables, which lasted quite a while; the second just began - matching to Unknowns).

I think that the complicated lots & lots of duplicates with the different styling of space/no space will all be found within a month or two & then finding duplicates would not be so onerous & that change could be reversed, if that is technically feasible. 

Cheers, Liz
MatchBot MP
Join the MatchBot Monitors Project! See [this G2G post].

Hi Chris,

How would this work for the aka  (Other Last Name(s)) surname field?

The changes are now being made live.

I don't think there will be many objections to how this is being implemented.

If you notice any problems, post here. However, note that over the weekend we will be clearing and updating various cached content. There will be some inconsistencies until this is completed.

2 Answers

+2 votes

I think this would be a very useful improvement, and I wonder if it can be expanded a bit further.

Can variations with a leading 'de' also be found? So for a name like de Normandie, a search of Normandie, or De Normandie would return the same profiles?  If so, I am sure we can come up with a handful of more suggestions, ('de la ", 'la ', 'fitz', 'of ', etc.).

by Joe Cochoit G2G6 Pilot (258k points)
Hi Joe,

I see your point on this, but I'm thinking it would be a bridge too far. Essentially we'd now be saying O'Reillys are Reillys, for example. I imagine that would drive people crazy.

All this, by the way, is because of the technical way we're solving the problem of spaces. It's not just for searches. We're changing how we "normalize" and save names in our database. If the solution being implemented were limited to searches, what you're saying would almost certainly be a good idea. And perhaps we can look at implementing it just for searches later.

Chris
+6 votes

Thanks for considering this, Chris. My hope is that it would be possible to change the way spaces in names are handled in WikiTree search functions, without affecting the way these names are displayed or spelled in data fields.

The phenomenon of surnames with spaces in them is a major reason why the New Netherland Settlers project has faced difficulties with managing duplicate profiles. Historically, both families and genealogists have changed names with spaces to the space-free versions of the same names, so we always have to be mindful of the possibility of variant spellings with or without the space.  WikiTree has lacked effective tools for helping deal with these variations. Surname lists for names like my family name of Van Aken don't identify any "related surnames" (see https://www.wikitree.com/genealogy/Van_Aken) and surname pages for the concatenated versions of the same names have "related surnames" lists that don't hint at the existence of the versions of the name that contain spaces (see https://www.wikitree.com/genealogy/Vanaken). And as I noted in my earlier comment, the name search function doesn't deal at all well with spaces in last names. If our search protocols could treat Van Aken and Vanaken as the same name, while retaining the distinction in the actual name fields, it would save a lot of time and energy -- and prevent some gnashing of teeth.

by Ellen Smith G2G Astronaut (1.5m points)
I agree with Ellen. Change the way searching and matching is handled behind the scenes, but leave the display of the names and leave those alone.

I'll also point out that, with a bit of work, people wouldn't need to change the tags they follow either. Just change what the tag matches to, just like you would do with searching and matching. That way if someone with a tag of "van_der_walt" would match anything with "VANDERWALT".

It's really straightforward to do the searching and matching... Remove all non-alphabetical characters and either lowercase or uppercase the entire string and use those to match. Do that on the input and for the data to search. The question would be is if you need to have another index field generated in the table that has the match string pre-populated, for performance reasons.
Ellen, we're rolling out changes right now. It should be a great improvement.

Related questions

+7 votes
2 answers
206 views asked Oct 18, 2016 in Policy and Style by Guy Constantineau G2G6 Pilot (382k points)
+29 votes
5 answers
+16 votes
1 answer
+5 votes
1 answer
145 views asked Jan 5, 2019 in The Tree House by N Gauthier G2G6 Pilot (293k points)
+6 votes
4 answers

WikiTree  ~  About  ~  Help Help  ~  Search Person Search  ~  Surname:

disclaimer - terms - copyright

...