upload image

WikiTree redirection efficiency

Privacy Level: Open (White)
Date: 18 Mar 2018 [unknown]
Location: [unknown]
Surname/tag: wikitree
Profile manager: Roger Shipman private message [send private message]
This page has been accessed 185 times.

Contents

Can merge redirects be made more efficient?

Whenever two WikiTree profiles are merged, the information from both is combined into one of the two profiles. The other is changed into a "redirect"; its genealogical information is removed and completely replaced with a reference to the newly modified one.

When multiple merges are performed, this can lead to successive redirects, where a redirect points to a redirect (and so on...) that points to the information. This state of affairs causes undue load on the server. Also, it can possibly prevent a Google search from successfully finding the information sought if the number of redirects is too high.

Our hope is that, after a merge where the LNABs[1] match, the profile with a higher number will point to the one with a lower number, preserving the chronological order of their creation, thereby giving credit to the researcher who first documented the person. (Ideally, no merges are needed; even one redirect burdens the server. Or, where merges are unavoidable, all redirects point directly to the first-created profile of the correct LNAB.

When merging, then, it is best, whenever possible, to attempt always to perform a merge into the first matching profile created. This is the best way to avoid these problems. It is, however, not always possible. Sometimes matches are not found because the LNABs do not match, a married name was used and is replaced with Unknown, and then the proper LNAB is found, and so forth. Successive redirects are created.

Redirect collapse — making indirect redirects into direct ones by modifying the links, to help the efficiency of WikiTree and searches — can solve the problem of successive redirects and is a good idea. It is, however, not quite so simple as it seems at first blush.

Redirect collapse explained

Anytime a redirect points to a redirect, the first one could be modified to skip the intermediate, and redirection chains would be shortened. The key to making this work is twofold:

  1. DO NOT REMOVE the redirects in between. Just change the ones earlier in the chain to point to the end. Removing them, as Dennis Wheeler has said, would lead to broken links
  2. ADD CHANGE RECORDS about the “new” merges, so that each change log has a record of all the profiles merged into it.

Note that in this example below, LNAB does not represent any one particular name — each could vary. And the numbers are not intended to be representative. Any one number could be higher or lower than the one before in its chain. I just picked these numbers to make the example easy to follow.

LNAB-1
LNAB-21
LNAB-321
LNAB-421
LNAB-422

Five records have been created. These profiles are referenced in the change log, and any one of them could be referenced by its URL, so all five records MUST continue to exist.

With merges, four of these become redirects, as follows.

LNAB-422 is merged with 421:
  • 422 -> 421
The same user then notices 21:
  • 421 -> 21
The editor of 321 notices 21:
  • 321 -> 21
With all that good information collected, finally:
  • 21 -> 1

Now, this is the (backwards) current state of things:

  • 422 -> 421 -> 21 -> 1
  • 321 -> 21 -> 1

This could be rectified by EditBot. Find a profile in the list of redirects and follow it.

  1. If a second redirect is found: add the first of the three profiles to a list.
  2. Repeat for its referent.
  3. When a second redirect is NOT found, finish and change the whole list to point to the last referent. Also, add change records.

Example: We have 422 -> 421 -> 21. We found a second redirect.

Start a list: 422
Recurse: 422 -> 421, so repeat (look for a redirect) for 421.
421 -> 21 -> 1. We found a second redirect.
Add to the list: {422, 421}
Recurse: 421 -> 21, so repeat (look for a redirect) for 21.
21 -> 1. Only one redirect is found, and not a second.
Using the list, point to the last referent:
Change 422 -> 1, 421 -> 1. Add change log records for each.

Now the state is:

  • 422 -> 1
  • 421 -> 1
  • 21 -> 1

Continuing, the algorithm eventually finds 321.

  • 321 -> 21 -> 1.

Here 321 is added to a list, changed to point to 1, a change record added.

Now we have the state we want.

Note that whole process eventually makes even the algorithm slightly more efficient.

No links are lost, but each chain is now of length 1, and the change records are done as if all merges had happened correctly, by merging with the final profile.

The Change Log

This process is not complicated, although executing it involves examining each profile that has ever been merged, and following any redirects. (Note that in my example, changing 21 -> 1 means finding and changing all of 422, 421, 321. See below.) Because long chains are rare, and would become more so, the number of examinations would approximate the number of profiles (O(N), for the technical). This could mean millions or many thousands of checks. But doing this in the background (the process would be suspended whenever "real" work needed to be done) would not add much burden to the server, and the result would reduce server load.

The change log for each intermediate profiles would need to be updated.

An even simpler way?

Steven Tibbetts points out that the algorithm may only need to be run once. After that, the whole thing may be able to be managed by the merge tool, simply by looking at the list of merges already made into a merged profile.
In other words, when merging 21 -> 1, if the merges from 422, 421, 321 can be found in the change log for 21, they do not need to be found by a offline search, just updated to point to 1.

History

The intermediate links being removed, the change history would look as if all merges were direct. To show the proper history, it would be necessary to examine the change logs of all merged profiles and reconstruct it. This is also not complicated - and rare. It would only be needed when going to the Changes page.

How efficient that process would be would depend on the difficulty of traversing the Change log for a profile and determining what all was merged into it.

Disclaimer

This analysis would be more complete, but there are several things I don't know about the WikiTree database, but especially the Change logs.

Am I missing anything?

More information

Footnotes
  1. LNAB = Last Name At Birth

Acknowledgments

Created by Shipman-738 00:41, 19 March 2018 (EDT)

The author has 40 years experience as a software engineer.





Collaboration
  • Login to edit this profile and add images.
  • Private Messages: Send a private message to the Profile Manager. (Best when privacy is an issue.)
  • Public Comments: Login to post. (Best for messages specifically directed to those editing this profile. Limit 20 per day.)
Comments

Leave a message for others who see this profile.
There are no comments yet.
Login to post a comment.