The_Sourcing_Loophole-4.jpg

The Sourcing Loophole

Privacy Level: Public (Green)
Date: [unknown] [unknown]
Location: Globalmap
Surnames/tags: profiles sources
This page has been accessed 673 times.

Contents

Final Weekly Update, Effect of 2021 Source-a-Thon

Weekly tracking of {{Unsourced}} profiles on this page ends after logging the Data Doctors report of 31 October 2021. Further updates of this page, if any, will include new statistical estimates of the total number of profiles without sources only.

The attached graph of “Unsourced Profiles” shows the impact of the 2021 Source-a-Thon. The large drop (about 38,000 profiles) between 3 and 10 October resulted from the Source-a-Thon of the first three days of October. The relatively rapid growth in the number of profiles with the {{Unsourced}} template during September also reflects Source-a-Thon activity, as members sought profiles without sources and added the template in preparation for the contest. The graph also makes clear that outside the Source-a-Thon “season,” the number of templated profiles grows (July) or declines (October) quite slowly.

Meanwhile, the total number of profiles without sources continued to grow or decline at an unknown rate. The “Profiles Without Sources” graph illustrates the scale of the actual problem, compared to the small subset marked with the template. The “Low Estimate” and “High Estimate” lines represent the statistical estimates from the study described below, adjusted for Source-a-Thon activity. Although the statistical estimate is extremely imprecise, it does confirm that the true number of profiles without sources is many times higher than the known number of {{Unsourced}} ones. The fact that we have to rely on such estimates, obtained only at the expense of many hours of volunteer labor, is disturbing in itself. For WikiTree to have any hope of managing this problem, we first need a better way to measure it.

A new question is open on G2G about this update, so once again please hold any discussion there rather than on this page.

Purpose

This page aims to describe WikiTree's policy about adding sources, including the magnitude and growth of the resulting population of unsourced profiles. It is meant as a statement of the problem, for reference when seeking solutions on G2G. It describes the importance of sources in the context of our Honor Code, and the current policy on sourcing profiles. It goes on to present data from WikiTree + on profiles marked {{Unsourced}}, along with two statistical studies about unsourced profiles. Analysis of a random sample of Open profiles, described below, indicates that between 4.9 million and 8.1 million WikiTree profiles have no sources.

Background

WikiTree's Honor Code, Point VIII, states: “We cite sources. Without sources we can't objectively resolve conflicting information.” The sourcing requirement also relates to Point II, Accuracy, in that assertions of accuracy mean nothing if not supported by sources. Sourcing to a book or website provides an important means of giving credit (Point VII) and respecting copyrights (Point VI). In addition to explicit inclusion in the Honor Code, it's worth noting that well-sourced profiles stand much less chance of duplication or conflation, and that large numbers of unsourced or poorly sourced profiles detract from WikiTree's public image.

WikiTree strongly enforces most Honor Code points, notably Collaboration, Assuming mistakes are unintentional, Courtesy, and Privacy. In contrast, WikiTree does not require adding sources to every profile and never has required it. Chris Whitten, in proposing the current “Add person” implementation, said, “Although citing sources is required by our Honor Code..., entering something in the 'Source(s)' field isn't technically required.”[1]

As of this writing (28 July – 16 August 2021), on trying to save a profile with an empty “Sources” field, the user sees a reminder that

A source is required but you can select one of the following:
Unsourced family tree handed down to [the user, or]
Source will be added by [the user] by [the next calendar day].”

The first option reads somewhat differently if the profile represents a person born less than 100 years ago. If the user selects the second option (but not either version of the first option), the system automatically adds the {{Unsourced}} template to the profile. A third option exists, to satisfy the “Add person” data checks by making an entry in the “Sources” field. As I understand it, the system only checks for the presence of text in the field. Anything therefore passes inspection, including “Ancestry.com,” “Family history,” or “First-hand knowledge [of 18th or 19th century events].” Once the user negotiates the data check and saves the profile, WikiTree leaves the addition of sources entirely in the hands of the profile creator or diligent third parties (eg, Sourcerers).

Among the justifications for this lenient policy, two stand out. First, that some members prefer to save a profile without sources, then immediately edit it to include a full biography with inline references, and requiring an entry in the Sources field interferes with their method. Second, “that many members ... haven't yet learned the importance of collecting and preserving sources to explain their reasoning to other people.... We need to be patient with people as they learn to use WikiTree....”[2]

The current policy was proposed 11 May 2017[1] and implemented two months later[3] with minor revisions to the first, “Personal recollection/Unsourced tree,” option, despite extensive discussion eliciting numerous good ideas that might have improved its effectiveness. A recent proposal to eliminate the “promise” option[4] was rejected despite receiving seven to eight times more Yes than No votes. This proposal also generated vigorous discussion, resurrecting many of the same constructive suggestions disregarded in 2017.

The {{Unsourced}} Template

WikiTree has 1,054,000 profiles with the {{Unsourced}} template. 918,000 of these profiles have Open privacy. These figures are from the Data Doctors report of 8 August 2021, using WikiTree + to search for text “Unsourced_Profiles” and “Unsourced_Profiles Open.” An unknown number of these profiles have sources and bear the template in error. An unknown number of profiles also exist with neither a source nor a template. Linear regression analysis shows that profiles with the template increased at a rate of about 45 profiles/day during the period 5 July through 8 August 2021 (see table under Recent Data from Data Doctors Reports). As shown below, profiles with the template poorly represent the actual population of unsourced profiles.

Statistical Studies of Open Profiles

I conducted two studies of Open profiles, to explore the relationship between profiles with no sources and those with the {{Unsourced}} template. One study looked only at dateless Open profiles and the other looked at all Open profiles. In both studies, I drew a random sample of profiles from the 8 August 2021 Data Doctors report on WikiTree +, then examined each profile to determine whether it had sources, and whether it had the {{Unsourced}} template. The populations were restricted to Open profiles, to ensure access to the biographies and, if necessary, the Changes logs. I examined the profiles the evening of 13 August and the day of the 14th (United States Mountain Daylight Time). I did not edit or correct the profiles, but others might have done so after the studies.

A specific but unknown percentage of all Open profiles have sources. To estimate the percentage, we can take a random sample of profiles and ask the Yes/No question, “Does the profile have a source?” The percentage of profiles with (or without) sources provides an estimate of the unknown percentage in the entire population, because the number of “Yes” answers follows a binomial probability distribution. Well-established statistical methods can also give us the range within which the actual percentage likely resides.

Source or Not a Source?

Assessing individual profiles as Yes/No, Sourced or Unsourced, requires a clear definition of a source. In deciding what qualifies as a source, I tried to follow the Help guidance that “A source is the identification of where you obtained information,” tempered with some common sense. I rejected “sources” that could not have provided the stated information, that are not themselves supported by sources, or that are not useful for directing a reader to the actual source material.
Examples of entries rejected as sources:
  • Bad links to Ancestry trees and working links to Private Ancestry trees;
  • “First-hand information” about events the WikiTreer could not have witnessed;
  • GEDCOM imported references not leading to actual records, websites, images, etc;
  • Links to trees or profiles behind paywalls (no way to see any underlying sources);
  • FindAGrave memorials without headstone images or other sources;
  • Unsourced family trees and unsourced profiles at other websites;
  • Vague references to “Family research,” “MyHeritage.com,” “Birth and death records,” etc.
Any item above might rightfully belong among a list of sources on a profile, but none can stand alone as a profile's only source. In the studies that follow, profiles whose only source had these “not a source” characteristics were counted as unsourced. Some might consider profiles assessed under the above criteria “undersourced” rather than “unsourced.” Reasonable people can disagree, and certainly we could apply looser or stricter criteria. Other than producing a different count of unsourced profiles, a different set of rules does not change the method of sampling or analysis.

Forty Undated Profiles

I collected forty profiles from the Data Doctors report of 8 August 2021, using the text query “B0 D0 Open,” meaning no birth date, no death date, and Open privacy. The query returned 478,345 profiles, shown ten per page on pages numbered 0 through 47,834. To obtain a random sample, I used the random number function in a spreadsheet program to generate 40 page numbers from that range. I examined the first profile on each resulting page, asking, “Does it have any valid sources?” and “Does it have an {{Unsourced}} template?”
Among the forty profiles:
  • 32 profiles had no legitimate sources, as described above. None of the 8 sourced profiles had an {{Unsourced}} template.
  • 6 profiles had the {{Unsourced}} template, only one of which was added at creation. None of the profiles so marked had sources.
The sample indicates that 80% of undated profiles (32/40) have no sources, and about 80% of the unsourced ones (26/32) are not marked with the template. 95% confidence limits for the binomial distribution give a range of 64% to 91% of undated profiles without sources. The wide confidence interval (64-91%) results from examining only forty profiles. Larger sample sizes produce narrower confidence intervals.
The sample contained no profiles incorrectly tagged {{Unsourced}}. 15% of the profiles (6/40, as noted above) had the {{Unsourced}} template. The entire population of 478,345 undated profiles has 58,415 (12%) marked {{Unsourced}} (found by adding “Unsourced_Profiles” to the text query). For the 15% estimate of {{Unsourced}} profiles, the 95% confidence interval is 6% to 30%, which is not inconsistent with the known value of 12%.
Although imprecise due to the small sample size, this study validates the following statements:
  • A large majority (64 to 91%) of profiles without dates also lack sources.
  • A large majority of undated, unsourced profiles lack the {{Unsourced}} template.
  • The percentage of profiles incorrectly tagged Unsourced is probably small.

160 Open Profiles

I used a larger sample taken from all Open profiles to obtain an estimate of the total number of unsourced profiles, compared to the number marked as such. In WikiTree +, searching the 8 August 2021 data set for the text query “Open” yielded 21,813,695 profiles, with 918,395 {{Unsourced}}. I selected 160 of the 21.8 million at random, using the same method described for the forty undated profiles study.
The 160 profiles contained
  • 47 without sources, and none of the 113 with sources were tagged {{Unsourced}} in error, and
  • 8 profiles marked {{Unsourced}}.
The sample indicates 29% of all Open profiles have no sources, with a 95% confidence interval between 22% and 37%. In terms of the entire Open population, that is between 4.9 and 8.1 million unsourced profiles. Again, sample size determines the width of the confidence interval, or range of the estimate. Improvements in precision carry a steep price: to reduce the estimated range of unsourced profiles from 3.2 million (8.1 minus 4.9 million) to 320,000 would require checking 16,000 profiles rather than 160.
Only 5% (2% to 10%, at 95% confidence) of Open profiles are marked {{Unsourced}}. As in the smaller study, the estimate for templated profiles (2-10%) is not inconsistent with the known figure of 4.2% (918,395 of 21,813,695). The absence of any sourced profiles incorrectly marked {{Unsourced}} gives an upper limit (95% confidence) of 2.3% of Open profiles with this error. If the 113 sourced profiles in this study represent a random sample (they might not), then the upper limit for sourced profiles incorrectly templated is 3.2%.
This study validates the following statements:
  • The number of profiles with the {{Unsourced}} template grossly underestimates the actual number of profiles without sources.
  • Only a small percentage (at most 2 – 3%) of profiles with sources incorrectly have an {{Unsourced}} template.

Conclusions

At minimum, our tree contains about 5 million unsourced profiles. That corresponds, on average, to more than 1,000 new unsourced profiles every day of the site's thirteen-and-a-half year history. Recent reports (see table below) show the rate of increase of profiles marked {{Unsourced}} at about 46 per day. The statistical studies show clearly that the actual number of unsouced profiles exceeds the {{Unsourced}} number by a factor of at least five, but we have no way to obtain a direct count of the actual number. For reasons beyond the scope of this report, it is not practical (and may not be possible) to track the growth or decline of unsourced profiles with statistical methods. It seems we will need direct counting to do that.

Regarding the ongoing creation of unsourced profiles, members who save profiles then immediately edit to add sources do not contribute to the problem. For others, the “Add person” page tells them “A source is required,” then immediately relieves them of that requirement. Thus the responsibility for sourcing devolves to all members, in effect making no one responsible and robbing “We cite sources” of meaning.[5]

The data show that the “patient” approach fails to put sources on profiles. The “source will be added” option has seen widespread misuse since it first appeared in July 2017. Four years of Sourcerer and other clean-up activity notwithstanding, anyone can easily find in the weekly Data Doctors report examples of {{Unsourced}} profiles, some pre-1700, with unfulfilled “source will be added” promises dating back to the initial implementation. The other options for not adding a source probably suffer similar misuse, but lack of the automatic {{Unsourced}} template makes them more difficult to track. I think it would benefit the tree greatly to close this sourcing loophole altogether, and not allow members to save unsourced profiles. Short of that, we should at least apply the {{Unsourced}} template to all profiles a creator declines to source.

Another change that would help manage this issue would be to program a bot to crawl the tree searching for profiles without sources and add the {{Unsourced}} template to them. That, of course, would require implementing an algorithm to detect sources within the mixed text of biographies. For WikiTree ever to reverse the trend (if one exists) of steadily increasing numbers of unsourced profiles, we must first find a way to stop or significantly slow the creation of new ones.

Note: To avoid having similar discussions in two places, I ask that you post responses and ideas on the companion G2G question rather than on this page.

Recent Data from Data Doctors Reports

Report Date Total Unsourced Unsourced Open
4 Jul 2021 1,052,732 916,683
11 Jul 2021 1,053,033 916,899
18 Jul 2021 1,053,087 916,986
25 Jul 2021 1,053,649 917,542
1 Aug 2021 1,053,683 917,638
8 Aug 2021 1,054,417 918,395
15 Aug 2021 1,055,596 919,510
22 Aug 2021 1,056,873 920,602
29 Aug 2021 1,061,041 924,684
5 Sep 2021 1,066,857 930,468
12 Sep 2021 1,069,496 933,157
19 Sep 2021 1,073,076 936,744
26 Sep 2021 1,075,566 939,255
3 Oct 2021 1,077,704 941,483
3 Oct 2021 End of Source-a-Thon
10 Oct 2021 1,039,991 904,018
17 Oct 2021 1,039,124 903,203
24 Oct 2021 1,038,756 902,819
31 Oct 2021 1,037,824 901,943


Acknowlegdments

Many thanks to Julie Kelts and other friends and associates who provided invaluable help and advice in writing this page.

Sources

  1. 1.0 1.1 https://www.wikitree.com/g2g/391719/what-do-you-think-this-plan-for-making-source-required-field
  2. https://www.wikitree.com/g2g/1265434/proposal-remove-source-added-later-option-profile-creation?show=1273601#a1273601
  3. https://www.wikitree.com/g2g/423109/did-you-see-the-new-source-requirement-when-creating-profiles
  4. https://www.wikitree.com/g2g/1265434/proposal-remove-source-added-later-option-profile-creation
  5. Wikipedia contributors, "Diffusion of responsibility," Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/w/index.php?title=Diffusion_of_responsibility&oldid=1028257045 (accessed July 28, 2021).

Edited 16 Aug 2021 to expand some sections and to incorporate results of the two statistical studies.





Collaboration
Comments: 8

Leave a message for others who see this profile.
There are no comments yet.
Login to post a comment.
Herb et al., would you mind providing the data sources behind the numbers you are collecting? I think you have noted elsewhere that the data is unreliable, but perhaps reviewing the data sources can help us better understand how we can better track and report on this information.
posted by Steven Harris
Steven, the data comes from the weekly Data Doctors report, which anyone can access. For the chart above, the figures on the 8 Aug row can be found by searching in WikiTree + for text "Unsourced_Profiles" and "Unsourced_Profiles Open," with the Max Profiles parameter set high enough to capture all matching profiles. I only know how to access the current week's report, but perhaps Aleš has prior weeks' reports archived or otherwise accessible. I log the counts every week but do not keep data downloads or screenshots of the query results.

The data is reliable to the extent that the Data Doctors report accurately answers the text query. Participants in the G2G discussion widely agree that a large but unknown number of profiles have no sources but are not marked {Unsourced}. And certainly some of the marked profiles actually have sources. Please see the G2G discussion for caveats and estimates beyond the raw DD data.

Thanks for your interest in this issue.

posted by [Living Tardy]
edited by [Living Tardy]
Thanks Herb. Perhaps we should look into a bit more accurate reporting options before proceeding to far into interpreting data with all of the unknowns in reliability and accuracy.
posted by Steven Harris
I would certainly welcome more accurate reporting, and I look forward to you providing it. In the meantime, the consensus at G2G, among members fully aware of the causes of possible inaccuracy, is that the data reported here lean heavily toward underestimating the actual numbers. As long as these are the only hard data available (and made available, I might add, by a WikiTree Team member), I will continue updating this page. The text of the page includes disclaimers about uncertainties, and as I mentioned the uncertainties are also discussed at G2G. I think our intelligent and knowledgeable members should be able to see these data and understand them, and reach their own interpretations. I encourage you to do likewise.

I also welcome you to bring up your concerns on the G2G question, to allow all interested parties the opportunity to address them.

posted by [Living Tardy]
Hi Herb, perhaps you misunderstood.

I did not suggest you to discontinue the work, nor am I discounting the fact that Unsourced Profiles pose an issue to the site and are contrary to our collective efforts of a sourced single-family tree. I am simply saying that using unreliable data to draw conclusions, no matter how you try to explain all the caveats and no matter where that data comes from, is not in the best interest of WikiTree or your own mission to help reduce the amount of Unsourced Profiles. The real objective, at least to start, would be to accurately identify Unsourced Profiles, and then use that data to further efforts to reduce them.

Estimating, guessing, and allowing others to come to their conclusions based on limitations can do much more harm than good.

posted by Steven Harris
Thank you Steven. I think we would all benefit if you make that point on the G2G question.

Kay Knight posted a data-based answer citing her study using the BioCheck app, where she estimated a much higher number of Unsourced profiles. We can debate the exact magnitude of the problem based on estimates and 'guessing,' but I think that diverts from the main issue. My point on this page, and on the G2G question, is that this is a large and growing problem that has been allowed to grow throughout the entire history of WikiTree. That is the only conclusion I've drawn, and I believe the data support it, which is why I presented the data. If you can make a convincing argument that my conclusion is incorrect, that the problem is either not large or not growing, I think all our fellow WTers would like to see your case in the public forum.

I would greatly appreciate you making any further comments on the G2G question. Not only would that add a different point of view to the discussion there, but it would relieve me from dividing my attention between two different conversations on the same topic.

Just as I finished writing, I see you posted an answer on G2G. Thanks!

posted by [Living Tardy]
Thank you, Herb, for putting this together, and bringing it to G2G. Unsourced profiles are a huge problem for Wikitree.
posted by Nan (Lambert) Starjak
Very happy to help, Herb. Thank you for all your hard work and for once again presenting an important WikiTree issue with such clarity.
posted by [Living Kelts]