Wikitree_Statistics.png

Wikitree Statistics

Privacy Level: Open (White)
Date: [unknown] [unknown]
Location: [unknown]
Surname/tag: Statistics
This page has been accessed 4,215 times.

Contents

Wikitree Statistics

Last updated 17 Nov 2024 - added Missing Location results

This page is for an ongoing summary of Wikitree quality indicators.

Information on general statistics on profiles within Wikitree can be found at: https://plus.wikitree.com/default.htm?report=stat1 Under "Database dump" select the type of statistics of interest and click "Get table".

Overall Wikitree Status


The following table of data shows the growth of Wikitree over time in terms of total number of profiles, number of connected profiles, and number of profiles with DNA test connections. These numbers are as reported by Wikitree.

Technical notes:

  • Prior to about Feb 2016 the total number was 8% overstated since it included merged profiles.
  • Drop in Number of DNA linked profiles in May 2018 due to changes to comply with EU privacy rules.
  • Some early values are monthly averages.
Date Total Profiles Connected Profiles Profiles with DNA
14 Nov 2024 39,973,814 34,689,761 14,041,855
15 Nov 2023 36,210,475 31,230,244 12,415,008
10 Jun 2023 34,651,302 29,747,035 11,700,445
8 Nov 2022 32,353,319 27,594,516 10,664,900
7 Jun 2022 30,774,171 26,127,434 9,956,897
8 Nov 2021 28,620,028 24,127,642 8,958,243
6 Jun 2021 27,138,283 23,744,052 8,274,116
3 Nov 2020 24,996,828 20,803,351 7,397,210
9 Jun 2020 23,596,972 19,522,130 6,742,871
1 Jan 2020 22,210,087 18,268,598 6,173,368
6 Nov 2019 21,798,127 17,868,106 5,982,205
30 Jul 2019 21,011,879 17,155,386 5,678,443
18 Jun 2019 20,645,245 16,800,932 5,524,106
27 Mar 2019 20,005,060 16,224,171 5,220,156
8 Jan 2019 19,308,041 15,575,517 4,866,658
6 Nov 2018 18,779,693 15,076,525 4,629,120
1 Oct 2018 18,496,225 14,804,060 4,499,992
28 May 2018 17,482,356 13,852,666 4,004,584
16 May 2018 17,437,255 13,802,722 4,270,056
8 May 2018 17,372,228 13,738,055 4,235,537
8 Apr 2018 17,105,544 13,497,437 4,088,486
28 Jan 2018 16,213,128 12,860,615
29 Oct 2017 15,581,091 12,069,345 3,200,000
9 Jul 2017 14,579,940 11,220,333
25 Apr 2017 13,902,843 10,642,591
16 Apr 2017 13,881,513 10,579,228
29 Jan 2017 13,157,338 10,023,513
24 Jul 2016 11,831,219 8,891,739
8 Feb 2016 10,629,448 7,869,136
1 Jan 2016 11,378,699 7,640,805
1 Jul 2015 10,259,275
1 Jan 2015 8,945,881
1 Sep 2014 8,094,866
1 Jan 2014 6,567,960
1 Jul 2013 5,489,983
1 Jan 2013 4,502,821
12 Jan 2012 3,000,000
23 Jul 2011 2,000,000
20 Dec 2010 700,000
31 Aug 2010 200,000
15 Nov 2009 50,000
18 Jun 2009 20,000
31 Jan 2009 50
1 Nov 2008 1


Wikitree Profile Accuracy

Ideally we would have a measure of correct profiles. Possibly in the future there will be some stamp of approval that indicates a profile has been reviewed and is considered accurate. However, in the meantime, we can monitor the amount of known incorrect profiles.

In particular, the Database Suggestions report runs a number of checks on profiles. Many of these checks are related to missing information (e.g. gender), unusual information (spelling of names) or formatting issues. However a number of these checks identify information in a profile summary that is not physically possible or at least internally consistent - for example, being born after a child was born. These identified consistency errors provide a measure of the known inaccuracy of the listed profiles.

Technical notes:

  • Not all identified consistency errors are real, but in my experience most are.
  • A profile can have more than one consistency error. No correction made for this.
  • The following Suggestions Report items are considered consistency errors: 101, 102, 103, 104, 111, 112, 201, 202, 203, 205, 206, 207, 208, 209, 210, 301, 303, 305, 306, 307, 308, 309, 310, 401, 403, 404, 405, 406, 407, 408, 412, 413, 414, 415, 417, 418, 606, 636, 666, 911, 912. [slight changes over time]
  • Total number of suggestions changes as new checks are added, and is not tracked here as many suggestions are not profile errors.
Date Total Profiles Consistency Errors
14 Nov 2024 39,973,814 90,507
15 Nov 2023 36,210,475 101,158
8 Nov 2022 32,353,319 96,533
7 June 2022 30,774,171 96,719
8 Nov 2021 28,620,028 99,261
6 Jun 2021 27,138,283 99,927
1 Nov 2020 24,986,750 103,686
9 Jun 2020 23,596,972 106,938
6 Nov 2019 21,798,127 112,818
20 Jun 2019 20,645,245 117,033
6 Nov 2018 18,779,693 132,604
29 Jul 2018 17,994,278 147,578
28 May 2018 17,482,356 154,059
16 May 2018 17,437,255 154,007
8 Apr 2018 17,105,544 161,762
25 Feb 2018 16,755,359 164,800
29 Oct 2017 15,581,091 180,122
9 Jul 2017 14,632,639 198,757
16 Apr 2017 13,881,513 204,923
24 Jul 2016 11,765,939 258,754
29 May 2016 11,442,341 276,549

The following chart shows how the number of known consistency errors has decreased over time.


Wikitree Profile Sourcing

A good Wikitree profile is well sourced. There are currently no counts available for the number of profiles with sources. A measure of the sourcing is provided here by analyzing a random sample of profiles. Profiles can be randomly sampled based on their Wikitree number (not their Wikitree id). Based on the size of Wikitree, a sample of about 300 profiles is used to get useful accuracy (roughly about +/-5%), while not posing an excessive manual analysis burden. (Technical note: merged duplicate profiles will be oversampled.)

Profiles are randomly sampled and assigned to the following categories:

  • 3 or more sources, where sources are likely original records or books.
  • 1 or 2 sources
  • Poorly sourced, such as a link to an Ancestry tree or another website, or vague source description
  • Unsourced
  • Unavailable for analysis (Unlisted, Red or Orange privacy)

The results of the analyses to date are listed in the table below.

Date Nbr Sampled 3+ Sources 1-2 Sources Poorly Sourced Unsourced Unavailable
8 Nov 2024 352 27% 39% 10% 14% 10%
15 Nov 2023 321 22% 36% 12% 22% 8%
5 Nov 2022 319 18% 37% 10% 24% 11%
5 Jun 2022 309 15% 41% 11% 22% 11%
8 Nov 2021 322 17% 34% 11% 25% 13%
6 Jun 2021 323 18% 35% 10% 26% 11%
3 Nov 2020 322 18% 34% 12% 26% 10%
9 Jun 2020 321 13% 33% 14% 27% 13%
6 Nov 2019 314 12% 35% 15% 25% 13%
20 Jun 2019 316 15% 34% 16% 20% 15%
6 Nov 2018 316 11% 32% 13% 29% 15%
28 May 2018 302 11% 27% 15% 33% 14%
8 Apr 2018 284 12% 26% 12% 40% 11%

The following chart shows the percentage breakdown by sourcing of the profiles at various times. The small kinks in the results with time are likely due to sampling uncertainty.


Profile Sourcing Estimate using BioCheck

I also report sourcing results from Kay Knight’s BioCheck app (now v1.7.14). This useful app can be run in random profile mode to estimate sourcing on up to 5000 profiles, or over 10x more than is practical using my manual process above. It credits some possible sources that I do not include in my manual source count. Effectively, BioCheck's reported "Sourced Profiles" count is roughly analogous to my "Sourced" plus "Poorly Sourced" count. BioCheck results are listed below.

Date Total Profiles Bio Not Open Sourced Uncertain Marked Unsourced
10 Nov 2024 5000 13.4% 72.0% 12.0% 2.6%
17 Nov 2023 5000 14.1% 70.4% 12.4% 3.1%
8 Jun 2021 3580 16% 64% 16% 4%


Duplicate Profiles

In order to meet its goal of "One World Tree", there should be minimal duplicate profiles.

One measure is the number of pending merge requests. Of course not all of these are truly duplicates, but also this is likely not the full list of duplicates. As of Nov 2024, the pending merge list has about 11,200 entries.

As a more complete measure, I have randomly sampled profiles from across the entire Wikitree. For each of these profiles I checked the possible matches provided through the Search for Matches tab on the pull-down menu. Some were also checked with the built-in search function. Most were not a match. Based on an analysis in Nov 2024, I found 3 probable matches in 100 open profiles. This suggests the estimated number of duplicate profiles in Wikitree overall is in the range of 1-9% (95th percentile confidence interval).

Technical notes:

  • This assumes Wikitree Match Search identifies most matches. There are a number of known cases where it does not. E.g. profiles in non-Roman characters; profiles with de/von in surname. This analysis assumes these are a small fraction of all profiles (or have similar duplicate rate).
  • Unlisted/Red/Orange/Yellow profiles cannot be checked; they are assumed to have the same fraction of duplicates as the rest of the tree.
  • Matches may not be identified if the profiles have little information to compare.
Date Total Profiles Estimated Duplicates Sampling Basis Pending Merges
16 Nov 2024 39,931,000 1 - 9% 3/100 11,180
25 Nov 2023 36,298,183 0.2 - 7% 2/100 5,430
30 Dec 2021 29,059,846 3 - 13% 7/110 22,100
20 Feb 2020 22,641,134 1 - 8% 4/106 13,600
26 Jan 2019 19,481,987 1 - 9% 5/105 15,200


Undated Profiles

There are a relatively large number of profiles that are simply linked names with no location, dates or other information. Suggestions 131, 132, 133 and 134 provide a good estimate of these profiles - although strictly these suggestions only identify undated profiles, they often have no other information.

Technical notes:

  • Undated profiles can no longer be created in Wikitree.
  • This count is limited to Open (white profiles).
  • As of Nov 2022, the Open undated profiles were about 29% of the total number of undated profiles (excluding unlisted which could not be analyzed).
Date Undated Open Profiles
14 Nov 2024 385,294
15 Nov 2023 417,680
8 Nov 2022 443,687
7 June 2022 455,829
8 Nov 2021 468,252
6 Jun 2021 484,384
1 Nov 2020 506,144
9 Jun 2020 517,222
6 Nov 2019 519,047
18 Jun 2019 529,130
6 Nov 2018 550,241
29 Jul 2018 528,512
8 May 2018 480,950


Unlocated Profiles

Profiles should have a location to help identify the person and to avoid duplicates. Using Wikitree+, we can find profiles that are missing Birth, Marriage and/or Death Locations. The results indicate that while there are a large number of profiles missing one of the locations. The count for profiles with no locations is presently in development.

Technical notes:

  • This count is limited to Open (white)/Public (green) profiles.
  • It is plausible that a large number of the non-open profiles do not have a location, based on information for undated profiles.
Date Open Profiles No Death Location No Birth Location No Birth/Death Location No BMD Location
10 Nov 2024 33,875,608 13,562,775 5,709,964 4,390,380 1,844,677
25 Dec 2022 27,201,437 11,699,274 5,448,422
25 Dec 2016 9,860,617 5,414,916 3,081,742






Collaboration
  • Login to edit this profile and add images.
  • Private Messages: Send a private message to the Profile Manager. (Best when privacy is an issue.)
  • Public Comments: Login to post. (Best for messages specifically directed to those editing this profile. Limit 20 per day.)


Comments: 18

Leave a message for others who see this profile.
There are no comments yet.
Login to post a comment.
This is interesting information. Thanks for taking the time to compile it.

I have a suggestion and a question.

Consider calculating the consistency errors as a percentage of the total records. Then present the data year over year - it tells a stronger story. You really see how much work the team has done.

Date Total Profiles Consistency Errors (%)

29-May-16 11,442,341 276,549 2.42%

29-Oct-17 15,581,091 180,122 1.16%

6-Nov-18 18,779,693 132,604 0.71%

6-Nov-19 21,798,127 112,818 0.52%

1-Nov-20 24,986,750 103,686 0.41%

8-Nov-21 28,620,028 99,261 0.35%

8-Nov-22 32,353,319 96,533 0.30% <<-- This is a small error rate. Nice Work


Question - why not just hide/remove/ignore the Undated profiles - especially if they are x years old? Or have a process that allows them to go to a "Freeze" state after someone reviews them to confirm they are not worth anymore time?

Again nice work.

Edits: Trying to address the formatting.

posted by Tricia (Payne) Aanderud
edited by Tricia Payne
Hi Tricia, Thanks for the comments. Regarding the suggestion. In past updates, I've plotted the consistency error as percentage as you suggest, and indeed the good news is that it is a small percentage. More recently though I've been plotting the actual number as I think the percentage value was starting to be more influenced by profile additions to Wikitree and not as much by working on the errors. Plotting the number of profiles puts more emphasis on that task I think.

Regarding the undated profiles, that's a question for Wikitree Admin. The guiding principle is that Wikitree rarely deletes profiles.

posted by Paul Gierszewski
I like these stats, thanks!

If it's easy to do, I'd also appreciate a histogram of profiles by year of birth.

posted by Jimmy Tree
Jimmy, That's not a statistic that I am tracking. I'm not immediately sure how to do it, but suggest that you could start by looking at Wikitree+ if interested. I think you could at least manually construct a histogram of profiles by century of birth.
posted by Paul Gierszewski
thanks, for the hint, with wikitree+ it was tedious, but possible. Going back to 1700 includes 26 million profiles out of 28 million dated ones.

here's a screeshot of the chart i made: https://imgur.com/a/3dN7RBy

posted by Jimmy Tree
Paul,

Thanks for putting in the time and effort it must take to do this.

Since I am a coordinator with the DNA Project, I would be interested in DNA related stats. I see that you have already included a count of those who have DNA tests listed on their profile, but there are a few more things that would be helpful from an accuracy standpoint.

1. The number of profiles having a parent marked with "confirmed with DNA" status.

2. The number of 213 and 313 errors (a parent with "confirmed with DNA" status but no corresponding DNA source citation).

3. The number of profiles having a parent with "confirmed with DNA" status but is genealogically unsourced. (DNA confirmation needs genealogy to confirm; it is not a valid substitute for primary sources.)

John Kingman

posted by John Kingman
edited by John Kingman
John, These are good points. DNA is important to Wikitree to it would be good to track quality not just quantity. I appreciate that the examples you name are all relevant and are different aspects of the DNA data quality. However if we only tracked one stat as a proxy for the DNA data quality, what would you think best?
posted by Paul Gierszewski
No. 1 would give us a basis for the magnitude of possible DNA confirmation errors, necessary, but not a proxy. It is a subset of the number of DNA tests that are listed, since not all who have taken DNA tests are involved in confirming parents with DNA.

No. 2 yields a count of profiles which generate 213 and 313 errors. You may have this already since you deal with other 2xx and 3xx errors.

No 3. is like #2, but not currently generated by the error checker.

So a DNA data quality proxy to start with could be the #2 count divided by the #1 count.

This could be improved over time if more DNA related error codes are added by Ales or you can detect some DNA errors via your processing.

John

posted by John Kingman
edited by John Kingman
I'm not aware of #1 count being currently available? (We could ask Ales if he can add it to Wikitree+.) I can include it with my manual sampling in future, and try out these statistics.
posted by Paul Gierszewski
OK. What is the best way for us to ask Ales for that count?
posted by John Kingman
Hopefully you also realize that all profiles on WT are not getting all of their possible suggestions created weekly. The suggestion on a profile may appear on a profile one week, that has had no edits and the suggestion was not shown last week. ALL suggestions are like this.
Thank you this is very interesting and I must study it in more detail as I am sure I could add more details to profiles than I currently do!
posted by Anon (Cormack) Sharkey
"as I am sure I could add more details to profiles than I currently do!"

Are you moving on to a 25 hour day then?  :-)

posted by JG Weston
Thanks for all your had work preparing this! Would it be possible to add to this report a measure of how many members actively participate? Say, for example, total members with 10 or more contributions per month?
posted by [Living Tardy]
Hi Herbert, This seems more a measure of the Wikitree membership than the Wikitree quality, which is my intended focus. But more generally I don't know where to get that info. Maybe this should be a G2G question to see if Wikitree Admin will release this info or if there is an alternative way to find it?
posted by Paul Gierszewski
Since the last update to the table, we've crossed million-profile thresholds in all three categories. There are now 22,089,777 total profiles, 18,152,828 connected profiles, and 6,119,046 profiles with DNA test connections.
posted by Ellen Smith
Presumably the dust has settled since the Connect-a-Thon. Current totals are:

Total profiles: 20,973,311 - up 1.6% since 18 June

Connected: 17,123,470 - up 1.9% since 18 June

With DNA: 5,660,455 - up 2.5% since 18 June

posted by Ellen Smith
This is great stuff, Paul! Thank you for putting so much effort into this. Well done.
posted by Chris Whitten