Wikitree Statistics - Nov 2018

+46 votes
283 views

I have been tracking several statistics that approximately represent the quality of the Wikitree database.  The last update was posted in June 2018 in G2G.  Following is a summary of current information: 

Overall status:  18.8 M total profiles; 15.1 M or 80% are connected; 4.6 M or 25% have DNA links (from Wikitree info).

Profiles with known internal consistency issues:  133,000 or 0.7% of all profiles (based on Suggestions report data).

Sourcing:  about 11% with 3 or more original sources, 32% with 1-2 sources, 13% poorly sourced, 29% unsourced, and 15% Unavailable (Unlisted/Red/Orange privacy) (based on random sampling).

Identified Duplicates:  about 8,805 or 0.05% (based on Suggestions report data).

Compared with June 2018 when I last reported on these statistics, there are 1.3 M more profiles.  Of particular note, the number of profiles with known consistency errors has dropped from 154,000 in June to 133,000 now.   Also, the fraction of profiles with 1 or more sources has increased from 38% to 43%, an increase which may be more than just sampling uncertainty (+/-5%).

A Free Space page with graphs, historical data and technical details is available here:https://www.wikitree.com/wiki/Space:Wikitree_Statistics

asked in WikiTree Tech by Paul Gierszewski G2G6 Mach 2 (25.2k points)
retagged ago by Paul Gierszewski
Very cool to see these stats. Please keep us posted!
Thanks so much Paul.  I get such a different perspective working on suggestions that it's nice to see these statistics.
Thank you, Paul.  Very interesting statistics.  I have been "sourcing" my heart out - maybe others are too!   Progress is good.  Go WikiTreers.

-NGP
A side note.  If I just look at profiles added since Jan 2017, the sourced fraction increases from 43 to 61%.  So we seem to be doing better now during profile creation.
Now that is good news!
Great stuff, Paul! Thank you so much for doing all this work and sharing it.
How do you qualify poorly sourced?
Poorly sourced is a link to an Ancestry tree or another website, or a vague source description.  The authors have provided something, but it is not directly useful.
It is my experience that as many as 1/2 of the profiles that need data doctoring in certain categories are not sourced or are poorly sourced.  It makes me wonder what the correlation is between profiles that are missing key details and that are also missing sources (real sources).
I would not be surprised if there was a correlation.  Possibly one could test this by checking how many Suggestions there are for profiles (random sampled) and see if the number of suggestions correlates (inversely) with the degree of sourcing.

6 Answers

+23 votes
 
Best answer
Nice job Paul. Interesting statistics. A lot of duplicates. Sometimes the older profiles have PM's no longer active and never get merged. Good news about the 154,000 consistency errors dropping down to 133,000 now..
answered by Dorothy Barry G2G Astronaut (1.2m points)
selected by Susan Laursen
Actually, I was surprised at how low the duplicate numbers are in proportion to the total numbers. The consistency error and unsourced numbers are much higher. The arborists must be doing a terrific job at merging duplicates to keep those numbers so low.
I suspect that the duplicate profile count is an underestimate, but this measure drawn from the Suggestions report is presently the only reliable/repeatable estimate basis that I have on this topic.
+14 votes

This is so awesome, Paul. You've put together the sort of thing I was looking for with my WikiTree Dashboard suggestion. I'm only sorry I missed your earlier messages about it.

answered ago by Greg Slade G2G6 Pilot (146k points)
+13 votes
Thanks, Paul. I’ve been waiting on the update. Have had your stats page lately need on my Nav homepage since the last update.
answered ago by Pip Sheppard G2G6 Pilot (643k points)
+5 votes
Thank you, Paul.  It's especially nice to see errors as a percent of profiles decreasing, as well as possible duplicates and unsourced profiles.  Makes all the Data Doctoring, and Source-a-thoning feel like we are making real progress.
answered ago by Cindy Cooper G2G6 Mach 2 (28.3k points)
+3 votes
Paul - this is great information.  I've always wondered how things were going.  Thank you for your work!!

Karen
answered ago by Karen Hoy G2G5 (5.4k points)
+3 votes
Thank you for sharing, it is always nice.to have a meter to look at to have some idea of where we are .
answered ago by SJ Baty G2G6 Pilot (209k points)

Related questions

+40 votes
5 answers
363 views asked Nov 5, 2017 in The Tree House by Paul Gierszewski G2G6 Mach 2 (25.2k points)
+26 votes
2 answers
+13 votes
2 answers
243 views asked Jun 14, 2017 in WikiTree Tech by Paul Gierszewski G2G6 Mach 2 (25.2k points)
+16 votes
3 answers
+11 votes
3 answers
103 views asked Oct 5, 2017 in The Tree House by Lynda Crackett G2G6 Pilot (617k points)
+7 votes
1 answer
164 views asked Apr 13, 2017 in WikiTree Tech by Jack Day G2G6 Pilot (222k points)
+65 votes
9 answers
+39 votes
7 answers

WikiTree  ~  About  ~  Help Help  ~  Search Person Search  ~  Surname:

disclaimer - terms - copyright

...