Statistical analysis of database dump

+34 votes
1.0k views

Hi,

I did some statistical analysis of database dump.

You can see the results on freespace page Database_dump_statistics.

I also plan to check trees and grapes statistics, but I need a few more days for that.

Maybe also world map of people in Wikitree so we can see geographical distribution of people.

In future I will also do database changes over time, but database is updated once a month so next month.

Have fun checking the results.

Regards Aleš.

WikiTree profile: Space:Database_dump_statistics
in WikiTree Tech by Aleš Trtnik G2G6 Pilot (804k points)
retagged by Maggie N.
Interesting to see the statistics (-:
A lot of fathers who are female and mothers who are male...
I agree. It should be validated at data entry by wikitree. Also missing gender is unneeded problem. As I was entering my family, I often forgot to enter gender.
Great that someone takes the initiative and does this, but It tells me nothing, because it is incomplete. I do not see the amount of profiles that are as near as good. The qualities such as time period, level of validation, country [nowhere do I see South Africa though I do see Zimbabwe ...!?!]) - seem very selective. Statistics also need interpretation. Context.

Have a better look. 

  • South Africa: 33,309 (0.32%)

What does as near as good mean? Computer cannot tell if some children or spouse is missing. But i can find errors. Some of them are, that father is female, If mother is over 99 years old, when she gave birth, and so on.

I will think about preparing a list of persons with most obvious errors.

 

I write some explanations on page as questions about it occurs.

Ok, I see it now. But in it's near 400 years of existence, the "country" has only been officially known as South Africa since the beginning of the 20th century, when it was united (the four main provinces) into the "Union of South Africa" that existed until 1961 after which it became South Africa proper. Before that you will find people born either in Europe and dying in a colony (Cape of Good Hope - Dutch; Cape Colony - British), or born in a colony (Cape Colony for example) and dying in one of the Boer Republics - ZAR or OVS for example. Sometimes we simply do not know where they exactly died, because they had been on trek in Angola, Namibia (which also had other names in the 19th century), Botswana or Rhodesia (same - other names). Where people died is not a priority, getting the spelling of the LNAB is. And coupled to that of course is the place of birth, not country of birth.

I would enjoy seeing some stats via graph analysis: number of disconnected trees, number of orphan profiles (no relations), histogram of tree sizes, top 10 trees by size, etc. I remember seeing a presentation like this about the FamilySearch tree some years ago at RootsTech.
Hi Justin.

I am thinking in a similar way. That was the reason I started this. I can't get my tree (250 persons) connected to Big tree and I was wandering how many others have the same problem. I already looked into this. over 2 million pages are orphans with no connection.8 milion are connected to big tree, but without private and protected data there will be many smaller trees.
Hi Philip,

Initially my goal was to show distribution of people in WikiTree based on today's demographics. I did enter some aliases for biggest unknown countries (different spelling, different historical names,...) (DDR, West Germany, Deutschland, Deutsches Reich,...). But I have a problem with old countries, that are today split into several (Prussia, Austia-Hungary,...). Still working on a solution for that.

As for south africa, now I added Cape Town, Cape Province, Cape of Good Hope, Cape Town, Union of South Africa, ZAR and OVS as South Africa, So result will be a bit different.
Thanks Ales, if you put Cape Province also add Transvaal (Province) Orange Free State (OVS for short - also a province) and Natal (also a province). Today the situation is a bit different again - we still have Transvaal but South Africa has since some years now been again divided into nine provinces including the most well-known Gauteng: https://en.wikipedia.org/wiki/Gauteng As for the project that I'm coordinating now - I try and be as attentive as possible to dates of birth and death as well as the places ...

Hi Philip

I added a few more, so here is the list for future calculations.

South Africa   OVS 168
South Africa   Republic of South Africa 14
South Africa   RSA 227
South Africa   South Africa 33309
South Africa   South Afrika 61
South Africa   Suid-Afrika 341
South Africa   Union of South Afrcia 6
South Africa   Union of South Africa 460
South Africa   ZAR 17
South Africa   Zuid-Afrikaansche Republiek 166
South Africa   Zuid-Afrikaansche Republiek (South Africa) 41
South Africa Cape of Good Hope Cape Colony 2424
South Africa Cape of Good Hope Cape of Good Hope 1861
South Africa Cape Province Cape Province 291
South Africa Cape Town Cape Town 826
South Africa Gauteng Gauteng 79
South Africa Natal Natal 249
South Africa Natal Natal Colony 89
South Africa Natal Natal South Africa 22
South Africa Orange Free State Orange Free State 163
South Africa Orange Free State Orange Free State South Africa 50
South Africa Transvaal Transvaal 409
Great ... thanks! how about

[South Africa] {I use this a lot on pre-1910 profiles - like a master category}
SA
Natalia
I will probably strip all parentheses, from location, since they don't really have any meaning to location.

Can't do - it implies "Currently known as" South Africa ...

8 Answers

+11 votes
Love it!  Thank you.
by Peter Roberts G2G6 Pilot (702k points)
+8 votes
I like it.  One improvement would be to combine places that are the same.  E.g. Ohio and Ohio, USA ad NY and New York.  It might take a few minutes to make the changes, but will improve the final product.
by Dave Dardinger G2G6 Pilot (440k points)

I did partially but there are thousand's of such instances. I did the ones with > 10000 occurrences and for European countries, since they have different spelling in local language. I used comma delimited addresses with country on the end. Here are some examples:

Country name Region name Entered last text
     
Belgium   Belgique
Belgium   België
Belgium   Belgium
     
     
United States Oklahoma Oklahoma USA
United States Oklahoma OK
United States Oklahoma Oklahoma
     
United States   US
United States   United States of America
United States   USA
United States   United States

 

I explained this on page Space:Database_dump_statistics and added the link to see actual matches. 

+14 votes
Aleš, I'm so excited to see what you've done here. This is fantastic!

This starts to lay the foundation for even more cool things in the future.

Today, Bob Fields suggested via an e-mail that we should have a bot that posts messages when data is incongruous or overly private, e.g. "Was this person really <born before 1900><dead>? If not, please fix the <birth><death> date. If so, please <remove the email><adjust privacy to a lower level (green or orange)> so that we can all collaborate and share ancestors."

He was suggesting that we do this internally, but there's no reason it couldn't be done by members like you, perhaps working together in a project. We already started to set some precedents for when/how external bots would be appropriate. We could build on that.

Again, bravo!
by Chris Whitten G2G Astronaut (1.5m points)
Thanks Aleš, Very interesting.  But is there an easy way find the profiles with all the aberrations? For example, can I get a list of profiles that use PA (USA)?

See my comment here. Also, the work is not done by a very long mile yet. The last thing we need are automatic emails on project profiles (at this stage) insisting that we change or add data. We are still collating data. And trying to stay the continual creation of duplicates either manually or through GEDCOM. Every one such a duplication means weeks of extra work.

To Vic,

Here are a zip file of all pages with PA in birth or death location. Inside is text (comma delimited) and excel file.

http://www.softdata.si/osebe_staro/ales/wikitree/PA.zip

it is 75615 pages to check.

Have fun

To Bob,

Where on WeRelate did you find a downloadable database of valid standardized place names, along with an open source project to maintain the data (not in PHP)? I did have a quick look there, but couldn't find it.

We could use a bot to delete Unknown, ?, Y/ as a location, since it is not a location. It is 40K of them.

Regrds Aleš.

Looks like it is gone now, but it used to be under http://github.com/DallanQ/Places. Dallan Quass, who created WeRelate, also created the infrastructure and validations and tooling on that site. You may be able to locate an older copy somewhere (try google search for DallanQ Places database), or you could contact Dallan directly.  His blog: https://dallanq.wordpress.com/.

To Bob,

I found the logic behind his database on net, and I miss timeframe for specific name.

If someone was born in 19th century in Ljubljana, he was born in Ljubljana, Avstroogrska (Austia-Hungary), In 1930 It would be Ljubljana, Slovenija, Kraljevina SHS, in 1950 it would be Ljubljana, Slovenija, Jugoslavija and in 2000 Ljubljana, Slovenija.

In dedicated database on WikiTree, in combination with entered date, the correct place name could be chosen.

This could be entered in Region categories as specific templates and extracted by Wikitree to use in input boxes.
Yes, Dallan's database did not take dates into consideration when determining if a place is valid/standard. The FamilySearch places does a better job of this. That is one of the reasons I prefer using somebody else's API instead of building a new places service within WikiTree, apart from having to maintain/update the data. My preference would be to have an external API with some internal 'override' of valid/invalid place names, so we don't have to try to update FamilySearch or some other DB with additional locations..

Thanks so much for helping out with this. You certainly can start with some of the most obvious errors. I'd also start with profiles earlier than 1700, since those are the ones we are most concerned with data accuracy.. You can also check for locations='Somme, Picardie' which I believe is a default from ancestry.com when users type 'Y' and save.

I analysed Y in location field and here is the report http://www.softdata.si/osebe_staro/ales/wikitree/Y.htm. It is a few Somme, Picardie, but I was referring to Y/ with 4788 occurrences. I think those were some kind of error in GED import. 

I don't think I can use familysearch API, since I validate all 1,5 million unique addresses. They would probably lock me out. Also they return that a location can be one of few addresses, which is not exactly what I need. I will check it out a bit more. 

It could be used at data entry on wikitree, but that is for admins to decide.

I would be very cautious about the FamilySearch locations.  They are incorrect for the places and times I work.  In fact, I have the dropdowns turned off because they are so wrong.  I wish everyone would stop using them and start using the places provided in the excel spreadsheet.  Aren't we getting close to having our own database, Ales, by the table you use for suggestions?

@Cindy,

I work with Dutch profiles nearly 100% of the time and use the Familysearch suggestions always. They are almost always correct or acceptable as current place names, and complete: city, province, Nederland. Historic place names? Useless. Try searching Dutch archives, that have all civil register records (Post-1810) and many church registers (Pre-1811) indexed and online, for "bataafse republiek" (1795-1801) and you get zero results. Same searching Familysearch catalog.

+7 votes
Added a few more analyses, locations refined.

Added Error sections with links to problematic profiles, to easily correct data.
by Aleš Trtnik G2G6 Pilot (804k points)

Added new error for duplicate siblings.

Errors are now on this page: Space:Database_dump_errors

Here is a list of errors, that are calculated.

206558 Errors Total
101 Birth in future 328
102 Death in future 376
103 Death brfore birth 12840
104 Too old 6914
105 Duplicate sibling 4784
201 Father is self 294
202 Parents are same 228
203 Father is Female 5801
204 Father has no Gender 2143
205 Father is too young or not born 47673
206 Father is too old 6839
207 Father is also a child 515
208 Father is also a spouse 231
209 Father is also a sibling 3560
301 Mother is self 28
303 Mother is Male 7738
304 Mother has no Gender 2026
305 Mother too young or not born 63816
306 Mother is too old 5719
307 Mother is also a child 50
308 Mother is also a spouse 1483
309 Mother is also a sibling 396
401 Spouse is self 12
402 Unknown gender of spouse 2788
403 Single sex marriage 4267
404 Marrige before birth 9783
405 Married too old 2612
406 Marrige after death 11623
407 Death too old after Marriage 1691

 

I looked up the age constraints on Wikipedia (I know, I should look for the Guiness Book of World Records, but I'm already late for supper...) I admit that the ranges took me by surprise.

Youngest mother: 5 years, 7 months

Oldest mother: 70 years

Youngest father: 7 (this one is from answers.com)

Oldest father: 96

Greg
This is really helpful for easily finding profiles that need work. Thanks for running these reports!
+7 votes

Acadia (properly Acadie) was part of New France until 1763, when it became British territory. Most of what was Acadia is now in Canada, but it included parts of what is now Maine. More details at https://en.wikipedia.org/wiki/Acadia

Île Saint-Jean is now Prince Edward Island.

Port-Royal is now Annapolis Royal, Nova Scotia

Buckinghamshire (also abbreviated as "Bucks") is a county in England.

QC is an abbreviation for Québec, a province in Canada. (You may also see "PQ" for "Province of Quebec, although that's an old usage.)

NB is the abbreviation for New Brunswick, also a province in Canada (and once part of Acadia).

CT and Conn. are abbreviations for Connecticut, once a British Colony, now a state in the United States of America.

DC stands for District of Columbia (where Washington, the capital of the United States of America is located. It's a distinct territory, but doesn't have the same legal relationship to the rest of the country that states do. (To be honest, I don't understand the differences, but Americans hasten to point out that they're there.)

Glasgow is in Lanarkshire, a county in Scotland.

I'm guessing that Heiliges Römisches Reich is the Holy Roman Empire. https://en.wikipedia.org/wiki/Holy_Roman_Empire

Hertfordshire, also abbreviated as "Herts", is another county in England.

Liverpool is a city in England. Since 1974, it's been be part of the county of Merseyside. Prior to that, it was part of Lancashire (abbreviated as "Lancs".)

New York (also NY) was once a British Colony, and is now a state in the United States of America. Although "New York" may also refer to New York City. (But since the city is located within the state, the presence of either is a pretty good indication of at least the state level location.)

Northamptonshire (abbreviated "Northants") is a county in England.

Nouvelle France, without any more detail, could refer to Acadia, Québec, or Louisiana (not the current state, but the whole swath of territory from what is now Nova Scotia, Canada to New Orleans, Louisiana - although much of the land in between wasn't settled, just the occasional fur trading post before the British kicked the French out).

Rotterdam is a city in the Netherlands.

Upper Canada existed from 26 December 1791 to 10 February 1841 and generally comprised present-day Southern Ontario, Canada. From 1841 to 1867, it was known as Canada West. After 1867, it became Ontario, and eventually expanded to its present borders.

Virginia (also VA) was another British colony that became a state in the USA. During the U.S. Civil War, West Virginia (WV) was split off from Virginia.

Utrecht is a city in the Netherlands. It's located in a province, also named Utrecht.

Louisiana (abbreviated as "LA" without periods, not to be confused with "L.A." which is Los Angeles, California) was part of Nouvelle France, bought by the USA in 1810, and the territory of the French colony is now taken up by several states. 

Greg

 

 

by Greg Slade G2G6 Pilot (678k points)
+7 votes
This thing is a MONUMENT!  Awesome job, Ales [excuse the missing diacritic, I was too lazy to look for it :-)  ].  It is, however, more than just a little bit overwhelming.  This will undoubtedly provide us with a wide array of "forever" projects.
by Fred Remus G2G6 Mach 4 (43.3k points)
+9 votes
Update of statistical data (May 15 2016).
by Aleš Trtnik G2G6 Pilot (804k points)

Stunning work Ales, I have just started correcting (and ticking of "false") errors of the Database errors Project by usin this link with which the data error zoekfunctie can be opened (I have made time today) - I also see along the way how much of my own genealogical line still needs better sourcing, dates, places (soooo busy with projects not getting the time to do so ... will have to make a plan ...). Thanks to you and everyone else involved who made this happen ...!

image

+3 votes
The database dump is done weekly on Sunday (unless the docs are wrong).
by Rick Williams G2G1 (1.4k points)

Related questions

+7 votes
0 answers
+30 votes
2 answers
472 views asked Jun 2, 2018 in WikiTree Tech by Paul Gierszewski G2G6 Mach 8 (88.6k points)
+3 votes
1 answer
201 views asked Jul 26, 2020 in WikiTree Tech by Justin Cascio G2G5 (6.0k points)
+14 votes
5 answers
629 views asked Jul 1, 2021 in The Tree House by Shawn Ligocki G2G6 Mach 2 (28.9k points)
+15 votes
3 answers
434 views asked Jun 24, 2021 in The Tree House by Shawn Ligocki G2G6 Mach 2 (28.9k points)
+17 votes
4 answers
757 views asked Jun 24, 2021 in The Tree House by Shawn Ligocki G2G6 Mach 2 (28.9k points)
+30 votes
8 answers
+18 votes
3 answers
538 views asked Jun 28, 2016 in WikiTree Tech by Aleš Trtnik G2G6 Pilot (804k points)
+3 votes
1 answer

WikiTree  ~  About  ~  Help Help  ~  Search Person Search  ~  Surname:

disclaimer - terms - copyright

...