Bot import of The Peerage data in Wikidata - Good or Bad News?

+27 votes
851 views

I've been alerted by a recent burst of notifications on my Wikidata dashboard, mostly linked to "items" (read : person profiles) I had edited months ago. Most of those were the work of a bot called GZWDer_(flood), apparently processing and importing data from http://www.thepeerage.com/, which does not go without problems like creation of duplicate entries or duplicate family relationships. See the ongoing discussion at https://www.wikidata.org/wiki/User_talk:GZWDer#User:GZWDer_(flood)_creating_duplicates

I was not aware of the size of the Peerage data base, about 600,000 entries. This import means the amount of genealogical data in Wikidata has changed about one order of magnitude. Just an example, the number of "has mother" relationships in Wikidata as Aug 1st, 2019 was about 62,000. It's now over 480,000. 

I'm not sure of the impact this can have on the Wikidata-WikiTree connection. I'm a bit concerned by what it means for Wikidata data quality, because if I read the discussion, Wikidata human task force is not enough to check the bot import results, and they count on other bots to fix the errors of the first one. 

I'm pretty sure we'll see more of the same in the future. Wikidata frightening growth (over 3 millions items/month) is mostly achieved today by bots. Not sure it's good news.

in The Tree House by Bernard Vatant G2G6 Mach 3 (35.5k points)
retagged by Robin Lee

9 Answers

+16 votes
 
Best answer

What is the point of Wikidata? It's not a primary source. 

Older profiles on Wikitree are particularly vulnerable to the garbage that is now coming through Wikidata. We in the England Project have done our best to deal with inaccurate and spurious stories which people have just made up and brought along. Now those inaccuracies are coming through again via The Peerage, a site which the England Project counts as an Unreliable Source

Today I have 188 errors for the England Project. Only 36 of those are proper errors, mostly from profiles which we have very recently adopted and which we are still working on. In fact while counting them up, I've corrected 15 of them. Another 13 are Wikitree / FaG discrepancies, because you can't get the profile manager on FaG to make changes to their errors. 

139 "suggestions" are when Wikitree and Wikdata don't agree. 139 profiles which I will have to check, and I expect that having looked at some of these suggestions, it's going to be Wikidata that is wrong. And then what do we do? Am I expected to sign up to Wikidata too, to clean their database to make sure that the errors coming via The Peerage aren't adopted by WT members clearing the errors list? As Isabelle has pointed out False Suggestions have a habit of reappearing in the error list.

I vote for a suspension of Wikidata matching until they have sorted out their own input criteria, and only work from primary sources.

Declaration of interest in this topic: England Project Managed Profiles coordinator.

Edit:typo

by Jo Fitz-Henry G2G6 Mach 6 (63.3k points)
selected by Elizabeth Viney
Bravo, Jo!

Jo, and all who massively support your viewpoint. I think the message is clear. I intend to point this thread, if you don't mind, to the Wikidata folks who created this mess, and/or let it happen default proper governance and analysis of real users needs.

But, to tell the truth, reading "What is the point of Wikidata?" in bold case, makes me sad. I've been involved in linked data (of which Wikidata is now the hub and main implementation) since the very beginning (2006), and even before the term was coined. I've been known as a "linked data evangelist", and was very happy with the launch of Wikidata, which I saw as a full scale implementation of an old dream the hard work of many smart people had eventually made happen.  

Even if I am now retired, giving me time for genealogy smiley, my heart is still partly with my former community of practice and research. So, please understand I'm sad to see, now that I stand on the side of users, that the dream is turning into nightmare. 

Although I agree in the current state of affairs stopping Wikidata suggestions is a needed emergency measure, please let's not throw the baby with the bathwater. Let us (those of us who are convinced of the benefits of Wikidata in the long run) provide feedback to Wikidata stressing the need for quality, and keep how it turns out in the radar. But don't bother regular and serious WikiTreers with it until there is solid ground to resume the connection - if ever.

Wikipedia expressed some worries:

“Concerns of circular sourcing. Wikidata's indiscriminate bot sourcing from Wikipedia, from other unreliable sources, and mass import-export with other databases gives false or unreliable information a false appearance of authority. It can become difficult or impossible to trace the origin of claims. “

From:https://en.m.wikipedia.org/wiki/Wikipedia:Wikidata/2018_State_of_affairs

Thanks Joe for this link. I read further down the same page 

Wikidata is unstable and its governance is immature

If even Wikipedia people say so ... sad

Hear, hear!

Jo refers to profiles managed by the English project.I suspect  this is only a fraction of the English (and Scottish) profiles affected.
+16 votes

I think it is. I did expect this to happen back in 2016, when we started with Wikidata connection. 

At the moment there is quite some chaos due to the import. I just provided them a list on same page of 170 duplicates they created, but those hiccups will be solved.

Due to this import I added over 6000 connections from wikidata to wikitree and many more profiles are checked id data is correct. I did see quite a few mistakes on WikiTree, that were hidden until now. Don't get me wrong, also WikiData is wrong many times.

I think we already passed 80000 connections. You can see the number of them over time here.

http://wikitree.sdms.si/default.htm?report=stat1&dataID=501&Year=2

by Aleš Trtnik G2G6 Pilot (478k points)
Thanks for the head up, Aleš. I was sure you will be the first to react. So for you it's good news, technically speaking.

Now regarding quality, do you, and others, consider The Peerage as a quality source? I looked at random recent entries, and noticed a lot of references are "email message from X" ...

I'm pretty sure the manager of The Peerage is a serious gentleman, but I remember a recent discussion here where people wondered how a single man could handle seriously a genealogical data base of 200,000+ profiles. The Peerage has three times this number ...

I have corresponded with Darryl a number of times. He has, upon request, put me in touch with the person who provided the data for a branch I was working on, and that contact has been very helpful to me, too. In my experience, Darryl usually implements any fixes that I send to him within a couple of weeks. So, while his dataset does have errors (as does any dataset), he is fixing them as he's told about them, and that's about as much as I can ask of any maintainer.

Greg, I've no doubt he his doing his best, is open to suggestions, fixes whatever he can. But he is nevertheless a single man maintaining a database of 600,000+ entries.

I would not consider ThePeerage.com a reliable source. 

Actually it's not just me - it is listed as Unreliable on the England Project Reliable Sources page (see "Websites with online trees and genealogical information" section). 

It's very often possible and even easy to confirm (or not - obviously the site contains errors) the data with better sources.

Oh, no. I don't consider it a "source". I list it under a "See also:" header, rather than directly under the == Sources == header. But I do use it as a starting point, and look for actual sources to confirm what it says. It's only when the sources contradict what's on ThePeerage.com that I bother writing to Darryl. (I should probably send him all the sources I turn up to confirm what he lists, but he keeps listing my emails as sources, instead of whatever site I'm pointing to.)

Only about 12,000 WikiTree pages seem to have the Wikidata template.

Should more be added?  Is it the PM's choice?

What's the easy way to find the QID, if there's no template on the page?

Does the QID need to be passed to the template?  Is there a query to find the item by looking up the WikiTree-ID in Wikidata?

Can the Wikidata template be extended to output the link to ThePeerage.com?
In addition to 11139K Wikidata templates (not all links are to persons), there is also 163308 links to wikipedia on 129184 profiles. 112345 of them are to persons. Bot not all of them are to the same person.

We are looking for a way to make it easy to come to wikidata from a profile. For now only wikitree+ has that information for all 95000 profiles.

Main information is stored on WikiData and that is used for linking, so only there a link to WikiTree profile needs to be added.

Linking to the peerage is not directly possible from template. But there might be a solution for that in the future. Although I am seeing many negative comments about the quality of it as a source.
+13 votes

"I'm not sure of the impact this can have on the Wikidata-WikiTree connection. I'm a bit concerned by what it means for Wikidata data quality,"

To me it is bad news, if we are comparing what is on wiki-tree with what is on wikidata

I'm not sure I understand wikidata but if the data is incorrect, what is the purpose of wikidata?

I am familiar with the Peerage (mostly from barely sourced profiles)

The website uses a variety of sources, from the many editions of Burkes to emails from readers. http://www.thepeerage.com/s1.htm#s3268

I have to go by experience which of course may be exceptional but it leads me to worry about data on  wiki-tree being compared to this data. 

 " Sir Nicholas Martyn was born in 1550.1 He married Margaret Wadham, daughter of John Wadham and Joan Tregarthen.1 He died in 1595.1

     He lived at Athelhampton, Dorset, England.1

   Margaret Wadham was born circa 1526.1 She was the daughter of John Wadham and Joan Tregarthen.1 She married John Young.1 She married Sir Nicholas Martyn.1 She died after 1606.1

     Her married name became Martyn.1 Her married name became Young.1 "

   http://www.thepeerage.com/p63056.htm#i630558

Even a brief glance at the respective  birth dates of Sir Nicholas Martyn and Margaret Wadham should signal that something is wrong.  Margaret is said to have been 24 years older than her husband. If you look at the two daughters mentioned Margaret would have been too old to have been their mother. Nevertheless this 1550 date is all over the internet. The only source is an Email (the couple actually had 11 children, but only  four daughters survived as  stated on the still surviving  monument in their parish church ) 

 Nicholas's father actually died in 1548 and his ipm clearly says  " He died 19 Jul 1548. His son and heir was Nicholas Martyn aged 20 on 22 Feb 1549"

The date on the Peerage is twenty years out and when this was used on wiki-tree  dates on other family members were entered accordingly. Change one according to the evidence and a whole host of others were now wrong. This date may not have come originally from the Peerage but is certainly perpetuated by it . 

Just one example.Obviously,it may not be typical  but because the site uses such a variety of sources, there are bound to be others.

Here on wiki-tree, surely  we strive for accuracy. Primary sources are key. On wikipedia primary sources are forbidden but the emphasis is more and more,as far as I can see on reputable secondary sources. The Peerage is a tertiary source

by Helen Ford G2G6 Pilot (290k points)
edited by Helen Ford
I agree Helen.  I am not sure I understand the flow of data here.  But, The Peerage website frequently contains errors because of a reliance Burke and other flawed secondary sources.  It can be used as a guide, but not as a sole source.  I don't think we want a bot adding data from The Peerage (is that what is happening here?) without some real person evaluating the quality of the data.
100% agreement, Helen. Wikidata was a very good idea to begin with, consolidation of data across multilingual versions of Wikipedia, based on a robust and extensible data model (Entity-Relation).

Then other data sources were accepted. The import of The Peerage is an example of its current drift towards uncontrolled aggregation of dubious data.

In theory, every fact asserted on Wikidata could, and actually should, be sourced. The data model allows that. But it's almost never used. Too bad.
Joe, those data have been added to Wikidata, not to WikiTree. But data from Wikidata will come indirectly to your WikiTree dashboard through your suggestions page, for example.
This will greatly add to our suggestion lists then.  It is both a good and a bad thing.
Joe, No data on WikiTree is changed by the bot based on any external source. There is always a human, that changes things.

The intent to linking to other sources is also to identify the differences in data. Than a human should decide which data from which source is most probable and correct. And it would be good to also document invalid data, so it is known in the future which date is used and why. Otherwise someone will just change the date at some point, since he found it somewhere.

Helen, the example you describe can also be found on wikitree. There were thousands of them, and the situation is improving, but there are shill many more. Just check the date suggestions.
The incorrect  dating  in my example was previously  to be found on wiki-tree; it has changed;  the result of knowledge of the area and local research.

 The Peerage is a site related to Britain. On wiki-tree  we  now have  proactive projects ( certainly  for England and Scotland ) whose members are motivated,  have expertise and  in many cases access to local sources. As a result the situation on wiki-tree is improving.   Undoubtably it will take time but this is our potential forté, using   local  expertise to  find  good evidence.

It's going backwards if we then  have to compare with data bases such as  this one with a curate's egg of sources.
+12 votes
English genealogists have been ranting about Burke's for 150 years.  Particularly egregious is the 2003 edition, where Mosley cynically trawled the internet for popular junk he could include.

And that went into thepeerage.com, and now it's in Wikidata.

But the good news is, if anybody really wants to spend their time cleaning up junk genealogy, they can now go on Wikidata, and not have to contend with PMs, PPPs, MIRs and the rest of the WikiTree obstacle course.

This might help to rid WikiTree of some of its peskiest members :)
by RJ Horace G2G6 Pilot (561k points)
edited by RJ Horace
True, but the only trouble is, if you fix something on WikiData the odds are it will be reversed directly for the same reasons.
+5 votes
So what happens next?

Somebody writes a bot to upload a gedcom to WikiData, and all the world starts uploading their gedcoms?

Somebody uploads the WikiTree database dump?
by RJ Horace G2G6 Pilot (561k points)

Exactly, that's the kind of things I fear. It used to be limited by the "notability" rules of Wikipedia, but the Wikidata rules are set in such a way that they are open to any kind of interpretation. From https://www.wikidata.org/wiki/Help:FAQ/Genealogy#Who_can_be_added_to_Wikidata?.

You can create an item about someone if verifiable information can be added to it. Wikidata have a more lenient notability standard than Wikipedia. Wikidata does not require any sort of significant coverage; primary sources can also be used in some circumstances. But the existent of records in user-generated websites (such as Geni.com) does not automatically make someone notable.

The last sentence has to be read tongue in cheek, it seems.

And another issue I see, this one purely technical, is the scalability of Wikidata data base and query interface. The current data base growth rate is above 3 million new items/month, counting all categories of items. 

People and genealogy data are just a small part of the data base, but the bigger the data base, the harder it becomes to handle queries. Some quite simple queries which were yielding results in a decent time one year ago now regularly time out. See the time taken by https://w.wiki/DXq, which simply yields the number of items of type "Human" in the data base (over 6 million).

They seem to have underestimated the job every which way.

They seem to think an American website only has to worry about American copyright law.

"Verifiability" is hopeless.  I find a marriage on Wikidata, citing thepeerage.com.  So I look on there, and it cites Complete Peerage.  So I look at that, and it says "he is said to have married", which is code for "we couldn't find any actual evidence".

So now I know where I stand.

But I could have gone straight to Complete Peerage for that.  If I go to Wikidata, I'm looking for computer-usable data.  It's not computer-usable if I have to hand-filter it.  I need it to have been filtered already.

They seem to think an American website only has to worry about American copyright law.

For the record, the techie who runs the bot, GZWDer, is Chinese. GZWDer stands for Guangzhou Wikidata User.

+12 votes

I spent a lot of time this morning checking my new "suggestions" which came as a result of this massive (and still ongoing) import. There was one useful suggestion which allowed me to correct a typo on a death date (Jan instead of Jun), which is great, but there was a high price to pay for this.

Most of the other suggestions were incorrect. I spent some time pasting ugly bold notices at the top of all concerned profiles urging people not to replace the correctly sourced data on the profiles with unsourced "information" from ThePeerage.com via Wikidata (see https://www.wikitree.com/wiki/Mac_Mahon-160https://www.wikitree.com/wiki/Murat-14https://www.wikitree.com/wiki/De_Pucheu-1). I know, I know, mark the suggestion as false, and I did, but that is not always enough! Sometimes the error re-appears, and sometimes contributors "fix" profiles without needing to be prompted by Wikidata (many links to ThePeerage.com were recently added to profiles I manage - including some where life events were already all sourced with primary records). There is a third group of suggestions where the data I have is currently incomplete, and the suggestions may be useful as prompts to review the profiles. I'm not sure all these suggestions will turn out correct however.

The worse of it is in at least one case, I had to mark a false suggestion that had already been marked previously. The incorrect data on Wikidata had been fixed (so the error disappears of the database, as well as its False Suggestion status) but it was re-introduced by the recent import from ThePeerage.com.

So while this could potentially improve the strength of the data overall, in the long run, it will be a long and painful process. In the short term the impact will probably be negative with the reintroduction of old errors which will need fixing again.

by Isabelle Martin G2G6 Pilot (366k points)
Isabelle, given your answer and many converging others, don't you think it would be a good idea to stop feeding suggestions with Wikidata information, at least temporarily until things are clarified? Or have Wikidata suggestions as an option. As Lindy writes in the next answer, people who plainly ignore what Wikidata is don't have to be inundated with dubious suggestions. The risk is to have suggestions become counter-productive, with people ignoring them altogether.
Certainly. Even people who know Wikidata are not necessarily pleased with these suggestions. At a minimum, having them on a second list so that they don't reflect badly on our work would help. (for instance, some might question how a pre-1500 certified project leader can be allowed to have 217 suggestions - 214 of them are Wikidata).
I only have a handful on my own list but I know of others  that I have worked on that are on it because of the work I have done on them. I can only check those I remember( we need a watchlist that includes all profiles we have invested time in)

Most of these profiles  were  unsourced or sourced from the Peerage itself or other general compilations .They often  have multiple  absentee pms (original and duplicate creators who won't be checking their suggestion list. )  I worry that changes will  be made by datadoctors or others simply to  match what the Peerage says and hence get rid of the suggestions.

Bernard, I neither said nor implied that WikiTreers "plainly ignore what Wikidata is...".

What I basically said is that many WikiTreers are simply unaware of that website or don't understand what possible use it is at WikiTree.

In my opinion, any connections being made to Wikidata is being made in the "back office" for members who actually use Wikidata and are versed in its purpose and use (in other words, our higher-tech members). Thus, any Data-Doctors-Project suggestions should stay in the back office for our higher-tech members to handle.

Again, in my opinion, we average WikiTreers are only receiving these suggestions in the hopes that we will all "get on board" the Wikidata bandwagon - whether we have any interest in or competence with that site. We are basically being recruited to do work beyond that which we volunteered to do at WikiTree.

I believe ALL Wikidata suggestions should be removed from the general suggestions report.

edit: I agree with your prediction that these suggestions will be counterproductive to our work at WikiTree. In my opinion, they already are.

Lindy, sorry for the misunderstanding, maybe there is a hidden meaning under "plainly ignore" that was not intended by me (sometimes it shows that I'm not a native speaker blush). I did not meant that Wikitreers were deliberately ignoring Wikidata, but, as you say, that they just didn't know what Wikidata is and really don't care. 

So basically, we agree completely - I hope smiley. Language is treacherous.

Thanks for the reply, Bernard. As you note, it's just a language/usage variation.

You should feel free to post in your native tongue so I can have Google translate for me - that site is usually good for a laugh!!laugh

"any (Wikidata) Data-Doctors-Project suggestions should stay in the back office"

What Lindy said. And this is probably feasible. Suggestions around PPP issues were not released to the general public for a while. Category errors are still in a separate report.

+11 votes
I think this is NOT good news.   Just as when people started using Find A Grave for unsourced data, the amount of unsourced data on WikiData has grown exponentially.   I worked about 100 suggestions where the WikiData did not agree with WikiTree due to the addition of ThePeerage data and in 81 of the cases, the data on WikiData from The Peerage is merely someone's unsourced family tree.
by Robin Lee G2G6 Pilot (638k points)
In my opinion, the main problem with the WikiData suggestions is that they are for profiles that do not even have a visible link to or citation for WikiData.

In many cases, WikiTree profile managers not only don't know what WikiData is or how it can be useful, these managers haven't even heard of WikiData.

Why should we WikiTreers be inundated with suggestions for a site we don't use for sourcing and citing of WikiTree profiles?
Lindy, Robin see my proposal in above answer to Isabelle. BTW, what was the process (not the technical process, but the decision process) who brought Wikidata into suggestions?
The suggestion that scares me most is "Clue for father/mother". Creating parent profiles just because they are on Wikidata / out of what is on Wikidata is not a good idea. I'm sure some of these clues are incorrect or even represent people who didn't exist.

Re: the process (how Wikidata and Find A Grave matches became suggestions), I suppose Aleš will know. I admit I completely missed the G2G discussion before they were introduced, and even the memo that followed the discussion.
+1 vote

Putting this in a separate answer, slightly off-topic : I asked on the Wikidata discussion around this import whether adding so many people was conformant to Wikidata "notability" policy. https://www.wikidata.org/wiki/User_talk:GZWDer#The_Peerage_and_notability

Looking at the debate, it seems that people in Wikidata can have widely different views on this, and the simple fact that this import was possible is pointing at serious process issues.

by Bernard Vatant G2G6 Mach 3 (35.5k points)

And people who have not looked at the criticism of the earlier editions of 'Burkes' on wikipedia https://en.m.wikipedia.org/wiki/Burke%27s_Peerage, let alone the numerous 'email' sources used  on the Peerage.

Wikipedia's notability criterion basically says, don't write the article if there isn't enough reliable material.

The standard for reliable is realistic (ie high) on paper

https://en.wikipedia.org/wiki/Wikipedia:Reliable_sources

but in medieval history it's flouted constantly.

Wikidata has basically the same rule, but the assumption seems to be that the net is cast much wider because you need very little material to create an item.

On the other hand, in genealogy, it's still going to be hard to find material that meets general reliability tests (as opposed to the standards that genealogists define for themselves, which are much lower).

 

+10 votes
I regard this development as unwelcome. Even before information from thepeerage.com started to feed into Wikidata, I found only a very small minority of Wikidata suggestions of any value. The same goes for FindAGrave, which we have also discussed in G2G. Now I am going to have to divert more of my time into unproductive work to mark unreliable info from an unreliable source as false - or else have a growing list of uncleared DBE suggestions. At a minimum, could we exclude both Wikidata and FindAGrave from suggestions for pre-1500, or preferably pre-1700, profiles?
by Michael Cayley G2G6 Mach 8 (85.1k points)

I completely agree - unwelcome indeed.  My attitude has always been that I signed up for WikiTree, not Wikidata (nor Find a Grave).  But I now have 65 suggestions on my list, where it was at zero.

angry

As of today I have some 90 new suggestions on my personal suggestions feed. All, on a quick skim, absolutely useless. Do I really have either to mark them as false or leave them cluttering up my suggestions feed? One of them even has a woman called Dora marked as male in Wikidata.
Orphaning profiles is another way to reduce your personal suggestions feed.

If I don't want to do that, I just cave in and accept the suggestion, citing thepeerage.com and Wikidata.  If somebody objects, I'll mark them Uncertain.

Related questions

+17 votes
6 answers
+30 votes
8 answers
+1 vote
0 answers
57 views asked Aug 14 in The Tree House by M Cole G2G6 Mach 3 (33.8k points)
+6 votes
0 answers
95 views asked Nov 27, 2017 in The Tree House by C S G2G6 Pilot (273k points)
+10 votes
3 answers
+19 votes
6 answers
+19 votes
5 answers

WikiTree  ~  About  ~  Help Help  ~  Search Person Search  ~  Surname:

disclaimer - terms - copyright

...