Should WikiTree have a style guide for AI generated content?

+28 votes
928 views

Lots of discussions are being had in the community about 1) identifying AI content and 2) citing the content as AI generated. Here is how the question was formulated by Brad Foley in Canada Project discussions.

We should develop styles and policies regarding the use of AI in generating content (especially biographies) before their abuse becomes an issue.

In particular, tools like Bing and ChatGPT can easily generate lovely bios with lots of false "facts". In one sense this is covered by existing WikiTree citation guidelines. But the scale and volume that these biographies can be generated poses a new challenge.

For instance, in a biography of 30 lines, with 45 different statements, what constitutes sufficient and adequate sourcing? A single footnote meets current standards, but is clearly not enough to support such a long biography, which might be loaded with hallucinated (made-up) "facts". Current guidelines are not much help.

In those cases, should we ask (or require):

* a "biography generated by ChatGPT/Bing" citation

* the profile manager to delete the text

or are current tools up for the job like

* tons of {{needs citations}} tags

* an {{insuffcient sources}} sticker

Do we also need to develop new documentation around AI generated content as standalone docs, or as additions to existing docs.

Some of these issues have been raised in a few other posts.

https://www.wikitree.com/g2g/1544689/artificial-intelligence-questions-answers-citation-sources

https://www.wikitree.com/g2g/1531185/anyone-tried-chatgpt3

Go!

Edited to add content and links.

in Policy and Style by Mags Gaulden G2G6 Pilot (644k points)
edited by Mags Gaulden
Perhaps we should be taking our direction from Ian Becall's post on Aug 20, 2023. Pay attention to what is being created

Please use Auto Bio responsibly

https://www.wikitree.com/g2g/1624639/please-use-auto-bio-responsibly
We should take our direction from official policy - our Help pages and other pages linked directly from those Help pages.

13 Answers

+21 votes
I've already seen a user create multiple space pages, purporting to provide timelines and facts, with a single "Microsoft AI results" citation. They honestly looked pretty good, but that in itself worries me.

In principle it's not that much different than citing Wikipedia, but in the case of Wikipedia we're relying on a previous layer of human mediated editing and fact checking. Here, we know that AI oftentimes (up to 20% or more) makes up facts, ie "hallucinates".  I'm honestly worried that it's going to be increasingly difficult to wade through a large volume of AI generated text in bios and space pages.

I'm enthusiastic about the promise of AI as an aid, but I'm also worried that we're going to need to have legions of volunteers to go through new content, and check, line by line, what is real and what isn't, and to add sources.

It'll be like the early days of gedcom dumps all over again.
by Brad Foley G2G6 Mach 7 (79.2k points)
reshown by Mags Gaulden
Brad, not to sound alarmist but because (1) the professional community I used to be a part of is frankly deeply concerned about the deceptions that can be worked using A.I. and (2) any certified forensic/fraud auditor will tell you that old-fashioned paper-trails, once used to support a deception, are much easier to create electronically and are also the hardest to characteristically identify as being untruthful by the reviewer. Now A.I. is being developed for broad scale public use and there is no doubt it will be used, whether intentially or unintentionally, to advance deception by supporting fictional event(s).

I salute you and the Canadian team's effort to bring the issues to everyones attention.
+16 votes
The use of AI whether ChatGPT or other similar software to create biographies or sources bothers me.

In many ways it seems like it could be useful, but as someone who isn't comfortable with, or experienced with many types of software, I don't know how to determine if any AI generated content is correct or if it could be correct. What are the sources/

I have seen examples of AI generated biographies or information on other websites and much of it seems generated based on examples from non verified sources.

As an example with a made up name: Mary Jane___ was a loving mother, she always looked after her children well, she was well known as an accomplished needle woman etc, etc.

It just seems like the sort of information that might be put in a obituary by someone who knew little if anything about the deceased and was just trying to fill the approx 10 lines required for an obituary.

As a further example; an x times GGA of my husband, became a widow with 3 young children at about 26 years old. She remarried within 12 months of her husband's death.

The previous PM, who has not been active in many years had suggested that there was an extra-marital relationship with her second husband prior to her 1st husband's death.

There seemed to be no understanding by the previous PM that a widow with 3 young children in a frontier community either remarried or was left with no support and was unlikely to be able to manage on her own.

If we leave the decision about what is included about the person to AI, how do know that AI is not looking at completely unsupported information about the person such as the 'gossip" shown above.

We would need the same verifiable source for AI enhanced profiles that we currently required.

If AI does not have valid sources it is not any more reliable than an unsourced Ancestry tree that only has other unsourced  Ancestry trees as a source.
by M Ross G2G6 Pilot (738k points)
Let me ask my question here (I'm repeating what I said below to another comment you made):

Granted that AI is unreliable, and granted it isn't a source. I think we agree. What has changed is:

* It generates very plausible (and possibly wrong) bios
* It generates a lot of text quickly
* I feel like it will be difficult to police and keep on top of

You seem to be saying (here and in other comments) that you don't trust AI generated bios, but that current guidelines, policies, and tools are enough to keep on top of the new technology.

Is this a fair summary of your position?
No Brad, if I was the person in charge I would ban use of AI on WT. Along with any mention of online family trees as a source and grandma told me what her great great person said and did 200 years ago.

But I'm not the person in charge and people will continue to use unsourced family trees and what grandma said as sources even if they don't list them as sources.

I have a reputation among people who know me well as a 'pit bull for details'.

And many people don't like being told that they are incorrect or don't have enough information to support their family stories.

And yes there is a huge difference between sedum acre and sedum spectabile. Yes I'm a plant person.

Just as there is a huge difference between Sarah Jane Richards 1836-1902 and her sister Sarah Richards 1838-1893. And I have a 3 x GGF to blame for that.

Just as today we cannot prevent people creating unsourced trees, we will not be able to  stop people using AI.

So we need or will need some sort of system to cope with the fallout from using AI.

A banner that says created by AI be very careful!
+19 votes

 A "biography generated by ChatGPT/Bing" statement (it can't be said to be a citation) should be required, and no, it is not nearly enough. I do think current tools are up for the job, a big, bright {{Unsourced}} being the main one (or an {{insuffcient sources}} sticker, assuming those AI generated profiles have at least one actual source in order to have been created), along with a {{needs citations}} tag for every statement. Those will quickly expose the "emperor's new clothes" for what they are -- pretty words for a lot of nothing.

by Stephanie Ward G2G6 Pilot (118k points)
I agree a biography generated by ChatGPT or other similar AI programs, is unsourced and should be labelled as unsourced.

It is no different than many other 'sources' unsourced family tree, or my 3 x GGM who died 100+ years ago said so.

After having recently read about a court case where the defendants used AI to create their defense, which was thrown out by the judge, I see no reason to accept anything created by AI as reliable.
+11 votes
Perhaps we could institute a == Generated by AI == section, much like the newly popular and useful == Research Notes == section. It must be made clear that text in the AI section needs to be verified and sourced to be taken seriously.
by Lucy Selvaggio-Diaz G2G6 Pilot (833k points)
+15 votes
I have been using ChatGPT to combine bios in complicated merges, or if there are several cut and pasted excerpts.  It works pretty well, but I provide the content, and I still consider it a draft. It's pretty easy to add citations afterward and check the statements (ChatGPT can't help adding lofty summary statements about people's achievements no matter what prompts I use to ask it to stick to facts and avoid opinions).  Having ChatGPT generate the content, including the research is frightening, but I suppose unavoidable.

I think ChatGPT bios should be considered unsourced - So yes we should have a style guide for ChatGPT.
by M Cole G2G6 Mach 9 (90.6k points)
edited by M Cole
This to me is the best existing use of ChatGPT (for WikiTree). The same as I use it to generate code at work. I still need to run, debug, and edit the code, but it can save me hours of work.

But I think the question isn't whether we should consider them unsourced (definitely ChatGPT isn't a source) but how we should handle a potential flood of hard-to-understand-what-the-sources-are and what-is-potentially-made-up in new bios.
The answer is the same as other unsourced family trees, AI is not a source.

The actual sources and citations are the source not the program that created the equivalent of GedCom junk.
Exactly. So in response to floods of gedcoms we instituted new policies so that people wouldn't manually have to go around and find, identify, and fix the automated imports of junk.

If we're at the point where AI is going to start generating a flood of questionable content, my question remains "are our current procedures enough to cope".

It sounds like your answer is "yes"?
+22 votes
Just to add to the conversation...

In thinking about this, if one has a list of sources, then the biography practically writes itself. It's easy enough to create a chronological biography with citations for each fact.

I would have a concern that people would focus on putting together a biography without sources already at hand. Having an AI model do that for you is disastrous.

Until there is a high level of accuracy of having an AI model write a genealogical biography, with the ability to properly cite sources (given a list), then I'm ok with just outright banning them.

Biographies are not that hard to write, and the act of writing them helps to work through genealogical issues, like what is missing data, or lack of proof of relationships, etc. AI can't do that yet. It's folly to rely on an incomplete, and inaccurate, tool that is more of a toy at this stage.
by Eric Weddington G2G6 Pilot (521k points)
I definitely agree with everything you said.

The question is whether people who haven't done a lot of research understand that. Or if there are other people who just want to take shortcuts. I suspect that lots of people are going to be tempted to throw some unsourced ancestry tree or whatever (or even a bunch of sources) into an AI blender and post it as a biography.

Or they might stumble on such a biography second hand and take it as gospel. If a website has ai content that says "John Bobblebonk is my great uncle. I remember the smell of his pipe and the sound of his laugh. When he was 2 he fell and broke his front tooth ...." It sounds real. But....

I agree with other posters that, in principle, this is no different than the situation we're currently in, regarding sources. But the volume and diversity of spam content might plausibly be overwhelming.
All good points.

One thing that came to mind with your response is that a genealogical biography is supposed to be in the 3rd person, not the first person. I don't know if people realize that.
I haven't tried this, but I think you can direct Chat GPT to generate sources and citations. You'd probably need to convert them to Wiki Markup, but that's easy to do.  I'm thinking about the stories of the legal brief that was submitted to a court with phony references.  It must have looked sourced, but the information was all imagined.
I asked to ChatGPT 4 to create a profile for Aquila Chase with citations for each statement. (Unlike earlier versions, ChatGPT will search the internet on a live basis, rather than just relying on its training data.) What it did was it found the WikiTree profile for Aquila, summarized it, and cited the WikiTree profile for each statement. LOLOL.
Chase, that's hilarious.
I just asked for one of Thomas Dudley.  It cited Wikipedian, Encyclopedia.com and Britannica.com, but the statements included information not in those sources that seemed to be hallucinations..like Thomas Dudley arrived on the Mayflower in 1630, and that he supported the "execution" of Anne Hutchinson.
+12 votes
Question: Whether or a human or an AI writes a biography, don't we still need to source (ideally) each statement made?
by Jillaine Smith G2G6 Pilot (911k points)
Absolutely - since AI is prone to" Hallucinations" it would be important to identify sections created by AI. Not that humans can't hallucinate, but...
I agree Mags, we already have a large number of unsourced, wrongly sourced profiles without accurate biographies; adding to this problem with hallucinatory AI information will only make the problem worse.
+12 votes
I share the concern, but this is not just an AI thing.   We currently have "tools" that take the information attached to a profile and creates a "lovely" biography.   I recently came across one that had 15 children listed in the biography and 15 children attached.   When I asked for the source from the "long time member" that had used a tool to create the biography, I was told that the tool just uses the data, no sources are confirmed.   End of story, only 6 of those 15 are confirmed children.

Unless we make the people who use such tools responsible for validating the information in the biography, it really will become an issue.
by Robin Lee G2G6 Pilot (866k points)
Robin I can't even begin to count the number of biographies I have seen that are similar to the one you describe.

We all know this "Genealogy without documentation is mythology"  using AI or other tools that promote more such mythology is a nightmare in the making.

How can we as a group who promote accuracy and documentation also endorse practices that will make our tree less reliable?
+9 votes
I don't think a new policy is needed. I think creating a profile by cutting a pasting an AI generated bio as no different than creating a bio by cutting an pasting a bio from geni.com or some other website. Either way, you need to cite your sources and a bio supported only by citation to an unsourced secondary source (like an AI) is poorly supported.
by Chase Ashley G2G6 Pilot (313k points)
Thanks for this Chase!

If not new policies, do you think maybe other kinds of AI-specific introductory material or documentation would be useful? Or is it so obvious that "AI generated text is not a source. We require sources for genealogy." that it's unnecessary?

Chase, I disagree that a biography that a WikiTreer creates using AI (with co-creator attribution) is the same as a biography written by another person and copy/pasted by a WikiTreer (without attribution to the creator).

The essential issues, in my opinion, are who created the biography and were the creators given credit. Proper attribution would fall basically upon the WikiTreer, so we should have at least a basic policy on this topic.


Brad, my experience at WikiTree and elsewhere tells me that NOTHING is so obvious to everyone involved that it doesn't need a policy.


Overall, I agree with Mags that WikiTree should discuss and adopt a policy to govern the use of AI on the website.

@Lindy - If text is a cut and paste, there should be attribution, regardless of whether it is from an AI or from a human source. Sooo . . . not sure I see the difference there.
@Brad - Re "Or is it so obvious that AI generated text is not a source." An AI is a source, just like an unsourced family tree is a source. But neither is a reliable source and both should be avoided as source.

I think an AI policy should say that (1) AI's are not considered reliable sources and (2) if an AI is used as a source, the contributor should cite the AI version they used and the query they used.
My understanding is that we give attribution, or cite, the source objects of our information, not the fact that we used a particular tool to write text for that information. If I copy/paste a biography from a Word or similar document that I created on my computer, I don't need to cite the Word program, do I? However, if I copy/paste text that another person created, I would expect to cite that person as the source object for the text rather than citing the tool that person used.

I see AI as just another tool. The user of that tool has the responsibility to learn how and when to use it. Having a clear policy would help users meet their responsibility.
Very different from Word. With Word the user is providing the substantive content, with a generative AI like ChatGPT, the AI is. The AI is creating a secondary source on the fly in response to the user's query, so the AI needs to be cited just like any other secondary source.
Can you provide an example profile for which AI has created a secondary source? Would this secondary source not be based on existing sources which should be cited instead of AI?
Every profile created by AI is a secondary source created from other sources (and perhaps made-up stuff). The AI won't necessarily cite the sources that it based its profile on. If it does, and if those cites are verified as supported the statements, then, yes, it would be appropriate to cite those sources rather than the AI. However, I note that lots of times, genealogists just cite the secondary source (eg, Great Migration Begins) rather than checking the cites that source cites and citing the underlying sources.

How is a created profile a source?

A profile created by an AI could be a secondary source for a WikiTree profile just like a profile on geni.com or Find-A-Grave or a profile in a book could be secondary source for a WikiTree profile.
The AI-created profile would not be its own source, would it?

If so, we definitely need style guidelines for AI usage.

Perhaps we are comparing apples and oranges. I see a tool like AI as the apple and its output - the text for a profile - as the orange.

Either way, I would cite the records behind the output, not the tool I used to assimilate the output. I could mention the tool I used, but I don't see the need to do so.
I won't share an example profile here, because I didn't want to draw criticism for a specific user. But a way that AI generated text is known to be a problem is when the AI invents (or hallucinates) facts. These hallucinations can go as far as invented citations to books or papers or records that never existed.

Humans of course can (and do) invent fake facts all the time. But AI tools can do this rapidly and fluently.

One worry is that people start relying on AI tools as a writing aid, and don't check the the sources and the narrative (which in turn maybe get cited elsewhere as fact). I think in this case Chase, it's very different from being a secondary source.

One worry is that people start relying on AI tools . . . and don't check the the sources and the narrative (which in turn maybe get cited elsewhere as fact). I think in this case Chase, it's very different from being a secondary source.

Seems the same as people relying blindly on Ancestry trees, which, like AI-generated profiles are secondary sources. Both unreliable sources that people should not be relying on.

+15 votes
I don't know if there needs to be an additional policy about AI. Content completely generated by AI should just be treated the same as any unreliable or unsourced information (basically, don't trust it if there is no way to verify it).

I do think we should at least have a help page about AI -- to warn that generative AIs can make stuff up and that it can also create realistic-looking images.
by Jamie Nelson G2G6 Pilot (631k points)

Should the help page about Copying Text specify that this applies whether the copied content is produced by 'humans' or AI?

Yes, John, it should. However, I think AI needs to be addressed separately as well because many people will not realize that their AI generated content is "copy/paste" material. Policies need to be explicit.
I agree with you, Jamie.
I'm a terrible writer but threw some ideas onto this page, if anyone wants to add to it: https://www.wikitree.com/wiki/Space:AI_Help_Page_Draft

Also, while I think for text AI stuff is mostly covered by the "dont copy" policy, we might need some policies for images. Do we want to ban generated images?

"Do we want to ban generated images?"

How would that be done? Just creating a rule prohibiting them would not prevent them from being used. If something can be done, some people will do it.

I expect AI-generated images could also include fake meta data.

We wouldn't be able to stop someone from uploading an AI-generated image, but a rule would discourage people from using them (because doing so could result in their account being closed).
Why would we wish to ban AI generated (or enhanced) images?
AI can generate plausible but false images of anything, including for example gravestones and documents.

Enhancement of existing images can be useful up to a point, but as usual AI could go overboard and introduce details that never existed.
If we go the route of banning AI enhanced images, there are a lot of stickers that will be affected.
I wasn't thinking about graphics, but rather images that someone could mistake for a photograph or source document.

Like this?

More like creating something like this and trying to pass it off as a real photo. Right now it's still fairly easy to spot fakes, but in a couple of years who knows.

 

These could be taken to be photographs, just not 19th century ones.

    

  

+3 votes
I don't think it will be possible or necessary to write a special policy for AI. The situation is changing and as others have pointed out the basic principles are not that much different than situations we know from the past.

People should always say where they got their material.

It certainly wouldn't be right to take an anti AI approach out of principle because AI can be used in so many ways. One thing I expect we'll be seeing more is for example translations written by AI. I guess these won't always be cited, and that is not really the end of the world, but it would be best practice.

What we are seeing with some of the new AI goes a bit further than mere translation. For example you can give a list of facts and ask for it to be written up in a certain style. That is something which is going to take some getting used to, and I'm sure it will bring complications.
by Andrew Lancaster G2G6 Pilot (142k points)
+4 votes

I think that "AI generated content" is a very diffuse concept. There is in a fact a continous scale, all from the simplest machine generated content up to the current state-ot-the-art generative AIs. I for one use machine-generated content for my bios all the time, with a self-developed Perl script that takes data from my own database and produces a full biography. I'm improving it all the time, and the output needs ever less hand editing. Given enough time for development, the output of such a script could eventually reach a level that might be called "Ai generated content".

But AI generated content in itself, as I see it, is not a problem. The real problem is the old "garbage in, garbage out" (GIGO) principle. It is really exactly the same problem that we've got already witrh the old machine-generated GEDCOM junk, with reams of sections and subsections which as a rule boils down to absolutely nothing of substance.

If "AI generated content" should be disallowed, it would probably make sense also to disallow GEDCOM imports. And maybe all "machine generated content", such as my own scripted biographies.

I think the real issue is the "fluff" factor, ie. the ratio of text over what might be called substance. Or in plain old information theory speak, the signal-to-noise ratio, which might actually be made into an operational definition of what is wanted in the Biography field of a profile.

As long as the generated text is supported by sources, everything should be OK. But bio text unsupported by sources should never be welcomed, whether it is generated by humans or computers.

by Leif Biberg Kristensen G2G6 Pilot (209k points)
+5 votes

My personal opinion, as someone that uses ChatGPT on a regular basis, the debate touches on a crucial aspect of integrating AI into historical and genealogical work. While AI can be a powerful tool, its current limitations, especially in generating factually accurate content without explicit sourcing, present significant challenges. The idea of a style guide or specific policies for AI-generated content seems prudent. It could help in setting clear standards for the use of AI, ensuring that any content it generates is properly vetted and sourced. This approach would maintain the integrity of the historical record while still leveraging the benefits of new technology. The balance between innovation and accuracy is delicate, especially in fields where factual correctness is paramount.

by Brian Parton G2G5 (5.1k points)

Related questions

+5 votes
0 answers
87 views asked Nov 22, 2023 in Appreciation by Phil Phillips G2G6 Mach 1 (15.4k points)
+7 votes
2 answers
289 views asked Jul 26, 2023 in WikiTree Tech by Bryan Simmonds G2G6 Mach 1 (16.2k points)
+7 votes
5 answers
384 views asked Jul 6, 2023 in The Tree House by Paul Schmehl G2G6 Pilot (149k points)
+6 votes
2 answers
167 views asked Jul 5, 2023 in The Tree House by Peter Roberts G2G6 Pilot (708k points)
+4 votes
1 answer
+43 votes
20 answers
+24 votes
8 answers

WikiTree  ~  About  ~  Help Help  ~  Search Person Search  ~  Surname:

disclaimer - terms - copyright

...