Applying Database cleanup concepts to the Profile Text

+22 votes
343 views

Is anyone thinking about automated analysis of the Profile Text like we do for the data fields of the profile as part of the database errors project. The DB errors project touches on this with 901 and 902 errors.

Would we treat this as a separate or sub project as the other db errors?

For a starter, here are some ideas base on errors I find myself fixing. There are probably dozens of other great ideas to be added:

Malformed Headings:

  • Text at end of trailing  =
  • Mismatch number of on front and back of heading =
  • Blank then = at start of line

Cite errors

  • Errors displayed as cite error 1-6 in body of text due to bad syntax (I’ve actually cleaned up many/most of these in last 6 months)
  • <ref></ref> -  empty body
  • <reference/>  - not exact text or content placed inside here
  • <ref></ref> pair with no corresponding <references/>
  • <ref></ref> pair after the <references/>

Merge incomplete

  • Duplicate section headings, most likely due to merge being incomplete

Source

  • No source section (probably over million?)
  • “* *” when * was probably intended
  • Only * on a line

Style Guide

  • Use of non-standard, discouraged formatting
  • Use of Acknowledgements for just the person who did an import (this one is debatable) 
in WikiTree Tech by Marty Acks G2G6 Pilot (153k points)
retagged by Ellen Smith

Just a note: We're working with Aleš to get him a dump of the text of profiles. This will be a huge amount of data for him to wrangle but he's as ambitious as ever.

Great to hear.

Good that this came back to the top of the list. I did think about most of these.

  • Text at end of trailing  = Can be checked
  • Mismatch number of on front and back of heading = Can be checked
    • Blank then = at start of line Is this not allowed?
     
    How about missing level in headings Is this allowed?
    = L1 =
    === L3 ===
     
    • Errors displayed as cite error 1-6 in body of text due to bad syntax (I’ve actually cleaned up many/most of these in last 6 months)
    What did you mean by this? I don't quite understand.
     
    • <ref></ref> -  empty body Planned
    • <reference/>  - not exact text or content placed inside here Kind of Planned Coverd by next item
    • <ref></ref> pair with no corresponding <references/> Planned
    • <ref></ref> pair after the <references/> Is this not allowed?
    Also nested ref tags <ref><ref></ref></ref>
    also unpaired <ref> tags are planned.
    • Duplicate section headings, most likely due to merge being incomplete Done for Sources, for others not sure if it is an error.
    • No source section (probably over million?) Cowered by other errors
    • “* *” when * was probably intended Is this often?
    • Only * on a line Is this often?
    • Use of non-standard, discouraged formatting Is there a list of non-standard, discouraged formatting
    • Use of Acknowledgements for just the person who did an import (this one is debatable)  Don't quite understand.

    Thanks. I noticed the latest DB errors picked up on the duplicate headings to catch uncleaned up merges. Very nice.

    Some further answers

    • Blank then = at start of line Is this not allowed? Marty>This causes the line to not be a heading.
    • How about missing level in headings Is this allowed?

      = L1 =

      === L3 ===

      Marty> I think it just ignores the gap. I would not bother testing it is subtle and does not impact the appearance of the profile.
    • <ref></ref> pair after the <references/> Is this not allowed? Marty> The <references/> only lists what is before it. Though, you will get many hits on the duplicate section headings, these do happen on their own, also.
    • Duplicate section headings, most likely due to merge being incomplete Done for Sources, for others not sure if it is an error. Marty> Good point. You may want to add Biography and Acknowledgments to this but not all the others like NAME, DOB, DOD, and the like
    • “* *” when * was probably intended Is this often? Marty> I see a fair amount. It might be isolated to one members contributions. Very minor issue. 
    • Only * on a line Is this often? Marty> I see this more than the one above. I think the new profile template may have added it i for some period of time to encourage the entry of the source. Again this is very minor for me.
    • Use of non-standard, discouraged formatting Is there a list of non-standard, discouraged formatting. Marty> See this style guide article. http://www.wikitree.com/wiki/HTML_and_Inline_CSS. Check with Chris or other more senior people than I. This may open a can of worms (sorry for the US idiom) 
    • Use of Acknowledgements for just the person who did an import (this one is debatable)  Don't quite understand. Marty> I think this is too subjective. I'd say not to do anything unless Chris or something other seniors think this is a good idea. I am not sure how you would automate this, but Acknowledgments are trending towards using rarely per style guidelines. See style guide article: http://www.wikitree.com/wiki/Acknowledgements
    Don't know if this belongs here but I just spent the evening fixing the "Clean up" errors on my error feed, approximately 125 errors (need cleanup) boy did they ever, but only 10 or less became un-sourced.  OK I was not looking for sources just cleaning up the profiles.  I guess, I feel that not too bad.
    I did some checking and

    ' =' at the beginning of the line appears in 3139 profiles. I will add this as an error.

    only '*' in a line appears in 107794 profiles. I think this is not a candidate for error. It is too many of them and doesn't matter to much.

    Here are frequencies of common header structures.

    == Biography == == Sources == 5477337
    == Biography == == Sources == === Footnotes === === Acknowledgments === 475427
    == Biography == === Name === === Birth === === Residence === == Sources == 169073
    == Biography == == Sources == === Acknowledgments === 160354
    == Biography == == Sources == === Footnotes === == Acknowledgments == 125344
    == Biography == === User ID === === Data Changed === == Sources == 122044
    == Biography == == Sources == == Acknowledgments == 111497
    == Biography == === Name === == Sources == 95324
    == Biography == === Data Changed === == Sources == 56517
    == Biography == === Birth === === Death === === Record ID Number === === User ID === === UPD === == Sources == 50085
    == Biography == == Sources == == Biography == == Sources == 45384
    === Source === === Sources === 41894
    == Biography == === Name === === Birth === === Residence === == Sources == === Acknowledgments === 38103
    === User ID === === Data Changed === 37297
    == Biography == === Name === === Birth === === Death === == Sources == 35088
    == Biography == === User ID === == Sources == 34718
    == Sources == 32009
    == Biography == === Name === === Birth === === Death === === Residence === == Sources == 31269
    == Biography == === Birth === === Death === === Record ID Number === === User ID === === UPD === == Sources == === Acknowledgments === 30824
    == Biography == 26852
    == Biography == === Name === === Birth === == Sources == 26494
    == Biography == === Name === === Birth === === Residence === === Marriage === == Sources == 25673
    == Biography == === Notes === == Sources == 23402
    === User ID === 23338
    == Biography == === User ID === === Data Changed === == Sources == == Acknowledgments == 22817
    == Biography == === Birth === == Sources == 22467
    == Biography == === User ID === === Data Changed === == Sources == === Acknowledgments === 22405
    == Biography == === Record File Number === === Submitter === === Data Changed === == Sources == 21494
    == Biography == === Name === == Sources == === Acknowledgments === 20957

    And here are frequencies for more than 100 similarly structured bios. 

    http://www.softdata.si/osebe_staro/ales/wikitree/Captions.htm

    As you can see it is mixture of all combinations. 

    Biography is mainly on Level 2 (== Biography ==), a few thousand on L1(= Biography =) and L3 (=== Biography ===), and a few thousand without space(==Biography==). Since this has no visual effect, I wouldn't bother with them.

    Same goes for sources. With sources, there are two texts (=== Source ===) and (=== Sources ===), but again no point in changing that.

    So the only errors to check would be:

    There are instances like this and shouldn't be.

    == Biography == 

    == Census == 

    === 1850 United States Census, 1850, Database with images, FamilySearch (https://familysearch.org/ark:/61903/1:1:MCD2-6T2 : accessed 17 June 2015), William Giffin, Knox county, part of, Knox, Tennessee, United States citing family 643, NARA microfilm publication M432 (Washington, D.C.: National Archives and Records Administration, n.d.). ===

    === 1860 United States Census, 1860, Database, FamilySearch (https://familysearch.org/ark:/61903/1:1:M8TP-87R : accessed 19 June 2015), Nancy Griffin in household of William Griffin, 14th Dist, Knox, Tennessee, United States from 1860 U.S. Federal Census - Population, database, Fold3.com (http://www.fold3.com : n.d.) citing p. 150, household ID 947, NARA microfilm publication M653 (Washington, D.C.: National Archives and Records Administration, n.d.) FHL microfilm 805,259. ===

    == Marriage == 

    === Marriage Certificate Tennessee, County Marriages, 1790-1950, Database with images, FamilySearch (https://familysearch.org/ark:/61903/1:1:XZ44-C47 : accessed 17 June 2015), William Giffin and Nancy King, 08 Sep 1843 citing Knox, Tennessee, United States, county courthouses, Tennessee FHL microfilm 1,205,071. ===

    == Children ==

    === Death Certificate of Leander Tennessee, Death Records, 1914-1955, Database with images, FamilySearch (https://familysearch.org/ark:/61903/1:1:NS4F-V2K : accessed 17 June 2015), Nancy King in entry for Leander Dowell Giffen, 13 Sep 1934 citing Cemetery, Mt. Olive, Knox, Tennessee, cn 20756, State Library and Archives, Nashville FHL microfilm 1,876,817. ===

    === Samuel Benton's Death Certificate Tennessee, Death Records, 1914-1955, Database with images, FamilySearch (https://familysearch.org/ark:/61903/1:1:N9BM-KXL : accessed 17 June 2015), Samuel Benton Giffin, 23 Dec 1940 citing Cemetery, Knox County, Tennessee, cn 27895, State Library and Archives, Nashville FHL microfilm 1,876,894. ===

    === Daughter Eliza Katherine Wilson's Death Certificate Tennessee, Death Records, 1914-1955, Database with images, FamilySearch (https://familysearch.org/ark:/61903/1:1:NS7K-XZP : accessed 29 June 2015), Nancy King in entry for Eliza Katherine Wilson, 21 Dec 1945 citing Woodlawn Cemetery, Knoxville, Knox, Tennessee, cn 25920, State Library and Archives, Nashville FHL microfilm 2,137,365. ===

    Aleš Trtnik,

    Very nice analysis. I'm find with whatever you want to add. At some point do you split the bio cleanup from the data cleanup reporting. We may have different groups that like to work in one area but not the other. 

    Well Bio cleanup errors are grouped on 800-900, so we can still be one group.
    That works for me.

    1 Answer

    +8 votes
    I routinely use Acknowledgements for the import data and for ancestry tree links that aren't linked to anything (I can't verify the links are good).  I know the import is in Changes but I like having it where I and others can easily see who to credit or blame.

    Don't know if you noticed Ales said Bio is not part of database dump he gets for now, but anyway doesn't hurt to start planning for it.
    by Living Anonymous G2G6 Mach 5 (51.5k points)
    I did see Ales said it he did not have access the text today but apparently he can tell whether it is empty somehow (and the size?) based on other posts.

    I probably should have left Acknowledgements out of my post. I tend to leave the section it in until I have better/real sources. That would be hard to get right automatically the more I think about it.

    Let's see if there is interest at this time in automated Profile text, if not we can move on. I suspect it may be a different set of people that are interested in text cleanup as you kind of have to be into the whole wiki markup language to correct some of the them like the ref tag related issues.

    We also may want to get through all the db errors first with our little fast moving group and attack the text later.

    And who knows, there may be some technical or privacy barriers that make this more problematic than it seems.
    Yes, after a merge or two, it's far more helpful to have Acknowledgements than to try searching thru the changes for an ID. As a "just in case", I'm adding a "created by" line under Acknowledgements for all my new profiles...

    Kitty,

    My interpretation of the style guidelines is that this is discouraged. See http://www.wikitree.com/wiki/Acknowledgements.

    Related questions

    +5 votes
    2 answers
    +5 votes
    2 answers
    394 views asked Jun 29, 2017 in Genealogy Help by Tim Lyne G2G2 (2.2k points)
    +5 votes
    3 answers
    +3 votes
    2 answers
    +4 votes
    1 answer
    +19 votes
    13 answers
    1.2k views asked Jul 21, 2016 in Policy and Style by Aleš Trtnik G2G6 Pilot (804k points)

    WikiTree  ~  About  ~  Help Help  ~  Search Person Search  ~  Surname:

    disclaimer - terms - copyright

    ...