Is it OK that GEDCOM export splits UTF-8 characters?

+6 votes
290 views

I recently tried to convert a GEDCOM produced by WikiTree to another codeset using the utility `iconv`. I was shocked to be told that the GEDCOM file is not valid UTF-8. Reason: a multibyte UTF-8 character appearing in a biography was split with the first half at the end of a line and the second half at the start of the following CONC line.

Is this considered to within the specification of a GEDCOM file that has CHAR UTF-8, or tolerable because genealogy programs will concatenate the parts and other programs are not supposed to be reading the file, or simply a bug?

Update: The answers make it clear that this is a bug, so the question now has that tag too.

 

in WikiTree Tech by Dirk Laurie G2G6 Mach 3 (35.7k points)
edited by Dirk Laurie

3 Answers

+4 votes
 
Best answer

What you describe isn't valid UTF-8; you cannot have one or more characters (the End of Line) inside another character. That not sense make.
You identified a WikiTree GEDCOM export defect, that WikiTree needs to fix.

Meanwhile, you can, as a workaround, move the two halves of the character together. If the utility accepts long lines, all you have to do is remove the EOL inside the character. It is annoying to have to do that, but it is doable, as you're not likely to have many instances of this export defect.

On a practical note, a problem you will walk into when trying to do that is that the ostensible WikiTree GEDCOM file is not even a valid text file.
You may want to use hex editor, as regular text editors may either refuse to read it, or get seriously confused.
Editors are likely to misrecognise the character encoding - and in this case, that is what you want, cause it cannot read it as UTF-8, as it isn't UTF-8.

Read the ostensible WikiTree GEDCOM as Windows ANSI, then remove that EOL, and save the file. Yes, you'll read and save the file as  Windows ANSI, as long as you don't change anything else, it will, ahem, magically, be a correct UTF-8 GEDCOM file now.
Obviously not recommended as a general technique, but should fix this problem.

update 2017-02-16: expanded answer into an article, with more details & explanation, a running example, illustration of the workaround and related links: http://www.tamurajones.net/AWikiTreeGEDCOMProblem.xhtml

by Tamura Jones G2G Crew (620 points)
edited by Tamura Jones
In researching this problem today, I just discovered your excellent website: http://www.tamurajones.net/index.xhtml

And since you beat me to the punch, I no longer felt qualified to answer :)
Is there an actual line length limit for a CONC tag? Or is WikiTree's 78 character limit a bit arbitrary?
Yes, there is a maximum line length.
FamilySearch GEDCOM specifies it in a confused way.
The actual limit is not 255 characters, but 255 code units, as explained in this article: http://www.tamurajones.net/GEDCOMLines.xhtml

With WikiTree limiting lines to 80 bytes total, it is perfectly fine to merge two lines by removing the offending EOL. The resulting line will be at most 158 bytes long, well below the limit of 255.
Of course the point of the line length limit is to size a buffer.  1-byte characters - 256 bytes.  2-byte characters - 512 bytes.  Multibyte characters - not anticipated.

Unicode maps semantic characters to abstract code points, 0-65535.  Ideally, GEDCOM specifies a code-point-stream.  Then you map that to a byte stream in whatever format your pigeons carry, and if necessary, re-encode when changing pigeons - none of which would be of any concern to the application (nor would the byte-stream encoders need to know about GEDCOM).

Trouble is, the application then only knows the line length in characters.  The problem UTF-8 doesn't solve is the existence of too much old software using scanf-type input routines to read lines into fixed-length buffers.

The only way round that is to do the byte-encoding in the wrong place, so as to control the line length in bytes.

Which is where it all gets very messy.
WikiTree isn't some old pre-Unicode app. WikiTree is fairly new.
And the issue isn't WikiTree input of pre-Unicode GEDCOM, but WikiTree output of UTF-8 GEDCOM.

After some more thought and reading, I have to say I disagree with you.

The WikiTree export isn't a bug at all. Its true that it produces invalid utf-8, making the resulting gedcom a binary file instead of a text file. But once all the CONC tagged lines are reconstituted, the data becomes valid uft-8 text again.

You even detail that the specification never mentions that gedcom files are actual text files. So that seems to imply that binary data files are okay.

Even as a binary file, WikiTree still exports a valid gedcom file, that will re-import with no data loss or corruption (though I'm unaware how it fares with current validators). The way Chris breaks the lines, even with splitting multi-byte characters across lines, is perfectly valid -- though difficult to work with, outside of importing.

+2 votes

Wow, I wasn't convinced there was a problem, but I did some testing and discovered that you are correct. And as Tamura says, what you describe would indeed be a WikiTree GEDCOM export bug.

In my own experiments today, I noticed that the CONT and CONC tags are strictly cutoff at 78 characters without regard to word boundaries. The exported GEDCOM sets the CHAR to UTF-8, and many of the profiles do include many multi-byte characters. But initially, none of my exports encountered this problem, though a few of the lines were really close. I was just randomly lucky. Turns out, the lines are also cutoff without regard to multi-byte character boundaries either.

The 'file' command will show when its a problem or not:

$ file -bk --mime-encoding goodexport.ged
utf-8

$ file -bk --mime-encoding badexport.ged
unknown-8bit

This perl command will help identify which line number, character number on the line where the non-ACSII character occurs:

$ perl -ne '/^([\x00-\x7f\xa0-\xff]*)(.*)$/;print "$.:".($-[2]+1).":$_" if length($2)'  mygedcomfile.ged

So I found one of my profiles that was really close, and added (in blue below) just enough characters to the line to cause the GEDCOM export to split the mylti-byte character as you describe.

See https://www.wikitree.com/wiki/Brownlow-71 (the second, and current test edit). It splits the 3-byte em-dash across the lines.

Affected line(s) as displayed in emacs text editor:

2 CONC an]], son-in-law)<ref name="death">Texas, Death Certificates,123 1903\342
2 CONC \200\2231982</ref>

So, it is possible to find and correct the offending profiles using this workaround, though its a bit tedious having to do multiple exports and testing.

by Dennis Wheeler G2G6 Pilot (535k points)
edited by Dennis Wheeler
I am working on a Lua script for sanitizing GEDCOMs before importing them into LifeLines for producing custom-formatted output (it also needs to remove some non-standard extensions; LifelInes is rather pedantic about correctness). I'll just add this ability to it for the nonce. Fortunately Lua 5.3 provides some rudimentary support for UTF-8, including a validity check that gives the length up to where the UTF-8 encoding is OK.

99% of the UTF-8 characters encountered here aren't what one would expect. They aren't accented characters from non-English languages and such (though your experiences may vary). Instead, they are em-dashes, bullets, footnote up arrows, ellipsis, and the !@#$% Microsoft curly quotes -- most coming from copy-pasting from the internet.

Its doubtful that you'll be able to convert the Microsoft special characters to anything else useful using 'iconv'. Keeping it UTF-8 is your best bet.  (but you'll have to fix the broken UTF-8 characters that WikiTree produces)

+2 votes

Ok, until Chris has a chance to fix this bug, here's a Perl script you can use to repair your WikiTree exported gedcom. It makes a big assumption that there are no more than two consecutive lines that have split a utf-8 character between them.

https://www.eskimo.com/~wheelers/files/repairWTgedexport.pl

It has three options

  1. Check to see if your exported gedcom has any issues or not.
  2. View the affected lines (with line numbers).
  3. Fix it.

Let me know if you find any issues with it. (or if I need to add support for three or more consecutive lines)

by Dennis Wheeler G2G6 Pilot (535k points)

Dennis, you are helpful far beyond the call of duty. This is one of the things that is so grand in WikiTree: there is a general attitude of really wanting to get things done properly and being willing and eager to involved.

But (heck, how can I put this without coming over as a d*ckhead?) I learnt to program in 1965 and I'd class myself as semi-professional. I happen to do my scripting in Lua rather than Perl, Ruby, Python etc, but that's just the hammer principle taking effect. Debugging my own scripts ;-), testing those of others :-(, I hope you can understand why I would prefer to pass on this one?

 

 

Well, sure...write you're own scripts if you'd like. no worries from me. But its here for anyone else, who maybe doesn't quite have the skills to write their own :)

I learned programming much later in life, after a whole other career, though I tried initially when I was in school. But I kept shuffling the punch cards :/

Related questions

+2 votes
3 answers
0 votes
1 answer
+5 votes
0 answers
103 views asked Jun 27, 2017 in WikiTree Tech by Roger Barnes G2G5 (5.0k points)
+12 votes
3 answers
218 views asked Jun 3, 2017 in WikiTree Tech by Greg Slade G2G6 Pilot (410k points)
+2 votes
1 answer
91 views asked May 17, 2018 in WikiTree Tech by Zoe Erkenbeck G2G Crew (900 points)
+5 votes
2 answers
76 views asked Dec 28, 2017 in WikiTree Tech by James LaLone G2G6 Mach 5 (57.6k points)
+8 votes
0 answers
70 views asked Sep 26, 2017 in WikiTree Tech by Jim Lamson G2G1 (1.2k points)
+4 votes
2 answers
140 views asked Jul 20, 2017 in WikiTree Tech by Sara Hanley G2G Crew (720 points)
+3 votes
1 answer

WikiTree  ~  About  ~  Help Help  ~  Search Person Search  ~  Surname:

disclaimer - terms - copyright

...