Is it OK that GEDCOM export splits UTF-8 characters?

Question

Is it OK that GEDCOM export splits UTF-8 characters?

520 views

I recently tried to convert a GEDCOM produced by WikiTree to another codeset using the utility `iconv`. I was shocked to be told that the GEDCOM file is not valid UTF-8. Reason: a multibyte UTF-8 character appearing in a biography was split with the first half at the end of a line and the second half at the start of the following CONC line.

Is this considered to within the specification of a GEDCOM file that has CHAR UTF-8, or tolerable because genealogy programs will concatenate the parts and other programs are not supposed to be reading the file, or simply a bug?

Update: The answers make it clear that this is a bug, so the question now has that tag too.

asked Feb 10, 2017 in WikiTree Tech by Dirk Laurie G2G6 Mach 3 (39.4k points)
edited Apr 5, 2017 by Dirk Laurie

3 Answers

Best answer

What you describe isn't valid UTF-8; you cannot have one or more characters (the End of Line) inside another character. That not sense make.
You identified a WikiTree GEDCOM export defect, that WikiTree needs to fix.

Meanwhile, you can, as a workaround, move the two halves of the character together. If the utility accepts long lines, all you have to do is remove the EOL inside the character. It is annoying to have to do that, but it is doable, as you're not likely to have many instances of this export defect.

On a practical note, a problem you will walk into when trying to do that is that the ostensible WikiTree GEDCOM file is not even a valid text file.
You may want to use hex editor, as regular text editors may either refuse to read it, or get seriously confused.
Editors are likely to misrecognise the character encoding - and in this case, that is what you want, cause it cannot read it as UTF-8, as it isn't UTF-8.

Read the ostensible WikiTree GEDCOM as Windows ANSI, then remove that EOL, and save the file. Yes, you'll read and save the file as Windows ANSI, as long as you don't change anything else, it will, ahem, magically, be a correct UTF-8 GEDCOM file now.
Obviously not recommended as a general technique, but should fix this problem.

update 2017-02-16: expanded answer into an article, with more details & explanation, a running example, illustration of the workaround and related links: http://www.tamurajones.net/AWikiTreeGEDCOMProblem.xhtml

answered Feb 10, 2017 by Tamura Jones G2G Crew (680 points)
edited Feb 17, 2017 by Tamura Jones

Of course the point of the line length limit is to size a buffer. 1-byte characters - 256 bytes. 2-byte characters - 512 bytes. Multibyte characters - not anticipated.

Unicode maps semantic characters to abstract code points, 0-65535. Ideally, GEDCOM specifies a code-point-stream. Then you map that to a byte stream in whatever format your pigeons carry, and if necessary, re-encode when changing pigeons - none of which would be of any concern to the application (nor would the byte-stream encoders need to know about GEDCOM).

Trouble is, the application then only knows the line length in characters. The problem UTF-8 doesn't solve is the existence of too much old software using scanf-type input routines to read lines into fixed-length buffers.

The only way round that is to do the byte-encoding in the wrong place, so as to control the line length in bytes.

Which is where it all gets very messy.

commented Feb 11, 2017 by Living Horace G2G6 Pilot (633k points)

After some more thought and reading, I have to say I disagree with you.

The WikiTree export isn't a bug at all. Its true that it produces invalid utf-8, making the resulting gedcom a binary file instead of a text file. But once all the CONC tagged lines are reconstituted, the data becomes valid uft-8 text again.

You even detail that the specification never mentions that gedcom files are actual text files. So that seems to imply that binary data files are okay.

Even as a binary file, WikiTree still exports a valid gedcom file, that will re-import with no data loss or corruption (though I'm unaware how it fares with current validators). The way Chris breaks the lines, even with splitting multi-byte characters across lines, is perfectly valid -- though difficult to work with, outside of importing.

commented Feb 13, 2017 by Dennis Wheeler G2G6 Pilot (575k points)

Related questions

+3 votes

3 answers

391 views

Should weird and wonderful UTF-8 characters appear in database names?

asked Mar 17, 2017 in Policy and Style by Dirk Laurie G2G6 Mach 3 (39.4k points)

+1 vote

1 answer

215 views

What setting do I use to export a ged file if I don't have character set UTF-8?

asked Jul 17, 2012 in Genealogy Help by Debby Black G2G6 Mach 8 (85.0k points)

+5 votes

0 answers

127 views

problem with gedcom

asked Jun 27, 2017 in WikiTree Tech by Roger Barnes G2G6 (9.2k points)

+14 votes

3 answers

295 views

A few thoughts on GEDCOM handling

asked Jun 3, 2017 in WikiTree Tech by Greg Slade G2G6 Pilot (679k points)

+2 votes

1 answer

157 views

Two matches that don't exist in Gedcompare?

asked May 17, 2018 in WikiTree Tech by Zoe Erkenbeck G2G1 (1.3k points)

+6 votes

2 answers

126 views

Now what do I do.

asked Dec 28, 2017 in WikiTree Tech by James LaLone G2G6 Mach 6 (62.3k points)

+9 votes

0 answers

110 views

Where is my GEDCompare report? [Closed]

asked Sep 26, 2017 in WikiTree Tech by Jim Lamson G2G1 (1.7k points)

+5 votes

2 answers

225 views

Where is my GEDCompare report? [Closed]

asked Jul 20, 2017 in WikiTree Tech by Sara Hanley G2G Crew (780 points)

+8 votes

0 answers

193 views

How do I add birth dates into my GEDCompare file - the dates are already there, but they do not show up on my report!

asked Jul 19, 2017 in WikiTree Tech by George Merrick G2G Crew (940 points)

+3 votes

1 answer

280 views

When uploading a GEDCOM 5.5 file to WikiTree, should character set be AMSI, ASCII, or UTF-8?

asked Jul 10, 2013 in Genealogy Help by Steve Thompson G2G Rookie (220 points)

Answer 1 · 2017-02-10T23:44:59+0000

Wow, I wasn't convinced there was a problem, but I did some testing and discovered that you are correct. And as Tamura says, what you describe would indeed be a WikiTree GEDCOM export bug.

In my own experiments today, I noticed that the CONT and CONC tags are strictly cutoff at 78 characters without regard to word boundaries. The exported GEDCOM sets the CHAR to UTF-8, and many of the profiles do include many multi-byte characters. But initially, none of my exports encountered this problem, though a few of the lines were really close. I was just randomly lucky. Turns out, the lines are also cutoff without regard to multi-byte character boundaries either.

The 'file' command will show when its a problem or not:

$ file -bk --mime-encoding goodexport.ged
utf-8

$ file -bk --mime-encoding badexport.ged
unknown-8bit

This perl command will help identify which line number, character number on the line where the non-ACSII character occurs:

$ perl -ne '/^([\x00-\x7f\xa0-\xff]*)(.*)$/;print "$.:".($-[2]+1).":$_" if length($2)' mygedcomfile.ged

So I found one of my profiles that was really close, and added (in blue below) just enough characters to the line to cause the GEDCOM export to split the mylti-byte character as you describe.

See https://www.wikitree.com/wiki/Brownlow-71 (the second, and current test edit). It splits the 3-byte em-dash across the lines.

Affected line(s) as displayed in emacs text editor:

2 CONC an]], son-in-law)<ref name="death">Texas, Death Certificates,123 1903\342
2 CONC \200\2231982</ref>

So, it is possible to find and correct the offending profiles using this workaround, though its a bit tedious having to do multiple exports and testing.

Categories

Is it OK that GEDCOM export splits UTF-8 characters?

Please log in or register to add a comment.

Please log in or register to answer this question.

3 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Related questions