Should weird and wonderful UTF-8 characters appear in database names?

Question

Should weird and wonderful UTF-8 characters appear in database names?

390 views

Unicode has liberated the computers of the world from the need to know what 8-bit language-specific character set they are dealing with, by introducing a 32-bit character set containing all known alphabets. UTF-8 is a ubiquitous rendering of Unicode that happens to looks like ANSI to programs that can read GEDCOM. WIkiTree uses UTF-8 in its database, although funny things happen when you make web links from them.

Unicode and UTF-8 have also made it easy to represent letters fro different alphabets in the same document, and even in the same word.

Spare a thought for Lűcas Albertűs Jacobús (Lucas) du Preez formerly Dűpree. He was born in 1841 in what was then the Republic of Winburg. Names were not commonly spelt the way they are now, and Dupree was quite a common spelling then of the name more commonly spelt du Preez now. WikiTree has many Duprees.

Lűcas, however, is only WikiTree's second Dűpree, the first being his sister Judit, from the same page of a baptismal register. (He may will be its first Lűcas, but that is harder to check). A likely reason for that is that the second letter of Dűpree is commonly used only in Hungarian.

So how come the name? Well, anybody consulting the photo of the baptism register can see that the scribe wrote it that way. Just zoom it up enough and you can see two slanted lines above the u. As a visual rendering of what is actually on the picture it is reasonably accurate.

But is there any chance that a scribe in remote Winburg would have intended to write Hungarian u's? Is it not more likely that he made use of a clerk's habit to put some kind of tick above a u in order to distinguish it from an n? Almost all the u's on that page are so marked. (In fact, Judit is lucky not to have been entered as Jűdit, just zoom it up a little more.)

In a project where it is routine practice to PPP a profile if the spelling of the LNAB is nonstandard, should there not be a few commonsense guidelines about accents? Like, for example, "If the name is an Afrikaans/French/German/Swedish/Polish one, only use accents that occur in Afrikaans/French/German/Swedish/Polish?" or "Take into account the way the name was commonly spelt in that community at the time of birth."

WikiTree profile: Lucas Albertus Jacobus du Preez

asked Mar 17, 2017 in Policy and Style by Dirk Laurie G2G6 Mach 3 (39.0k points)
retagged Apr 10, 2017 by Abby Glann

3 Answers

Best answer

Hi Dirk,

The reason why we have a verbatim approach to the spelling of the LNAB is so that different researchers from different places will always reach the same profile from the same resources. If it was a clerk error the correct spelling of the name the person used is inserted into the preferred name and current last name fields. That way they will be searchable and can be found.

You can see that if you start adding rules or interpretations to the way it was written, then those same researchers with the same resources could end up on different profiles because how will we ever know if each of them is using the same rules or interpretations.

answered Mar 17, 2017 by Louis Heyman G2G6 Mach 9 (92.9k points)
selected Mar 18, 2017 by Philip van der Walt

Louis, I am itching to use phrases like "defying common sense" and "enshrining ignorance" but since that would not be polite I will not do so.

I recognize that second paragraph of yours as standard CoGH dogma that I have heard before from at least three different sources. I consider it, weighing my words carefully, to be fallacious, mechanical and self-defeating.

That piece of dogma seems to have a very low opinion of researchers: they can reproduce faithfully what they see, but they cannot reason intelligently. I'd like to think that someone who does genealogical research has in fact seen quite a few FamilySearch images of original documents and developed a certain degree of nous.

I agree that name variations should be respected, and in this case I would not dream of suggesting that the LNAB should be du Preez. But would it spoil some vast eternal plan if we could just agree on Dupree? The simple rule being, don't use accents unless they are enshrined by custom or indicate pronunciation. Or more simply, don't allow one clerk or one transcriber to invent new ways of accenting a name. By all means distinguish betwen Kotze and Kotzé, though. People are touchy on that one.

I can't agree that UTF-8 specials is the way to go. Must your transcribers now scrutinize a website like FileFormat in search of more and more exact ways of rendering the precise appearance of handwritten text? Suppose someone has the habit of using greek e's. Will we soon have people with LNAB van Rεεnεn? If you are really consistent in this "verbatim", many of your Bassons and Vissers should actually be Baſson or Viſser.

If you look at that whole page, you will notice that almost all the letters u have some kind of swish over them, and that there is no great consistency over exactly what it is. Other old texts, e.g. the Allert van Zijl family bible, are like that too. That is the way that clerks wrote a simple, plain u in those days. It can in no way be compared to an Italian ú or a German ü or an Hungarian ű, all of which serve to indicate actual differences in pronunciation.

A major reason for recognizable, searchable LNABs is that WikiTree gets a chance to suggest a match when you create a new profile. This does not happen when the characters in the name are esoteric symbols from a language not even related to the one spoken by the subject or the researchers. I just tried to create a profile, without saving it of course, for "Lucas Dupree", born 1841, died 1900, and was told No close matches were found. Proceed to step 3.

The slope you are now on will also bring in Dúpree and Dùpree and Dũpree and Düpree, each consisting of one or two individuals. Do you honestly, really want that?

commented Mar 17, 2017 by Dirk Laurie G2G6 Mach 3 (39.0k points)

A simple acceptable standard is: ignore all accents unless there is a very strong reason to put them in. That gives a contemporary LNAB "the way they would have spelt it".

Imagine that you are a compositor working for the local paper in 1841. You have received copy with details of recently born babies destined for the births column. How would you typeset them?

You have a box of movable type. Putting in an accent is possible — it is a separate sliver of lead — but it is extra work. You will avoid it whenever possible. Basically Mr Kotzé needs to have kicked up a fuss because you omitted it last time. And even then, your available accents would be those used in 19-th century Dutch: grave and acute single acccents, diaresis, circumflex.

Exactly those accents, too, would be the ones available to you on a South African typewriter of the 1960s (except mine had 'n as a single character, which even Unicode does not). So you could also visualize yourself typing it up on a wax sheet for the church bulletin.

Note that transcription and maintaining a database are very different tasks. It may well be desirable for a transcription to be visually as close as possible to the original look, particularly when the image itself is not available, but for a database the goal is to maximize the chance that your search engine will find the record.

commented Mar 17, 2017 by Dirk Laurie G2G6 Mach 3 (39.0k points)

I don't see any need for adjusting that. It is well formulated, provided that "spelling" means "expressing in terms of the letters of the available alphabet".

The point that I have been trying to make is that the Dutch alphabet does not contain accented letters except in loanwords. What is being transcribed as accents on a u are in fact mere marks aimed at making it easier to distinguish from an n. By omitting them you are not deviating from the spelling; on the contrary, you are respecting the spelling. Dutch surnames simply do not have accented us all over the place. The situation does not differ in any important respect from that of the long S which has already been mentioned.

So this is not an exception at all. This is a pure manuscript convention. The "other documents from at or near the time of birth" could be newspapers, books, anything printed. You will find names of German origin with umlauts on them, even in Dutch sources, but not profligate accents.

In the family bible I mentioned, it is hard to find any unmarked u. Even the month Augustus is written Aŭgŭstŭs just about everywhere.

commented Mar 17, 2017 by Dirk Laurie G2G6 Mach 3 (39.0k points)

Related questions

+1 vote

3 answers

167 views

Looking for "Jack" du preez who died at sea

asked Jun 12, 2019 in Genealogy Help by Joey Uys G2G Rookie (160 points)

+2 votes

1 answer

191 views

Missing Wife

asked Jan 7, 2019 in Genealogy Help by Michelle Hodson G2G6 Mach 1 (12.5k points)

+5 votes

1 answer

244 views

Moto On Coat of Arms for Du Pre

asked Jul 21, 2018 in Genealogy Help by Michelle Hodson G2G3 (3.6k points)

+2 votes

0 answers

134 views

Du_Preez-1376 Looking for Jacobus Johannes 2nd wife

asked Jul 10, 2018 in Genealogy Help by Michelle Hodson G2G3 (3.6k points)

+4 votes

1 answer

114 views

Du Preez 1678 Brother of Wouter

asked Jul 8, 2018 in Genealogy Help by Michelle Hodson G2G3 (3.6k points)

+3 votes

1 answer

133 views

Maria Elizabeth Du Preez nee Du Toit 1876-1913

asked Jul 6, 2018 in Genealogy Help by Michelle Hodson G2G3 (3.6k points)

+3 votes

1 answer

116 views

Du Preez - looking for Mother of Wouter Johannes du Preez

asked Jul 4, 2018 in Genealogy Help by Michelle Hodson G2G3 (3.6k points)

+6 votes

4 answers

286 views

Can we agree on names for christening and burial fields while we wait for them to appear in the database?

asked Jul 30, 2017 in WikiTree Tech by Dirk Laurie G2G6 Mach 3 (39.0k points)

+6 votes

3 answers

517 views

Is it OK that GEDCOM export splits UTF-8 characters?

asked Feb 10, 2017 in WikiTree Tech by Dirk Laurie G2G6 Mach 3 (39.0k points)

+2 votes

2 answers

154 views

Pleas help me trace ancestry of du Preez family

asked Dec 24, 2020 in Genealogy Help by Raymond Diederiks G2G Rookie (220 points)

Answer 1 · 2017-03-18T01:41:47+0000

Hi Dirk,

There of course are several approaches for the LNAB, different projects can use different things . We now can choose one that makes everyone happy and this is also why we started the SAR Project G2G of course.

I have added your points / questions to the project G2G now as well, if we all can talk about things related and important for the project there, we all will have just one place where we all can find what was discussed and decided for the project (sometimes it's not so easy to trace or find a G2G so to prevent a lot of searching I think it's easier to take it there ?

And perhaps we should add a short list of things there, we should discuss to make sure everyone is happy with the guidelines and way we (SAR Project) are going to work ?

answered Mar 18, 2017 by Bea Wijma G2G6 Pilot (307k points)

No Dirk we don't want you to 'shut up and sulk in your own little tree house', it's great to have someone like you, who is so enthusiastic and caring about the project and addressing issues like this, and indeed I understand it perhaps was not really clear this is what the project G2G was meant for, so perhaps we should improve this as well and add a link to the project page directly to the most recent G2G with a little explaining what it is for eeh ?

It's a project for all South African (roots) members and members interested in South Africa so it's very important and we need to discuss and decide about these things now and make sure all (at least as much as possible) and not just a few members will be happy with the guidelines and way we all are going to work on profiles part of the project . And make sure we all can trace (find) back when and where the project guidelines were discussed and decided, so that's why it's more handy /important to have it all in one Project G2G.

commented Mar 18, 2017 by Bea Wijma G2G6 Pilot (307k points)
edited Mar 18, 2017 by Bea Wijma

Answer 2 · 2017-03-22T17:32:44+0000

The discussion has moved to the de facto SA Roots page but as my last contribution here I would like to share the link Gekaapte Brieven, which deals with the concept of diplomatic transcription of handwritten Dutch letters from the 17th and 18th century. The question of diacritic signs above a u, writing out of abbreviations, etc. is specifically addressed there. That source agrees with Helen :-)

Categories

Should weird and wonderful UTF-8 characters appear in database names?

Please log in or register to add a comment.

Please log in or register to answer this question.

3 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Related questions