Should weird and wonderful UTF-8 characters appear in database names?

+2 votes

Unicode has liberated the computers of the world from the need to know what 8-bit language-specific character set they are dealing with, by introducing a 32-bit character set containing all known alphabets. UTF-8 is a ubiquitous rendering of Unicode that happens to looks like ANSI to programs that can read GEDCOM. WIkiTree uses UTF-8 in its database, although funny things happen when you make web links from them.

Unicode and UTF-8 have also made it easy to represent letters fro different alphabets in the same document, and even in the same word.

Spare a thought for Lűcas Albertűs Jacobús (Lucas) du Preez formerly Dűpree. He was born in 1841 in what was then the Republic of Winburg. Names were not commonly spelt the way they are now, and Dupree was quite a common spelling then of the name more commonly spelt du Preez now. WikiTree has many Duprees.

Lűcas, however, is only WikiTree's second Dűpree, the first being his sister Judit, from the same page of a baptismal register. (He may will be its first Lűcas, but that is harder to check). A likely reason for that is that the second letter of Dűpree is commonly used only in Hungarian.

So how come the name? Well, anybody consulting the photo of the baptism register can see that the scribe wrote it that way. Just zoom it up enough and you can see two slanted lines above the u. As a visual rendering of what is actually on the picture it is reasonably accurate.

But is there any chance that a scribe in remote Winburg would have intended to write Hungarian u's?  Is it not more likely that he made use of a clerk's habit to put some kind of tick above a u in order to distinguish it from an n? Almost all the u's on that page are so marked. (In fact, Judit is lucky not to have been entered as Jűdit, just zoom it up a little more.)

In a project where it is routine practice to PPP a profile if the spelling of the LNAB is nonstandard, should there not be a few commonsense guidelines about accents? Like, for example, "If the name is an Afrikaans/French/German/Swedish/Polish one, only use accents that occur in Afrikaans/French/German/Swedish/Polish?" or "Take into account the way the name was commonly spelt in that community at the time of birth."

in Policy and Style by Dirk Laurie G2G6 Mach 3 (33.7k points)
retagged by Abby Glann

3 Answers

+3 votes
Best answer
Hi Dirk,

The reason why we have a verbatim approach to the spelling of the LNAB is so that different researchers from different places will always reach the same profile from the same resources. If it was a clerk error the correct spelling of the name the person used is inserted into the preferred name and current last name fields. That way they will be searchable and can be found.

You can see that if you start adding rules or interpretations to the way it was written, then those same researchers with the same resources could end up on different profiles because how will we ever know if each of them is using the same rules or interpretations.
by Louis Heyman G2G6 Mach 4 (41.8k points)
selected by Philip van der Walt
I could add that in older transcripts where a different alphabet was used it is accepted to use the modern alphabet when doing the transcript. For example long S and short s become ss

Louis, I am itching to use phrases like "defying common sense" and "enshrining ignorance" but since that would not be polite I will not do so.

I recognize that second paragraph of yours as standard CoGH dogma that I have heard before from at least three different sources. I consider it, weighing my words carefully, to be fallacious, mechanical and self-defeating.

That piece of dogma seems to have a very low opinion of researchers: they can reproduce faithfully what they see, but they cannot reason intelligently. I'd like to think that someone who does genealogical research has in fact seen quite a few FamilySearch images of original documents and developed a certain degree of nous.

I agree that name variations should be respected, and in this case I would not dream of suggesting that the LNAB should be du Preez. But would it spoil some vast eternal plan if we could just agree on Dupree? The simple rule being, don't use accents unless they are enshrined by custom or indicate pronunciation. Or more simply, don't allow one clerk or one transcriber to invent new ways of accenting a name. By all means distinguish betwen Kotze and Kotzé, though. People are touchy on that one.

I can't agree that UTF-8 specials is the way to go. Must your transcribers now scrutinize a website like FileFormat in search of more and more exact ways of rendering the precise appearance of handwritten text? Suppose someone has the habit of using greek e's. Will we soon have people with LNAB van Rεεnεn? If you are really consistent in this "verbatim", many of your Bassons and Vissers should actually be Baſson or Viſser.

If you look at that whole page, you will notice that almost all the letters u have some kind of swish over them, and that there is no great consistency over exactly what it is. Other old texts, e.g. the Allert van Zijl family bible, are like that too. That is the way that clerks wrote a simple, plain u in those days. It can in no way be compared to an Italian ú or a German ü or an Hungarian ű, all of which serve to indicate actual differences in pronunciation.

A major reason for recognizable, searchable LNABs is that WikiTree gets a chance to suggest a match when you create a new profile. This does not happen when the characters in the name are esoteric symbols from a language not even related to the one spoken by the subject or the researchers. I just tried to create a profile, without saving it of course, for "Lucas Dupree", born 1841, died 1900, and was told No close matches were found. Proceed to step 3.

The slope you are now on will also bring in Dúpree and Dùpree and Dũpree and Düpree, each consisting of one or two individuals. Do you honestly, really want that?



I was typing my reply, did not see the addition to your message, and in fact included examples involving the double-s. So at that one rule at least is within the grasp of those researchers? Good!
Hi Dirk,

It would seem that you never needed an answer to your question.

I want the topic debated, and G2G encourages one to phrase the topic as a question. If the only answer I am going to get boils down to "that's the way we do it on this project, and basta", I shall be bitterly disappointed.

Please Dirk, don't get me wrong. You are obviously knowledgeable and experienced. Your potential contributions to WikiTree with the knowledge you have is invaluable. Now without being degrading or trying to belittle anyone else, please consider the practical implications of working with people with less knowledge and experience over the whole of WikiTree.

Now try and come up with a different acceptable standard which will work for everyone.

A simple acceptable standard is: ignore all accents unless there is a very strong reason to put them in. That gives a contemporary LNAB "the way they would have spelt it".

Imagine that you are a compositor working for the local paper in 1841. You have received copy with details of recently born babies destined for the births column. How would you typeset them?

You have a box of movable type. Putting in an accent is possible — it is a separate sliver of lead — but it is extra work. You will avoid it whenever possible. Basically Mr Kotzé needs to have kicked up a fuss because you omitted it last time. And even then, your available accents would be those used in 19-th century Dutch: grave and acute single acccents, diaresis, circumflex.

Exactly those accents, too, would be the ones available to you on a South African typewriter of the 1960s (except mine had 'n as a single character, which even Unicode does not). So you could also visualize yourself typing it up on a wax sheet for the church bulletin.

Note that transcription and maintaining a database are very different tasks. It may well be desirable for a transcription to be visually as close as possible to the original look, particularly when the image itself is not available, but for a database the goal is to maximize the chance that your search engine will find the record.

Hi Dirk, The current wording being

"if there are any contemporary written documents, the spelling from those documents should be used, In particular, the spelling that appears in a birth record should be used for the Last Name at Birth unless there are other documents from at or near the time of birth that inform us about a more common or correct spelling."

How would you adjust that and what are the exceptions?

I don't see any need for adjusting that. It is well formulated, provided that "spelling" means "expressing in terms of the letters of the available alphabet".

The point that I have been trying to make is that the Dutch alphabet does not contain accented letters except in loanwords. What is being transcribed as accents on a u are in fact mere marks aimed at making it easier to distinguish from an n. By omitting them you are not deviating from the spelling; on the contrary, you are respecting the spelling. Dutch surnames simply do not have accented us all over the place. The situation does not differ in any important respect from that of the long S which has already been mentioned.

So this is not an exception at all. This is a pure manuscript convention. The "other documents from at or near the time of birth" could be newspapers, books, anything printed. You will find names of German origin with umlauts on them, even in Dutch sources, but not profligate accents.

In the family bible I mentioned, it is hard to find any unmarked u. Even the month Augustus is written Aŭgŭstŭs just about everywhere.

+3 votes

Hi Dirk, 

There of course are several approaches for the LNAB, different projects can use different things . We now can choose one that makes everyone happy and this is also why we started the SAR Project G2G of course. 

I have added your points / questions to the project G2G now as well, if we all can talk about things related and important for the project there, we all will have just one place where we all can find what was discussed and decided for the project (sometimes it's not so easy to trace or find a G2G so to prevent a lot of searching I think it's easier to take it there ? 

And perhaps we should add a short list of things there, we should discuss to make sure everyone is happy with the guidelines and way we (SAR Project) are going to work ? 

by Bea Wijma G2G6 Pilot (247k points)

I did not quite realize that the topic Attention Please SOUTH AFRICAN ROOTS PROJECT UPDATE :) enjoys the status of being the Project G2G. If one clicks on "Discuss" on the project template, it takes you to a page headed Recent questions tagged SOUTH_AFRICAN_ROOTS where that topic is one among several, mostly posted by me. I am clearly talking too much. Maybe I should just shut up and sulk in my own little tree house.

No Dirk we don't want you to 'shut up and sulk in your own little tree house', it's great to have someone like you, who is so enthusiastic and caring about the project and addressing issues like this, and indeed I understand it perhaps was not really clear this is what the project G2G was meant for, so perhaps we should improve this as well and add a link to the project page directly to the most recent G2G with a little explaining what it is for eeh ? 

It's a project  for all South African (roots) members and members interested in South Africa so it's very important and we need to discuss and decide about these things now and make sure all (at least as much as possible) and not just a few members will be happy with the guidelines and way we all are going to work on profiles part of the project .  And make sure we all can trace (find)  back when and where the project guidelines were discussed and decided, so that's why it's more handy /important to have it all in one Project G2G. 

0 votes

The discussion has moved to the de facto SA Roots page but as my last contribution here I would like to share the link Gekaapte Brieven, which deals with the concept of diplomatic transcription of handwritten Dutch letters from the 17th and 18th century. The question of diacritic signs above a u, writing out of abbreviations, etc. is specifically addressed there. That source agrees with Helen :-)

by Dirk Laurie G2G6 Mach 3 (33.7k points)

Related questions

+2 votes
1 answer
+5 votes
1 answer
+4 votes
1 answer
+3 votes
1 answer
+6 votes
3 answers
217 views asked Feb 10, 2017 in WikiTree Tech by Dirk Laurie G2G6 Mach 3 (33.7k points)
+8 votes
2 answers

WikiTree  ~  About  ~  Help Help  ~  Search Person Search  ~  Surname:

disclaimer - terms - copyright