no image

Dirk Laurie's WikiTree Blog

Privacy Level: Public (Green)
Date: [unknown] [unknown]
Location: [unknown]
Profile manager: Ronel Olivier private message [send private message]
This page has been accessed 304 times.

This page is for notes to myself[1] that I might want to read again. Everything in it is unfinished, but some sections are less unfinished than others. My favourite ways to indicate that a section is very unfinished are to include a subsection with only a title line or a dangling link to a nonexistent subsection.

I've written in English and made the contents Public so that you are welcome to read it too. The usual feedback channels (public comment box and private message link) are available. I use the public box myself to say that a section is considered to be not very unfinished.


The filesystem of your PC as a genealogical database

Your filesystem is ideal for storing your genealogical work.[2] It has a tree structure, timestamps, access control, visibility control and is recognized by a very large range of software, including packages that do version control. You can easily make an archive file for communication with other genealogists.

All that is required on your part is to be disciplined in how you use it, so disciplined in fact that you can teach a computer your rules in order that some tasks can be automated. I actually have some programs like that. Here is what they expect.

One person, one folder Think of that folder as the person's profile. A person must have exactly one profile, so there must be exactly one folder.
Root individuals The folders of a small number of root individuals are stored in a root folder.
Hierarchical structure Everybody else's folder is a subfolder in the folder of an immediate relative: parent, child or spouse. There are three main methods for naming these: by ascent, by descent and by families.
Naming conventions A folder can contain anything you like, but files that you want your programs to read, must have names that the programs expect. You can have any number of files not named according to conventions: the programs will simply ignore them. My conventions work with prefixes: combinations of characters at the start of a filename that tell what its role in the folder is. Extensions (everything after the last period) retain the commonly understood meaning.
Database chunks These are single files with information that can go into a genealogical database. The extension indicates the format, e.g. .ged (GEDCOM), .json (JSON), .saf (SAG/SAF). The prefix bio is used for the "main" genealogical information.
Journalism These are files with supporting anecdotes, research notes etc. The extension is usually one recognized by document viewers, text editors, word processors, typesetters etc, i.e. .pdf, .txt, .docx, .tex etc.
Downloads FamilySearch and some other large repositories offer you a unique name for images that you download. Store the file under that name, and maintain an index images.html to provide user-friendly links to your downloads.
Index If your database ever gets stored on a webserver, a file index.html in every folder is recognized as special by most webservers. A browser visiting that folder will not be served with a list of files actually in the folder, but with the contents of index.html. Use it to provide links to those files that you want to make accessible.

Ascent method

A folder may contain subfolders for ancestors, numbered 2,3,4,... according to the Ahnentafel rule. You are 1 (and don't actually need a subfolder, but see Merged profiles), your father's number is double yours, your mother's is your father's plus one. A great advantage of this numbering system is that except possibly for yourself, odd numbers are for females and even numbers for males. The prefix to their folder is that number, followed by a period. If your file manager is not clever enough to know that 2 comes before 11, use 02 or even 002 instead.[3] The prefix itself is all that is reserved. The rest is up to you: the computer is already satisfied, but you probably can't remember off the top of your head who ancestor 27 is, so the folder name is not just 27. (though it could be) but something like 27. Anna Christina Francina JOOSTE.

Inevitably, if you go on long enough, some ancestor is repeated. In that case, there is a file instead of a folder, and its name says which (not necessarily lower-numbered) ancestor is the same person. The contents of the file may be anything, including being empty. More about this under the discussion of merged profiles.

Descent method

A folder may contain subfolders for children, numbered 1,2,3 according to the order in which they were put into the folder. The prefix a lowercase letter, any letter except 'x' but the same for all of them, then a number. As in the case of ancestors, you can use the rest of the name for something you, not the computer, finds useful, i.e. b1 Trudie etc. One could include grandchildren and more this way, too: b3c1 would be a child of b3, b3c1d2 a child of b3c1, etc. The later letters follow consecutively in the alphabet from the first.

It is not necessary that the letter moves on in the subfolders. The prefixes can start at b1 in every folder. For the computer, that would even be preferable, but since humans are so easy to confuse, and for the sake of collation with the original sources, you are allowed any starting letter.

In order to avoid duplicates in this method, children's folders are stored under one parent only; the other parent's folder may contain a file for the child as explained under Merged profiles. One should have a system for choosing that parent. There are two plausible methods: the principal parent may the one from whom the child inherits the LNAB, or the one that is a blood relation of the root individual.


The disadvantage of the descent method is that in the case where the children do not all belong to the same other parent, the numbering system can't tell the difference.

For this purpose, file and folder names that start with one or more x's, followed by a blank, are used. In that file or folder, information relating to this particular pair or parents is stored. I have not met the problem often enough to prescribe general rules, but there is a complicated situation in my own family tree that shows a solution.

Another solution would be that a database chunk gives the necessary information.

In the family model (which I do not use and my programs at present do not recognize) an individual's folder contains no ancestors or descendants, only spouses, and these x-subfolders are used to hold the folders of the children.

Full codes

It is not necessary to use a pure ascent or pure descent method. In fact, it would be annoyingly restrictive to demand that. Genealogists often start from themselves, then do their parents, grandparents, etc, but at a certain point, do a generation or two of the descendants of some notable ancestor. Alhough it is perfectly possible and even logical to have children, spouses and ancestors in the same folder, we need to agree how to refer to them outside that folder. For that purpose, we combine prefixes into a single full code.

The simplest case of a full code is the ancestor of a root individual: use just the number.

Not as simple, but well known to South African genealogists, is the so-called the De Villiers-Pama code: each ancestor's position in their own family is separated by consecutive lower-case letters after an initial b., e.g. b2c1e4d5'.

If it is necessary to indicate from which marriage a child comes, the spouse code is included before the letter, e.g. b2c1xxe4d5. It is unnecessary, but not wrong, to have a single 'x.

Rarely, but not inconceivably, one may need to refer to the other parent of a child without knowing which spouse of the principal parent it is. This is possible, but we need a colon to separate the numbers, e.g. b2c4:3 is the mother of b2c4. Once we have such a colon, the whole lettering system starts afresh. E.g. your uncles and aunts can be referred to as 4:b1 (eldest child of your paternal granddather) etc. Obviously you can easily lose your way, and this way of doing it is best controlled by a computer program, but it does allow you, if you prefer, to put all relatives of the root individual in one massive folder.

The actual letters in the filename on the system might have been anything, e.g. b2/b1/g4/f5, but the full code uses the numbers only and puts in the correct letters so that the result is b2c1e4d5.

The computer can't straight-off reverse this process, but since you are not allowed different starting letters in the same folder, it can read b2c1e4d5' as "child 2's child 1's child 4's child 5" and get to the correct subfolder.

Merged profiles

It is unavoidable, because of human error as well as the innate complexity of genealogical relationships, that certain paths lead to the same individual. The way to handle that is that there is only one folder for that individual; all other paths lead to a file. The file name and file contents instruct the user where to find the actual folder.

If the target folder is in the same containing folder, all you need is an empty file whose name contains the two codes separated by an equal sign , e.g. 260=284' to indicate another way to ascend to the ancestor, or b6=b4 when it is discovered that two children previously listed are in fact one and the same. As always, the file name can be longer, e.g. b6=b4 Robert Naylor might show the name attributed to the profile that has been merged away.

If the target folder is the folder containing the current folder, we use -1 for the containing folder, e.g. b3=-1 says the containing folder is that of the third child. Remember: the naming conventions apply to prefixes only, so you could still have a descriptive name after the -1. Thus method is not recommended except in that precise situation.

A better way, in general, is to give the full path from the root folder. The part after the equal sign gives the folder name of the root individual, followed by a slash and the full code of the target folder, e.g. x=OPPERMAN/b2c1e4d5.

If the folder is not accessible by navigating from the root folder, there must be nothing after the equal sign, and the contents of the file must be the filename on the system. This is an emergency measure, possibly not allowed by some programs.

Symbolic links

Don't use them in your genealogical database, even if you are a Linux expert. They start as spaghetti and end up as a gooey mess. My programs test for them and abort if they are found.

Genealogical files

The names bio.ged, bio.xml, bio.lua, bio.json and bio.saf.txt are reserved for files that encode genealogical data for the individual, as elaborated below.

Case studies


A method based on the GEDCOM family model.

Some really good general advice, not tied to a particular organizing system like mine

Genealogical formats

Pseudo-GEDCOM formats

The closest thing to a standard genealogical format is GEDCOM 5.5.1, whose last revision, dated 1999, is still marked "draft", but which is, within wide limits, adhered to by most software developers. It is characterised by short keywords called tags, which are mostly abbreviations for easily-guessable words, and usually capitalized. Though precise enough for computers, it is not totally impenetrable to human readers.

A typical individual (omitting relationships for now) might be coded[4] as:

0 @Major_John_Laurie@ INDI
1 NAME John /Laurie/
2 DATE 23 Jan 1794
2 PLAC London
2 DATE 25 Jan 1860
2 PLAC London
1 WWW Laurie-476

GEDCOM syntax is based on line endings and whitespace as the main delimiters, with slashes and at-signs used as secondary delimiters. Each line starts with a level number, thus imposing a tree structure which we have emphasized by indentation.

The wordlets used as GEDCOM tags are not prescibed by the rules of syntax, but only certain tags are approved. The GenWiki site lists them and has articles on the more common ones, with details of agreements between software writers on how these are to be used, which tags and features are considered essential or optional, etc. For example, all programs should recognize a surname inside slashes, but breaking it up into pieces called SURN, MIDN etc or adding extras like NICK is optional.

It is easy for a human to get to know the more common tags and to edit a GEDCOm file manually (particularly when assisted by syntax highlighting).

One can represent the same information in any data description language without sacrificing the GEDCOM look-and-feel. We call these representations pseudo-GEDCOM formats.

Here is XML:

<INDI id='Major_John_Laurie' NAME='John /Laurie/' SEX='M'
WWW='' >
<BIRT DATE='23 Jan 1794' PLAC='London'/>
<DEAT DATE='25 Jan 1860' PLAC='London'/>

Or Lua, mechanically translated from the XML and slightly post-edited to look closer to the GEDCOM:

{xml="INDI"; id="Major_John_Laurie"; NAME="John /Laurie/"; SEX="M";
{xml="BIRT"; DATE="23 Jan 1794"; PLAC="London"};
{xml="DEAT"; DATE="25 Jan 1860"; PLAC="London"};

Or JSON, mechanically translated from the Lua and similarly post-edited

{"xml":"INDI", "id":"Major_John_Laurie",
"NAME":"John \/Laurie\/",
"1":{"xml":"BIRT","DATE":"23 Jan 1794","PLAC":"London"},
"2":{"xml":"DEAT","DATE":"25 Jan 1860","PLAC":"London"},

In both Lua and JSON, slightly more idiomatic versions (not betraying the XML origin) could have been defined, at the expense of needing specialized GEDCOM-to-Lua and GEDCOM-to-JSON converters rather than relying on existing well-tested packages.

The SAF text format

Oude Kaapsche Familien

The three iconic volumes comprising Geslacht-Register der Oude Kaapsche Familien were compiled by C C de Villiers, a printer by trade, and edited by the historian G McC Theal. Their respect for rigidity of form and precision of information allows their characteristic layout to be formalized as parsable text. Here is a short sample.

This genealogy, compiled in the 19th century, conforms to a file format decades ahead of its time — one that can be read and processed as a computer language. With the aid of the optical character recognition program Tesseract and a little post-editing, followed by syntax highlighting and reorganization of whitespace, we get:

(d) 2 Schalk Willem, gedoopt 3 October 1743, burger te Stellenbosch,
gehuwd 8 Maart 1765 met Maria Magdalena Botha,
hertrouwd (1) 23 Februari 1777 met Maria de Bruijn, weduwe van Philippus Lodewicus du Preez,
(2) 26 Februari 1786 met Machteld van Heerden, weduwe van Schalk Willem van der Merwe,
(3) 21 November 1790 met Aletta Sibella van der Merwe, weduwe van Nicolaas van der Merwe
(e) 1 Hester Petronella, gedoopt 29 Maart 1766
2 Maria Catharina, gedoopt 19 Juli 1767,
gehuwd met Willem Petrus van der Merwe,
hertrouwd met Rudolph Johannes Brits
3 Anna Magdalena, gedoopt 11 September 1768,
gehuwd met Schalk Willem van der Merwe,
hertrouwd met Gerrit Jacobus Olivier
4 Schalk Willem, gedoopt 16 Augustus 1772, burger te Stellenbosch,
gehuwd 29 April 1792 met Hester Cecilia Burger, weduwe van Carel van der Merwe
(f) Willem Francois, gedoopt 22 Maart 1795, burger te Tulbagh,
gehuwd 21 Maart 1813 met Anna Margaretha Geertruida Theron, weduwe van Wijnand Carel du Toit
(e) 2 Margaretha Elisabeth, gedoopt 28 April 1775,
gehuwd met Johannes Pienaar

The amazing thing is that this prettyprinted genealogy does not actually need the punctuation, linebreaks and indentation, it only needs word separation by whitespace and that the bits in bold are treated as keywords.

To emphasize that point, here is exactly the same information, mechanically translated to resemble what is commonly called SAG/SAF notation. TODO

  1. It's an age thing. Fellow septuagenarians will understand.
  2. I started writing this before I became aware of several existing essays overlapping with this one. By then I had already started doing it my way.
  3. Doesn't 007. James BOND look just right?
  4. This is manually coded GEDCOM. Computer-generated GEDCOM usually does not indent and uses cross-references like @I2351@ etc.


Leave a message for others who see this profile.
There are no comments yet.
Login to post a comment.