What do you recommend for the mathematics of genetic genealogy?

Question

What do you recommend for the mathematics of genetic genealogy?

1.6k views

I have been reviewing what people are doing with their DNA results and their GEDCOMs. I have yet to try out the Genetic Genealogy Kit and affiliated tools, but I have tried out GRAMPS, RootMagic, Ancestral Quest, Genome Mate Pro, GEDMatch, and a few of Felix Immanuel's genetic genealogy tools.

I've been going through my library of mathematics and genetics textbooks including Schaums Outlines Genetics 4th ed, Snustad and Simmons' Principle of Genetics 4th edition, my references on vector analysis, linear algebra, and the python programming language. I haven't been able to find what I'm looking for.

The basic of it is that a person can be represented in a genetic genealogy by their DNA plus annotations like name, date, location, and events. In simplified formal notation, their DNA can be written as a physical measurement of an experimental subject. In physics, forces and force interactions or waveform interference can be written in terms of vectors. Physical measurements of physical systems can generally be written as vectors, so your DNA should be able to be represented as vector.

I want to do it this way because I want to be able to decompose the vectors into subvectors representing the contributions of genetics from other family members. This way my DNA can be effectively factored recursively into maternal vs paternal, maternal grandfather vs maternal grandmother, paternal grandfather vs paternal grandmother, and so on. You could then compare the factored or phased DNA to matches shared with other family members and determine immediately where in your family tree they must be. Likewise, you could use the vector representation of other people's DNA in order to automatically generate genealogies and check for intersections.

So what references do you all recommend for doing mathematical or quantitative genetic genealogies?

Update: For common reference.

COOP Lab at UC Davis:

asked May 27, 2016 in The Tree House by Ian Mclean G2G6 Mach 1 (13.6k points)
edited May 31, 2016 by Ian Mclean

It isn't as complicated as all that.

With my mother's DNA and my father's DNA results in hand, I can do what is called phasing. Where I compare my DNA to their DNA and I see what DNA comes specifically from my mother and what DNA comes specifically from my father. This part of the vector decomposition is easy relatively speaking as long as we neglect mutations and transcription errors. My mother gave me either one of her X chromosomes whole or she gave me a combination of her X chromosomes; Schaum's Outlines for Genetics 4th edition page 150 has a handy diagram of what is shared along the X and Y chromosomes, and there are what are called sex-linked genes that are passed strictly from one parent to their child. X chromosomes have a large segment that is non-homologous or completely sex-linked; if I share segments from that part of my X chromosome with someone else then it is most probable that I share matrilineal ancestors with them. This goes both ways for male to male comparisons such that if we share X chromosome segments from the completely sex-linked region then we have to share a common matrilineal ancestor.

These facts can be used in vector analysis in order to differentiate at least some genetic matches into a pool of most probably matrilineal common ancestors and least probably matrilineal ancestors. If I have my DNA phased with both my parents then I can actually positively identify which portions of my DNA came from which parents, so if I share segments of DNA with someone else then I have to share them through that parent.

There will be a lot of DNA that is in common between parents, and the problem is worse for those populations where the parents are close relatives, but for a lot of people, you can sort the DNA data into shared by both parents, shared by father, shared by mother, and shared by neither parent.

This would be a mere novelty in genetic genealogy except that my cousin's cousins are my cousins and my mother's cousin's cousins are her cousins roughly speaking. Because you can sort shared genetic DNA in this way, you can triangulate cousins against each other using your completely sex-linked differentiated shares as a control group for sorting shared autosomal results onto one side or the other of your family tree. If you do vector decomposition and use phased genetic analysis then you can actually sort your autosomal DNA into different pools; you got autosomal DNA from each of your parents, but it is unlikely you got the same autosomal DNA from both, and it is unlikely they got the same autosomal DNA from their parents, and so on. Theoretically, you can search and sort your autosomal DNA via phasing, triangulation, and chromosome mapping according to which parent, grandparent, great grand parent, etc gave it to you.

This can all be done using common scientific programming libraries like Anaconda for python. In fact, this can all be done in such a way that much of this becomes a push-button-get-results kind of application.

commented May 27, 2016 by Ian Mclean G2G6 Mach 1 (13.6k points)

Here's a rough pictorial representation of what I am talking about. In this diagram, I or any xy-karyotype am the root of the tree, yamx; I inherited chromosome y, chromosome x, mitochondrial DNA, and the autosomal chromosomes 1-22. My father contributed my chromosome y and a portion of his autosomal chromosomes, a. My mother contributed chromosome x, my mitochondrial DNA, and a portion of her autosomal chromosomes, A. My autosomal DNA is a combination of my mother and my father's Aa.

This is a simplification of the situation because there is some crossover from my father's x and y chromosomes. And the exact composition of my mitochondrial DNA and X chromosome from my mother is not captured in the diagram. But it serves as a rough outline of a basic model.

If my DNA is represented by a vector, I, then we might model it as [y a m x]=I. Mother * Father = [0 a_0 m x] * [y a_1 0 0] = I where * is a reproduction operator. Could probably do it as some form of bra-ket: <Mother|Father> = <I>.

commented May 29, 2016 by Ian Mclean G2G6 Mach 1 (13.6k points)

""There will be a lot of DNA that is in common between parents" ?!?!? do you mean pedigree collapse is common?"

Statistically speaking over the entire length of human history, yes, pedigree collapse is common. That's why human beings generally share the majority of their genome in common.

Though in this case, I wasn't referring as much to recent (within 10 generations) pedigree collapse. I am just referring to the fact that if we were working with whole genome sequences that comparing the sequences of anyone together base-by-base would produce more in common than different. Geneticists and genetic genealogists have to pick and choose what genetic information to examine and compare, so we choose hot spots of variation in the human genome. What amounts to about 2-3% of each of our genomes.

So for example, suppose you're comparing distant cousins with unknown relations to your parents. If your family graph was strictly a tree--no pedigree collapse anywhere in it--then the cousins would strictly be sortable as either paternal cousins XOR maternal cousins. However, with any quantity of pedigree collapse in your family tree anywhere, some of your cousins are going to be both paternal and maternal cousins because you and them will share a distant common ancestor that is also shared by your mother and father.

Under conditions approaching the no-pedigree-collapse-model for XY-karyotypes, finding a match on your X chromosome in the nonhomologous portion conclusively means that person is related to you strictly through your mother and not your father. However, if your father and mother share a distant maternal ancestor then some matches in that region will be from that distant maternal ancestor, so you will find cousins on your father's side who share X chromosome matches with you despite the fact that your father didn't pass on his X chromosome to you. In practice for XY-karyotypes, you can generally assume X chromosome matches are probably maternal relatives because the probability of them being on your father's side is low especially for matches below 7 cM.

commented May 29, 2016 by Ian Mclean G2G6 Mach 1 (13.6k points)
edited May 29, 2016 by Ian Mclean

Identical by state vs identical by descent is roughly analogous to correlation vs causation. We can represent correlations in vector analysis as single direction vectors; causations can be represented as bi-directional or bijective vectors. That's actually a major part of what we're supposed to be doing when we're looking at potential matches. Finding a correlation is a starting point; the process of triangulation creates a categorical system of correlates that should knock out mismatches especially for measurements larger than the margin of error in measurement and in comparison.

If A correlates with B and if B correlates with C and if C correlates with A then A = B = C; this condition is a kind of mathematical closure which is why creating polygons like triangles is important.

The vector model that I am interested in should strictly differentiate IBD from IBS, and in fact, the vector decomposition process should explicitly return strictly the IBD data from comparisons.

commented May 29, 2016 by Ian Mclean G2G6 Mach 1 (13.6k points)

Magnus, I have to disagree that IBS is useless.

It may be useless in identifying the common ancestor but it an iterative process of building a decision tree (for lack of a better analogy) based on probabilities. These probabilities effect our confidence level.

If I know that neither of my parents share any segment greater than 3 cM, then the probability of match that includes a 4cM segment came from the same parent as the segment(s) which caused the match.

Because I tested both my brothers, there are segments that overlay, some 4cM, but when combined into a new Lazarus Kit, those segments are joined to form a new larger segment, providing further evidence of its usefulness.

Using just DNA, you can begin to create Parent A, Parent B, Grand Parent A, Grand Parents B, etc. without assigning a gender to these ancestors. Once you do assign a gender to one, there should be a domino effect in assigning a gender to others.

commented May 30, 2016 by Ken Sargent G2G6 Mach 6 (61.9k points)

2 - My parent's DNA doesn't tell me what DNA I got from whom. It doesn't tell me or my siblings what my siblings got from whom, and how we differ. Factors can be discovered by shared segments between me and my siblings, between my siblings and my cousins, between my parents and my aunt, etc. Given the nature of the non-associative algebra "Factor" might be somewhat misleading. But the recursive nature of DNA is not a point of controversy; we are able to do DNA transcription because of its similarities to computer code, so it is at least partially governed by principles of coding and generally recursive functions.

3 - The probably is in "Only a maternal relative" vs "Both patrernal and maternal relative."

commented May 31, 2016 by Ian Mclean G2G6 Mach 1 (13.6k points)

1 - X and Y are sex chromosomes

The existence of a Y tells us the sample is male, which is relevant to how the X is treated. Also, the Y may be relevant when 2 relatives claim to share the same distant common ancestor along their patrilineal lines. It may further support the claim to some degree, or refute it indicating a Non-Paternal event.

2 - There is much to be gained by factoring in children's DNA. I gave an example earlier, but the clearest gains are accuracy and depth. AncestryDNA factors these child/parent relationships and accomplished both. The process is called phasing. In my case, their predictions have been consistent with the documentation, even out to 8th cousins. There is one prediction that is further than the documentation. Both i and my match believe that a Non-Paternal event occurred.

3 - Matching segments do not always follow the same path in and endogamic relationship. You could match your non-sex segments to a 1st cousin (who is not a stranger), and also be related to that same person as a 6th cousin on your mother's side sharing no non-sex segments.

commented Jun 1, 2016 by Ken Sargent G2G6 Mach 6 (61.9k points)

2 - My parent's DNA doesn't tell me what DNA I got from whom.

For genealogical purposes, it won't help to know.

Factors can be discovered by shared segments between me and my siblings, between my siblings and my cousins, between my parents and my aunt, etc.

Nothing you don't already know

But the recursive nature of DNA is not a point of controversy; we are able to do DNA transcription because of its similarities to computer code, so it is at least partially governed by principles of coding and generally recursive functions.

Now you're just playing games with different meanings of recursive

3 - The probably is in "Only a maternal relative" vs "Both patrernal and maternal relative."

As to a possible link on the other side, you none the wiser. You can't introduce a no-pedigree-collapse assumption to infer that a link on one side makes a link on the other side less likely than it otherwise would be.

There's no free lunch here. There's no substitute for lots of data from cousins and cousins of cousins, including people you didn't know were cousins. But given the data, the conclusions aren't that hard to reach.

commented Jun 1, 2016 by Living Horace G2G6 Pilot (631k points)

RJ, you do not get it. I hear that you are frustrated, and you are formally advised to walk away. Further replies on your part will be received as aggression. If this all doesn't seem worth the time or to offer any value to you then leave it; you are not required or desired to participate and this thread is not about trying to convincing people that what they are doing is not worth the time or effort. You don't get to police what we waste our time and effort doing.

2) It won't help in the conventional genealogical sense to know what DNA I got from whom. But I am not necessarily interested knowing the name of the person or any of the usual genealogical details of the people I get my DNA; my genetic genealogy in a form which only represents the structure of my DNA inheritance can be completed irrespective of the status of my WikiTree genealogy. For me, the genetic genealogy is more important because I can directly, accurately, and precisely know what the genetic profile of my ancestors look like; when I compare those profiles with the genetic profiles of others then I can know with little doubt that I am genetically related to them or not.

The kind of genealogy that I am interested in and which I am discussing the mathematical model of here is not directly the kind of genealogy that WikiTree is constructing. The WikiTree genealogy puts the social relationships and records first and uses the genetic genealogy to support those social relationships and records.

To me this is backwards because the more reliable data for inheritance relationships is the data that is produced by genetics; much of conventional genealogies beyond immediate relationships is largely speculative in nature and subject to a plurality of errors that results in genealogies that are often inaccurate or simply causes genealogies to dead-end with no discernible trail.

The kind of genetic genealogy I am interested in producing by these mathematics would then have conventional genealogies mapped to it hypothetically rather than the other way around; I know I am related to the person with the genetic profile produced, but I do not necessarily know that I am related to the person who has the WikiTree profile attached to my family tree. I want to build my genealogy on what is known and knowable rather than mere speculation and family mythology.

commented Jun 2, 2016 by Ian Mclean G2G6 Mach 1 (13.6k points)

None of this is personal and I've no intention of trying to police anything. But this is a public forum and people need to comment on anything posted which they think might mislead other readers.

Genetics loses half the information at each generation. The only way to get it back, short of digging up skeletons, is to test lots of relatives and relatives of relatives.

If you want to know which side of your tree somebody is on, you need a triangulated 3-way match with another cousin. You can get there the hard way from first principles, or you can take the message that's already been potted and packaged. It will come to the same thing.

To pin somebody down further, ask the same question from other points of view. You'll need more 3-way matches.

commented Jun 2, 2016 by Living Horace G2G6 Pilot (631k points)

And you are again showing confusion based on your own presumptions.

Yes, genetic information is lost at each generation. But not the same genetic information. I have about half of my mother and father's DNA. My brother has about half of my mother and father's DNA. One of my half-sisters has about half of my mother's DNA. My brother, my sister, and I do not share the exact same half of our mother's DNA, so we do not share the exact same quarter of our grandparent's DNA and so on. If my brother, sister, and I compare our genetic results then each of us will have parts of our parent's DNA that would be absence in examining only one of us. This basic principle extends to cousins, aunts, and uncles for grand parents on up, so if I compare all my first cousins, aunts, siblings, parents, and myself together then we can get some percentage of our grand parent's DNA reconstructed without ever digging anyone up or even involving my living grand parents in the process.

But that's only a part of what I am talking about. Without regard to any other DNA kits besides my own, my DNA has to split in certain ways. I didn't inherit a random assortment of DNA. I inherited roughly half of my father's autsomal DNA and roughly half of my mother's autosomal DNA; the half I inherited from my father in general isn't identical to the half I inherited from my mother, and which half goes where is roughly linked with which sex-determining chromosomes I got from whom. Which portions of what chromosomes won't be known without making matches to other people's DNA, but it doesn't matter in the mathematical model. My DNA is treated as a variable or what is called an UNKNOWN in the 1800s language of mathematics; what DNA my father contributed is another different variable or another UNKNOWN. Same for my mother. Same for my siblings. Same for my cousins, aunts, uncles, grand parents, and people totally unrelated to me.

Graphs and tables can be constructed which show what abstract portion of autosomal DNA I got from whom. I am interested at the moment ONLY in the abstract relations. Once I have the parameters of the problem to plug into a fully constructed model, I can actually start doing comparisons in order to analytically link certain portions of my autosomal DNA with certain sides of my genetic genealogy starting with the maternal or paternal difference. Like my maternal grandfather potentially gave me X-chromosome DNA but gave me no mitochondrial DNA and no Y-chromosome DNA; roughly a quarter of my autosomal DNA comes from my grandfather, and the quarter isn't continuously distributed across 1-22 of my chromosomes, so I might have my maternal grandfather's DNA on my 1, 3, 4, 6, 7, 9, and 10th chromosomes. If my maternal grandfather's DNA can be put into a set like (1, 3, 4, 6, 7, 9, 10), and I compare my DNA with a random stranger that happens to match in (1, 3, 4, 6, 7, 9, 10) then I know that random stranger is related to me through my maternal grandfather's side of the family. With successive comparisons and enough genetic samples from close family members, I can use chromosome maps of that kind to automatically sort future matches to their proper place in my family tree. I might not be able to immediately place them exactly where they are in relation to me, but I will quickly be able to place them on the maternal or paternal side then place them on the paternal or maternal's grandfather or grandmother, and so on.

Not all unknowns are equally unknown though; I know I got a Y chromosome, and I know I got an X chromosome, and I know I got the Y chromosome with roughly half my autosomal DNA, and I know I got the X chromosome with roughly half my autosomal DNA, so I know that the relationship between roughly half my autosomal DNA is not entirely independent of which sex-determining chromosomes I inherited. Because the autosomal DNA is not entirely independent then I can write a functional notation representing that non-independent relationship where either my sex-determining chromosome is dependent on roughly half my autosomal DNA or roughly half my autosomal DNA is dependent on my sex-determining chromosome. With the difference between mitochondrial DNA and X chromosomes, we can actually establish more nuanced relationships between X-linked, Y-linked, and MT-linked inheritance as cross compared to each other.

So in the way that my directly measured DNA can be treated as a variable in a system so can unmeasured DNA of ancestors long since dead. You can think of my DNA as a solved system of equations which can be compared with other partially solved or unsolved but expressed systems of equations in order to examine the state of unmeasured DNA ancestors by indirect inference. Rather than thinking of the ancestor as strictly solved or unsolved, we can think of the ancestor in terms of percentages. If you only have my DNA to work with then you can only have about (1/(2^n))% of a given ancestor at a generation n solved. But if you compare me and my siblings then you can have more than (1/(2^n))% of an ancestor solved, and if you keep adding descendants to the comparison then we can tell more about the common ancestor. There's a mathematical relationship telling us what the minimum or maximum number of such comparisons will be to get 100% of the ancestor's DNA reconstructed.

"But this is a public forum and people need to comment on anything posted which they think might mislead other readers."

This is a blatant admission on your part that you think I am trying to mislead others. Your posts so far have been technically hostile to the process of free inquiry. It is great that you want to go ahead and keep doing things the way they have always been done. I am certainly not trying to stop you from doing exactly that. I don't care that there are labor intensive alternatives to solve these problems individually or by strict experimental methods.

I know there are mathematical methods which would be somewhat difficult to develop but which would be instrumental in the development of automated reasoners for genetic genealogy which makes the problem push-button for the average user who doesn't have interest in doing genetic genealogy the way it has always been done. In the meantime, the mathematics of genetic genealogy can be used individually to setup spreadsheet macros or simple programs that search and sort through data sets to make the process of identification of family members simpler and less manually intensive.

commented Jun 2, 2016 by Ian Mclean G2G6 Mach 1 (13.6k points)

Add more known to the equation!!!

I agree with RJ Horace you need more things to make it easier. A lot of unknowns doesn't make the equation easier to solve ....

As I am a big fan of open linked data I feel we have more known in the equation if ´we start adding more data as structured machine readable data in WikiTree ==>

If we add coordinate templates with date timestamps to Wikitree ==> we can create a timeline with locations as one known parameter i the equation
==>
1. Narrow down that two matching segments is in a particular area
If we add templates for sources ==> for some church books we can easily see that two people has sources in the same church book. Combining that with matching DNA segments starts getting interesting
==>
1. then we can narrow it down to a parish if for a specific segment match

commented Jun 2, 2016 by Living Sälgö G2G6 Pilot (296k points)

Ian - I share your frustration. To be blunt, the source of the problem is that Wikitree leadership and the DNA project have been reinforcing the proposition that "you need a triangulated 3-way match with an[other] cousin[S]." as a prerequisite to further confirm virtually any part of the tree.

The initial DNA Wikitree support also required all tests to be on Gedmatch and the results of the tests made public. It seems there was a belief that the auDNA and yDNA tests and community could be treated the same.

This is why RJ believes "Nothing is gained by factoring your own [DNA with your parents]." but this proposition is directly contrary to AncestryDNA telling us that Accuracy and genetic distance is gained by factoring in your own DNA with a parent or child." There had also been a push at 23andme to incorporate the phasing used in Ancestry Composition to their DNA Relatives algorithm. There is no way that Wikitree and AncestryDNA can both be true.

This is also why RJ believes "If you want to know which side of your tree somebody is on, you need a triangulated 3-way match with another cousin."

There is no DNA Service or Gedmatch that places this restriction. Triangulation has absolutely no effect on this outcome. It is a requirement unique to wikitree.

This restriction is almost always used when a 3rd match, who does not have documentation, wants to prove a relationship with 2 other DNA matches by identifying a common ancestor.

In virtually every case, you only need (1) a tree and (2) a single match in order to confidently support which side of your tree somebody else is on.

A perfect example implementing this logic is the Gedmatch Lazarus feature. You can generate a new kit for someone not tested based on knowing 2 groups. Group 1 descendants, and group 2 cousins. The Lazarus process DOES NOT care about triangulation.I have created a kit for my maternal grandmother. I plug in the information based on my tree, and the process uses the segments based on MATCHES. I can then use this result to identify PROBABLE matches via my mother, and PROBABLE matches of my mother via her mother. Triangulation plays NO role.

Wikitree leadership or the DNA project needs to step in and correct this misunderstanding of triangulation and how it used.

commented Jun 2, 2016 by Ken Sargent G2G6 Mach 6 (61.9k points)
edited Jun 2, 2016 by Ken Sargent

Magnus, you wrote " IBS ==> there is not 0 correlation but its random ==> useless... ==> back to square 1"

This may be true if you are working on a formula that predicts a relationship between 2 DNA testers, but this is not IAN's objective. He is working to build up evidence one connection at a time between a child and a parent.

Here is an example where a Triangulation Group is not used and a child/parent is used. if a mother and son are DNA Tested, and the son matches (meaning they meet the mininum requirements of a match) a cousin, but the mother does not. One of the segments is IBS, but the mother does not share that IBS segment with her sons match.

The only question that is being asked from the DNA is "Is this match related to the son via the mother or the father. Since we know that none of the segments are shared with the mother, we can infer from the evidence that the IBS segment came from the father but according to your statement

"there is not 0 correlation but its random ==> useless..."

This seems to be true when addressing "sticky" segments, which seem similar to IBS. There is a correlation between this IBS segment and those matches to the son that share this IBS segment. These matches are probably related to the son via his father.

Do you agree with RJ and Wikitree that "If you want to know which side of your tree somebody is on, you need a triangulated 3-way match with another cousin."?

Do you disagree that matches which include this IBS segment are probably related to the son via his father?

commented Jun 2, 2016 by Ken Sargent G2G6 Mach 6 (61.9k points)
edited Jun 2, 2016 by Ken Sargent

Magnus, I haven't commented on sticky segments because I am not familiar with them, and they would represent corrections on the basic model we haven't yet discovered or developed. I can see how they might be useful in the end game.

The "little experience" argument makes sense for empirical arguments, but mathematics doesn't rely strictly on experience. It relies on rules and assumptions. Rules like "The son inherits a y chromosome from only their father." from that rule and some other assumptions we can infer the existence of patrilineal inheritance. This distinct difference is conceptualized in comparison between empirical methods and (deductive or constructive methods).

The basic expression that needs resolution before we can really progress on to a more complex model is a) can a genome be represented as a vector quantity b) what is the dimensionality of the vector quantity c) what operator represents reproduction and what operator represents the inverse operation on reproduction.

Towards that, I did find a highly technical mathematics paper on the mathematics of genetic inheritance that I've linked in the OP; the main thing I took away from the paper is that genetic algebras are non-associative and sex-linked inheritance is anti-symmetric both of which make the representation of the human genetic sample, the reproduction operation, and the decomposition operation less straightforward than I had hoped. I will note that some aspect of the genetic sample has a scalar representation possibly the whole genetic sample according to the paper if I interpreted it correctly.

The obvious representation of the genetic sample is as a column or row of chromosomes. I am not sure how to treat the mitochondrial DNA or the X and Y chromosomes in the formal picture though. Seems like those should be treated differently from the more symmetric autosomal chromosomes.

The UC Davis links include a visual representation of the structure I am describing, but they don't have the mathematics for a precise representation from real world genetic data.

I'll be posting a graph of the representations I've tried so far in a few days.

commented Jun 2, 2016 by Ian Mclean G2G6 Mach 1 (13.6k points)

Peter, what you are describing is IBC, Identical by Coincidence. This is different than a segment that has been identified as Identical By State (IBS). In my example, for those matches to the son that include a valid IBS segment but that same IBS segment is not shared with the mother, who in this case could be phased with then son, is probably related to the son via his father.

If it makes a difference, since we have the son and mother, lets only use their phased data which results in a match with one of the segments used to determine the relationship and range identified as IBS.

IBS causes problems in predicting a relationship and range between matches, but we don't care about this, we only care about the source of the segment in one specific case and question. Is the match related to the son via his father or not. If no part of a valid IBS segment is shared with the mother, then it must be shared with the father.

commented Jun 2, 2016 by Ken Sargent G2G6 Mach 6 (61.9k points)
edited Jun 2, 2016 by Ken Sargent

Let's suppose you have lots of cousins and get them all tested. One in 8 will give you a Y match, your father's brothers' sons.

Half will give you an X match. They're on your mother's side.

Those cousins will also give you loads of autosomal matches. For simplicity we'll suppose that all the matches are through your mother.

So, looking at segment Blah1 to Blah2 on your Q chromosome-pair, if you have a match with maternal cousin Fred, you know one chromosome of the pair came from your mother. But you knew that anyway.

You also now know that any other match on the same chromosome comes through your mother.

But you don't know which other matches are on the same chromosome. The testing people say chromosome when they mean chromosome-pair because they can't separate the pair. This is why they have to do fuzzy matching and triangulation.

Every segment of every chromosome-pair except XY will match cousins on both sides if you have enough cousins. Which matches happen to exist in your sample isn't information, it's just a sampling artefact. But those matches won't yield any new information about who is on which side.

commented Jun 4, 2016 by Living Horace G2G6 Pilot (631k points)

RJ, This is a good question.

You begin with a false premise, based on what you have been told on this site.

"Every segment of every chromosome-pair except XY will match cousins on both sides if you have enough cousins"

The opposite is most likely true, especially for IBD, which can be traced to a unique common ancestor.

The reason for this is that that you may have a segment which matches a maternal cousin Fred, but on the paternal strand, part of the strand could be via the paternal grandfather, and part of the strand the paternal grandmother. Testing siblings, parents, aunts, uncles, and cousins, will identify those that have such a condition.

In these cases, the probability of the same segment matching more than one side is near 0%.

I would also like to correct you on your statement...

"This is why they have to do fuzzy matching and triangulation."

1st, DNA services, as far as I know do not include triangulation in their matching algorithm, they provide reports that allow you to create your own TG's.

2nd, I know that Wikitree likes to characterize the matching as "Fuzzy Matching', but that is not how it has been used, at least in the past, outside of wikitree. The logic which determines the endpoints has been described as using fuzzy logic, which is why different DNA services may report different end points. The matching algorithms use what may be better characterized as "Educated guess" or "Prediction".

A 7cm segment may actually be a 5cM or 6cM segment. This is why AncestryDNA encourages parents and children to test. Phasing the DNA Data works to eliminate the fuzziness, make the predictions more accurate, and extends the distance of the predictions.

commented Jun 4, 2016 by Ken Sargent G2G6 Mach 6 (61.9k points)

A) "1st, DNA services, as far as I know do not include triangulation in their matching algorithm, they provide reports that allow you to create your own TG's."

?!?!? its easier to tell what you refer to.... think this is a never ending discussion.....

1) FTDNA just do segment matching based on size and total in common and don't display results lower than a threshold

2) Ancestry DNA have DNA circles that are secret but we can guess they use the family tree available,.... ==> the have triangulation somehow...?!?!?

3) 23andMe ?!?!?

4) ?!?!?

B) 7cm segment may actually be a 5cM or 6cM segment ?!?!?

Do you mean something that looks like a IBD is a IBS sounds less possible or do we have numbers on that?

commented Jun 4, 2016 by Living Sälgö G2G6 Pilot (296k points)

a1) FTDNA just do segment matching based on size and total in common. Yes, and this is not triangulation.

a2) Ancestry DNA has circles that are secret but we can guess they use the family tree available,.... ==> the have triangulation somehow

Here is the Help on AncestryDNA Circles

"DNA Circles show you which members share DNA with one another in the genome, but not where in the genome they share that DNA. This is because our studies of genetic inheritance and DNA Circles have shown us that individuals in DNA Circles very rarely share the same matching segments"

I have been told that on Wikitree, the term triangulation means they are part of a triangulated Group. AncestryDNA clearly does not use triangulation. This clearly tells us that AncestryDNA does not use triangulation.

a3). 23andme does not use triangulation to determine what is a match, or prediction. You can run reports that will provide you the data for you to determine what is and what is not triangulated, but they do not report on what matches are triangulated.

a4) ??

a5) "B) 7cm segment may actually be a 5cM or 6cM segment ?!?!?"

FTDNA and 23andme may report a 7cM because the endpoints are "Fuzzy", but when AncestryDNA takes that same Raw Data and phrases it, the fuzziness is nearly eliminated, and the more accurate result is 5cM.

These are both IBD because they are the same segment, but the fact AncestryDNA will phase data when available, it results in a more accurate results. This is why AncestryDNA minimum is 5cm and the others 7cM.

commented Jun 4, 2016 by Ken Sargent G2G6 Mach 6 (61.9k points)

RJ, "Comes to the same thing." - Not on Wikitree. Wikitree only accepts triangulation when there is a triangulation group which shares the same 7cm or greater.

1. Wikitree does not accept less than 7cm. We on Wikitree can't say a person is related via one parent, based on evidence that a less than 7cM IBS segment absolutely did not come from the other parent. IMO, this is logic 101.

Outside of wikitree, I doubt many people will agree " If you have a match with an unknown person, you'll need them to match a known relative". Simple logic tells us given only 2 choices, and we eliminate one, the other must be true. If I can prove the match is not my mother, then it must be via my father.

I would like to correct you on the following.

"If you have a match with an unknown person, you'll need them to match a known relative to be able to find out which side of the tree they're on. Then the answer is immediate and doesn't need any further analysis."

Although I agree with this statement, it is not within the Wikitree guidelines or the comments made. You can not just match. If this were the case, then you would not have to look at segments. Even though you might have a completely documented connection to a cousin and you match that cousin, you have to find a third cousin who shares a triangulated segment. Why? it adds nothing when deciding which side of the family a cousin is related on.

commented Jun 4, 2016 by Ken Sargent G2G6 Mach 6 (61.9k points)
edited Jun 4, 2016 by Ken Sargent

Magnus,

If I match an unknown cousin on a 10cm segment, and my mother does not match on any part of that segment, the probability is that this segment came via my father. There are no Triangulated Groups involved. Just to be clear, you disagree because it seems you are still supporting the claims

“"If you want to know which side of your tree somebody is on, you need a triangulated 3-way match with another cousin."?”

If we agreed that this was a smaller IBS 6cM segment, and the mother did not share this 6cM segment, are you still supporting the claim…

Do you still disagree that segments from matches which are IBS segment are probably related to the son via his father?”
Every segment of every chromosome-pair except XY will match cousins on both sides if you have enough cousins. Even though segments on one strand came from the same paternal grandparent, but the other strand was split between the maternal grandfather and maternal grandmother.

commented Jun 4, 2016 by Ken Sargent G2G6 Mach 6 (61.9k points)

23andMe has recently added a triangulation groups system for open profiles. The system shows both In Common With indirect matching (Ancestry.com style DNA circles) and lists what ICW matches also form triangulation groups.

--------------------------------------

I figure that we should explicitly state something that is basically important: If you share a segment of DNA with someone then they are probably related to you; segments over 7 cM are more likely to be positive matches than to be false matches, and segments over 10 cM are almost certainly not false positives (ISOGGWiki).

You don't need triangulation groups to establish that you are related to someone by DNA comparison in some way. Sharing more than 15 cM can be considered to be almost certainly a direct genetic relationship.

Triangulation groups are necessary for establishing probable most recent common ancestors; I need my two sibling's and my DNA to indirectly establish my mother as our common ancestor, or I need one sibling, my mother, and my own DNA to directly establish that my mother is our common ancestor by a triangulation group. I need one sibling or my mother, one of either my maternal aunt or my first cousins by my maternal aunt, and my DNA to indirectly establish either of my grand parents as a common ancestor. This logic extends up through all genetic ancestors but not necessarily for every genealogical ancestor (See the UC Davis genetic genealogy blog in the OP for details)

commented Jun 5, 2016 by Ian Mclean G2G6 Mach 1 (13.6k points)

4 Answers

I have a couple of genetic matches to people who were adopted, and I would really like to figure out what branches of my family they relate to. Which is why I've been trying so hard to figure out the mathematical model for this search and sort method.

I had thought about a fan chart; I think that is the best way to represent whole genome sequences or to keep track of exome results like what has been derived by labs like 23andMe and FamilyTreeDNA. The structure I would like to find is the one which shows what fragments of DNA I got from who.

Think of it like this. At me, the structure would ideally have 100% of my DNA exactly as it is. At my parents, they'd each have roughly 50% of my DNA representing the portions they passed on to me; this would be basically my phased DNA showing exactly what I got from my father and exactly what I got from my mother.

Normally in figuring out what my mother and father are going to pass on to a child it is a matter of some randomness and probability, but in the case where we're examining me, my DNA, and my parents and their DNA there isn't strictly a probabilistic relationship to be concerned about; we should be able to use strict differences to deduce what actually happened as compared to what could have happened from the actual measurements.

In practice for figuring out where distant cousins go in the family tree, I do think it would be a probability or at least a degrees of truth problem written in statistical or fuzzy logic.

A bonus to making the kind of map that I am thinking about is that we'd eventually see what DNA survived from my ancestors to me and see what is missing from the puzzle. With enough people represented in this same kind of structure, we could start to see where the pieces fit together, so we could reconstruct the whole genome sequences of common ancestors that we don't necessarily know. To me that would be useful for determining where I fit in the global family graph, and I imagine it would be similarly useful to other people looking for how they fit in.

commented May 29, 2016 by Ian Mclean G2G6 Mach 1 (13.6k points)

Ken, the first problem I have to solve is which side of my family they are related to me from. For my maternal grandmother's side of the family, I know with relative certainty that one of the adoptees is not related to me by my great grandparents or lower; I know all my great aunts and uncles and all their children and all the great grandchildren. There's a distinct possibility that they are related through my maternal grandfather's side of the family, but I have put figuring that out specifically on hold until I can rule out the more difficult case: they are related to me through my father's side of the family.

I know very little about my father's side of the family relatively speaking. I barely have my relatives documented out to my paternal grandparents. One of the adoptees shares X chromosome DNA with me, so I can generally assume she is a relative on my mother's side, but the main adoptee that I want to help shares only autosomal DNA with me, so it is ambiguous as to where they are in my family.

In order to figure out their parents, I need to figure out which side of the family I need to look on. From there I need to figure out our most recent common ancestor. From the most common recent ancestor, I can then trace down the line to the adoptee and at least one of their parents; the adoptee has already found their parent of record at least under a pseudonym, and from what is known of their father, I am related to them through their mother.

I'll keep the Lazarus kits in mind for this, but the Lazarus kits depend on having solved more basic problems.

commented May 30, 2016 by Ian Mclean G2G6 Mach 1 (13.6k points)

I like your model Ian - and of course it should work. Magnus and Ken are smart cookies and they've pointed out some issues. Nevertheless the data should tell the story. And that for me is the rub - the data aren't necessarily there yet. Despite the seeming precision of these tests we don't know the error rate or variation in results due to testing procedures, lower level data sorts, different tolerances, thresholds, or magnitudes for categorizing the lower-level data, etc. None of these data from these tests are ready for the precision of the 'exacto knife' of a model you have in mind presently. Even the underlying proteins themselves appear to behave in unpredictable ways so while I am hopeful better models will be developed for predictive as well historical reasons I'm not sure were aren't stuck in the Sherlock Holmes era for a bit longer. I would think Mathmatica might do some interesting things with the data but you are likely to have to rely on statistics and categorical analysis for the state of the art.

commented May 31, 2016 by Leake Little G2G6 Mach 1 (15.6k points)

I get the issues with the available data. But the precision isn't so much the issue anymore, and as time goes by, it is going to become less the issue. Error rates are entering into the 1% range and rapidly diminishing for individual genetic tests. Comparison between old kits and new kits or between standard kits and custom kits are the major problem at the moment.

With the cost per genome rapidly approaching 0, the issue of precision or lack of data is going to effectively go away entirely. For my personal case, I have most of my immediate family members totally on board for genetic sequencing and analysis, so I am not concerned about not having access to the minimum data necessary to pull apart my genome and figure out deductively and experimentally where I got what from whom. To me, it is simply a matter of finding and learning to use the correct tools. Or inventing them where they don't yet exist.

For me the major issue isn't the reliability of the specific genetic testing kits though. What I want is the basic mathematical model for the simplest case: the generalized family tree without pedigree collapse.

That model isn't going to depend on any of those factors, and we can actually use deviations from the simplest model as a way to infer information that wouldn't otherwise be obvious.

The mathematical model can be constructed without actually depending directly on any given test or precision. The data may not be present or up to the required precision, but we have the basic theories for vector analysis, physical measurement, computer coding, and genetics. The mathematical theory of genetic genealogy can be written before we have the data to test the theory of genetic genealogy. Data developed later can then be used to test the theory and possibly refute it or some of its assumptions.

The basic structure is actually already relatively well known: "[Identical by state data] may be useless in identifying the common ancestor but it [is useful in] an iterative process of building a decision tree [...] based on probabilities." -Ken Sargent

commented May 31, 2016 by Ian Mclean G2G6 Mach 1 (13.6k points)

Answer 1 · 2016-05-27T05:36:42+0000

The problem, as I see it, is that only a sampling of a given person's genome is tested, so that while getting a statistical measure of relatedness for a few generations is easy enough, it will only work definitively for a very few generations. Further back patterns will exist for areas of the genome, but since they are characteristic of a large number of people in a given area, it isn't possible to do what you suggest if there is a very large tested group. I've been looking a bit for a good source of technical explanations but so far I mostly see stuff for the non-technically oriented. But Wikitree is a large group and I'm pretty sure actual articles will be cited here which don't take a bunch of money to read. Meanwhile I have more to do here than I can get even started on. But I'll keep an eye on you to see if you get a handle on how to handle things.

Answer 2 · 2016-05-29T13:20:36+0000

I believe what you are saying is true but some clarification is necessary.

Using your conclusion: "This way my DNA can be effectively factored recursively into maternal vs paternal, maternal grandfather vs maternal grandmother, paternal grandfather vs paternal grandmother, and so on. You could then compare the factored or phased DNA to matches shared with other family members and determine immediately where in your family tree they must be. Likewise, you could use the vector representation of other people's DNA in order to automatically generate genealogies and check for intersections."

My two brothers also have been DNA tested, and without a tree, we can make certain conclusions about the source even though we can't specifically identify which parent or another ancestor is the source.

For example, if the two oldest brothers in my family share a segment, but the 3rd brother does not, it means that the 3rd brother received that particular segment from a different paternal grandparent and a different material grandparent than the other two.

If a cousin matches the 3rd brother, then you can also make some assumptions that about the grandparent of the 1st two brothers. We presume that the well-documented tree is correct, and by using that tree, it is determined that this match is via his paternal grandfather, you can presume that other matches on that same segment to only the two oldest brothers was inherited via your paternal grandmother.

I can make this presumption because I know my parents share no segments.

A tree and DNA are mutually dependent on each other for the answers we are asking A tree and DNA are either consistent with each other or they are not. DNA alone can not independently Confirm nor Prove particular relationship. It can only further support or refute an existing claim.

Answer 3 · 2016-06-05T09:36:08+0000

Here is my exchange with Ann Cousin aka DNACousins.

I asked for clarification on some things to make it clearer but this morning I told her it was not needed. I understood why she answered as she did but just reading the response.

1. Conceptually, a match for a son not found in his mother can be attributed to his father. This includes IBD and IBS, but not IBD. I have no problem limiting those matches (without triangulation), to only those that include an IBD segment, which is all I initially intended.

2. Given the answer to #1 only requires a match, it indicates that triangulation is not necessary. Since triangulation is only used to find common ancestors for those without a tree, she interpreted the question that way.

Do I really have to ask Ann to clarify her last statement by telling her that the Wikitree technical group believes that triangulation is used for something other than finding the common ancestor? Do I really have to say to her that the Wikitree Technical members are not convinced her answer to #1 because "If you want to know which side of your tree somebody is on, you need a triangulated 3-way match with another cousin. "

I tried to ask the question so not to bias her answer but I should have noted that Wikitree places a triangulation requirement on more than finding common ancestors.

From: Ann Turner

Sent: Saturday, June 04, 2016 4:32 PM

To: Kenneth Sargent

Subject: Re: I am hoping you will clear up a disagreement.

I've been wishing I could spend more time on WikiTree, but it seems like there's always something else demanding my attention.

1) Conceptually, a match for a son not found in his mother can be attributed to his father. There are a couple of "gotchas", though. The segment must be long enough that you can rule out a coincidental match. There's no consensus on how long that should be. And there is also a possibility of a false negative in the mother, e.g. at FTDNA (which requires a total of 20 cM, including small 1-3 cM pseudo-segments), AncestryDNA (with its TIMBER algorithm discounting some segments) and 23andMe (with a cap on the number of DNA Relatives). GEDmatch lets you look at everyone through the same lens.

2) There's also no consensus on whether you "need" a triangulated group. AncestryDNA uses more of a network approach. I wrote up some material about how difficult it is to assemble TGs here: http://tinyurl.com/TheTroubleWithTriangulation. But if you have the good fortune to get a triangulated group with pretty robust segment sizes, I do think it's possible to attribute it to a specific ancestral couple if it's not too many generations back. When you go back many generations, that brings up the possibility of multiple lines of descent.

Hope that helps,

Ann

On Sat, Jun 4, 2016 at 9:48 AM, Kenneth Sargent <msnkjsargent@msn.com> wrote:

Hi Ann,

I’ve been spending too much time on Wikitree, devoted almost entirely to the discussions on DNA. I suspect that Wikitree is the best source for publicly available documented trees but the discussions are not at the level as 23andme used to be. I was hoping to ask you two basic questions and get your permission to post your response. We are discussing the “mathematics of genetic genealogy”.

Your responses to these questions could significantly affect how Wikitree users think about how to use DNA in their research.

Scenario: We have the raw data for a mother and a son available to us for customization. There are matches to the son, that are not matches to his mother. More specifically for these matches, the segments are shared with the son, but none are shared with the mother. Since the data can be phased, I am presuming the process could phase the data first.

1. Is it possible, using the data available, in these cases, that a match to the son, and not the mother, is probably related to the son via the father?

2. Do you agree “If you want to know which side of your tree somebody is on, you need a triangulated 3-way match with another cousin. “ FYI – a triangulated 3-way match” means part of a Triangulated Group. You don’t have to go further than yes or no, but feel free to comment.

Thank you

answered Jun 5, 2016 by Ken Sargent G2G6 Mach 6 (61.9k points)

RJ, the scenario we have been using only involves 3 people who are not all biologically related to each other. The son and the mother are related to each other, but the DNA Cousin is only related to the son. There is no triangulation match with the son. It is a simple match which contains IBD AND possibly IBS Segments.

Given there is NO TRIANGULATION in this scenario, I maintain "you need don't need a triangulated 3-way match with another cousin." which is directly contrary to your assertion. This same principle that is applied to Wikitree requirements that only the approved method of the confirmation of a father or mother requires triangulation.

I am not sure what you mean by "Most ancestors are inaccessible". We are not looking at any ancestors of the son other than the mother and father.

It seems that you believe (and wikitree) that you have to know the common ancestors in order to determine if a match is related to the father or to the mother in every case.

commented Jun 5, 2016 by Ken Sargent G2G6 Mach 6 (61.9k points)

Based on your last post, I will assume some misunderstanding.

I think it important then to identify the problem with communication in this case.

1^st The title implies a more technical discussion on “mathematics of genetic genealogy?” in which a higher level of precision is assumed.

2. You stated, “If you want to know which side of your tree somebody is on, you need a triangulated 3-way match with another cousin.”

This has really been then the focus of the exchange. To Ian and I, this was obviously false.

Ian provided examples that contradicted this proposition by providing examples of showing which side of your tree somebody is on, without a triangulated 3-way match with another cousin.

3. I provided a simple scenario of the son, mother, and cousin where the son is related to the cousin via his father and repeatedly used it to show my point. I took your responses as denials. This is also without a triangulated 3-way match with another cousin.

I thought I was very specific about the scope of my statements. Please understand that I am unclear about what you believe is obvious.

Do you still believe…

If you want to know which side of your tree somebody is on, you need a triangulated 3-way match with another cousin.”

Because if you still support this, then you can’t believe

“the son is probably related to the cousin via the son's father” because there is no triangulated 3-way match with another cousin”

commented Jun 6, 2016 by Ken Sargent G2G6 Mach 6 (61.9k points)

The following is a pseudocode rendering of an algorithm for determining genetic relationships between anonymous or pseudoanonymous genetic samples.

First step: determine which side of the family a match is for your genetic genealogy via comparison to yourself and at least one parent; mitochondrial matches are going to be strictly along your matrilineal relations; X chromosome matches can be effectively treated as being strictly maternal for XY karyotypes but maybe paternal for non-XY karyotypes; Y chromosome matches can be effectively treated as strictly paternal.

Second step: repeat the above for n matches to create a pool of sorted matches which have been determined to be on your father's side, your mother's side, both, or neither (you might have matches due to mutation). The choice of the size of the pool, n, needs to be based on standards for statistical significance.

Third step: find all matches that share a sex-linked segment and an autosomal segment. These are weakly patrlineal (Y Chromosome), weakly matrilineal (mitochondrial), or weakly maternal (X Chromosome) autosomal matches; there is a probable relationship of inheritance between.the autosomal match and the sex-linked match; this is a correlative relationship but not necessarily a causal relationship. This group of matches are useful for figuring out what autosomes to target first in the search and sort.

Fourth step: analyze the matches and sort according to probable degree of relationship. Naively, order the sorts according to cM lengths, the number of shared segments, and total shared cM lengths; a more sophisticated algorithm for determining probable degree of relationship can and should be be used.

Fifth step: diagram yourself at the center of a bifurcated polar coordinate system with the sorted matches plotted to intervals representing the range of probable degree of relationship over the rings radiating out from you on the appropriate side of the map. Mother's matches on one side and father's matches on the other side; I would probably exclude plotting the both or neither matches for now. The idea is to find clusters of matches that match each other and graph those clusters according to their probable degree of relationship; by graphing their probable degree of relationship to you and their probable degree of relationship to each other, you create a relative topological reference of distance and connection.

Sixth step: find all triangulations between you and your mother's matches; find all triangulations between you and your father's matches. Mark the abstract relationship of you, your mother, and your match's most recent common ancestor; at this point, the graph should begin to show a structure of relationships resembling a familiar genetic genealogy; it will likely be incomplete and will have islands of disconnected relations.

Steps beyond this really depend on what you want to accomplish. The islands can be recursively connected by performing steps 1 through 6 for each child-parent pair you can find among your matches. There's a critical threshold of matches that would result in a chain reacting algorithm that would tend towards total connectivity.

A DNA cousin can know which side of the family I am on by the mirror image of the process by which I discovered what side of the family they are on.

To determine what side of my father's family tree a given match is more information is required. In my case, I basically do not have access to my father's DNA directly, so the best I can do is phase my DNA with my mother and my siblings to composite my father's DNA via Lazarus kits or similar.

However, I can also take all of the cousins that I am able to discern are not related to my mother and composite their DNA matches with me as well into the image of my father's DNA; I don't know what side of his tree they all are on, but I don't need to know either because I only need to know that they are not on my mother's side of the family tree. I can composite a functional image of my father between my siblings, my mother, my father's pseudonymous genetic relations, and me; the issue then is to determine his mother or father's DNA. Obviously, his mother's DNA can't be fully reconstructed without a genetic kit from his daughter, maternal sisters, maternal aunts, or maternal uncles because I probably share no X Chromosome or mitochondrial DNA in common with my father. His father can be partially reconstructed sans X chromosome and mitochondrial DNA because I share upwards of 1/4th my autosomal DNA and almost my whole Y Chromosome in common with him, but again, we would need genetic kits from my father's paternal sisters, paternal aunts, paternal uncles, or daughter. Without going through all that, I would guess that we can use my DNA and my father's partially reconstructed DNA to sort paternal DNA cousins into probable pools of paternal grandfather and grandmother matches by those who do not share DNA with me and my partrilineal uncles or paternal XY-karyotype first cousins.

Though we are now getting into why it is important to derive the mathematical genetic decomposition or "factorization" of people for comparison. The determination of which side a DNA cousin lies on of my father is answered by how the pieces fit together to form a completed puzzle and depends on the mathematical decomposition of a genome into a quantitative genetic genealogy; unlike a common puzzle where each piece has a unique fit, this puzzle can be assembled multiple ways from multiple other puzzles. The classification and categorization of which are methodically significant and mathematically possible.

commented Jun 7, 2016 by Ian Mclean G2G6 Mach 1 (13.6k points)

Categories

What do you recommend for the mathematics of genetic genealogy?

Please log in or register to add a comment.

Please log in or register to answer this question.

4 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Related questions