`trident fetch -d /somewhere/to/store/the/archive --downloadAll `

# 3 Introduction to F3- and F4-Statistics

## 3.1 Admixture - F3 Statistics

F3 statistics are a useful analytical tool to understand population relationships. F3 statistics, just as F4 and F2 statistics measure allele frequency correlations between populations and were introduced by Nick Patterson (Patterson et al. 2012), but see also (Peter 2016) for another introduction.

F3 statistics are used for two purposes: i) as a test whether a target population (C) is admixed between two source populations (A and B), and ii) to measure shared drift between two test populations (A and B) from an outgroup (C).

F3 statistics are in both cases defined as the product of allele frequency differences between population C to A and B, respectively:

\[F3(A,B;C)=\langle(c−a)(c−b)\rangle\]

Here, \(\langle\cdot\rangle\) denotes the average over all genotyped sites, and a, b and c denote the allele frequency for a given site in the three populations A, B and C.

It can be shown that if \(F3(A, B; C)\) is negative, it provides unambiguous proof that population C is admixed between populations A and B, as in the following phylogeny (taken from Figure 1 from (Patterson et al. 2012):

Intuitively, an F3 statistics becomes negative if the allele frequency of the target population C is on average intermediate between the allele frequencies of A and B. Consider as an extreme example a genomic site where \(a=0\), \(b=1\) and \(c=0.5\). Then we have \((c−a)(c−b)=−0.25\), which is negative. So if the entire statistics is negative, it suggests that in many positions, the allele frequency c is indeed intermediate, suggesting admixture between the two sources.

One way to understand this is by looking what happens to a list of SNPs and allele frequencies for groups A, B and C:

SNP | A | B | C | \((c-a)(c-b)\) |
---|---|---|---|---|

1 | 1 | 0 | 0.5 | -0.25 |

2 | 0.8 | 0 | 0 | 0 |

3 | 0 | 0.7 | 0 | 0 |

4 | 0.1 | 0.5 | 0.3 | -0.04 |

5 | 0 | 0.1 | 0.2 | 0.02 |

6 | 1 | 0.2 | 0.9 | -0.07 |

\(F_3(A, B;C)\) | -0.057 |

Every SNP where C has an allele frequency *intermediate* between A and B contributes negatively. Here, the average is also negative, providing evidence for admixture. For statistical certainty, an error bar for this estimate is needed, which is typically computed via Jackknife (see for example the xerxes whitepaper).

### 3.1.1 Computing Admixture-F3 with xerxes

We will use this statistics to test if Finnish are admixed between East and West, using different Eastern and Western sources. In the West, we use French, Icelandic, Lithuanian and Norwegian as source, and in the East we use Nganasan and one of the ancient individuals analysed in (Lamnidis et al. 2018), from the site of *Bolshoy Oleni Ostrov* from the Northern Russian Kola-peninsula, and dating to 3500 years before present.

We use the software `xerxes fstats`

from the Poseidon Framework. Here is a command line that computes 8 statistics for us:

```
xerxes fstats -d ~/dev/poseidon-framework/community-archive \
--stat "F3(Nganasan,French,Finnish)" \
--stat "F3(Nganasan, Icelandic, Finnish)" \
--stat "F3(Nganasan, Lithuanian, Finnish)" \
--stat "F3(Nganasan, Norwegian, Finnish)" \
--stat "F3(Russia_Bolshoy, French, Finnish)" \
--stat "F3(Russia_Bolshoy, Icelandic, Finnish)" \
--stat "F3(Russia_Bolshoy, Lithuanian, Finnish)" \
--stat "F3(Russia_Bolshoy, Norwegian, Finnish)" \
```

Here is the result that you should get, nicely layouted in a Text-table:

```
.-----------.----------------.------------.---------.---.---------.----------------.--------------------.------------------.---------------------.
| Statistic | a | b | c | d | NrSites | Estimate_Total | Estimate_Jackknife | StdErr_Jackknife | Z_score_Jackknife |
:===========:================:============:=========:===:=========:================:====================:==================:=====================:
| F3 | Nganasan | French | Finnish | | 593124 | -1.0450e-3 | -1.0451e-3 | 1.2669e-4 | -8.249133659451905 |
| F3 | Nganasan | Icelandic | Finnish | | 593124 | -1.1920e-3 | -1.1920e-3 | 1.3381e-4 | -8.908381869946188 |
| F3 | Nganasan | Lithuanian | Finnish | | 593124 | -1.1605e-3 | -1.1605e-3 | 1.5540e-4 | -7.4680182465607245 |
| F3 | Nganasan | Norwegian | Finnish | | 593124 | -1.0913e-3 | -1.0914e-3 | 1.3921e-4 | -7.83945981272796 |
| F3 | Russia_Bolshoy | French | Finnish | | 542789 | -6.1807e-4 | -6.1809e-4 | 1.0200e-4 | -6.059872900228102 |
| F3 | Russia_Bolshoy | Icelandic | Finnish | | 542789 | -6.2801e-4 | -6.2802e-4 | 1.1792e-4 | -5.325695961373772 |
| F3 | Russia_Bolshoy | Lithuanian | Finnish | | 542789 | -3.7310e-4 | -3.7310e-4 | 1.2973e-4 | -2.8760029685791637 |
| F3 | Russia_Bolshoy | Norwegian | Finnish | | 542789 | -3.7646e-4 | -3.7653e-4 | 1.1630e-4 | -3.2375830440434323 |
'-----------'----------------'------------'---------'---'---------'----------------'--------------------'------------------'---------------------'
```

You can see that in all cases this statistic is negative (`Estimate_Total`

). The next two columns ( `Estimate_Jackknife`

and `StdErr_Jackknife`

) are computed using Jackknifing (see the xerxes whitepaper for details). The last column is the Z score, and it is important here: It gives the deviation of the f3 statistic from zero in units of the standard error. As general rule, a Z score of -3 or more suggests a significant rejection of the Null hypothesis that the statistic is not negative. In this case, all of the statistics are significantly negative (with one borderline exception), proving that Finnish have ancestral admixture of East and West Eurasian ancestry. Here, Eastern ancestry is represented by Nganasan or Russia_Bolshoy, respectively, while Western ancestry by French, Icelandic, Lithuanian and Norwegian, respectively. Note that the statistics does not suggest when this admixture happened!

### 3.1.2 Running xerxes via a configuration file

As you will have noticed, the command line above is getting quite long, since a separate `--stat`

option has to be entered for every statistic to be computed. There is a more powerful and elegant interface to xerxes, which uses a configuration file in YAML format. To illustrate it, let us consider the configuration file (which can be found in `fstats_working/F3_finnish.config`

in the git-repository of this book) needed to compute the same statistic as above:

```
fstats:
- type: F3
a: ["Nganasan", "Russia_Bolshoy"]
b: ["French", "Icelandic", "Lithuanian", "Norwegian"]
c: ["Finnish"]
```

You can then run xerxes as

`xerxes fstats -d ~/dev/poseidon-framework/community-archive --statConfig fstats_working/F3_finnish.config`

So `xerxes`

then automatically creates all combinations of populations listed in slots `a`

, `b`

and `c`

.

## 3.2 F4 Statistics

A different way to test for admixture is by “F4 statistics” (or “D statistics” which is very similar), also introduced in (Patterson et al. 2012).

F4 statistics are also defined in terms of correlations of allele frequency differences, similarly to F3 statistics, but involving four different populations, not just three. Specifically we define

\[F4(A,B;C,D)=\langle(a−b)(c−d)\rangle.\]

### 3.2.1 Shaping intuition - the ABBA- and BABA-sites

To understand the statistics, consider the following tree:

In this tree, without any additional admixture, the allele frequency difference between A and B should be completely independent from the allele frequency difference between C and D. In that case, F4(A, B; C, D) should be zero, or at least not statistically different from zero. However, if there was gene flow from C or D into A or B, the statistic should be different from zero. Specifically, if the statistic is significantly negative, it implies gene flow between either C and B, or D and A. If it is significantly positive, it implies gene flow between A and C, or B and D.

It is helpful to again consider an example using a SNP list, this time assuming that every population is just a single (haploid) individual, so each allele frequency can just be 0 or 1. For example:

SNP | A | B | C | D | \((a-b)(c-d)\) |
---|---|---|---|---|---|

1 | 1 | 0 | 0 | 0 | 0 |

2 | 1 | 0 | 1 | 1 | 0 |

3 | 0 | 1 | 1 | 0 | -1 |

4 | 0 | 1 | 0 | 1 | 1 |

5 | 1 | 0 | 0 | 1 | -1 |

6 | 1 | 0 | 0 | 0 | 0 |

\(F_4(A, B;C, D)\) | -0.0167 |

You can see that the only SNPs that contribute positively to this statistics are SNPs where the alleles are distributed as `1010`

and `0101`

, and the only SNPs that contribute negatively are `1001`

and `0110`

. In the literature, the two patterns have been dubbed “ABBA” and “BABA”, which is why the statistical test behind this statistic (see below) was sometimes called the ABBA-BABA test (see for example (Martin, Davey, and Jiggins 2015)).

The intuition here is straight-forward: In positions that are polymorphic in both \((A,B)\) and \((C,D)\), this statistic asks whether B is genetically more similar to C than it is to D. This is most useful as a test for “treeness”: If A, B, C, D are related to each other as indicated in the above tree, then B should be equally closely related to C as to D. But if we actually find evidence that B is closer to C than to D, or vice versa, then this means that the tree above cannot be correct, but that there must be a closer connection between B and C or B and D, depending on the sign of the statistic.

### 3.2.2 From single samples to allele frequencies

So the ABBA- and BABA-categories of SNPs help shape intuition for how this statistic behaves for single haploid genomes. But what about population allele frequencies? Looking back at the formula \(\langle(a−b)(c−d)\rangle\) this doesn’t help very much with intuition how this behaves with frequencies. Well, a nice feature of F4-Statistics is that averages factor out. This means, that if you have multiple samples in one or multiple slots A, B, C or D, the total F4-statistic of the *groups* is exactly equal to the *average* of F4-Statistics of the *individuals*. Here is a more mathematical definition.

Let’s say we have 2 individuals in each of A and B, so we may perhaps write \(A=\{A_1,A_2\}\) and \(B=\{B_1,B_2\}\). Then one can show to have

\[F4(A, B; C, D) = \text{Average of}[F4(A_1, B_1; C, D), F4(A_1, B_2; C, D), F4(A_2, B_1; C, D), F4(A_2, B_2; C, D)]\]

so just thte average over all individual-based F4-statistics. And this can be shown to be true for arbitrary numbers of samples. So in other words: An F4-Statistic *always* measures the *average* excess of pairwise BABA SNPs over ABBA SNPs. To me, this is a useful insight, as I find thinking in terms of ABBA-BABA somehow more helpful than thinking in terms of correlations of allele-frequency differences (which is really what the original formula is).

The way this statistic is often used, is to put a divergent outgroup as population A, for which we know for sure that there was no admixture into either C or D. With this setup, we can then test for gene flow between B and D (if the statistic is positive), or B and C (if it is negative).

### 3.2.3 Running F4-Statistics with xerxes

Here, we can use this statistic to test for East Asian admixture in Finns, similarly to the test using Admixture F3 statistics above. We will again use `xerxes fstats`

. We again prepare a configuration file (in `fstats_working/F4_finish.config`

in the git-repository of this book), this time with four populations (A, B, C, D):

```
fstats:
- type: F4
a: ["Mbuti"]
b: ["Nganasan", "Russia_Bolshoy"]
c: ["French", "Icelandic", "Lithuanian", "Norwegian"]
d: ["Finnish"]
```

You can again run via

`xerxes fstats -d ~/dev/poseidon-framework/community-archive --statConfig fstats_working/F4_finnish.config`

The result is:

```
.-----------.-------.----------------.------------.---------.---------.----------------.--------------------.------------------.--------------------.
| Statistic | a | b | c | d | NrSites | Estimate_Total | Estimate_Jackknife | StdErr_Jackknife | Z_score_Jackknife |
:===========:=======:================:============:=========:=========:================:====================:==================:====================:
| F4 | Mbuti | Nganasan | French | Finnish | 593124 | 2.3114e-3 | 2.3115e-3 | 1.2676e-4 | 18.235604067907143 |
| F4 | Mbuti | Nganasan | Icelandic | Finnish | 593124 | 1.6590e-3 | 1.6590e-3 | 1.4861e-4 | 11.163339072181776 |
| F4 | Mbuti | Nganasan | Lithuanian | Finnish | 593124 | 1.3290e-3 | 1.3290e-3 | 1.4681e-4 | 9.052979707622278 |
| F4 | Mbuti | Nganasan | Norwegian | Finnish | 593124 | 1.6503e-3 | 1.6503e-3 | 1.5358e-4 | 10.745850997260929 |
| F4 | Mbuti | Russia_Bolshoy | French | Finnish | 542789 | 1.8785e-3 | 1.8785e-3 | 1.2646e-4 | 14.854487416366263 |
| F4 | Mbuti | Russia_Bolshoy | Icelandic | Finnish | 542789 | 1.0829e-3 | 1.0828e-3 | 1.4963e-4 | 7.236818881873822 |
| F4 | Mbuti | Russia_Bolshoy | Lithuanian | Finnish | 542789 | 5.4902e-4 | 5.4907e-4 | 1.4601e-4 | 3.7605973064589096 |
| F4 | Mbuti | Russia_Bolshoy | Norwegian | Finnish | 542789 | 9.3473e-4 | 9.3475e-4 | 1.5302e-4 | 6.108881868125652 |
'-----------'-------'----------------'------------'---------'---------'----------------'--------------------'------------------'--------------------'
```

As you can see, in all cases, the Z score is positive and larger than 3, indicating a significant deviation from zero, and implying gene flow between Nganasan and Finnish, and BolshoyOleniOstrov and Finnish, when compared to French, Icelandic, Lithuanian or Norwegian.

## 3.3 Outgroup-F3-Statistics

Outgroup F3 statistics are a special case how to use F3 statistics. The definition is the same as for Admixture F3 statistics, but instead of a target C and two source populations A and B, one now gives an outgroup C and two test populations A and B.

To get an intuition for this statistics, consider the following tree:

In this scenario, the statistic F3(A, B; C) measures the branch length from C to the common ancestor of A and B, coloured red. So this statistic is simply a measure of how closely two population A and B are related with each other, as measured from a distant outgroup. It is thus a similarity measure: The higher the statistic, the more genetically similar A and B are to one another.

Here is again a SNP table to illustrate, using haploid individuals:

SNP | A | B | C | \((c-a)(c-b)\) |
---|---|---|---|---|

1 | 1 | 0 | 0 | 0 |

2 | 1 | 0 | 0 | 0 |

3 | 0 | 0 | 1 | 1 |

4 | 1 | 0 | 1 | 0 |

5 | 1 | 1 | 0 | 1 |

6 | 0 | 0 | 1 | 1 |

\(F_3(A, B;C)\) | 0.5 |

You can see that each position which is similar between A and B, but different to C contributes 1, all other SNPs 0. So it directly measures similarity between A and B on alleles that differ from the outgroup C.

We can use this statistic to measure for example the genetic affinity to East Asia, by performing the statistic F3(Han, X; Mbuti), where Mbuti is a distant African population and acts as outgroup here, Han denote Han Chinese, and X denotes various European populations that we want to test.

You can again define a configuration file that performs looping over various populations X for you:

```
fstats:
- type: F3
a: ["Han"]
b: ["Chuvash", "Albanian", "Armenian", "Bulgarian", "Czech", "Druze", "English",
"Estonian", "Finnish", "French", "Georgian", "Greek", "Hungarian", "Icelandic",
"Italian_North", "Italian_South", "Lithuanian", "Maltese", "Mordovian", "Norwegian",
"Orcadian", "Russian", "Sardinian", "Scottish", "Sicilian", "Spanish_North",
"Spanish", "Ukrainian", "Finland_Levanluhta", "Russia_Bolshoy", "Russia_Chalmny_Varre", "Saami.DG"]
c: ["Mbuti"]
```

which cycles through many populations from Europe, including the ancient individuals from Chalmny Varre, Bolshoy Oleni Ostrov and Levänluhta (described in (Lamnidis et al. 2018)). We store this file in a file called `fstats_working/OutgroupF3_europe.config`

and run via:

`xerxes fstats --statConfig fstats_working/OutgroupF3_europe.config -d ~/dev/poseidon-framework/community-archive -f fstats_working/outgroupf3_europe.tsv`

Here is the output of this run (but note that a tab-separated version was also stored in `fstats_working/outgroupf3_europe.tsv`

using the option `-f`

):

```
.-----------.-----.----------------------.-------.---.---------.----------------.--------------------.------------------.--------------------.
| Statistic | a | b | c | d | NrSites | Estimate_Total | Estimate_Jackknife | StdErr_Jackknife | Z_score_Jackknife |
:===========:=====:======================:=======:===:=========:================:====================:==================:====================:
| F3 | Han | Chuvash | Mbuti | | 593124 | 5.3967e-2 | 5.3967e-2 | 5.0668e-4 | 106.51180329550319 |
| F3 | Han | Albanian | Mbuti | | 593124 | 4.9972e-2 | 4.9973e-2 | 4.9520e-4 | 100.91326321202445 |
| F3 | Han | Armenian | Mbuti | | 593124 | 4.9531e-2 | 4.9531e-2 | 4.7771e-4 | 103.68366652942314 |
| F3 | Han | Bulgarian | Mbuti | | 593124 | 5.0103e-2 | 5.0103e-2 | 4.8624e-4 | 103.04188532686614 |
| F3 | Han | Czech | Mbuti | | 593124 | 5.0536e-2 | 5.0536e-2 | 4.9261e-4 | 102.58792370749681 |
| F3 | Han | Druze | Mbuti | | 593124 | 4.8564e-2 | 4.8564e-2 | 4.6788e-4 | 103.79674299622445 |
| F3 | Han | English | Mbuti | | 593124 | 5.0280e-2 | 5.0281e-2 | 4.9183e-4 | 102.23198323949656 |
| F3 | Han | Estonian | Mbuti | | 593124 | 5.1154e-2 | 5.1155e-2 | 5.0350e-4 | 101.59882496016485 |
| F3 | Han | Finnish | Mbuti | | 593124 | 5.1784e-2 | 5.1784e-2 | 5.0603e-4 | 102.33488758899031 |
| F3 | Han | French | Mbuti | | 593124 | 5.0207e-2 | 5.0208e-2 | 4.8552e-4 | 103.40976592749682 |
| F3 | Han | Georgian | Mbuti | | 593124 | 4.9711e-2 | 4.9711e-2 | 4.8100e-4 | 103.34881140790415 |
| F3 | Han | Greek | Mbuti | | 593124 | 4.9874e-2 | 4.9874e-2 | 4.8994e-4 | 101.79554640756365 |
| F3 | Han | Hungarian | Mbuti | | 593124 | 5.0497e-2 | 5.0498e-2 | 4.9878e-4 | 101.24215699276706 |
| F3 | Han | Icelandic | Mbuti | | 593124 | 5.0680e-2 | 5.0680e-2 | 4.9729e-4 | 101.91303336514295 |
| F3 | Han | Italian_North | Mbuti | | 593124 | 4.9903e-2 | 4.9904e-2 | 4.8436e-4 | 103.03094306099203 |
| F3 | Han | Italian_South | Mbuti | | 592980 | 4.9201e-2 | 4.9201e-2 | 5.1170e-4 | 96.15239597674244 |
| F3 | Han | Lithuanian | Mbuti | | 593124 | 5.0896e-2 | 5.0896e-2 | 5.0638e-4 | 100.50984037418753 |
| F3 | Han | Maltese | Mbuti | | 593124 | 4.8751e-2 | 4.8751e-2 | 4.7500e-4 | 102.63442479673623 |
| F3 | Han | Mordovian | Mbuti | | 593124 | 5.1820e-2 | 5.1820e-2 | 4.8853e-4 | 106.07409963190884 |
| F3 | Han | Norwegian | Mbuti | | 593124 | 5.0724e-2 | 5.0724e-2 | 4.9514e-4 | 102.4454387098217 |
| F3 | Han | Orcadian | Mbuti | | 593124 | 5.0469e-2 | 5.0469e-2 | 4.9485e-4 | 101.98814656611475 |
| F3 | Han | Russian | Mbuti | | 593124 | 5.1277e-2 | 5.1277e-2 | 4.8613e-4 | 105.48070801791317 |
| F3 | Han | Sardinian | Mbuti | | 593124 | 4.9416e-2 | 4.9417e-2 | 4.8908e-4 | 101.04049389691913 |
| F3 | Han | Scottish | Mbuti | | 593124 | 5.0635e-2 | 5.0635e-2 | 5.0565e-4 | 100.13962104744425 |
| F3 | Han | Sicilian | Mbuti | | 593124 | 4.9194e-2 | 4.9194e-2 | 4.8157e-4 | 102.15353663091187 |
| F3 | Han | Spanish_North | Mbuti | | 593124 | 5.0032e-2 | 5.0032e-2 | 4.9377e-4 | 101.32594226555439 |
| F3 | Han | Spanish | Mbuti | | 593124 | 4.9693e-2 | 4.9693e-2 | 4.8551e-4 | 102.35200847948641 |
| F3 | Han | Ukrainian | Mbuti | | 593124 | 5.0731e-2 | 5.0731e-2 | 4.9506e-4 | 102.47529111692852 |
| F3 | Han | Finland_Levanluhta | Mbuti | | 303033 | 5.4488e-2 | 5.4488e-2 | 5.7681e-4 | 94.46487653920919 |
| F3 | Han | Russia_Bolshoy | Mbuti | | 542789 | 5.7273e-2 | 5.7273e-2 | 5.2875e-4 | 108.31739594898687 |
| F3 | Han | Russia_Chalmny_Varre | Mbuti | | 428215 | 5.4000e-2 | 5.4000e-2 | 5.6936e-4 | 94.84371082564112 |
| F3 | Han | Saami.DG | Mbuti | | 585193 | 5.4727e-2 | 5.4728e-2 | 5.5546e-4 | 98.5265149143263 |
'-----------'-----'----------------------'-------'---'---------'----------------'--------------------'------------------'--------------------'
```

Now it’s time to plot these results using R. Let’s first read in the table:

`<- read.csv("fstats_working/outgroupf3_europe.tsv", sep = "\t") d `

We can check that it worked:

`head(d)`

```
Statistic a b c d NrSites Estimate_Total Estimate_Jackknife
1 F3 Han Chuvash Mbuti NA 593124 0.053967 0.053967
2 F3 Han Albanian Mbuti NA 593124 0.049972 0.049973
3 F3 Han Armenian Mbuti NA 593124 0.049531 0.049531
4 F3 Han Bulgarian Mbuti NA 593124 0.050103 0.050103
5 F3 Han Czech Mbuti NA 593124 0.050536 0.050536
6 F3 Han Druze Mbuti NA 593124 0.048564 0.048564
StdErr_Jackknife Z_score_Jackknife
1 0.00050668 106.5118
2 0.00049520 100.9133
3 0.00047771 103.6837
4 0.00048624 103.0419
5 0.00049261 102.5879
6 0.00046788 103.7967
```

Nice, now on to plotting (here I’m using Base R for zero-dependency pain, you’re welcome!):

```
<- order(d$Estimate_Total) # order the estimates for visual effect
order <- d$Estimate_Jackknife[order]
x <- d$StdErr_Jackknife[order]
xErr <- seq_along(d$b)
y plot(x, y, xlab = "Z Score", ylab = NA, yaxt = "n", # plot no y-axis ticks
xlim = c(0.048,0.06),
main = "F3(Han, X; Mbuti)")
# plot the labels
text(x + 0.001, y, labels = d$b, adj=0)
# plot the error bars
arrows(x - xErr, y, x + xErr, y, length=0.05, angle=90, code=3)
```

As expected, the ancient samples and modern Saami are the ones with the highest allele sharing with present-day East Asians (as represented by Han) compared to many other Europeans.