(Previous | Contents | Next)

Evolutionary Forces

This article describes, in some technical detail, the way we model evolutionary forces. You don't need this information to get the program running, but it can be very helpful in understanding your results.

The forces supported at this time are:

For each force, we describe the parameter being estimated, and list some assumptions made in the analysis. If the assumptions are violated, the results will not necessarily be wrong, but should be interpreted cautiously.

Coalescence

We estimate the parameter Theta for each population. While most papers describe Theta as 4Nemu (where Ne is the effective number of individuals and mu is the mutation rate in mutations per generation), this is specific to diploids. If you put haploid mtDNA into LAMARC, the Theta estimates will be estimates of 2Nfmu instead, where Nf is the effective number of females in the population. It is best to think of Theta as "number of heritable copies in population * 2 * mutation rate," since this definition works no matter what the ploidy is. (The "2" comes from the fact that two sequences that have diverged for time t are different by 2 * mu * t mutations, since both diverging lineages accumulate mutations.)

The units of mu are mutations per site per generation. Please note that many studies compute a (lower-case) theta whose units involve mutations per locus per generation. To convert, multiply the per- site Theta by the number of sites.

If the "multiple rate categories" option of the data model is used, the mutation rate is the weighted mean across all categories.

While it would be helpful to have separate estimates of population size and mutation rate, with genetic data from a single time point there is no way to separate them. If you have outside information about either population size or mutation rate, for example from outgroup analysis, you can then estimate the other parameter directly.

A Frequently Asked Question is how to interpret the parameter Theta when analyzing mitochondrial DNA or Y-chromosome DNA. In these cases Theta is proportional to the mtDNA or Y-chromosome effective population size. For example, when given mtDNA, the program estimates Theta=2Nfmu, and when given Y-chromosome DNA, the program estimates Theta=2Nmmu (where "f" and "m" refer to the female and male effective population sizes).

To combine estimates of Theta from regions with different Thetas, such as mtDNA and nuclear DNA, you can set the relative population sizes in the 'Effective Population Size' menu, or set these values directly in the XML input. Reported multi-region Theta estimates will be scaled accordingly.

Assumptions:

Migration

We estimate the immigration rate into each population from each other population (moving forward in time), so that a three-population case will estimate six rates. The immigration parameter M is expressed as m/mu where m is the immigration rate per generation and mu is the neutral mutation rate per site per generation. (As usual, if multiple mutation categories are used, mu is the weighted average.)

If you would prefer to consider migration in terms of 4Nm, multiply the given value by the recipient population's value of Theta. Please be careful of the distinction between immigration and emigration. LAMARC always estimates and reports immigration, so 'M12' indicates the rate at which individuals from population 1 have moved to population 2.

In the current version we cannot meaningfully combine genetic regions with different migration rates in the same analysis. We hope to provide this capability later.

Peter Beerli has argued that it may be useful to include a population in your analysis even if you have not sampled any individuals from it. This can guard against biases produced by unacknowledged populations. For example, if you believe that your real-life situation involves three populations exchanging migrants, but you have been able to sample only two of them, you might add the third population even with no individuals. The parameter estimates involving this population will be weak, but the estimates involving the other two populations may be more accurate than if the unsampled population were omitted entirely.

However, LAMARC is not very stable with unsampled populations, because the likelihood surface for the parameters of the unsampled population may be flat or multi-peaked. For this reason we have not allowed unsampled populations in v2. If you feel this capability would help you, consider using MIGRATE, and also please let us know. We will work on stabilizing estimates for unsampled populations if this is of general interest.

Assumptions:

We do not assume that migration is symmetrical, though you can impose this assumption if you wish using constraints.

Recombination

We estimate the recombination rate r = C/mu, where C is the recombination rate per inter-site link per generation and mu is the neutral mutation rate per site per generation. As usual, if multiple rate categories are in effect, mu is the weighted average of the categories.

In the current version we cannot meaningfully combine regions or populations with different recombination rates. We hope to provide this capability later.

We do not assume that all recombinations are visible. An advantage of LAMARC and its predecessor RECOMBINE over pairwise methods is that their recombination estimates are not dragged downwards if the data are nearly invariant (though the error bars, of course, increase greatly).

Assumptions:

Growth

We estimate the exponential growth rate g as defined in the following equation, where t is a time before the present:

Thetat = Thetapresent time exp(-gt)

This means that positive values of g indicate a population which is growing, and negative values a population which is shrinking. They are not symmetrical in magnitude, however; g = 10 indicates rather slow growth, but g = -10 indicates significant shrinkage for most values of Theta.

When migration is also in effect, growth rates are computed for each population independently.

In the presence of growth, the reported value of Theta is the present-day Theta (at the time when the data were sampled). The equation above can be used to determine values of Theta for past times. Just remember that the time parameter t is measured in units of mutations (i.e. 1 t is the average number of generations it takes one site to accumulate one mutation), and g is measured in the inverse units of t. If mu is known, divide generations by mu to get units of t (conversely, t*mu is a number of generations).

Assumptions:

Simulation studies show that estimation of growth tends to be biased upwards; this bias is reduced by having multiple unlinked regions and/or, in cases with recombination, long sequences. Be cautious in interpreting results from one or a few regions. The profile likelihoods are more reliable than the maxima but can also show the bias.

Gamma-distributed relative mutation rate over regions

This isn't exactly a "force," but because LAMARC optionally allows you to estimate the parameter (α) of the gamma distribution which best fits data sets composed of multiple unlinked genomic regions, it's in the menu for evolutionary forces. More information about this is available here.

Because there need to be multiple genomic regions in order for there to be relative mutation rates to distribute among them, this "force" will not appear in the evolutionary forces menu if the data is annotated as coming from a single genomic region.

Due to a mathematical detail of how this feature is implemented in LAMARC, it can only be applied to maximum-likelihood analyses. If you wish to perform a Bayesian analysis, your best bet is to estimate the relative single-region mutation rates by some other means, then supply these to LAMARC as constants.

Assumptions:

(Previous | Contents | Next)