LAMARC Documentation: Suitable data for LAMARC

Suitable data for LAMARC

This article gives our best available information on the type and amount of data needed to use LAMARC successfully. We cannot guarantee success even if you meet these criteria, as every data set is different; but if you have much less data than described here, you will probably be disappointed in your results.

How many unlinked regions?

One answer is "as many as possible." For estimation of Theta and migration rate it is possible to get results with one region but they will improve markedly with more; doubling the number of regions nearly doubles the available information. Estimation of growth rate is very poor with less than 3 unlinked regions and particularly benefits from having more. The only exception to this rule is recombination: the best estimate of recombination rate will come from a single lengthy region.

Runtime increases approximately linearly with the number of regions. It is hard to have too many, unless you exceed your computer's memory capacity.

How many samples?

Surprisingly, Felsenstein has shown that for estimation of Theta the optimal number of samples is very low, around 8. (Reference: Felsenstein 2006.) If you can obtain another unlinked region you will gain more than you would gain by increasing the sample size for the current region beyond 8. We generally recommend aiming for 10-20 sequences, or sequences per subpopulation in a subdivided population. Again, recombination is an exception; a recombination analysis probably needs 15-20 sequences.

If you cannot get multiple genes, multiple sequences are some help, but we do not recommend going above 30 sequences/subpopulation as the added difficulty of the analysis more than outweighs the gain in information. Remember that the more sequences you have, the longer you will have to search to find good trees, and the longer each tree will take to process. Runtime goes up with the log of the number of sequences, but you will also have to perform more steps, perhaps many more steps; in the final outcome, adding sequences probably causes a worse-than-linear increase in needed runtime.

How many linked sites?

This depends on the expected level of polymorphism. For DNA or SNP data, a region should ideally be long enough to see 10 or more variable sites. If that's not possible you will definitely need multiple unlinked regions. Above 100 or so variable sites, there is little additional information to be gained outside of estimating recombination.

Runtime goes up less than linearly with number of sites; how much less depends on how polymorphic your data are. Nearly invariant data are very quick, so if you have long sequences, by all means use them (assuming you have the memory).

Recombination is an exception. You would like your sequenced area to include tens of recombinations, but not hundreds. Unfortunately it is hard to know this in advance. A recombinational analysis will definitely need 20-30 variable sites or more to have much accuracy. In general, sequences for recombination inference should be long, but watch out for signs that you are in the hundreds-of-recombinations zone as such runs will take a very long time. (The estimates will be good, but you won't want to wait for them.)

What kinds of samples?

LAMARC requires random samples from each subpopulation. Do not cherrypick the most divergent or most interesting sequences, pick one from each major lineage, discard identical sequences, or anything of that sort! Such data will give grossly biased estimates.

It has been stated in print (not by us!) that mutually incompatible sites in the data set must be removed. This is totally untrue. It is true for programs which assume an infinite-sites model, but LAMARC does not. Remove sites only if you cannot be sure they are correctly aligned. It is best to remove all alignment columns for which the alignment is doubtful.

Be sure that all of your samples are of the same homolog; throwing in a few sequences from a paralog will ruin your analysis.

You do not need to sample from multiple subpopulations evenly, nor in proportion to their population sizes. However, try not to let the sample size from any subpopulation go below 2 diploid or 4 haploid individuals as the estimates are likely to be unstable.

What kinds of populations?

LAMARC works well for subpopulations which show definite signs of geographic structure. You may want to use the program STRUCTURE to test for this before beginning. If STRUCTURE does not see any structure in your populations, the migration rate is probably too high and you should pool those populations together.

At the other extreme, if your populations are totally isolated you will not, of course, get a good estimate of the migration rate, though other parameters can be well estimated. If your populations have probably not exchanged migrants in the last 4N generations (for a diploid, or 2N for a haploid) you should analyze each one separately and not try to estimate migration rates.

(Previous | Contents | Next)