Population Genetics forum: topic

This is a public forum

Calculation Jost D

Penny Nelson

Thursday, 12 Mar 2009 23:43 UTC

I have calculated the D and ST for my data. GST = 0.09 while D = 0.2! This is quite a difference. The ST was 1.3 – does that mean that I only have 1.3 effective populations..? Is anyone else calculating these stats and can I use the D for pairwise population differentiation? I see that there is an online Jost D calculator at www.ngcrawford.com/django/jost/ which calculates pairwise population differentiation loci by loci.

Thanks

  • Replies

    Post a reply
    • Hello everyone

      While waiting for a way to calculate Jost’s D across multiple loci as well as for pairwise D, other than using the harmonic mean of the D-value, I was thinking about the validity of using the multi-loci dataset as a single locus per population with the number of alleles equal to the total number of alleles for all loci.

      My reasoning is (although not mathmatical at all) is this: Consider two similar landscapes with different habitat types you wish to compare species diversity and similary for. The species in each habitat type is non-overlapping and the number of habitats per habitat type as well as the number of different habitats is more or less equal for the two landscape. Then, the species are the variables. An approximation of the habitat type comparison could be to sum all species across habitat types and see the landscape as one mosaic habitat. This equals my suggested pooling of alleles across all loci…

      Any thoughts?

    • On the harmonic mean as a way of working with multiple loci:
      Aside from the statistical considerations, I think that this is a horrible idea. As Dr. Jost has pointed out, the harmonic mean is good if the mutation rates across loci aren’t to different. In other words, the loci are all saying just about the same thing. In this case, use whatever mean you want, they’ll all give you essentially the same answer. Unfortunately, it has been fairly clearly demonstrated that mutation rates at microsatellite loci vary from motif to motif, locus to locus, and even allele to allele [see Ellegren H (2000) Microsatellite mutations in the germline: implications for evolutionary inference. Trends in Genetics, 16, 551-558., Huang QY, Xu FH, Shen H, Deng HY, Liu YJ, Liu, YZ, Li JL, Recker RR, Deng HW (2002) Mutation patters at dinucleotide microsatellite loci in humans. American Journal of Human Genetics, 70, 625-654., and Dupuy BM, Stenersen M, Egeand T, Olaisen B (2004) Y-chromosomal microsatellite mutation rates: differences in mutation rate between and within loci. Human Mutation, 23, 117-124.]. And, they can vary substantially.

      The theoretical expectation for D across multiple loci (given constant mutation rate across loci and pure neutrality) is a frequency distribution. I’m not a mathematician so I’ll totally capitulate on what type of distribution is expected, but given the vargarities of drift, it seems reasonable that not all loci will give the exact same answer. If this is a normal or unimodal distribution, then again, use whatever mean you want and they are all likely to give you the essentially the same answer. Also, as pointed out by Dr. Jost, when dealing with a large number of loci, the harmonic mean is probably OK. Unfortunately, we rarely have sufficient numbers to accurately recreate the underlying expected distribution of D. So, there are sampling (i.e., sampling of the true evolutionary history of the populations with a limited number of loci) considerations that seem, to me, to make a harmonic mean a bad choice. I differ, however, with Dr. Jost in that I’m doubtful that there exists such a thing as a “…robust estimator for the harmonic mean of small set of loci…” The reason for this is that, I presume, this estimator will be robust, with the assumption that all loci are evolving neutrally. Selection does happen and, yes, Virginia, it does happen even at microsatellite loci (a good start at the literature would be the somewhat dated but very good Kashi, Y. and M. Soller. 1999. “Functional Roles of Microsatellites and Minisatellites.” In: Microsatellites: Evolution and Applications. Edited by Goldstein and Schlotterer. Oxford University Press.). For example, in a population genetic assessment, balancing selection (i.e., lower D estimate for the selected loci than neutral) would be biasing your combined D much more with the harmonic mean than with the arithmetic mean. Since neutrality is an assumption and never a demonstration, the harmonic mean, in any form, would be my last choice.

      Just my 2 cents worth

    • I am glad to get some discussion on this. I am in London now giving talks so I don’t have time to write much now, but I want to point out that the connection between the harmonic mean of the mutation rates and the harmonic mean of D is a mathematical consequence of the finite island model, and not a matter of preference. It is not a “suggestion” of mine but something that I derived and that can be confirmed by anybody who checks. It is algebra, not biology. Whether this mathematical fact is useful or not is the real question. I agree with Stephan that it may not be, especially if sample size is small. However, the relation holds even if mutation rates are extremely nonuniform across loci, so Stephan’s objections based on variability of mutation rates are not really objections against this relationship. I will post some simulation results when I get home and have some free time.

    • It is readily seen why we mathematically obtain a harmonic mean of D for multiple loci if we know the approximation D~1/[1+(n-1)(m/u)] for a single locus (equation 17 from Jost 2008), where n: number of demes, m: migration rate and u: mutation rate. This approximation is quite accurate in most application cases under a finite island model; It has also an important implication: differentiation among demes depends only on one parameter: ratio of m and u. The above approximation gives an elegant non-linear relation between D and the ratio (m/u)! (My simulations have provided clear evidence to verify this) Another implication is that differentiation is theoretically independent of deme size under a finite island model! Now the proof for the multiple loci case only takes one line:

      E(1/D) = 1+(n-1)m E(1/u), where E denotes statistical expectation.

      We are done because the general definition for the harmonic mean for D, H(D), is 1/E(1/D) and the harmonic mean for u , H(u), is 1/E(1/u). (If D only takes finite number of values, D_1, D_2, …, D_n, each with probability 1/n, then the general definition reduces to the usual definition of harmonic mean. Same explanation for u.) From the above proof line, we then have
      H(D) = 1/[1+(n-1)m/H(u)]
      Comparing this multiple-loci formula with that for single-locus formula, we then see the multiple-loci case can be considered as just a single-locus case with D being replaced by its harmonic mean H(D) and u replaced by its harmonic mean H(u).

      The above conclusion is valid for any distribution of D and u. However, practically, we need to estimate D and sometimes D_est is very small and could be negative. A harmonic mean is dominated by a very small value and is meaningless for negative values in this case. That was why under such situations I suggested using (on July 9 of this forum)
      H(D)~ 1/[(1/A)+ var(D)(1/A)^3] where
      H(D): harmonic mean of D values,
      A: arithmetic mean (average) of D values,
      var(D): variance of D values.
      This approximation is better than the arithmetic mean.

      The above discussion is of course based on the equilibrium state under a finite island model. Whether the result is approximately valid for other types of models requires further theoretical derivation and simulations.

    • Dear all,
      I have a few questions about the bootstrapping procedure and confidence intervals that are created by SMOGD.

      1) is it true that SMOGD results dont make sense for haploid data? I tried this:
      “Toy data haploid
      Locus1
      Pop
      A , 001
      Pop
      B , 002
      This data set gives HS equals 0.5 , which is clearly not correct.

      2) I use another toy-data set with diploid data
      “Toy data diploid
      Locus1
      Pop
      A , 001001
      Pop
      B , 002002
      I get HS equals 0, which is fine and Dest equals 1.0 which is also fine. However, in this case, there should not be a lot of support for the Dest value, right? I would like to see a non-significant p-value, or some kind of statistic that says that with only two individuals sampled, it is not very surprising to find a configuration with one homozygote in one population and one in the other. The bootstrap doesn’t help here:
      it gives 95 % CI Min: 1.0 and 95% CI Max: 1.0, which is because the bootstrap only creates subsets of the data, and whichever subset one takes of the dataset, Dest will always be 1.0 .
      In this case I would like to do some kind of permutation of the data where I distribute the individuals randomly over the populations a 1000 times and each time calculate Dest to create a null-distribution for Dest for my dataset, to see whether the Dest value I find is surprising / an outlier / significantly different from expected.

      3) A similar problem occurs not just with small datasets, but also with datasets with very high mutation rates. I tried to calculate Dest for MtDNA haplotypes, but the sequences are so long that almost every haplotype is unique. The result is very high Dest values. But these Dest values are in a way meaningless: any configuration of the data would have created the same high Dest values. So, yes the subpopulations are different, but the differences are entirely due to differences between individuals.
      “Toy data all genotypes unique
      Locus1
      Pop
      A, 001001
      A, 003004
      A, 005006
      A, 007008
      A, 009010
      Pop
      B, 011012
      B, 013014
      B, 015016
      B, 017018
      B, 019020
      (I created one homozygote individual to avoid HS being 1, which would lead to division by 0).
      Again the bootstrap doesn’t help here it gives 95 % CI Min: 1.0 and 95% CI Max: 1.0.
      The result is that if one would like to see high differentiation, one should sequence long sequences of very variable microsatellites!

      4) Has anyone come up with another way of calculating Dest for sequence data, taking into account distances between haplotypes? (A similar question was posed by Marc Stift a while ago (http://network.nature.com/groups/popgen/forum/topics/4220?page=7#reply-14853), but maybe by now someone has an answer…)

    • I am wondering if anyone has views on the use of D or similar measures on microsatellite data that has some loci in HW equilibrium whilst others are significantly deviating from the model? I also suspect that there is greatly varying mutation rates between the 7 loci I am using. Also Jost’s paper (2008) concludes that they are accurate model-independant descriptive measures so they will be especially useful when such models are unavailable. I work in Marine systems where connectivity and gene flow is highly influenced by oceanographic currents, stratification etc. and so it is highly unlikely that sub-populaitons act according to the finite island model. One of the reasons for looking at the genetic structure between these populations is to try to understand the connectivity patterns in this highly variable environment. Will these measures be robust to possibly quite severe violation of the assumption that there is equal gene flow between sub-populations of the finite island model?

      Thanks
      Natalie

    • I have just run my data from 7 microsatellite markers from 2 populations through the genetic diversity software of Crawford (2009). I am wondering why, when I have 2 populations I only have one value for each locus for
      Hs-est? If this is the Nearly unbiased estimator of within subpopulation heterozygosity shouldn’t I have a value for each sub-population? I also have virtually no difference between Hs-est and Ht-est – any ideas on interpreting that result?

      Thanks and I apologose for my ignorance!
      Natalie

    • Pleuni, I am sorry that I didn’t see your post until today. Those are important observations. The bootstrap method of calculating confidence intervals is only valid when the number of samples is relatively large, so yes, the method will fail when sample sizes are so small. Perhaps Anne Chao will comment further on this.
      I don’t know anything about how Nick Crawford coded SMOGD. The differentitaion measure itself works just as well as a measure of allelic differentiation for haploid data as for diploid data.
      Regarding your third point, it is important to realize that D, as published, uses alleles as the unit of analysis (as does Gst and its relatives). You are right that if a sequence is long enough, virtually every individual will be different and D will approach unity. This is in fact the correct behavior for any measure of differentiation that uses alleles as the unit of analysis. You are right that it is often important in genetics to take into account molecular distance. Anne and I and others are perfecting this approach and will publish soon.

    • Natalie, if you are just trying to describe the degree of differentiation of allele frequencies across two or more populations, D makes no assumptions at all. It can be thought of as a purely descriptive measure of the differences in allele frequencies across demes. The finite island model plays no role. In this it differs from Gst or Fst, which are difficult to interpret when the finite island model does not hold.
      Likewise HW equilibrium is not a prerequisite for this measure. It doesn’t look at individuals, but rather at all alleles pooled together, from each population. It tells you how different those demes are in terms of total allele frequencies. Now, it is worth noting that populations may differ not only in terms of allele frequencies but also in how those alleles are distributed among individuals. It could be that one deme is in HW equilibrium and the other is not. This kind of difference between demes would not show up using just D (or Gst either). But this kind of difference is not normally what one looks at.
      Regarding your question about Hs, this is (the unbiased estimate of) the mean within-group heterozygosity. Maybe it would be a good idea to list within-group H for each deme. You should suggest that to Nick if it is not in there somewhere.

    • Natalie, about the small difference between Hs and Ht, don’t try to interpet that directly, that is what D is for. Hs and Ht could be similar for two very different reasons. One reason could be that there is little or no differentiation between subpopulations. The other is that diversity is so high that Hs and Ht are both nearing their maximum value of unity. This could happen even when the demes are completely differentiated. D will tell you what is really happening in terms of differentiation. If it is close to zero, differentiation is close to zero.

    Post a reply

Search forums Advanced search

web feed

Submit this topic to

Advertisement