Behavioral Ecology Advance Access originally published online on April 28, 2006
Behavioral Ecology 2006 17(4):682-687; doi:10.1093/beheco/ark005
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Forum |
Comparing effect sizes across variables: generalization without the need for Bonferroni correction
Department of Biology, University of Antwerp, Campus Drie Eiken, Universiteitsplein 1, B-2610 Wilrijk, Belgium
Address correspondence to L.Z. Garamszegi. E-mail: laszlo.garamszegi{at}ua.ac.be.
Received 24 July 2005; revised 16 March 2006; accepted 22 March 2006.
| INTRODUCTION |
|---|
|
|
|---|
Studies in behavioral ecology often investigate several traits and then apply multiple statistical tests to discover their pairwise associations. Traditionally, such approaches require the adjustment of individual significance levels because as more statistical tests are performed the greater the likelihood that Type I errors are committed (i.e., rejecting H0 when it is true) (Rice 1989
The strict application of Bonferroni correction in the field of ecology and behavioral ecology has therefore been criticized for mathematical and logical reasons (Wright 1992
; Benjamini and Hochberg 1995
; Perneger 1998
; Moran 2003
; Nakagawa 2004
). As a potential solution, Wright (1992)
and Chandler (1995)
advocated that the sacrificial loss of power can be avoided by choosing an experimentwise error rate higher than the usually accepted 5%, which results in a balance between different types of errors. As another alternative, the researcher might be more interested in controlling the proportion of erroneously rejected null hypotheses, the so-called false discovery rate, than in controlling for familywise error rate (Benjamini and Hochberg, 1995
). Although this approach allows for increased power in large series of repeated tests, it is rarely applied in ecological studies (Garcia 2003
, 2004
).
Recently, Nakagawa (2004)
suggested reporting effect sizes together with confidence intervals (CIs) for all potential relationships to allow the readers to judge the biological importance of the results and to reduce publication bias. Due to the low power of the tests, the majority of investigated relationships are expected to be nonsignificant, which is thought to make publication difficult. Such difficulty is generally assumed to cause behavioral ecologists to selectively report data (Moran 2003
; Nakagawa 2004
). The omission of nonsignificant results from publications is undesirable for both scientific and ethical reasons, which makes Bonferroni adjustment problematic. It is noteworthy that direct tests comparing effect sizes of representative samples of published and unpublished studies showed no evidence of publication bias in the biological literature (Koricheva 2003
; Møller et al. 2005
). However, independent of publication bias, conclusions drawn from effect sizes and the associated CIs should be encouraged. Such an approach considers the magnitude of an effect on a continuous scale, whereas conventional hypothesis testing based on significance levels tends to treat biological questions as all-or-nothing effects depending on whether P values exceed the critical limit or not (Chow 1988
; Wilkinson and Task Force Stat Inference 1999
; Thompson 2002
). Hence, using the same data, the former approach may reveal that a particular effect is small, but still biologically important, whereas, the later approach may lead the investigator to conclude that the hypothesized phenomenon does not exist in nature. Although such philosophical differences may dramatically influence our knowledge, presenting standardized effect sizes is still uncommon in ecology and evolution (Nakagawa 2004
).
Here, I suggest that, in addition to their presentation, the calculated effect sizes may be further used in simple analyses that can help to estimate the true effect of a predictor variable and thus make general conclusions. These analytical tools rely on the fact that the strength and direction of relationships, as reflected by standardized measures of effect sizes (Pearson's r, Cohen's d, or Hedges' g), are comparable and independent of the scale on which the variables were measured (e.g., Hedges and Olkin 1985
; Cohen 1988
; Rosenthal 1991
). Thus, if multiple traits are measured and multiple correlations are calculated, the corresponding effect sizes tabulated among the variables measured will have a certain statistical distribution with measurable attributes. Below, I present 4 simple analyses to demonstrate how such statistical attributes can be used to make general interpretations. I will confine myself to a typical sampling design from behavioral ecology in which the experimenter is interested in explaining variation in certain traits (response variables) in the light of other (predictor) variables. Specific sampling designs can be tailored according to the biological question at hand that will be illustrated by using real data on the collared flycatcher, Ficedula albicollis from Garamszegi et al. (2004)
. I will also discuss the confounding effect of colinearity between variables that may violate the assumption of statistical independence and the potentially low power of the suggested tests.
| ANALYSES OF EFFECT SIZES |
|---|
|
|
|---|
First, the mean effect size from multiple pairwise tests can be calculated to test the null hypothesis that the mean underlying effect size does not differ from zero. It will be rejected if the measured variables covary with a predictor variable consistently in the same direction. Normally, a few of the investigated relationships will be significant but the majority will not (see an example in Table 1). The classical interpretation of these results relies on the relationships that pass the filter of Bonferroni correction (i.e., strong effects). However, weak effects may also have biological importance: a meta-analysis of meta-analyses in ecology and evolution revealed small to intermediate mean effect sizes (r < 0.2) and that the amount of variance explained in biological studies appears to be very small (Møller and Jennions 2002
|
Second, effect sizes and the corresponding CIs may stimulate meta-analytic thinking (Thompson 2002
|
Third, when neglecting the direction of the relationships, unsigned effect sizes can be used to reflect the strength of a given relationship, for instance, according to Cohen's (1988)
Fourth, if it is biologically relevant, it may be interesting to test for a relationship between the effects sizes of 2 predictor variables. If different mechanisms are responsible for the detected effects for each predictor variable, different traits with different magnitudes will be associated with the predictor variables. In this case, at the level of variables, the effect sizes should not covary between the predictor variables (see Figure 2 as an example). On the other hand, if similar mechanisms shape the observed patterns, similar relationships will be found for both predictor variables, and effect sizes may be positively associated across them. For such a test to be robust, it is important also to assess the relationship between the predictor variables themselves. It may happen that we find a correlation between effect sizes for 2 predictor variables but that this is due to a close positive association between the predictor variables (see also below).
|
|
| CONFOUNDING EFFECTS: COVARIATION BETWEEN TRAITS AND LOW POWER |
|---|
|
|
|---|
Effect sizes are estimated from the same sample of individuals; therefore, they are not independent observations. This nonindependence violates one of the most important assumptions of parametric tests and meta-analyses. Hence, the association between different variables at the level of individuals may confound the analyses of effect sizes at the level of variables. One potential solution may be to calculate partial correlations between the predictor variables and each of the response variables while holding the variation constant for the rest of the response variables. However, the use of such a partial correlation approach would require very complex partial correlation matrices for all variables involved with, more or less, completely filled data matrices. Unfortunately, missing values often cause difficulties in such multivariate statistics.
I suggest an alternative method to be developed that can potentially be utilized to control for the associations between variables when test statistics are based on effect sizes and variables are the unit of analysis. The relationship between different variables causes a lack of statistical independence similar to the one that arises from the use of species values as independent data points in comparative analyses (Felsenstein 1985
). In comparative studies, phylogenetic approaches are applied to eliminate such confounding effects due to common ancestry to ensure statistical independence (Harvey and Pagel 1991
). Being an analogous problem, similar approaches can be used to deal with the confounding effect arising from the associations between variables. If the association between variables can be represented by a "phenetic" tree, it could subsequently be used in a phylogenetic analysis to control for the relationships between different variables. In such a tree, tips should be the variables, and different paths and branch lengths should represent their distance and relatedness (see an example in Figure 3). The hierarchical classification of the response variables based on joining- or tree-clustering methods with a single linkage can result in such structures (Podani 2000
). Tree-clustering methods use dissimilarities or distances between objects to group objects of similar kind into respective categories. As the distance between variables can be reflected by their relationship, a correlation matrix of variables could be used as a distance matrix in a cluster analysis. If the distance between 2 variables is estimated as 1 |r|, strongly correlating variables will be closely related to each other, and the distance between them will be small. Such distances could be computed for all pairwise relationships. The numeric (unsigned) correlation coefficients should be used because we are interested in controlling for the strength of different associations neglecting the direction of the patterns. Therefore, relying on the correlation of traits, one can create a distance matrix for the variables that can be used in a cluster analysis to classify variables hierarchically. The resulting tree that holds information about the relatedness of variables can subsequently be imported into a phylogenetic program that eliminates the confounding effect of the relationships between observation points (see Harvey and Pagel 1991
; Pagel 1999
for different approaches), that is, causing effect sizes to be independent of correlations between variables. For example, comparative analyses based on phylogenetically independent contrasts (CAIC) use the phylogeny of the species in the data set to partition the variance among species into independent comparisons (so-called linear contrasts), each comparison being made at a different node of the phylogeny (Purvis and Rambaut 1995
). The resulting contrasts can be analyzed validly in standard statistical packages to test hypotheses about correlated evolution of traits. Similarly, based on the estimated effect sizes, independent contrasts can be calculated for each node of the phenetic tree of variables (such as in Figure 3), and these contrasts can be used to test hypotheses about the strength of relationship between different biological effects.
Note that despite the analogies, CAIC was especially developed for phylogenetic analyses and may be sensitive to specific assumptions. The applicability of the phylogenetic framework in the current context should be tested in the future, and specific methods may be developed to deal with the nonindependence of effect sizes. Until then, analyses of effect sizes should be interpreted with caution. However, generalizations by graphical approaches, such as the distribution of effect sizes, meta-analytical summaries, or the phenetic tree of variables, could already provide us with important biological information.
An additional problem may appear when test statistics are based on effect sizes. Because these approaches use variables as the unit of analysis, the sample size will be equal to the number of variables involved. Therefore, the power of the suggested tests may be limited, and conclusions based on the associated P values will be sensitive to Type II errors. In fact, below a certain limit, making analyses at the level of variables does not make much sense. When only a few variables are considered, the explanation of individual effect sizes (and CIs) should be preferred. However, as the number of variables increases, the suggested analyses become more powerful, corresponding to the increased need to be able to make generalizations. In these situations, I would avoid focusing merely on significance levels and thus committing the same errors again. The framework involving graphical approaches outlined above has the potential to capture biological patterns requiring no statistical tests of significance.
| CONCLUSION |
|---|
|
|
|---|
In a stimulating paper, Nakagawa (2004)
The analytical tool I presented can be used to address various biological questions, even within and between species, as effect sizes can be calculated and tabulated according to the problem at hand (see an example in Garamszegi et al. 2006
). Here, I provided an example by using real data from the collared flycatcher. I showed that relying on Bonferroni adjustment, the traditional analysis of the available data would suggest that there is no relationship between the expression of sexual signals and a measure of malemale competition. However, analyses at the level of effect sizes demonstrated that the expression of 12 sexual traits appears to covary with nest-box retention in the same direction (Table 1 and Figure 1). Mean effect size reveals that generally, males with elaborate sexual signals appear more successful in nest-box retention than males with less elaborated signals confirming the prediction of sexual selection. However, analyses of unsigned effect sizes showed that the strength of this relationship is generally weak, which could be estimated with broad CIs in the current study. As there was no relationship between effect sizes for nest-box retention and pairing success (Figure 2), the 2 measures of mating success are independent components of sexual selection. These findings provide us with biologically relevant and general conclusions without the need for additional data or the drawback of creating publication bias by selective reporting of results.
| ACKNOWLEDGEMENTS |
|---|
Three anonymous referees provided stimulating criticism that significantly improved the manuscript and for which I am extremely grateful. I am highly indebted to M. D. Jennions for his constructive comments. J. Podani provided help with the hierarchic classification of variables. During this study, I received a postdoctoral fellowship from the Fonds voor Wetenschappelijk Onderzoek Flanders (Belgium).
| REFERENCES |
|---|
|
|
|---|
Benjamini Y, Hochberg Y. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc B 57:289300.
Berglund A, Bisazza A, Pilastro A. 1996. Armaments and ornaments: an evolutionary explanation of traits of dual utility. Biol J Linn Soc 58:38599.[CrossRef][Web of Science]
Cabin RJ, Mitchell RJ. 2000. To Bonferroni or not to Bonferroni: when and how are the questions. Bull Ecol Soc Am 81:2468.
Chandler CR. 1995. Practical considerations in the use of simultaneous inference for multiple tests. Anim Behav 49:5247.
Chow SL. 1988. Significance test or effect size. Psychol Bull 103:10510.[CrossRef]
Chow SL. 1998. Precis of statistical significance: rationale, validity, and utility. Behav Brain Sci 21:16994.[CrossRef][Web of Science][Medline]
Cohen J. 1988. Statistical power analysis for the behavioural sciences. 2nd ed. Hillsdale, NJ: Lawrence Erlbaum Associates.
Cooper H, Hedges V. 1994. The handbook of research synthesis. New York: Russell Sage Foundation.
Felsenstein J. 1985. Phylogenies and the comparative method. Am Nat 125:115.[CrossRef][Web of Science]
Garamszegi LZ, Hegyi G, Heylen D, Ninni P, de Lope F, Eens M, Moller AP. 2006. The design of complex sexual traits in male barn swallows: associations between signal attributes. J Evol Biol 10.1111/j.1420-9101.2006.01135.x.
Garamszegi LZ, Møller AP, Török J, Michl G, Péczely P, Richard M. 2004. Immune challenge mediates vocal communication in a passerine bird: an experiment. Behav Ecol 15:14857.
Garcia LV. 2003. Controlling the false discovery rate in ecological research. Trends Ecol Evol 18:5534.
Garcia LV. 2004. Escaping the Bonferroni iron claw in ecological studies. Oikos 105:65763.[CrossRef]
Harvey PH, Pagel MD. 1991. The comparative method in evolutionary biology. Oxford: Oxford University Press.
Hedges LV, Olkin I. 1985. Statistical methods for meta-analysis. London: Academic Press.
Koricheva J. 2003. Non-significant results in ecology: a burden or a blessing in disguise? Oikos 102:397401.[CrossRef]
Møller AP, Jennions MD. 2002. How much variance can be explained by ecologists and evolutionary biologists. Oecologia 132:492500.[CrossRef]
Møller AP, Thornhill R, Gangestad SW. 2005. Direct and indirect tests for publication bias: asymmetry and sexual selection. Anim Behav 70:497506.
Moran MD. 2003. Arguments for rejecting the sequential Bonferroni in ecological studies. Oikos 102:4035.
Nakagawa S. 2004. A farewell to Bonferroni: the problems of low statistical power and publication bias. Behav Ecol 15:10445.
Pagel M. 1999. Inferring the historical patterns of biological evolution. Nature 401:87784.[CrossRef]
Perneger TV. 1998. What's wrong with Bonferroni adjustments. Br Med J 316:12368.
Podani J. 2000. Introduction to the exploration of multivariate biological data. Leiden, The Netherlands: Backhuys Publishers.
Purvis A, Rambaut A. 1995. Comparative analysis by independent contrasts (CAIC): an Apple Macintosh application for analysing comparative data. Comp Appl Biosci 11:24751.[Medline]
Rice WR. 1989. Analysing tables of statistical tests. Evolution 43:2235.[CrossRef][Web of Science]
Rosenthal R. 1991. Meta-analytic procedures for social research. Thousand Oaks, CA: Sage Publications.
Searcy WA, Andersson M. 1986. Sexual selection and the evolution of song. Annu Rev Ecol Syst 17:50733.[CrossRef][Web of Science]
Sokal RR, Rohlf FJ. 1995. Biometry. 3rd ed. New York: W. H. Freeman & Co.
Thompson B. 2002. What future quantitative social science research could look like: confidence intervals for effect sizes. Educ Res 31:2532.
Török J, Hegyi G, Garamszegi LZ. 2003. Depigmented wing patch size is a condition-dependent indicator of viability in male collared flycatchers. Behav Ecol 14:3828.
Wilkinson L, Task Force Stat Inference. 1999. Statistical methods in psychology journals: guidelines and explanations. Am Psychol 54:594604.[CrossRef]
Wright SP. 1992. Adjusted P-values for simultaneous inference. Biometrics 48:100513.[CrossRef]
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
L. Z. Garamszegi, S. Calhim, N. Dochtermann, G. Hegyi, P. L. Hurd, C. Jorgensen, N. Kutsukake, M. J. Lajeunesse, K. A. Pollard, H. Schielzeth, et al. Changing philosophies and tools for statistical inferences in behavioral ecology Behav. Ecol., November 2, 2009; (2009) arp137v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. D. Shawkey, K. L. Kosciuch, M. Liu, F. C. Rohwer, E. R. Loos, J. M. Wang, and S. R. Beissinger Do birds differentially distribute antimicrobial proteins within clutches of eggs? Behav. Ecol., July 1, 2008; 19(4): 920 - 927. [Full Text] [PDF] |
||||
![]() |
A.P. Moller, J.T. Nielsen, and L.Z. Garamzegi Risk taking by singing males Behav. Ecol., January 1, 2008; 19(1): 41 - 53. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Kraan, T. Piersma, A. Dekinga, A. Koolhaas, and J. van der Meer Dredging for edible cockles (Cerastoderma edule) on intertidal flats: short-term consequences of fisher patch-choice decisions for target and non-target benthic fauna ICES J. Mar. Sci., December 1, 2007; 64(9): 1735 - 1742. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Z. Garamszegi, M. Eens, D. Z. Pavlova, J. M. Aviles, and A. P. Moller A comparative study of the function of heterospecific vocal mimicry in European passerines Behav. Ecol., November 1, 2007; 18(6): 1001 - 1009. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. D. Huntley, J. V. Campo, R. E. Dahl, and D. S. Lewin Sleep Characteristics of Youth with Functional Abdominal Pain and a Healthy Comparison Group J. Pediatr. Psychol., September 1, 2007; 32(8): 938 - 949. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Z. Garamszegi and A. P. Moller Prevalence of avian influenza and host ecology Proc R Soc B, August 22, 2007; 274(1621): 2003 - 2012. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. L. Machemer and P. Crawford Student perceptions of active learning in a large cross-disciplinary classroom Active Learning in Higher Education, March 1, 2007; 8(1): 9 - 30. [Abstract] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||







