Empirical research based on experiments and data analysis requires an objective measure of the pre-experimental difference between treatment groups. The common way to measure such a difference is to use P-values. They are the outcome of statistical tests based on the data, for which level of statistical significance of P = 0.05 has become a recognized and accepted measure. When testing for statistical significance and obtaining values higher than 0.05, the difference or relationship is deemed weak and, by extension, uninformative and uninteresting. P values falling below this boundary suggest a strong, important, or “statistically significant” difference.
Statistically significant results, then, attract considerable interest in the research community in contrast to other well-designed and performed studies that ended with their main relationship as statistically non-significant. This black-and-white perspective not only stems from a misinterpretation of P-values but, more importantly, stimulates some malpractice (Amrhein et al. 2019).
The over-representation of significant P-values in the scientific literature has been widely documented in several fields. One of the reasons behind this bias has been attributed to selective reporting where significant results are more likely to be submitted for publication but also published by editorial boards due to the false perception that significant results are more interesting and of higher scientific value than non-significant results. Because of this perception, some researchers are inclined, consciously or not, to manipulate data and analyses to obtain statistical-significant results (P< 0.05). This phenomenon is known as P-hacking (Head et al. 2015).
In our current study, we focused on an alternative scenario, where researchers favor non-significant outcome of a statistical test. We define reverse P-hacking as the manipulation of data and analyses to obtain a statistically non-significant result (i.e. P > 0.05). We thought this could occur in experiments when researchers randomly assign individuals to control or treatment group where they don’t want the groups to differ. This random assignment is often used to account for a confounding variable that, despite not being the focus of the study (mostly parameters like body size or age), may still affect the results.
Even under such a random setup, statistically significant results are expected to occur by chance alone in 5% of studies (i.e. commonly accepted threshold P-value of 0.05). Failing to acknowledge the effect of a confounding variable could have far-reaching consequences. Imagine releasing a new medical treatment after a clinical trial showed no significant adverse effects on patients, only afterward realizing that the placebo group was significantly older than the treated group. The trial failed to acknowledge the confounding variable of age, which might explain the absence of a significant difference in side effects between groups (e.g. the placebo group might have been more likely to have health complications due to aging that made the side effects from the younger treated group seem non-significant).
We screened a representative number of research articles published over 30 years within the discipline of behavioral ecology for these types of tests. We found that only 3 of 250 papers (here, 5% would be 12 papers) had reported a significant treatment-control difference for confounding variables. We conclude that the lower-than-expected number of significant P-values in the literature reporting effects of associated with confounding variables could be caused by reverse P-hacking and/or selective reporting. Selective reporting could stem, for example, from editorial boards decisions to reject a paper based on an experimental flaw (i.e. cannot disentangle the effect of the variable of interest with the confounding variable).
Despite not being able to isolate reverse P-hacking as the cause of too few significant P-values, our empirical study provides a proof of concept, and we hope that future studies will replicate it in their own discipline. Much of the literature on publication bias is by statisticians discussing “in principle” methods to detect and correct for publication bias, or policy statements. These types of papers vastly outnumber studies that collect data. One of our main points was to show yet another way that the use of P-values (and a dichotomy between significance and non-significance) can lead to poor scientific practices that create a discrepancy between data collection/analysis and what eventually appears in the literature.
For some, it might come as a surprise that randomization alone is not enough for dealing with confounding variables. Those 5% of “unlucky” studies that got significant by chance alone might suffer from their undermined scientific conclusions or be improved by a justification of some unexpected results. If one wants to minimize the likelihood of obtaining a significant difference in confounding variable between treatment groups, we recommend the use of balanced designs over randomization. This would entail, for example, pre-sorting individuals based on a confounding variable (e.g. groups of similar-size animals) and then randomly selecting individuals from each category consecutively to form treatment groups. In addition, through the constant improvement of statistical methods, researchers have the option of controlling for confounding variables by including them in their statistical models, rather than attempting to control for them experimentally.
Whatever the methods used to form treatment groups, researchers should always test for differences in confounding variables and report them accurately. Editorial board members and reviewers must flag the absence of such tests and thus encourage better science practices in the community.
These findings are described in the article entitled Evidence that nonsignificant results are sometimes preferred: Reverse P-hacking or selective reporting?, recently published in the journal PLOS Biology.
Citation
- Amrhein V, Greenland S, McShane B (2019) Scientists rise up against statistical significance. Nature (567): 305-307. doi: 10.1038/d41586-019-00857-9
- Head ML, Holman L, Lanfear R, Kahn AT, Jennions MD (2015) The Extent and Consequences of P-Hacking in Science. PLoS Biol 13(3): e1002106. doi:10.1371/journal.pbio.1002106