The Case For Accurate Reporting Of “Nonsignificant” Results

Empirical research based on experiments and data analysis requires an objective measure of the pre-experimental difference between treatment groups. The common way to measure such a difference is to use P-values. They are the outcome of statistical tests based on the data, for which level of statistical significance of P = 0.05 has become a recognized and accepted measure. When testing for statistical significance and obtaining values higher than 0.05, the difference or relationship is deemed weak and, by extension, uninformative and uninteresting. P values falling below this boundary suggest a strong, important, or “statistically significant” difference.

Statistically significant results, then, attract considerable interest in the research community in contrast to other well-designed and performed studies that ended with their main relationship as statistically non-significant. This black-and-white perspective not only stems from a misinterpretation of P-values but, more importantly, stimulates some malpractice (Amrhein et al. 2019).


The over-representation of significant P-values in the scientific literature has been widely documented in several fields. One of the reasons behind this bias has been attributed to selective reporting where significant results are more likely to be submitted for publication but also published by editorial boards due to the false perception that significant results are more interesting and of higher scientific value than non-significant results. Because of this perception, some researchers are inclined, consciously or not, to manipulate data and analyses to obtain statistical-significant results (P< 0.05). This phenomenon is known as P-hacking (Head et al. 2015).

In our current study, we focused on an alternative scenario, where researchers favor non-significant outcome of a statistical test. We define reverse P-hacking as the manipulation of data and analyses to obtain a statistically non-significant result (i.e. P > 0.05). We thought this could occur in experiments when researchers randomly assign individuals to a control or treatment group where they don’t want the groups to differ. This random assignment is often used to account for a confounding variable that, despite not being the focus of the study (mostly parameters like body size or age), may still affect the results.

Even under such a random setup, statistically significant results are expected to occur by chance alone in 5% of studies (i.e. commonly accepted threshold P-value of 0.05). Failing to acknowledge the effect of a confounding variable could have far-reaching consequences. Imagine releasing a new medical treatment after a clinical trial showed no significant adverse effects on patients, only afterward realizing that the placebo group was significantly older than the treated group. The trial failed to acknowledge the confounding variable of age, which might explain the absence of a significant difference in side effects between groups (e.g. the placebo group might have been more likely to have health complications due to aging that made the side effects from the younger treated group seem non-significant).


We screened a representative number of research articles published over 30 years within the discipline of behavioral ecology for these types of tests. We found that only 3 of 250 papers (here, 5% would be 12 papers) had reported a significant treatment-control difference for confounding variables. We conclude that the lower-than-expected number of significant P-values in the literature reporting effects of associated with confounding variables could be caused by reverse P-hacking and/or selective reporting. Selective reporting could stem, for example, from editorial boards decisions to reject a paper based on an experimental flaw (i.e. cannot disentangle the effect of the variable of interest with the confounding variable).

Despite not being able to isolate reverse P-hacking as the cause of too few significant P-values, our empirical study provides a proof of concept, and we hope that future studies will replicate it in their own discipline. Much of the literature on publication bias is by statisticians discussing “in principle” methods to detect and correct for publication bias, or policy statements. These types of papers vastly outnumber studies that collect data. One of our main points was to show yet another way that the use of P-values (and a dichotomy between significance and non-significance) can lead to poor scientific practices that create a discrepancy between data collection/analysis and what eventually appears in the literature.

For some, it might come as a surprise that randomization alone is not enough for dealing with confounding variables. Those 5% of “unlucky” studies that got significant by chance alone might suffer from their undermined scientific conclusions or be improved by a justification of some unexpected results. If one wants to minimize the likelihood of obtaining a significant difference in confounding variable between treatment groups, we recommend the use of balanced designs over randomization. This would entail, for example, pre-sorting individuals based on a confounding variable (e.g. groups of similar-size animals) and then randomly selecting individuals from each category consecutively to form treatment groups. In addition, through the constant improvement of statistical methods, researchers have the option of controlling for confounding variables by including them in their statistical models, rather than attempting to control for them experimentally.

Whatever the methods used to form treatment groups, researchers should always test for differences in confounding variables and report them accurately. Editorial board members and reviewers must flag the absence of such tests and thus encourage better science practices in the community.

These findings are described in the article entitled Evidence that nonsignificant results are sometimes preferred: Reverse P-hacking or selective reporting?, recently published in the journal PLOS Biology.


  • Amrhein V, Greenland S, McShane B (2019) Scientists rise up against statistical significance. Nature (567): 305-307. doi: 10.1038/d41586-019-00857-9
  • Head ML, Holman L, Lanfear R, Kahn AT, Jennions MD (2015) The Extent and Consequences of P-Hacking in Science. PLoS Biol 13(3): e1002106. doi:10.1371/journal.pbio.1002106

About The Author

Pierre Chuard currently works as a research associate in the laboratory of Prof. Jade Savage at Bishop's University. He is the project coordinator of, a web platform identifying ticks using pictures submitted by the public to better monitor tick populations in Canada. In collaboration with public health officials, they aim at improving prevention and awareness to vector-borne pathogen exposure. Pierre is also a part-time professor at the University of Ottawa where he teaches Species Conservation Biology.

Milan Vrtílek is a postdoctoral researcher at the Institute of Vertebrate Biology, Czech Academy of Sciences. His research is focused on the life-history evolution in annual killifish - fish adapted to periodically desiccating ephemeral pools. He is mainly interested in reproductive biology and maternal effects.

Megan L. Head is a research scientist at the Australian National University.

Born and raised in Canberra, I moved to James Cook University to begin my undergrad in marine biology. During my degree I quickly realised that my passion lay in understanding biodiversity, so I moved to the ANU where I completed my degree with a focus on ecology and evolution. I conducted my honors with Prof Scott Keogh on chemical communication in water skinks. I then moved to the University of New South Wales to conduct my Ph.D. under the supervision of Prof Rob Brooks looking at the evolutionary consequences of the costs of mate choice. After my Ph.D., I spent two years at the University of Wisconsin investigating the role of sexual selection in speciation of three-spine sticklebacks. During this time I got to spend considerable time in the field on an island off the coast of British Columbia - what a beautiful place! After my time in the states I then spent 6 years as a post-doc in the UK - there, I worked on a range of animals and questions including maternal effects in dung beetles at QMUL, nesting behavior in sticklebacks at Uni of Leicester, and the evolution of parental care in Burying beetles at the University of Exeter. In 2013 I moved back to Australia and the ANU to take up a post-doc with Prof Michael Jennions, looking at all things mosquitofish and reproducibility in science. In 2017 I took up an ARC Future Fellowship to investigate how sexually transmitted infection influence the evolution of mating behavior.

Michael D. Jennions is a research scientist at the Australian National University.

I am a evolutionary biologist with a special interest in behavioural ecology. I mainly work on sexual selection and reproductive decisions (female choice, male-male competition, sperm competition, parental care, life histories etc). I try to test predictions from general theory that can be widely applied across species: Do females prefer symmetrical males? Is the elaboration of sexual signals constrained by predation or by trade-offs with investment in other fitness-enhancing traits? I tend to ask a question and then pick a study animal that can be used to answer it. I have no taxonomic prejudice, but I do think it is important to feel some affinity with your study animal. You have to think it's cool. I am respectful of the incredible expertise many colleagues possess concerning the biology of specific taxa.