(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org. Licensed under Creative Commons Attribution (CC BY) license. url:https://journals.plos.org/plosone/s/licenses-and-copyright ------------ Analysis of 567,758 randomized controlled trials published over 30 years reveals trends in phrases used to discuss results that do not reach statistical significance ['Willem M. Otte', 'Biomedical Mr Imaging', 'Spectroscopy', 'Center For Image Sciences', 'University Medical Center Utrecht', 'Utrecht', 'The Netherlands', 'Department Of Child Neurology', 'Umc Utrecht Brain Center', 'Christiaan H. Vinkers'] Date: 2022-02 This study systematically assessed more than half a million full-text publications of RCT published between 1990 and 2020 for the prevalence of specific phrases linked to almost but formal nonsignificant reporting (i.e., P values just above the 0.05 threshold), including temporal trends and manual validation of the associated P values. We present an estimate of 9% of RCTs using specific phrases to report P values above 0.05. This prevalence has remained relatively stable in the past 3 decades. We also determined fluctuations over time in the frequently used nonsignificant phrases. Some phrases gained popularity over time, whereas others are more in decline. Our manual analysis confirmed that most of the phrases described nonsignificant results corresponded with P values in the range of 0.05 to 0.15. Our study also has inherent limitations. First, we predefined more than 500 phrases denoting results that do not reach formal statistical significance. We may have missed phrases with similar meanings. This would lead to an underestimated overall prevalence. However, we did not implement an elastic search strategy in our pdfs, as this could potentially change the interpretation, for example, by removing the negation. We are certain that a specific spin-like phrase is written in the trial report with exact string matching. However, this may have led to underreporting, and the actual prevalence of these phrases may be an underestimation. Second, not all phrases are equally specific in their association with P values just above 0.05. Third, we studied English language RCTs only. Generalizations to other languages can therefore not be made. Fourth, we only had access to published full texts. This prevents us from drawing causal conclusions as nonpublished manuscripts with specific nonsignificant phrases, which did not undergo a peer-review process, are not available. Connected to that, despite our data collection in September 2020, we missed a relatively large proportion of RCTs published in 2020, rendering our results less stable for the last year. Fifth, we only characterized P values in the direct vicinity of the phrases. Long-range referrals in the text or tables were not included. The association frequencies may hence be conservatively low. Sixth, it remains unknown whether some trials may have had nonsignificant results and used different sentences to describe these results. This may have caused underestimating the prevalence of these types of sentences. Seventh, we do not know whether the P value and the corresponding significance phrase actually referred to the study’s primary outcomes or whether it described less important secondary or tertiary outcomes. Finally, not all predetermined sentences actually represent a P value above 0.05. (e.g., ‘”marginally significant”). However, we hardly found P values lower than 0.05 corresponding with specific phrases in the manual analysis (see S3 Table ). For example, the phrase “failed to reach statistically significant results” highlights a fact, although not as neutral as simply stating “nonsignificant results.” Therefore, the amount of spin may vary between phrases and potentially overreport some of our individual phrase prevalence estimations that describe marginally significant results. This is the first study to explore a vast body of PubMed-indexed RCTs on the occurrence of phrases reporting nonsignificant results. Given the relatively low frequency of several phrases, such a large sample is essential to effectively quantify prevalence and changes in phrasing over time. Moreover, we also quantified the actual P values of the most frequently used phrases reporting nonsignificant results. Interpretation Our findings suggest that specific phrasing to report nonsignificant findings remain fairly common in RCTs. RCTs are time- and energy-consuming endeavors, and an “almost significant” result, can, therefore, be a disappointing experience in terms of the interpretation and publication of the results: Did the RCT “find” an effect or not? Our description of the characteristics of the most prevalent phrases can help readers, peer reviewers, and editors to detect potential spin in manuscripts that overstate or incorrectly interpret their nonsignificant results. Our results also support the notion that some phrases are becoming more popular. The detected P value distributions are important in light of the recent discussions to lower the default P value threshold to 0.005 to improve the validity and reproducibility of novel scientific findings [1]. P values near 0.05 are highly dependent on sample size and generally provide weak evidence for the alternative hypothesis. This threshold can consequently lead to high probabilities of false-positive reporting or P-hacking in clinical trials [22]. However, replacing the common 0.05 threshold with an even lower arbitrary value is not a definitive solution. Clinical research is diverse, and redefining the term “statistical significance” to even less likely outputs will probably have negative consequences. Lakens and colleagues [23] therefore suggest that we should abandon a universal cutoff value and associated “statistical significance” phrasing and allow scholars to judge the clinical relevance of RCT results on a case-by-case basis. Based on our data, we think that such a personalized approach is beneficial for everyone—especially since it is currently unknown if P value cutoffs as low as 0.005 do indeed lead to lower false-positive reporting and will lead to more rigorous clinical evidence. A stricter threshold requires large sample sizes in replication studies—which are hardly conducted—and will probably increase the risk of presenting underpowered clinical results. Moreover, since it is estimated that half of the results of clinical trials are never published [24], mainly due to negative findings, lowering the P value threshold may result in more “negative” studies that remain largely unpublished. Although the detrimental effects of lowering the threshold for statistical significance for medical intervention data are disputed [25–27], a recent retrospective RCT investigation showed that shifting the threshold of statistical significance from P value < 0.05 to < 0.005 would have limited effects on medical intervention recommendations as 85% of recommended interventions showed P values below 0.005 for their primary outcome [28].We are also aware that his will come with new problems and ways to game this new artificial statistical threshold. We think that if authors discuss and judge their threshold value transparently and show the clinical relevance, there is no need to tie oneself to a universal P value cutoff. Journal editors and (statistical) reviewers can play an important role in propagating ideas from the so-called “new statistics” strategy, which aims to switch from null hypothesis significance testing to using effect sizes and cumulation of evidence to explore and determine potential clinical results relevance [29–31]. Chavalarias and colleagues [32] describe in their paper results that are related to the reporting of pv values, effect sizes, and CIg; in the vast majority (88%) of the included RCTs, they found the reporting of a P value <0.05. They also highlight that in 2% to 3% of the analyzed abstracts, they found the reporting of CIs and 22% of the abstracts described effect sizes. Despite this improvement, we remain skeptical whether this will not shift the problem and stimulate researchers to overly report their effect sizes and CIs. Some argue that BFs should replace the quest for statistical significance. Some phrases were associated with BFs that represent “decisive evidence” for temporal changes in our analysis. It is worth mentioning that BFs are considered a good alternative for statistical significance. However, the BFs may be subject to other biases and linguistic persuasion and should be interpreted in light of their research context [33], so this not be a definitive solution. Our study questions the current emphasis on a fixed P value cutoff in interpreting and publishing RCT results. Besides abandoning a universally held and fixed statistical significance threshold, an additional solution may be the 2-step submission process that has gained popularity in the past years [34,35]. This entails that an author first submits a version including the introduction and methods. Based on the reviews of this submission, a journal provisionally accepts the manuscript. When the data are collected, the authors can finalize their paper with the results and interpretation, knowing that it is already accepted. In conclusion, too much focus on formal statistical significance cutoffs hinders responsible interpretation of RCT results. It may increase the risk for misinterpretation and selective publication, particularly when P values approach but do not cross the 0.05 threshold. Fifteen years of advocacy to shift away from null hypothesis testing has not yet fully materialized in RCT publications. We hope this study will stimulate researchers to put their creativity to good use in scientific research and abandon a narrow focus on fixed statistical thresholds but rather judge statistical differences in RCTs on its effect size and clinical merits. [END] [1] Url: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001562 (C) Plos One. "Accelerating the publication of peer-reviewed science." Licensed under Creative Commons Attribution (CC BY 4.0) URL: https://creativecommons.org/licenses/by/4.0/ via Magical.Fish Gopher News Feeds: gopher://magical.fish/1/feeds/news/plosone/