Evidence-based medicine: What do we mean by ‘low quality’ evidence in Trans Healthcare?

In 2021 the NHS (NICE) reviewed the evidence for the use of puberty blockers within trans adolescent healthcare. They concluded that the evidence was of “very low certainty”, a finding that was then repeated across national media, with the headline “Evidence for puberty blockers use very low, says NICE”. This unsurprisingly prompted increased demands, including from politicians, for withdrawal of trans adolescent healthcare.

Several people have written about the flaws in the NICE review, including this excellent article by AJ Eckert. I’ve examined parents of trans children’s concerns with the NICE approach to evidence in a recent peer reviewed article in which parents of trans children described puberty blocker Randomised Control Trials as “conversion therapy” or akin to “eugenics”.

A recent expert report from the US indirectly challenges the NICE approach, and merits further reading.


In 2022 the Florida Agency for Health Care Administration published a biased and ideologically driven report claiming that medical care for gender dysphoria does not meet generally accepted medical standards. A group of experts linked to Yale University responded with an analysis and critique of the poor science underpinning Florida’s report. Their full report is here.

Evidence Quality

One section of the Expert report from Yale University focuses on evidence quality. Their analysis has relevance for conversations on trans adolescent healthcare happening in the UK (including the Cass review), so this blog will shine a spotlight onto this section of the report.

*Note: In the below quoted sections, BPW refers to sources cited in the original (and misleading) Florida report. The June 2 report refers to the original, and flawed Florida report.

The following italicised text is taken directly from the Yale report. Please take time to read this section through:

…the BPW analysis reaches the conclusion that there is little or no evidence for the benefits of medical care for gender dysphoria.

The BPW analysis is highly deceptive, because it dismisses nearly all existing studies of medical treatment for gender dysphoria as “low quality,” without explaining that this is a highly technical term and not a natural-language condemnation of the studies. By contrast, the GRADE system, which the authors purport to use, is quite clear about its quality rating systems and its limitations.

In general, only randomized controlled trials (RCTs) are coded as “high” quality evidence in the GRADE system. A randomized controlled trial is a study that divides patients randomly into a control group (no treatment) and a treatment group. In contrast, an observational study records information about patients in a real-world setting that is more reliably generalizable, e.g., a cohort of patients seen at a clinic. Under the GRADE guidelines, observational studies are coded as “low” in quality.

The key point is that “low quality” in this context is a technical term and not a condemnation of the evidence, because “low quality” studies regularly guide important aspects of clinical practice. Indeed, the GRADE system, which the BPW document claims to use, specifically notes that GRADE should not be used to dismiss observational studies or to give absolute priority to RCTs: Although higher quality evidence is more likely to be associated with strong recommendations than lower quality evidence, a particular level of quality does not imply a particular strength of recommendation. Sometimes, low or very low quality evidence can lead to a strong recommendation.

The methodology adopted by the BPW document will thus, predictably, conclude that any body of scientific literature that does not contain RCTs is “low” in quality. Had BPW begun, as they should have, with a literature review of the evidence on puberty blockers and hormones, they would have seen that the evidence consists primarily of observational studies (for the good reasons discussed below). Thus, the 30 pages that it takes the authors to lay out their methodology is misleading: a knowledgeable reader would know that if there are few or no RCTs in the literature, then the BPW technical conclusion is foregone and, as importantly, is not a sound guide for clinical recommendations.

Put in simpler terms, if we coded apples as “high quality fruit” and bananas as “low quality fruit,” then any fruit bowl that has only bananas would predictably be technically coded as “low quality.” But that technical conclusion conveys very little information without context. For example, if no apples exist, then bananas may be a nutritious choice.

The drafters of the GRADE system emphasize that technically “low quality” evidence can support a strong clinical treatment recommendation. For example, pediatricians now agree that children should not be given aspirin for fevers. This recommendation is based on observational studies that showed an association between aspirin treatment during viral illnesses and the development of Reyes syndrome (a rapid and progressive disease of neurological dysfunction that can be fatal). Based on those studies, it would be unethical to conduct an RCT giving some children aspirin, and so the strong, consensus treatment recommendation is based entirely on “low quality” studies.

The critical fact is that RCTs are not, and cannot be, the gold standard for medical research on gender dysphoria. In the context of treatments for gender dysphoria, randomized controlled trials would often be inappropriate for ethical reasons. Medical care has long been shown, by reliable scientific methods, to address gender dysphoria and improve mental health: as we have repeatedly noted, these treatments have been recommended by rigorous clinical practice guidelines issued by WPATH and the Endocrine Society and endorsed by every major medical organization. Given this medical consensus, which is based on solid scientific evidence, it would be unethical to conduct an RCT that involved denying standard medical care to a control group of individuals.

Similar ethical issues, along with practical barriers, leave many areas of consensus medicine supported by observational studies and not RCTs. Many surgical procedures, for example, are not supported by RCTs. Nor are standard protocols for lowering cholesterol using statins, one of the most widely-prescribed drugs in the United States.

It is thus simply a mistake – and a mischaracterization of medical research across fields of medicine – to conclude that the absence of RCTs means that there is “no evidence” for the efficacy of medical treatment for gender dysphoria. Medical research requires, instead, that researchers evaluate the design and conduct of specific observational studies and do so with an awareness of clinical context.

In sharp contrast to BPW, this is precisely what the authors of the Endocrine Society did in their 2017 clinical guidelines, which use the GRADE system but, in addition, carefully discuss the characteristics of the studies supporting each treatment guideline. The Endocrine Society discloses the GRADE rankings for each treatment recommendation in order to be transparent about the evidence base for each of its recommendations. Then, following National Academy of Medicine (formerly, Institute of Medicine) standards for clinical practice guidelines, they proceed to a qualitative review of the evidence, place the evidence in clinical context, and discuss openly the values at stake in making a clinical practice recommendation.

The June 2 Report repeatedly and erroneously dismisses solid studies as “low quality.” If Florida’s Medicaid program applied the June 2 Report’s approach to all medical procedures equally, it would have to deny coverage for widely-used medications like statins (cholesterol-lowering drugs taken by millions of older Americans) and common medical procedures like mammograms and routine surgeries.

In its opening words, the June 2 Report makes an error that is repeated throughout the document: “Studies presenting the benefits to mental health, including those claiming that the services prevent suicide, are either low or very low quality and rely on unreliable methods such as surveys and retrospective analyses, both of which are cross-sectional and highly biased.” As we document in Section II.B., above, it is an outright mistake to conclude that a study in the technical category of “low quality” is unreliable or poor evidence for clinical practice. Thus, it is frank error for the June 2 Report to dismiss well-done, scientifically important studies because they rank as “low quality” using specialized, technical terms.

Like the BPW document, the June 2 Report thus relies on a deceptive use of technical terminology that is at odds with the standards used in medical research. It simply is not – and cannot be – the case that all clinical recommendations must be based on RCTs. Many areas of medicine do not lend themselves to ethical and practical RCTs. It is unethical to conduct an RCT when randomizing a patient to a control group would cause harm by denying treatments of known efficacy. For example, it would be unethical to conduct an RCT on the treatment of juvenile diabetes by randomizing some participants to receive insulin and others to receive no treatment.

It is quite common for the medical community to adopt important, consensus clinical practices supported by observational studies alone. For example, observational studies, notably the famous Framingham Heart Study, provided the framework for clinical practice guidelines in prevention and treatment of cardiovascular disease. In 2013, the American College of Cardiology and the American Heart Association issued updated clinical practice guidelines on the treatment of cholesterol to reduce heart disease risk in adults (the “Cholesterol Guidelines”). These authoritative guidelines have been widely used in clinical practice but are based not only on RCTs but on a great deal of observational evidence, including studies technically ranked as “low quality.” Concretely, many of the original treatment recommendations regarding statins are based on observational studies, not RCTs. The authors of the Cholesterol Guidelines, very much like the Endocrine Society authors, are quite careful to grade their evidence. But they do not rest their treatment guidelines on a mechanical assessment of technical quality. Instead, they (like the Endocrine Society) carefully explain why particular bodies of evidence should be given weight in clinical decision-making.

The cholesterol example shows that the June 2 Report rests on a fundamental misunderstanding of medical research and clinical practice. If the Florida Medicaid program actually adopted the standard of evidence urged by the June 2 report, the program would not cover statins (drugs to lower cholesterol) for many patients, which are prescribed to 28% of adults over the age of 40 and are one of the most effective ways to prevent cardiovascular death. Other common practices that would have to be reconsidered under this logic include: post-menopausal hormone replacement therapy (which reduces lifetime risk of heart attacks and stroke) and mammography screening for breast cancer.

The same point is true of the technically “low quality” evidence base for many surgical procedures, including minimally invasive gall bladder surgery, which have long since had a foundational grounding in observational studies. We think it unlikely that Florida’s Medicaid program will begin to refuse to pay for statins, mammograms, and routine surgeries. If not, then the June 2 Report reflects an untenable and discriminatory double standard.

(I’ll emphasise in case of confusion – all the above text in italics is not my work, but a direct quote from the Yale report – the authors of which are listed here)

Relevance for the NHS

The text above prompts several important questions for the NHS:

  1. Does the NHS (NICE report and Cass review) reflect a similar “discriminatory double standard” in its characterisation of the evidence base as ‘low quality’? Does NICE use the same approach across other areas of healthcare (including e.g. abortion)?
  2. Should the NHS (NICE and the Cass review) be more careful when using technical terms like “low quality evidence” in an already polarised and politicised area of healthcare, knowing how such terms are reported in the media and knowing how such terms inflame and undermine support for healthcare?
  3. Why has the NHS (NICE and the Cass review) failed to grapple with the very well established reasons why “high quality evidence” like RCTs are infeasible and unethical? Given the known ethical and practical impossibility of RCTs, why on earth did the NICE review recommend collection of such “high quality evidence”?

I’m also left with some bigger questions for the UK:

4. The above intervention from an esteemed team of establishment professors across different disciplines is likely to hold some weight in challenging and holding to account poor healthcare sector practices. Here in the UK we do not have paediatricians, law professors, experts in effective approaches to experience-informed healthcare speaking up on the clear flaws in the NHS approach to trans adolescent healthcare. Why not?

5. Are there lessons we can learn from experts in other areas of healthcare who have challenged commitment a narrow definition of ‘evidence’ within healthcare policy?

(post-script: I’d normally include a link to the NICE report, but it appears to have disappeared from the NICE website – if someone has a functioning link please let me know so I can update it).

