HER2 and Herceptin represent the biggest success story in breast
cancer of the last two decades. All might be as advertised, but growing
evidence raises key questions: Does HER2 drive breast cancer? Is it prognostic?
Is the definition of HER2-positive clinically validated? Are HER2 assays
reproducible? Can re-testing HER2 status change trial outcomes? Does HER2
predict benefit from Herceptin? Does Herceptin’s mechanism of action depend on
HER2? Has Herceptin increased five-year breast cancer survival at the
population level? Would Herceptin be approved by the FDA today? In the most
recent trials of Herceptin and T-DM1, can the spectacular success of adding
pertuzumab to Herceptin be squared with the failure of pertuzumab to make any
difference when given with Herceptin-based T-DM1?
Examination of these questions (ten in total) casts doubt on the validity and clinical utility of the HER2 subtype and current treatment guidelines for Herceptin in breast cancer.
1. Does HER2 drive breast cancer?
Summary: Evidence conflicts regarding whether HER2 drives cancer in humans; the decades-old experiments should be repeated to resolve the question definitively. Today, HER2 is the sole “Class I” oncogene that works by overexpression or amplification rather than mutation; no other type of cancer with a targeted therapy is driven by overexpression or amplification of the target. GRB7 is frequently co-amplified with HER2 and might be required to drive cancer, or GRB7 might contribute to oncogenesis independently of HER2. Conceivably, neither HER2, GRB7 nor any of the genes on the amplicon they share drive oncogenesis.
Robert Weinberg’s lab discovered that mutation of neu in rats drives cancer. But the analogous gene in humans, HER2*, is rarely mutated in breast cancer. Instead HER2 is either amplified or its protein product overexpressed. However, Weinberg’s lab found that amplifying neu 100-fold and increasing expression 10-fold did not transform mouse cells from normal to cancerous. Subsequently, the lab of Stuart Aaronson performed two similar experiments in the same mouse NIH/3T3 cell line but amplifying HER2 rather than neu. The first experiment confirmed Weinberg’s finding. But a second test used a different promoter that ratcheted HER2 expression five to ten times higher than in the first experiment, resulting in transformed cells, according to the researchers. Three decades later, however, Weinberg seems unpersuaded: the “report of Aaronson… may or may not have been independently replicated over the past 30 years since it appeared,” Weinberg wrote in email.
The experiments should be repeated. (I was not able to determine if HER2 has been shown to be transforming in human cell lines.)
HER2 is the only “Class I” oncogene that drives by overexpression and/or amplification
A 2004 census of amplified and overexpressed oncogenes found that just six met the most stringent “Class I” criteria. But by 2010, this fell to three: EGFR, AR and HER2. Now in 2016, HER2 stands alone as a Class I gene.
Class I genes must meet the lesser criteria of Class II and III genes and also require “that a drug that targets the encoded protein is used to treat patients for which efficacy must have been shown in clinical trials.” For EGFR in colo-rectal cancer, the targeted agent is Erbitux. However, although the FDA label for Erbitux lists EGFR/ERBB1 expression as an indication, the “consensus is that ERBB1 expression not required for therapeutic success,” according to Bert Vogelstein at Johns Hopkins University. In prostate cancer, amplification of the androgen receptor gene (AR) does not initiate prostate cancer but results from treatment. That leaves HER2 in a class by itself.
I asked Mike Stratton, a co-author of the 2010 census paper and director of the Singer Institute, if HER2 was indeed the sole Class I gene. Stratton replied: “I think that what you say is correct.”
A fundamental principle of targeted therapy is to attack tumor cells without harming normal cells. Consequently, targets are usually mutations. Herceptin’s ostensible target, however, is unmutated HER2. There is no driving oncogene for any human cancer where a targeted therapy aims for an unmutated target—except HER2 breast cancer.
In cancer cells, amplified regions are quite common. The near absence of Class I genes, according to the 2010 census paper, “reflects the difficulty encountered in identifying the true cancer gene on amplicons that often include several candidate genes.” GRB7 resides on the same amplicon as HER2, and one research group found that both genes are co-amplified in 15% of invasive breast cancers. The same group suggested HER2 was not transforming by itself but required GRB7 co-amplification, creating the possibility that a “combination of multiple genes, which do not have independent transforming activity, causes transformation.” That is, HER2 might not be transforming. An analysis of the Cancer Genome Atlas concurred that “GRB7 may be necessary for cancer cells harboring this amplicon, as previously suggested.” Betsy Ramsey’s lab also studied GRB7 and HER2. In email, Ramsey summarized: “Certainly GRB7 seems to be a player with or perhaps without HER2.”
2. Is HER2 prognostic in breast cancer?
Summary: Individual studies come to disparate conclusions about HER2 as prognostic in breast cancer. Reviews of the literature have been tagged with Expressions of Concern, leaving belief that HER2 is prognostic unsupported.
Women testing HER2-positive opt for more radical surgery than patients found to be HER2-negative. According to an analysis of more than 113,000 women, “mastectomy rates were higher in women with HER2-positive tumors than in those with HER2-negative tumors.” The reason is not known although the negative prognosis associated with HER2 breast cancer, long considered as “aggressive,” might be leading doctors and patients to pursue more aggressive treatment. But the evidence no longer supports HER2 as prognostic.
The HER2 prognostic literature begins with a 1987 paper from Slamon et al. which considered 86 node-positive patients from an unrelated clinical trial. (This might have made the cases non-random. In addition, the authors did not discuss whether the 86 were all from the same arm, raising the possibility of treatment confounding the analysis.) Slamon and colleagues found a statistically significant association between degree of HER2 amplification and recurrence and, to a lesser degree, survival: greater HER2 amplification increased the risk of recurrence and shortened survival. But when simply separating cases into amplified (n=52) vs. non-amplified (n=34), no difference in recurrence or survival emerged. To show such a difference and that HER2 was prognostic, the researchers dropped patients with a middling HER2 copy number of 2-4 (n=23) from the analysis. Based on the few remaining patients (n=11), with gene copy numbers of five or more, the authors reported a statistically significant difference in disease free survival (p=0.015) although still no difference in survival (p=0.06). Thus was born HER2’s reputation as an aggressive form of breast cancer.
More than one hundred studies followed Slamon et al., with an array of disparate results that might be expected given the methodology of the founding paper. However, the contradictory mess resolved into iron scientific consensus: “Initial conflicting reports regarding the prognostic relevance of HER2 were resolved with improved methodologies,” wrote Mark Moasser “and the overwhelming data now confirms this initial [Slamon et al.] landmark genetic-biologic finding.” The overwhelming data, according to Moasser, was “nicely reviewed” in a paper by Jeffrey Ross and colleagues.
Ross served as first author on three reviews of the HER2 prognostic literature, published in 1999, 2003, and 2009. However, in March of this year, all three reviews received an Expression of Concern from The Oncologist. (The EOC was prompted by my re-analysis of the Ross et al. reviews.)
The 2009 review included 107 papers. No search or inclusion criteria were specified, creating the possibility of selection bias. More important than how the 107 papers were chosen, however, the review contains errors on 30. More than one in four (28%) of the papers reviewed are mis-reported or should not have been included. For example, a paper from Battifora et al. is misreported: Ross et al. categorized it as “yes” under multivariate analysis of prognostic factors when the paper clearly did not find that HER2 was independently prognostic:
“This analysis identified independent prognostic factors of DFS and OS when all variables were considered together. Independent predictors of DFS included stage of disease, histology, and nuclear grade. Nuclear grade and stage were the only significant predictors of OS.”
There are 10 errors of this particularly blatant sort among the 107 papers reviewed in Ross et al. 2009. A separate group of seven of the 107 papers conducted no multivariate analysis of whether HER2 was prognostic, but Ross et al. reported that those studies did and that each of the seven found HER2 independently prognostic. Six of the seven did not conduct a multivariate analysis of any kind; one of the seven did but all patients were HER2-positive.
An additional set of 11 papers should have been excluded. Nine of these correlated HER2 with different biomarkers, not clinical outcomes. Among these, one paper included 3,655 patients, by far the largest study in the review. Together, these 11 papers contributed 7,511 (19%) of the 39,730 patients in the review. Of these extraneous patients, 7,213 (96%) were adduced in support of HER2 being independently prognostic.
I contacted Jeffrey Ross regarding these errors. Concerning those in his 2003 paper, Ross acknowledged “scattered errors.” Ross disputed none of the 30 errors I identified.
There do not appear to be other literature reviews showing HER2 to be prognostic in breast cancer, leaving that belief unsupported.
3. Is the definition of HER2-positive clinically validated?
Summary: There is no gold standard assay for definitively identifying HER2-positive tumors. The preferred assay and cutoff values have changed and changed back over time. But the modified standards are arbitrary, not clinically validated. The changes may reduce interobserver disagreement but do not increase accuracy in identifying true HER2-positives.
Changing assays: IHC vs. FISH
“There is no gold standard at present,” according to the most recent guidelines for HER2 testing, published in 2013. Perhaps for that reason, the preferred method for determining HER2 status has changed over time. Early research had shown that “amplification added little predictive value to the expression data.” Instead, the assay of choice, immunohistochemical staining (IHC), measured HER2 overexpression. The senior author of the paper, Dennis Slamon, subsequently led the first Phase III trial of Herceptin, for metastatic breast cancer. That study used immunohistochemical analysis and led to the first FDA approval of Herceptin in 1998.
However, not long after, re-analysis by the trialists showed that amplification measured by fluorescence in situ hybridization (FISH) predicted response better than IHC:
“FISH assays have higher sensitivity and higher accuracy and more frequently correctly identify altered HER-2/neu status (amplification/overexpression) in previously molecularly characterized specimens than did the FDA-approved immunohistochemistry assays interpreted manually.”
Of note, two of the authors, Michael Press and Dennis Slamon, also co-wrote the paper nine years before which found that “amplification added little predictive value to the expression data.” Now it was the opposite. Slamon-led researchers ultimately dismissed IHC with prejudice in 2005: “We do not consider immunohistochemistry screening for entry to clinical trials or for selection to Herceptin immunotherapy to be an acceptable strategy.”
The FDA, not long after approving Herceptin and IHC, voiced its displeasure about HER2 testing and the “many unanswered questions regarding HER2 detection systems…” The agency threatened to change the Herceptin label because of the “considerable confusion and misunderstanding on the part of the oncology community,” which was “significant enough to warrant general precautionary comment in the trastuzumab [Herceptin] label…”
By 2006, HER2 testing was broken, “a disorganized practice” with a “high rate of inaccuracy,” according to guidelines published that year by the American Society of Clinical Oncology (ASCO) and the College of American Pathologists (CAP). Both organizations had previously recommended HER2 testing but without seeing a need to specify how to do it.
Instead of choosing sides in the IHC vs. FISH debate, the new guidelines put the two assays together. No single kind of test sufficed to identify all HER2-positive tumors. An equivocal result from one assay would trigger another test using the other assay. The UK used this “two tier” system, and the ASCO/CAP 2006 guidelines proposed the same approach for the United States.
Although the UK guidelines detailed testing procedures, they did not cite clinical evidence in their support. Instead, the UK authors “expected that emerging data on accuracy of prediction of response to HER2 targeted treatments will influence the choice of testing method.” The US guidelines also adduced no evidence in support of the two tier system, acknowledging that, “current data are insufficient to define whether these patients represent true- or false-positives.”
Some of the ASCO/CAP panelists had wanted to scrap IHC entirely: “A minority view expressed within the panel was that IHC is not a sufficiently accurate assay to determine HER2 status and that FISH should be preferentially used.” Instead, between 80 and 90% of primary HER2 testing in the United States is done with IHC, and only 10 to 20% uses FISH, according to figures from 2008. A 2010 paper reported that 80% of HER2 assessments in the United States used the IHC-based HercepTest, manufactured by Dako. FISH costs more and requires more expensive equipment, a possible barrier for some hospitals and conceivably part of why FISH is not mandatory.
Herceptin trials were no help in shaping the ASCO/CAP guidelines: “the large prospective randomized clinical trials of trastuzumab were not prospectively designed to answer these questions.” Instead, retrospective “correlative studies” would have to suffice. Also, instead of seeking a true gold standard, any new assays would be measured by how well they reproduced the results of the old assays. According to the guidelines:
“Although a new HER2 assay ideally should have its clinical utility validated using specimens from prospective therapeutic trials that tested the effects of anti-HER2 therapy, the Update Committee recognizes that the rarity of these valuable specimens requires that new HER2 assays be approved on the basis of concordance studies comparing them with other established HER2 tests.”
In addition, the guidelines aimed for concordance rather than accurate diagnosis, with concordance substituting for accuracy. Inaccurate diagnosis is not possible to detect because there is no gold standard. By contrast, discordance can be measured, causes embarrassment and raises concerns at the FDA. As the guidelines recognized, however, interobserver agreement is not the same thing as accurate diagnosis: “concordance of assays does not assure accuracy (i.e., how close the measured values are to a supposed true value…).” Early UK guidelines strove to reduce “interobserver variation in the assessment of staining” by standardizing scoring “against known positive, negative, and borderline cases.” I asked Ian Ellis, corresponding author of the long-ago UK guidelines paper, whether the cases were clinically validated or if the staining patterns were selected for the purpose of minimizing interobserver disagreement. Ellis did not reply to multiple inquiries.
In 2006, widespread HER2 assay discordance prompted ASCO/CAP to announce more stringent cutoffs than the FDA because “the original US Food and Drug Administration-approved interpretation guidelines provide insufficient specificity.” The FDA had approved an IHC staining threshold of 10% of cells. But the panelists believed this resulted in an unacceptably large number of false positives. The guidelines recommended a higher cutoff of 30%, not based on published evidence but anecdote, “the cumulative experience of panel members that usually a high percentage of the cells will be positive if it is a true IHC 3+.”
The guidelines referred to “published reports using cutoff values higher than 10%,” but the footnote pointed to a single study in France which achieved 95% concordance between IHC and FISH by using a vastly higher staining cutoff of 60%. Why the revised US guidelines recommended a 30% cutoff is not clear.
The cutoff for FISH HER2/CEP 17 ratio was also raised, from 2.0 to >2.2. In addition to changing the definition of IHC3+, IHC 2+ patients were no longer considered HER2-positive as they had been in the original, FDA approval-winning trial of Herceptin.
But then seven years after raising the IHC thresholds, the ASCO/CAP guidelines rolled them back. The 2013 guidelines committee “decided to revert to the previously used IHC criterion of more than 10% cells staining.” Re-examination of one trial with 2,904 patients found that the higher threshold only excluded 107 or 3.7%. This seemed to contradict the “cumulative experience” of the 2006 panel that “usually a high percentage of the cells will be positive if it is a true IHC 3+.” Now it seemed to be remarkably rare for IHC 3+ to have more than 10% staining.
The re-examination found that patients not meeting the higher ASCO/CAP guidelines showed no difference (p=.55) in disease free survival whether they received Herceptin or not. But the guidelines were switched back nonetheless.
The threshold for IHC had been raised in 2006 out of concern for false positives. But a then-ongoing study assuaged this worry. The 2013 guidelines said that study found “less than 6% of patients initially considered eligible were not subsequently centrally confirmed as being HER2-positive.” The US, at least, appeared to have its house in order.
The 6% figure came from a central laboratory at the Mayo Clinic that re-examined about one thousand locally-tested patients. However, central vs. local testing in Europe of more than 8,000 patients yielded discordant results in 15% of cases. And when the two US and European central labs later compared results, examining only samples known to be false negatives, they differed on IHC scores for 6 of 25 cases (24%) while FISH scores differed for 3 of 25 (12%). Moreover, the Mayo Clinic systematically assigned higher IHC and FISH scores to a set of 23 cases previously judged as equivocal in local testing. Of the 23, the Mayo Clinic found 15 to be HER2-positive while versus 11 according to the European lab. In other words, the Mayo Clinic might have generated a high percentage of false positives, the concern that had led to raising the IHC staining threshold to 30%, which the 2013 guidelines rolled back.
Which central lab was right? Did the high US concordance rate mean US labs were correctly identifying true HER2-positive patients and the European lab was wrong? “It is not possible to know,” according to the ring study paper, “which central laboratory determination of HER2 status… was biologically correct in terms of distinguishing patients who do or do not benefit from HER2-targeted… therapies.”
4. Are HER2 assays reproducible?
Summary: Neither IHC, FISH, local or central testing generate reliably reproducible HER2 results. Re-examination of the clinical trials leading to FDA approval of Herceptin found discordant HER2 status on as many 26% of patients. Technical shortcomings of both IHC and FISH assays contribute to reproducibility problems. Each laboratory may be using its own cutoff criteria in judging these semi-quantitative assays. The type of assay, its manufacturer and who performs the test can decisively influence the HER2 status of any given patient.
The major trials leading to FDA approvals of Herceptin in breast cancer have been re-examined, producing substantially different results for patients’ HER2 status. As mentioned, the breakthrough trial leading to the first FDA approval of Herceptin depended on IHC. But re-examination found amplification measured by FISH predicted response to Herceptin more accurately; i.e., FISH did not reproduce IHC.
Central testing frequently contradicts local results, a problem which affected both US trials that supported FDA approval of Herceptin in the adjuvant setting. Central retesting of patients in NCCTG N9831 failed to reproduce a local HER2-positive result in as many as 26% of re-examined cases. Similarly, retrospective analysis of NSABP B-31 found 18% of tumors were HER2-positive according to local testing but HER2-negative by central testing using both IHC and FISH.
Trial MA.31 compared Herceptin and lapatinib in metastatic breast cancer in 652 patients found HER2-positive by local testing. However, central re-testing found 115 patients (18%) were not HER2-positive, although they had been enrolled in the trial and treated with an anti-HER2 therapy.
False negatives are also a problem. A 2014 study of 552 locally HER2-negative patients found that 4% were HER2-positive by central testing.
Non-reproducibility also impacts central laboratory testing. As mentioned, a US and European lab each examined the same set of 23 equivocal HER2 cases. The US lab found 15 (65%) HER2-positive while the European lab found only 11 (48%) HER2-positive.
Contributing to reproducibility problems are shortcomings of the tests themselves. In 2006, the ASCO/CAP HER2 testing panel had considered throwing out IHC, the original technology for identifying HER2-positive breast cancer. Skepticism, even condemnation, of IHC continues to this day. “The IHC assay is lousy,” according to Bert Vogelstein at Johns Hopkins University. “No IHC assay is great, many are inaccurate,” he added. FISH, according to Vogelstein, “is not that great either, but it’s the best that the pathologists have.”
For both IHC and FISH, the handling of samples before the test can affect test results. Also, the assays are only semi-quantitative. As the FDA observed in 2001, it “views both IHC and FISH as semi-quantitative if performed under ideal circumstances.” In addition, “Both methods require subjective interpretation.”
According to David Rimm, a HER2 testing expert at Yale Medical School, each lab has its own cutoffs, which he considered a “dirty secret.” (In reply to my initial inquiries about issues with HER2 testing, Rimm replied: “You are about to uncover a landmine.”) The College of American Pathologists (CAP) sends HER2 testing facilities samples to measure and encourage adherence to common, cross-laboratory criteria for HER2-positives and negatives. But according to Rimm, “the College doesn’t send them too many hard ones,” possibly to avoid generating discordant, non-reproducible results.
An additional complicating and underappreciated problem is that tumors are sometimes heterogeneous. A biopsy from one part of a tumor can test HER2-positive while a sample from a different part of the tumor tests negative. Researchers reporting such a case wrote: “We do not know the frequency with which a disparity of this degree occurs, but it is not even mentioned in reviews on this subject or consensus guidelines published previously. We therefore assume that it must be a rare phenomenon or one clearly underappreciated.”
According to Kornelia Polyak of the Dana Farber Cancer Institute, the phenomenon is not rare: “This is a pretty serious problem as we see that ~30-40% of HER2+ tumors have high heterogeneity for HER2 itself within the tumor.”
Not only do different labs sometimes disagree about assay results for a given specimen, in addition, where the tissue sample comes from in the tumor can determine whether a patient is deemed HER2-positive or HER2-negative. (Also tumor cells can interconvert between HER2-positive and HER2-negative states. See question 6.)
5. Can non-reproducible HER2 assays alter trial outcomes?
Summary: Central and local testing can produce conflicting HER2 assessments. In at least one trial, central retesting changed the outcome of the study. There are implications not only for bioethics but a forthcoming meta-analysis that attempts to measure Herceptin’s effects across trials, some with multiple, conflicting HER2 assessments.
The outcome of trial MA.31 depended on whether local or central HER2 determinations were used. By central testing, Herceptin extended life more than lapatinib; by local testing, there was no difference. Recall that in MA.31, local testing identified 652 patients as HER2-positive but central re-testing later found 115 (18%) weren’t HER2-positive.
This makes trial interpretation difficult or impossible, while the timing of the tests created an ethical conundrum. The central re-testing occurred during MA.31 according to Karen Gelmon, the study’s corresponding author. Regarding the unusual timing, Gelmon said: “the thought was to make it easier for the patients and doctors to use local HER2 for randomization to avoid a delay in starting treatment…” However, in speeding patients into the possible benefits of an untested treatment regimen, the design and conduct of the trial resulted in centrally HER2-negative patients being treated with anti-HER2 therapies.
Gelmon said “the central results are considered definitive.” But most patients, including those that were centrally HER2-negative, appear to have completed the study. Regarding this ethical conundrum, Gelmon said: “If the central confirmation showed negative results it was up to the treating physician to decide how to treat, which is always how it is, and they could continue the Herceptin if they thought the local testing was valid.” Gelmon did not reply when asked how many patients stopped receiving anti-HER2 treatment after being found HER2-negative by central testing. It is not clear if patients were informed of the equivocal test results. Anti-HER2 therapies are of course not approved for use in HER2-negative patients because the benefits, if any, are outweighed by toxicities and other side effects.
Although this ethical dilemma arose in a clinical trial, it conceivably impacts every breast cancer patient. It is unclear whether to believe local testing, central testing or neither. Gelmon doubled-down on central testing: “yes – central or validated testing is what should be recommended.” However, the ASCO/CAP guidelines recommend only that the testing laboratory be accredited. Gelmon simultaneously regards central testing as definitive but supports optionally ignoring it. Edith Perez, prompted by FISH-negative cases who later turned out HER2-positive by IHC, recommended that “in the case of negative results, it’s advisable to repeat the test you started with or to run a different test,” perhaps making it sound like testing should be continued until the result is positive. Perez led the NCCTG N9831 study.
It is unclear that a coherent testing algorithm is obtainable from the recommendations and practices coming from these clinical trials.
The Clinical Trials Service Unit (CTSU) at Oxford is conducting a meta-analysis of Herceptin in early breast cancer. But how will the study define “HER2-positive?” For trials with two sets of assessments, the meta-analysis will have to choose between them (and ignore one set) or present results for both local and central testing. However, not all trials re-tested HER2 status, further complicating the aggregation of Herceptin’s effects across trials.
In addition, the type of assay used might be important and worth reporting. “Scientists who actually do these assays (rather than see the reports of the results) know that neither of these assays (FISH or IHC) are particularly reliable on clinical samples,” said Bert Vogelstein. Given the actual complexity and uncertainties in HER2 assessments, CTSU could (and should) examine their accuracy, computing the likelihood of an assessment being correct and/or linking assessment accuracy estimates to the confidence interval around Herceptin’s clinical benefit. Since there is no way to identify “true” HER2-positives, perhaps the best that can be done is to calculate the likelihood of concordance or discordance if a sample were subjected to re-testing.
I asked CTSU’s Richard Gray: “Will your study use the initial or retested results for those trials? How will your meta-analysis deal with the mixture of protocols for determining HER2 status across trials?” Gray replied indirectly: “One prime aim of the meta-analysis will be to investigate whether there is benefit in HER2 receptor equivocal patients, and we’ll collect results of all available local and central assays to look at this.”
However, arguably the real problem is how to perform a meta-analysis of randomized clinical trials where a main variable, HER2, was not controlled. Kornelia Polyak believes Herceptin works, but observed: “If you pick variable patients by definition you will have variable responses leading to confusion.”
6. Is HER2 a valid biomarker that predicts benefit from Herceptin in breast cancer?
Summary: The published literature no longer supports the validity of HER2 as a biomarker for Herceptin in breast cancer. Both US trials leading to FDA approval of Herceptin in early breast cancer later found HER2 did not predict benefit from Herceptin. In small subgroups, HER2-negative patients appeared to benefit more than HER2-positive patients, and Herceptin is now being tested in HER2-negative patients. Alternative biomarkers for Herceptin have been proposed but none accepted.
The FDA approved Herceptin in 2006 for early breast cancer based mainly on two US trials: NCCTG N9831 and NSABP B-31. Both trials later announced some patients had been misdiagnosed as HER2-positive, making it possible to examine clinical outcomes of patients negative for HER2 by central testing who had been treated with Herceptin.
In 2007, one year after helping win FDA approval of Herceptin, B-31 trialists reported neither FISH nor IHC predicted response to Herceptin: “No statistical interaction was found between DFS benefit from trastuzumab and levels of protein (p=0.26) or HER2 gene copy number (p=0.60).” However, although the B-31 trialists wrote of “no statistical interaction,” it appears that HER2-negative patients benefited more than HER2-positive patients. The subgroups were small but nonetheless the researchers reported significant values for each of them, turning the HER2 world on its head.
FISH- IHC- (0-2+)
HER2-negative patients benefited more from Herceptin than HER2-positive patients in NSABP B-31. (Adapted from Paik et al., 2007)
In 2013, B-31 trialists sought
a new biomarker, reiterating that “HER2 itself failed to show predictive interaction
The second trial key to FDA approval, NCCTG N9831, corroborated B-31’s finding that FISH did not predict response to Herceptin. A 2010 re-analysis of N9831, again by the original investigators, found “Trastuzumab benefit seemed independent of HER2/centromere 17 ratio and chromosome 17 copy number,” i.e. independent of FISH.
The two trials which had established HER2 by IHC and/or FISH as the biomarker for Herceptin subsequently disestablished both. On this basis alone, HER2 would seem to no longer be a valid biomarker for predicting response to Herceptin. Logically, HER2 status no longer stands as a valid indicator for treatment with Herceptin. We have known this since 2010.
FISH and IHC might have found redemption in Hera, the trial that led to approval of Herceptin in the adjuvant setting in Europe and also supported FDA approval in the US. Post-approval, Hera trialists examined the relationship between degree of HER2 amplification by FISH and benefit from Herceptin. But they chose not to examine 41 patients with a FISH ratio under two, i.e. HER2-negative by FISH. “We deemed it inappropriate to analyze this small group,” wrote the investigators. Consequently, they could not say how FISH-negative patients responded to Herceptin and whether or not IHC by itself predicted response.
Without looking at the FISH-negative group, researchers continued to posit “a strong threshold effect whereby any degree of amplification above the cutoff ratio of 2.0 is of equal clinical significance.” However, B-31 previously and N9831 found no threshold among a combined 330 FISH-negative patients. The Hera team simply looked away.
The Hera trialists also examined IHC staining intensity and clinical outcomes. This time, IHC-negative patients were not included, which prevented analysis of whether FISH predicted response to Herceptin. Hera investigator Mitch Dowsett explained, in June 2014: “Because of our policy on recruiting only centrally confirmed HER2-positive cases to Hera we were not in a position to do this.” However, apprised that Hera enrolled at least 299 centrally confirmed, HER2-positive patients who were IHC-negative, Dowsett revised his explanation. “I think ‘policy’ is overstating things. We could and maybe should have looked at this group in more detail previously.” But, “prompted by a UK pathologist,” rather than the failure of IHC to predict response in B-31 and N9831, Dowsett said the Hera trialists would examine the IHC-negative, FISH-positive subgroup.
In August 2015, more than a year later, I asked Dowsett how the project was going. “The work was conducted and a manuscript created,” he replied. But then the primary investigator, Bharat Jasani, “left for [a] job in Kazakhstan,” said Dowsett, stalling the investigation. I emailed Jasani and asked: “Were you examining IHC-negative, FISH-positive cases from Hera before leaving for Kazakhstan?” Jasani seemed to contradict Dowsett: “The simple answer is no and I would like to confirm once again that I have not examined at any time any IHC-negative, FISH-positive cases from Hera.”
Analyses of key subgroups in the Hera trial appear to have been avoided. As it stands, every re-test of any assay used to assess HER2 in the FDA approval-winning trials in early breast cancer found that that assay did not predict benefit from Herceptin, or that being HER2-negative predicted greater benefit.
Similar to the Hera trials evasions, HER2 testing experts also avoided addressing HER2’s validity as a biomarker when I raised the issue to them in 2014. I emailed John Bartlett, at the Ontario Institute for Cancer Research, asking: “what established HER2 as a biomarker and what data informed the cutoff point for positive vs. negative?” Bartlett previously co-authored HER2 testing guidelines. His assistant replied: “John says he should be able to answer it via email.” However, Bartlett eventually wanted to speak by phone. When I requested email, the assistant wrote back: “Unfortunately Dr. Bartlett is unable to answer this question.”
The lead author of HER2 testing guidelines, Antonio Wolff, wrote me that FISH “Absolutely yes” predicts response to Herceptin even though N9831 and B-31 showed it did not. “I do fear that the dots you are connecting don’t quite tell a story,” Wolff said. Rather than explain, he wrote: “I think I will stop here.” He requested to speak by telephone but would not allow recording it: “Recording our conversation will not be ok and you do not have my permission.” He added: “My goal was to walk you through your questions informally as an expert source.” Arguably, Wolff declined to go on record to explain why FISH remained valid.
Reliably identifying HER2-positive patients might be impossible. According to Daniel Haber, at Massachusetts General Hospital, “Whether there are ‘true HER2’ tumors or not is up for discussion.” Instead of HER2 status predicting response to Herceptin, response to Herceptin determines who is HER2-positive. Said Haber: “the real definition is probably whether they [patients] respond to HER2 therapy or not…” But if so, HER2 is not a valid biomarker for Herceptin, and Herceptin has no valid biomarker, and prescribing Herceptin for HER2-positive patients makes no medical sense.
7. Does Herceptin’s mechanism of action depend on HER2?
Summary: Recent research suggests Herceptin does not block HER2 signaling, once considered its mechanism of action. No new mechanism of action has been clearly established. Some researchers believe Herceptin might work in HER2 0 patients, i.e. independently of HER2 status.
Herceptin does not block HER2 signaling
“[T]he talking points, the posters, the advertisements, are all about ‘HER2 blockade.’ It makes a good story, much simpler to understand, very pretty pictures, and nicely amenable to commercialization. Unfortunately it's not true.”
So wrote Mark Moasser, at the University of California at San Francisco, in email. More formally, in a published paper, Moasser wrote that Herceptin “was developed on the basis of 1980s understanding of HER2, and it is now clear that it does not actually inhibit HER2 signaling functions very well.” Tyrosine kinase inhibitors like lapatinib do block HER2 signaling but the clinical benefits of lapatinib are scant. Dual anti-HER2 therapy in which lapatinib is added to Herceptin showed no survival benefit in either the adjuvant or neoadjuvant settings as tested in the ALTTO and NeoALTTO trials. A lapatinib-only arm in ALTTO was closed early due to futility.
Additionally, there appears to be no consensus whether degree of HER2 positivity increases response to Herceptin, with greater amplification or overexpression leading to more pronounced clinical benefit. Also unexplained is how Herceptin might work in patients positive for HER2 by amplification but negative for overexpression. Krop and Burstein further fragment the HER2 edifice: in wondering “qui bono” or who benefits from Herceptin, they posit that “the mechanisms may differ in early- and late-stage breast cancer.”
Also, tumor cells appear to convert back and forth between HER2-positive and HER2-negative. Thus HER2 expression “identifies dynamic functional states,” according to Jordan et al. Interconversion may make tumors heterogeneous for any HER2 signal and might partly explain difficulties linking HER2 expression to any tumor phenotype.
HER2 may have nothing to do with Herceptin’s mechanism of action: “We don’t know that trastuzumab would not work in the adjuvant setting for HER2 0 patients,” according to Lou Fehrenbacher. Fehrenbacher is leading a trial, NSABP B-47, which tests Herceptin in HER2-negative patients. The trialists considered enrolling HER2 0 patients, but according to Fehrenbacher, this was deemed “too adventurous.” It might have undermined the entire HER2/Herceptin story. Instead of asking the question: “do Herceptin’s effects have anything to do with HER2,” B-47 answers the question: “Should HER2 low patients also be treated with Herceptin?” a stepwise distancing from current orthodoxy rather than quick, complete abandonment. B-47, which only includes HER2 1+ or 2+ patients, might also result in a large increase in patients treated with Herceptin. According to Fehrenbacher:
“[T]he number of women with 1+ and 2+ non HER2-positive tumors in the US, is 4x the number with HER2-positive. So if the trial is successful the number of women benefiting from trastuzumab will rise to a level 500% of the current number.”
By contrast, a trial design including HER2 0 patients might have shown no relationship between Herceptin and HER2 or perhaps an inverse relationship like the re-analysis of B-31.
B-47 represents an opportunity to test both whether Herceptin works in HER2 0 patients and whether the degree of HER2 positivity predicts greater benefit from Herceptin. B-47 should add an arm of HER2 0 patients allocated to Herceptin or placebo. In addition, a partial arm of 3+ patients should be added, all receiving Herceptin, to allow comparison of the drug’s effect across the range of HER2 positivity, from 0 to 3+.
There is no agreed upon alternative mechanism of action for Herceptin. A leading but unproven candidate is antibody dependent cellular cytotoxicity (ADCC). The current FDA label says “Herceptin is a mediator of antibody-dependent cellular cytotoxicity,” but only based on in vitro evidence. For clinical evidence, Mark Moasser pointed to “A recent landmark study… that showed response/resistance to trastuzumab is powerfully predicted by the immunological signature.” This re-examination of NCCTG N9831 found “that immune function genes are strongly linked to clinical outcome.” The authors proposed a complicated signature comprised “of any nine or more of 14 immune function genes at or above the 0.40 quantile for the population.”
But a critique by NSABP B-31 trialists found that randomly selecting any 14 genes at any expression level resulted in an interaction probability of less than 0.01 in 92% of 10,000 runs conducted using data from 731 patients in B-31. Consequently, “the conclusion that immune-related genes are driving the observation may not be valid because this criterion can be eliminated without effect on model performance.”
ADCC remains a hypothesis. “As far as I’m concerned, the jury is still out whether Herceptin works by ADCC, through other indirect mechanisms, or through interrupting some signaling pathway,” according to Bert Vogelstein. “If it does involve ADCC, it would have to discriminate between low and high amounts of cell surface ERBB2 protein.” It is “not so obvious” how Herceptin might do this given the widely varying levels of ERBB2 protein on cancer cells even in HER2-positive tumors.
In 2010, Edith Perez recommended against Herceptin for HER2-negative patients. One reason: “It doesn’t make any biological sense,” according to Perez. If Herceptin’s mechanism of action is independent of HER2, seemingly it would not make biological sense to recommend Herceptin for HER2-positive cases.
Ultimately, however, Mark Moasser is not worried by there no longer being an agreed upon mechanism of action for Herceptin: “At the end of the day, it doesn’t really matter what the mechanism is, as long as it works.”
8. Would the FDA approve Herceptin today?
Summary: Herceptin’s approval in the metastatic setting benefited from a new FDA fast track. The single phase III trial providing the basis for approval underwent extensive mid-trial modifications—adding different treatment arms and unlike patients—practices that are no longer permitted. Avastin later faced a different FDA process which led to revocation of its approval.
In early breast cancer, had Herceptin’s FDA application been based on the re-analyses of NCCTG N9831 and NSABP B-31, it presumably would have been rejected. The initially impressive results presented to the FDA likely required the heavy modifications made to them, including merging N9831 and B-31 together while dropping one arm which showed no survival benefit for Herceptin. It is unlikely such changes would be allowed today. The trials were enabled and shaped by new NCI policies that allowed cooperative groups like NCCTG and NSABP to conduct phase III trials in support of FDA approval while permitting those groups to accept funding from pharmaceutical companies.
Herceptin in metastatic breast cancer
Genentech’s Herceptin first won FDA approval in 1998 for metastatic breast cancer. With the process taking just five months, Herceptin benefitted from being the second drug to come off a new FDA fast track. Public perception at the time was that an approval logjam was blocking life-saving cancer drugs from reaching patients. But going faster required relaxing standards. On the fast track, “potential” effectiveness was to be considered with the standard only that “potential effectiveness of the treatment should outweigh its toxicities.” These educated guesses would not necessarily have to be checked later: “A post-approval study will not necessarily be required in the exact population for which the approval was granted.”
Trial H0648g provided the basis of Herceptin’s FDA application. But according to the FDA review, “multiple major changes in the protocol were enacted during the conduct of the study.” The biggest mid-course change added entirely new arms to the trial after enrollment of only about 100 patients. The original design tested Adriamycin and cyclophosphamide (AC) against AC + Herceptin (H). The new arms tested a taxane (T) against T + H.
The new arms were then pooled with the original arms—to the chagrin of the FDA. It considered patients in the AC and T arms as “clinically distinct.” The taxane cohort represented a “different prognostic group” than the AC patients and “baseline characteristics differed markedly between paclitaxel and AC patients regardless of assignment to Herceptin therapy or not.” However, the FDA acquiesced on pooling.
Remarkably, as arms were added, the double-blind with placebo design was dropped and the trial became open label. “Patients and investigators object to the placebo,” said the FDA, again accepting a fait accompli.
The trial found adding Herceptin to AC made no difference in overall survival. Similarly, in the new taxane arms, adding Herceptin did not increase survival. But the pooling of the AC and taxane arms, which the FDA had frowned upon, produced a statistically significant overall survival benefit, albeit with a confidence interval touching 1.0.) But absent the large, mid-course alterations to the trial, Herceptin would have shown no survival benefit.
Avastin, also from Roche/Genentech, lost its FDA approval for treating breast cancer after post-approval trials failed to demonstrate a survival benefit. Genentech proposed Avastin for treatment of metastatic HER2-negative breast cancer. As with Herceptin earlier, an accelerated FDA application for Avastin relied on an open label trial, E2100. Although the FDA initially approved Avastin, the review scolded the drug sponsor: “Genentech did not meet with FDA to reach agreement on the design of Study E2100 prior to study initiation.” The FDA found a host of problems with trial E2100 including the open label design and loss of patients to follow-up:
“[T]he effect on PFS by an independent group, masked to treatment assignment, was not implemented during the conduct of the trial. Retrospective analyses by an endpoint review team masked to treatment assignment to independently confirm the E2100 results was marred by substantial loss to follow-up prior to the independent review team’s confirmation of disease progression."
In addition, the lack of independent review led to investigator bias—toward Avastin. According to the FDA, “the discordance rates are slightly different for the two study arms, with the difference favoring the PAC/Bev [Paclitaxel/Avastin] arm over the PAC arm in ECOG investigator-determined assessment of PFS.” The FDA also looked at missing and data and found that a worst case analysis resulted in “elimination of the treatment effect altogether.”
The FDA examined financial ties to the sponsor and found five of the sixteen members on the data monitoring committee members received payments greater than $25,000 from Genentech. A sixth reported compensation that “could be affected by the study outcome.” In addition, “Eight out of 26 investigators (30%) who provided financial disclosure in the E2100 study administration body and data monitoring committee reported financial conflict of interest for receiving payment from pharmaceutical companies.” One of the study co-chairs “failed to reply to the Financial Disclosure requests.”
For Herceptin, by contrast, the FDA did not examine financial ties of trial investigators. But when the study was published in the New England Journal of Medicine, nine of the 12 authors reported relationships with Genentech. The FDA allowed arms to be added mid-trial for Herceptin whereas for Avastin, simply starting a trial without it being OK’d by the FDA drew censure.
Subsequent testing of Avastin in a double-blind, placebo-controlled design required by the FDA found no overall survival benefit, and the FDA revoked its approval of Avastin for breast cancer. For Herceptin to show a survival benefit in the metastatic setting had required pooling of arms the FDA regarded as distinct in an open label design.
Herceptin’s approval hurdles were lower; it might not have met later, higher standards.
In early breast cancer, the FDA approved Herceptin in 2006 based on a joint analysis of NCCTG N9831 and NSABP B-31. But by 2007, B-31 trialists reported that HER2 didn’t predict response to Herceptin, whether measured by IHC or FISH: “No statistical interaction was found between DFS benefit from trastuzumab and levels of protein (p=0.26) or HER2 gene copy number (p=0.60).” And although the authors reported no statistical interaction, HER2-negative patients appeared to benefit more than HER2-positive cases. Corroborating B-31’s results, N9831 trialists reported in 2010 that FISH did not predict response to Herceptin: “Trastuzumab benefit seemed independent of HER2/centromere 17 ratio and chromosome 17 copy number…”
Had these results been presented to the FDA when considering the application for Herceptin in early breast cancer, the application presumably would have been rejected.
What the FDA saw in the Herceptin application was a single successful trial, which was actually made from two studies merged together, with one arm discarded. The unplanned changes were made while the trials were in progress.
In N9831, Arm B tested sequential Herceptin in roughly one thousand women and ultimately showed no overall survival benefit from Herceptin: five-year survival for arm B was 89.3% versus 88.4% in the control arm. Arm B was dropped when N9831 was joined to NSABP B-31.
Although the FDA went along with merging the trials, the oncology community was divided. In 2006, one specialist noted that “In terms of combining the data from the two trials, some oncologists were initially questioning whether that was legitimate.” Sandra Swain, who was at NCI when the trials were joined, answered it was “clearly legitimate.” She asserted that the trials were combined because they were going well: “No one had any idea that we’d have the benefit that we do.” However, joining trials increases statistical power, enabling detection of weaker effects while, obviously, dropping a low or non-performing group of patients might have enhanced the perceived effects of Herceptin in the remaining arms.
An FDA spokesperson offered conflicting answers in 2014 regarding whether the individual trials would have met their endpoints, initially saying: “The FDA cannot speculate on if the trials would or would not have met their original endpoints.” But subsequently the spokesperson speculated that the trials would have been “likely to demonstrate efficacy as individual trials…”
No results for B-31 have been published. A number of researchers suggested in a letter, “Trastuzumab: possible publication bias,” published by the Lancet in 2008, that the results of the individual trials should be published separately. Asked in 2014 for efficacy data, NSABP’s Soon Paik declined, saying only that “they are essentially the same as what is in the combined analysis.”
A meta-analysis of Herceptin being conducted by the Clinical Trials Services Unit (CTSU) at Oxford will include all N9831 patients, including Arm B. According to CTSU’s Richard Gray: “We will analyse the combined concurrent and sequential trastuzumab arms versus no trastuzumab from the 3-way randomisation periods of N9831,” as well as the concurrent and sequential arms separately.
The N9831 and B-31 trials were conducted by cooperative groups, the North Central Cancer Treatment Group (NCCTG) and the National Surgical Adjuvant Breast and Bowel Project (NSABP) respectively. Originally, cooperative groups were funded by NCI. However, in 2000, NCI allowed cooperative groups to accept industry funding. And two years before, the FDA said it would accept trials performed by cooperative groups to support applications for FDA approval. Previously, cooperative groups mostly conducted phase II trials. Arguably, these decisions transformed a public and publically-funded system into one dominated by pharmaceutical companies. B-31 and N9831 were started around the time of the new NCI and FDA policies, in July 1999 and April 2000.
In 2002, the FDA sought to tighten a number of clinical trials policies, but they were opposed by the cooperative groups. NSABP, joined by NCCTG, challenged the FDA reforms in a letter signed by John Bryant, the statistician for the joint N9831/B-31 trial. The FDA had sought to treat the cooperative groups as a sponsor, perhaps because they had begun receiving industry funding. Also, the FDA wanted more blinding of study teams and greater independence of statisticians preparing reports. But the NSABP letter answered that “it will not be practical to arrange for statisticians independent of the Cooperative Groups to prepare and present interim reports…” There were too many trials and “simply not enough qualified personnel available to do so.” The cooperative groups claimed these and other proposed changes would have a “substantial negative impact” on clinical trials including even patient safety.
Concern about industry funding of previously trustworthy cooperative groups surfaced at a 2009 NCI workshop on “Multi-Center Phase III Clinical Trials and NCI Cooperative Groups.” As one participant said:
“If we do not have a robust independent review of these trials, the criticism will be raised quite quickly that these trials are being done by industry and that public dollars should not pay for them. What will protect these trials is that they have a very robust independent review, not just a cooperative group–only review.”
The FDA audited none of the US Herceptin trials. According to the FDA medical review of the joint N9831/B-31 trial, “A DSI [Division of Scientific Investigations] inspection was not performed for this application; given the large number of sites and small percentage of patients enrolled at any individual site, no single study or limited number of sites would have substantial impact on the study results.” If multiple sites and widely distributed patients protect against improprieties, then perhaps no phase III trial would ever need to be audited.
The FDA was unable to confirm that the cooperative groups audited their Herceptin trials: “Because of the nature of the conduct and reporting of the clinical site audits, it cannot be determined whether a specific study was audited during the clinical site inspection…” A statement by the sponsor about audits provided “no information on the actual results of site audits,” according the FDA review of Herceptin.
(I suggested to Richard Gray that the CTSU Herceptin meta-analysis could attempt to reproduce the results of the individual studies as one kind of check on the un-audited trials.)
The possibility of investigator bias was not examined. The FDA “did not request confirmation of the events by an independent endpoint assessment panel that was masked to treatment assignment.” The FDA reported “approximately 4% of the population in the ITT efficacy dataset had missing information with respect to surgical type, nodal status, hormone receptor status, tumor size, histological grade and histologic type.” However, the FDA did not examine whether the gaps could have influenced trial endpoints, whereas in the case of Avastin, a worst case analysis found that missing data eliminated the reported treatment effect.
In 2014, the FDA modified the Herceptin label to state for the first time that the drug increases overall survival in early breast cancer. However, the benefit was found in an “efficacy evaluable” population rather than the gold standard, intention to treat population (ITT). In an April 2014 conference call, the FDA asserted that the ITT and efficacy evaluable populations were identical and that the sponsor, Roche/Genentech, requested that the label read “efficacy evaluable.” Why a pharmaceutical company would request a lower grade of evidence for the lifesaving benefits of its drug is not clear.
Also, in the joint trial, disease free survival falls while overall survival climbs. Perhaps only Provenge demonstrates a similar pattern among cancer drugs. Provenge does not enjoy the same reputation for efficacy as Herceptin.
“It is what [it] is,” N9831 statistician Vera Suman wrote in email.
HR: disease event
Joint N9831/B-31 trial results over time (Source: Vera Suman personal communication, 18 October 2013)
Also, in the final report on the joint study, years of median follow-up took an unusually large, 4.5-year leap in the space of approximately one calendar year. Rebecca Gelman, statistician at the Dana Farber Cancer Institute, brought this to my attention in 2013:
“As a side comment, this all leads me to wonder about the ‘8.4 years of follow-up’ in the 2012 SABC abstract, since it is so much longer than the 2011 JCO paper. Either someone did a big update of survival in 2012 (by calling all the patients or by checking the National Death Index), or else the SABC abstract was reporting OS at a time past the median survival).”
Another statistician described the leap in follow-up as “impossible,” saying that median follow-up usually goes up about one year for every calendar year. The FDA said the difference might be explained by the data lock dates for the two papers. However, the agency didn’t provide dates that would allow verifying their explanation.
9. Can trial estimates of survival increases be squared with population-level survival figures?
Summary: Some medical researchers have suggested that therapies containing Herceptin may cure breast cancer. A Genentech-funded study estimated Herceptin saved 156,413 total life years in the United States from 1999 to 2013 for metastatic breast cancer alone. However, NCI reports only a 1.1% increase in five-year survival over a similar period. Estimates of Herceptin’s life-extending benefits should be compared to population-level figures.
HER2 prevalence at the population level is only 15%, according to NCI, well below early estimates of 25-30%.
Impact of Herceptin on five-year survival at the population level
At the 2012 SABC, presenting joint N9831/B-31 results, co-primary investigator Edith Perez advanced the idea that Herceptin cures breast cancer: “We believe that the data support the concept that many patients who present with HER2-positive breast cancer may be cured with combination strategies.” Herceptin had come a long way. Dennis Slamon, a main progenitor of Herceptin, originally believed that, by itself, Herceptin was only cytostatic, halting tumor growth which “resumed on termination of antibody therapy, indicating a cytostatic effect.”
According to a Genentech-funded study, Herceptin has saved 156,413 total life years in the United States from 1999 to 2013 for metastatic breast cancer alone. But it is unclear if population level statistics corroborate Herceptin’s curative powers. It is not known whether five-year survival has increased as much as it would need to in order to match the Genentech-funded estimate of years of life added. “We did not try to triangulate our results to the overall population,” said corresponding author of the study, Mark Danese.
In the overall population, according to NCI’s Jenny Haliski, “we are seeing a small increase in survival since 1998,” the year of Herceptin’s first FDA approval. Haliski is NCI Media Branch Chief. “Part of this increase can be attributed to improvements in treatment,” said Haliski. However, clinical trials are conducted in “ideal situations and usually include younger patients without comorbidity,” according to Haliski. “Thus, treatment efficacy in a clinical trial is usually higher than treatment effectiveness at the population level.”
It ought to be possible and instructive to decompose the 1.1% increase in five-year survival from 1999 to 2012 to determine the contribution from Herceptin. As I have suggested to Richard Gray, CTSU’s meta-study could and perhaps should try to square its estimate of Herceptin’s benefits with population-based survival figures.
Herceptin won FDA approval for metastatic breast cancer in 1998 and early breast cancer in 2006. (Chart source: National Cancer Institute, SEER Cancer Statistics Review 1975-2013, Table 4.13, all ages, all races)
10. Can the Cleopatra and Marianne trials be reconciled?
Summary: The Cleopatra trial, which added pertuzumab to Herceptin and a taxane, produced the largest survival increases of any of clinical trial of Herceptin, nearly 16 months. But the Marianne trial seems to contradict Cleopatra. Marianne tested a version of Herceptin, T-DM1. Adding pertuzumab provided no more clinical benefit than T-DM1 alone. The phase II NeoSphere trial of pertuzumab and Herceptin also did not produce the remarkable results of Cleopatra.
Adding pertuzumab to Herceptin and a taxane in the Cleopatra trial yielded a remarkably large increase in survival, nearly 16 months longer than the standard of care, Herceptin + taxane. However, the Marianne trial seems to contradict Cleopatra: an arm testing pertuzumab with the Herceptin-based T-DM1 did no better than Herceptin + taxane. As a notice on the ASCO website said: “the addition of pertuzumab to T-DM1 provided no efficacy benefit.” T-DM1 conjugates the cytotoxic emtansine to the Herceptin antibody.
Similarly, in the neoadjuvant setting, adding pertuzumab to Herceptin showed no benefit in the NeoSphere trial which reported that “progression-free survival and disease-free survival at 5-year follow-up show large and overlapping CIs.” Pertuzumab by itself showed very little single agent activity in a phase II trial, so the benefit of combination with Herceptin is presumably synergistic. Why would it not also be synergistic with T-DM1?
Paul Ellis, who led the Marianne trial, pointed to a “number of possibilities and probably a mix of a number of issues” that explained why pertuzumab showed no benefit. In Cleopatra, said Ellis, “patients have Taxol/ Taxotere as a backup” if they do not respond to Herceptin. However, T-DM1 by itself performed just as well as Herceptin plus a taxane. No “backup” needed, and the question is why including pertuzumab added nothing in Marianne.
Ellis also observed that the “Herceptin dose per week [was] higher than T-DM1.” Yet the dose of T-DM1 was apparently high enough to perform as well as H + T. And there does not appear to be support for another trial with a different dose. Said Ellis, T-DM1 “will now never see the light of day” in early breast cancer.
Also figuring in Ellis’ possibilities were “slightly different patient populations.” However, the differences would need to be extreme rather than slight: no response at all to pertuzumab in Marianne and incredible life-extending responses among Cleopatra patients.
That leaves the idea that “maybe [T-DM1] binds differently and alters configuration in a different way” than Herceptin. However, Ellis acknowledged this directly contradicted expectation: “Every senior clinician I know in his area expected Marianne to be positive!” In addition, prior to Marianne, one research group reported “T-DM1 plus pertuzumab resulted in synergistic inhibition of cell proliferation and induction of apoptotic cell death” while another found “Trastuzumab-DM1 (T-DM1) retains all the mechanisms of action of trastuzumab.” Commented Ellis: “I think this study [Marianne] has forced them to go back into the lab and try and understand it better.” According to Ellis, “even the guy at Genentech who invented both Pertuzumab and T-DM1 can’t really understand why” pertuzumab did nothing in Marianne.
In other words, it appears that conjugating emtansine to Herceptin completely cancels synergy with pertuzumab, although both drugs were designed by the same person. Alternately, Marianne disconfirms the results of Cleopatra.
I also asked Allan Lipton about the Cleopatra-Marianne dissonance. Lipton replied: “I do not think I am the right person to answer your Cleopatra questions.” But Lipton has co-authored several papers on alternative assays for determining HER2 status and investigated HER2:HER3 dimerization and pertuzumab. I replied to Lipton: “I wonder if you aren’t the ideal person to answer such questions.” He demurred: “I don't think I have any answers for you on these observations from clinical trials.”
Although pertuzumab is frequently described as completing the blockade of HER2 and HER3, according to Mark Moasser, “pertuzumab doesn’t interfere with dimerization when HER2 is overexpressed.” HER2 overexpression has been thought of as the sine qua non of HER2-positive breast cancer. Moasser emphasized that it is “very true” that pertuzumab doesn’t block HER2 signaling when HER2 is overexpressed. Instead, “trastuzumab and pertuzumab work through immunologic mechanisms in HER2-positive cancers, and two antibodies provides double the tumor cell coverage and better immunologic targeting by the immune system.” He added: “This is not universally accepted by everyone but at this point the data is pretty clear to me and many others.”
Moasser attributes the disappointing performance of pertuzumab + T-DM1 in Marianne to the absence of a taxane: “I would say it's because taxol (or taxotere) is so effective, it’s not a shortcoming of T-DM1.” Paul Ellis advanced a similar argument. However, T-DM1 by itself performed as well as Herceptin and a taxane. In fact, progression free survival with T-DM1 alone was higher, 14.1 months vs. 13.7 months although not significantly. But adding pertuzumab to T-DM1 did nothing.
According to Moasser:
“Chemos have a 12-hour high concentration exposure and cause a lot of tumor cell kill in a short time leading to release of many cellular antigens, etc. T-DM1 provides continuous exposure and there is incremental tumor cell killing day-by-day rather than mass killing on one day. That may be less immunogenic than the chemo method.”
However, emtansine provided enough immunologic kick for T-DM1 to equal the clinical benefits of Herceptin and a taxane. Thus Moasser’s explanation for the futility of pertuzumab seems to require that pertuzumab has different immunological prerequisites than T-DM1.
In the trial which won Herceptin initial FDA approval, adding a taxane to Herceptin delayed disease progression by 3.9 months, while in Cleopatra, further adding pertuzumab to the regimen added nearly 16 months. This quite massive effect is unexplained. Said Moasser: “I don’t claim to know all the nuances of how chemo and immunology interact with each other, and frankly nobody really does, the field is still in its infancy.” The pharmacologists, however, have somehow hit a home run with Herceptin and pertuzumab although swinging as if with eyes closed.
With EGFR inhibitors in lung cancers or BRAF inhibitors in melanomas, the mechanisms of action are clear as are the clinical results. However, said Bert Vogelstein, “we do not know how or why Herceptin works,” and the conflicting results of Cleopatra and Marianne show “that all conclusions or predictions are on thin ice,” according to Vogelstein.
The HER2 and Herceptin story used to be simple and compelling: we knew who it worked for and why. Now we don’t, despite nearly two decades of learning. The current balance of scientific evidence arguably no longer supports the idea of a HER2 subtype in breast cancer.
There is conflicting evidence whether HER2 is even transforming and whether it drives breast cancer. Also, the Ross et al. literature reviews supporting the prognostic role of HER2 are especially dubious. (Those papers should be corrected or retracted.) At present, the view that HER2 is prognostic is unsupported.
Although medical diagnostics have gray areas, the reproducibility of HER2 testing appears to be in a range where it perhaps should not be considered scientific. Different pre-analytic conditions, different assays, different subjective assessment criteria, tumor heterogeneity and the lack of any gold standard lead to conflicting results which are resolved arbitrarily. That the Hera trialists evade or perhaps even dissimulate regarding investigations of particular subgroups that could help validate or further discredit FISH and IHC might point to a widening disparity between appearance and reality. The main Herceptin orthodoxy has broken down completely: Herceptin does not block HER2 signaling and its mechanism of action might have little or nothing to do with HER2.
Nonetheless, Krop and Burstein contend: “Beyond a doubt, trastuzumab works.” Yet absent questionable modifications to key trials, Herceptin might not have won FDA approval. Avastin, which lost FDA approval, also works for some breast cancer patients, but there is no biomarker to predict response. The published literature demonstrates that HER2 does not predict response to Herceptin, leaving Herceptin without a valid biomarker. To paraphrase Daniel Haber: “HER2-positive” just means “responds to Herceptin.” Even HER2-negative patients can benefit, perhaps even more than HER2-positive patients. Seemingly, either all breast cancer patients should get Herceptin or none should, the latter option representing the FDA’s decision for Avastin.
At present, the standard of care is for all breast cancer patients to be tested for HER2. The tests suffer very considerable reproducibility problems. In addition, based on the re-analysis of clinical trials leading to FDA approval, HER2 doesn’t predict response to Herceptin. We don’t know who should get Herceptin but current guidelines pretend otherwise with HER2 tests that are too much like divining rods.
The clinical benefits of Herceptin might be smaller than thought. The modifications of the trials leading to FDA approval might have artificially pumped up the drug’s benefits. But in addition, at the population level, five-year survival has only increased about 1.1% since the introduction of Herceptin. Converting that modest rise into median number of months of increased survival per Herceptin patient might be instructive—perhaps corrective—of strong claims regarding the curative powers of Herceptin-containing treatment regimens.
The Cleopatra trial reported the largest increases in survival of any Herceptin trial ever. The addition of pertuzumab to Herceptin and a taxane pushed median survival up by an incredible 16 months, whereas adding the supposed workhorse of the two, Herceptin, to a taxane produced only a 4-month rise. Furthermore, in the Marianne trial, adding pertuzumab to the Herceptin-based T-DM1 did no better than T-DM1 alone, adding zero months of survival instead of 16. Worryingly, researchers who might be able to explain the seemingly contradictory results are silent. Somewhat as with conflicting HER2 assessments, researchers and physicians can just choose what to believe.
A kind of HER2 fundamentalism has taken hold as foundational truths have broken down: “clinicians should rely on established markers of HER2 expression for selecting patients,” suggested Krop and Burstein. But those very same biomarkers are what have been dis-established. Also, “established” does not mean valid, rather physicians are counseled to use the old knowledge from when the HER2/Herceptin story was compelling and coherent.
Like efforts to keep the earth at the center of the solar system, complicated epicycles have been devised to hold on to HER2 orthodoxies. A simpler explanation might better fit the contradictory evidence: while HER2 overexpression and amplification are real phenomena, there might not be a clinically meaningful HER2 breast cancer subtype.
Summary of Recommendations
- Reconduct the experiments addressing whether HER2 is transforming in mouse cell lines
- Add arms to B-47. Include HER 0 patients, receiving either Herceptin or a placebo, and a partial arm of HER2 3+ patients, all receiving Herceptin
- Decompose the 1.1% increase in five-year breast cancer survival from 1999 to 2012 to determine the contribution from Herceptin
The Herceptin meta-analysis being conducted by the Clinical Trials Services Unit at Oxford should:
Attempt to reproduce findings of the individual studies, including the joint N9831/B-31 trial that led to FDA approval of Herceptin in early breast cancer
- Estimate the likelihood of assessed HER2 status being correct, if that is possible
- Allow confidence intervals around Herceptin’s clinical benefits to reflect estimated HER2 test accuracy
- Check estimates of Herceptin’s contribution to overall survival against population-based survival figures