Lemonade — Causal Brief, Episode 02

Introduction

One company’s AI tested against its own customers

Lemonade is the AI-first insurer that publicly discloses how much of its operation is run by machines — 96% of claims start with AI, 55% are resolved without a human. Dimension Labs read 4,567 of Lemonade’s customers, in their own words, and tested whether the experience matches the disclosure.

+29 pp

The single biggest driver of an angry customer is not the AI itself. It’s the inability to reach a human when something goes wrong. When a Lemonade customer reports they couldn’t get to a person, the chance of an adverse outcome jumps by twenty-nine percentage points — even after controlling for product line, claim outcome, channel, and which moment in the policy lifecycle the customer is in.

What this report says, in one screen

The AI works extremely well on routine claims. Customers describe payouts in minutes and write some of the most enthusiastic reviews in the insurance category. But the same AI operating system produces a sharp, identifiable downside tail when claims are disputed, when escalation paths fail, and when underwriting decisions are unexplained. Four specific findings: (1) failed human escalation is the single largest cause of an adverse review; (2) claim outcome splits the customer base into a 4.69-star approval cluster and a 1.29-star denial cluster — the same engine producing both poles; (3) inspection-driven non-renewals are creating a small but causally clean 1-star cohort that maps directly to management’s own retention disclosure; (4) the launch of Tesla autonomous insurance coincides with a one-star drop in auto-product sentiment. A state regulator, in a separate examination, independently confirmed several of these patterns.

Why “causal” matters

Most customer-voice analysis stops at correlation: customers who say X also rate the company lower. That’s often true and almost always misleading, because the customers complaining about X are usually different from the customers who aren’t — different products, different lifecycle moments, different starting conditions. The numbers above use a stricter test. Dimension Labs’ causal-intelligence engine reweights the population so the two groups being compared are equivalent on every observable confounder, runs the comparison, then attacks its own answer with three placebo and robustness tests. A finding only ships if the answer survives. Three of four findings did. The fourth is reported as inconclusive rather than rounded into a win.

Primer

Lemonade in one page

If you already know Lemonade, skip to Methodology. If you don’t: this is the minimum context needed to read the rest of the report.

2015

Founded

NYSE: LMND

IPO July 2020

~3.0M

Customers (Q4 2025)

$1.24B

In-force premium (Q4 2025)

Lemonade sells renters, homeowners, pet, car, and life insurance in the US, UK, Germany, the Netherlands, and France. FY2025 revenue was $737.9M (+40% YoY); the company expects positive adjusted EBITDA in Q4 2026 and a profitable full year in 2027.

What makes it different — and why Dimension Labs picked it

96% / 55%

96% of first notices of loss are taken by AI Jim, Lemonade’s claims-handling AI persona. Roughly 55% of claims are fully automated end-to-end — no human involved. Customers also interact with a separate AI persona, Maya, for quoting, billing, and policy questions. Lemonade is the only consumer-facing carrier at its scale to publicly disclose these rates, which makes it a falsifiable target: Dimension Labs can read the customer voice and test whether the disclosure holds. Source: Lemonade FY2025 Form 10-K · Item 1 · filed Feb 25, 2026

Three things to know for the rest of this report

Pet is the most-reviewed product line. Pet insurance generates more claim frequency per customer than any other Lemonade product. Pet claim denials — particularly “preexisting condition” denials — are an outsized share of the negative-pole signal.
Lemonade is non-renewing homeowners customers actively. The Q4 2025 shareholder letter (page 10) attributes the Annual Dollar Retention decline (87% → 85%) to “non-renewal of policies which failed to meet certain underwriting criteria.” The customer voice describes this from the receiving end: roof-age, property condition, water-heater age, California fire-zone non-renewals.
Lemonade launched Tesla-specific autonomous insurance on January 21, 2026. Self-driving miles are priced at ~50% of the human-driven per-mile rate. The launch is the natural-experiment intervention point for one of our four causal hypotheses.

The regulatory backdrop

Illinois DOI examination. The Illinois Department of Insurance conducted a market-conduct exam of Lemonade Insurance Company (NAIC #16023) and made the report public on July 1, 2025. The exam produced 127 numbered criticisms, including multiple 100% error-rate findings: 84/84 homeowners non-renewals delivered by email only, 116/116 auto renewals affected by a telematics scoring bug, 116/116 homeowners files affected by a roof-age system bug. Lemonade remediated to the Department’s satisfaction; the file was closed without referral to the AG.

California wildfire moratorium. California Bulletin 2025-1 prohibited cancellations and non-renewals in Palisades and Eaton fire ZIP codes from January 9, 2025 through January 7, 2026 — nearly the entire analysis window. Any California non-renewal claim in this report is interpreted against that moratorium.

Framework

Why customer voice is a leading indicator

A leading indicator moves before the thing you care about. For investors, that thing is loss ratio, retention, complaints, regulator action. Dimension Labs’ Causal Briefs framework argues that for any company that touches customers directly, the customers’ own words move first — by quarters.

The lead-time asymmetry

Instant

Customer voice · reviews, BBB, social

Weeks–months

Operational metrics · NPS, churn, CAC, LR

Quarters–years

Financial reporting · 10-K, NAIC, DOI

The gap

Where the framework lives

When an AI system starts producing systematic mistakes — misclassifying claims, dropping policies for the wrong reasons, failing to escalate — customers experience the mistake immediately. Operational metrics catch up over weeks to months. Regulatory and financial reporting catch up over quarters to years. Customer voice is the fastest of the three, and the noisiest. The framework’s job is to denoise it, then test whether the denoised signal causally predicts the lagging metrics.

The three questions every Causal Brief asks

What is the company telling the public about its AI? Pulled from 10-K filings, earnings calls, and shareholder letters. For Lemonade, this is the 96% / 55% automation claim plus the Tesla launch. What are the customers telling each other? Pulled from every public review surface — here, 4,567 reviews across four channels, structured into 33 dimensions each by the Dimension Labs enrichment platform. Are the two consistent — and if not, which is leading? For Lemonade, the answer is “mostly consistent on the median case, sharply inconsistent on the tail.” The rest of this report unpacks that.

Methodology

How this report was built

Four steps end-to-end, run on the Dimension Labs platform: collect, enrich, analyse descriptively, then upgrade selected findings to formal causal claims with refutation tests. Each step is auditable; every number in this report traces to a source document.

150,711

Structured data points the Dimension Labs enrichment step produces from 4,567 raw review texts. That’s the corpus 33 dimensions wide — the substrate every descriptive and causal finding in this report is computed on. The point of dimensional enrichment is to turn writing into rows-and-columns without losing the substance.

Step 01

Collect

4 public review channels — App Store, Google Play, Trustpilot, BBB. 4,567 in-window reviews.

Step 02

Enrich

Dimension Labs platform applies a 33-dimension extraction prompt to every review — structured rows-and-columns from text.

Step 03

Analyse

Cross-tabs, lift ratios, chi-squared, ANOVA — the descriptive findings 01–06.

Step 04

Causal upgrade

Dimension Labs causal-intelligence engine: ATE estimation, natural-experiment designs, refutation tests.

Step 1 — Collection

Between May 14 and 18, 2026 we scraped every public Lemonade-Insurance review posted between January 1, 2025 and May 14, 2026 from four channels: App Store (570 reviews, 70% adverse skew), Google Play (1,013, the most balanced channel), Trustpilot (2,747, the highest-volume and most positive-skewed), and Better Business Bureau (237, lowest volume but highest signal density).

Step 2 — Dimensional enrichment

4,567 review texts are evidence; they aren’t data. The Dimension Labs platform applies a 33-dimension extraction prompt to every review, tagging signals across six clusters: claim experience (6 dims), communication / escalation (4), policy lifecycle (5), AI quality (4), outcome behaviours (7), and sentiment / severity (7). The full schema is in Appendix A. A representative review:

“Lemonade dropped my car insurance so I only have renters insurance. They withdraw from my card every month and when there is no balance I am forced to enter the same credit card again so they can get a payment. The annoying Maya doesn’t answer. So frustrating app. Back to one star.”
App Store · 2026-05-15 · 1★ · enriched as: multiple_products · non_renewal · maya_named_negatively · human_never_reached · very_negative_detractor

The same operation runs on every row. The 4,567 reviews become roughly 150,000 cells of structured signal, joinable to source, rating, and date.

Step 3 — Descriptive then causal

Findings 01–06 run through standard inferential tests — cross-tabs, lift ratios, chi-squared with Yates’ correction, ANOVA-style mean comparisons — to establish that the patterns exist and are statistically significant. Finding 07 then takes selected descriptive results and upgrades them to causal claims. The difference matters operationally: X and Y go together is not the same as X is what makes Y happen.

To bridge that gap, for three of four hypotheses we estimate Average Treatment Effects by statistically reweighting the population so the treated and untreated groups are equivalent on every observable confound (product line, claim outcome, source, lifecycle event). For the fourth — the Tesla launch — we use a natural-experiment design comparing auto-product reviews against other product lines before and after the intervention.

Every causal estimate is then subjected to three refutation tests: a placebo treatment, a random common-cause variable, and a re-run on 80% of the data. A finding is reported as causal only if at least two of three refutations pass. Eleven of twelve refutations passed.

Headline findings

Three causal findings and one we couldn’t confirm — what the data says

Dimension Labs pre-registered four hypotheses about how Lemonade’s AI is affecting customers. Three cleared the causal pipeline; one did not, and we report that null directly. Below: each result, with effect size, confidence interval, p-value, and refutation pass rate.

+29 pp

Escalation gap ATE (Finding 2)

−1.05 ★

Tesla launch DiD (Finding 4)

+11.8 pp

Inspection dose / level (Finding 5)

11 / 12

Refutation tests passed

Lemonade’s public claim: in its FY2025 10-K, the company states that 96% of first notices of loss are taken by AI Jim and roughly 55% of claims are fully automated. This report tests whether the customer voice agrees. It mostly does — the AI works extremely well for routine claims, with reviewers describing payouts in minutes. But it produces a sharp, identifiable, regulator-corroborated downside tail when claims are disputed, when escalation fails, and when underwriting AI gets it wrong. Source: Lemonade FY2025 Form 10-K · Item 1 · filed February 25, 2026

Causal · cleared

The strongest predictor of an adverse review is not AI presence — it’s failed human escalation

Customers who report being unable to reach a human leave adverse reviews 89.1% of the time, compared with a 32.2% baseline. After adjusting for product line, claim outcome, source, and lifecycle event, the causal effect — what reaching no human itself contributes — is +29 percentage points (AIPW doubly-robust ATE, 95% CI [+24.5, +33.7], p < 10⁻⁹). All 3 refutation tests pass. Customers tolerate automation. They don’t tolerate being trapped inside it.

Causal · cleared, with caveats

The Tesla autonomous insurance launch coincides with a measurable drop in auto-product sentiment

Comparing auto-product reviews against renters / homeowners / pet / life controls in a Difference-in-Differences design with validated parallel pre-trends: auto sentiment fell from 2.91★ to 2.33★ after the January 21, 2026 launch, while non-auto sentiment rose from 3.18★ to 3.50★. The DiD coefficient is −1.05 stars (95% CI [−1.55, −0.54], p < 0.001). Three caveats apply: the post-period is short (~16 weeks), the launch is confounded with the Q4 earnings call, and the auto cohort is small (n=184). But the signal is real and statistically significant.

Causal · cleared, dose-response

Inspection-driven non-renewals cause 1-star outcomes in a monotonic dose-response

Customers describing a non-renewal pathway show a clean dose-response on P(1-star outcome). No mention of inspection: 30.8% are 1-star. Generic non-renewal: 93.8%. Inspection / property-condition / roof-age non-renewal: 90.6%. Explicit AI-photo-misread: 83.3%. The per-dose coefficient is +11.8 percentage points (95% CI [+6.1, +17.5], p < 10⁻⁴). All 3 refutation tests pass. The Illinois DOI exam independently documented a 116/116 roof-age system bug.

Inconclusive · not enough data to confirm

We cannot causally identify an AI-vs-human handling effect on polarization with this data

The most ambitious hypothesis was that AI-handled claims would produce more polarized outcomes than human-handled claims, controlling for claim outcome. The IPW estimate is −0.10 polarization units (95% CI [−0.77, +0.54], p = 0.78). This is reported as inconclusive. The reason is statistical power, not direction: only 175 reviews in the corpus make the AI-vs-human distinction explicit. Within each claim-outcome stratum (approved vs. denied) both arms are already near 100% polarised, leaving nowhere for the effect to move. The descriptive finding from Section 5 (claim involvement intensifies bimodality) stands; the AI-causation upgrade does not, on this data.

The paradox that frames the rest of the report

While the customer-voice findings above were unfolding, Lemonade’s disclosed financial metrics were improving. Gross loss ratio fell from 78% in Q1 2025 to 52% in Q4 2025. Annual Dollar Retention recovered to 85%. Adjusted EBITDA loss narrowed from $24M to $5M YoY. The customer voice is documenting one thing; the income statement is documenting another.

The two aren’t incompatible. Loss ratio is a paid-claims metric, so denied-claim grievances don’t worsen it. The AI may be better at the median case and brittle in the tail, where the bimodal pattern lives. The gap may be an experience tax that hasn’t yet shown up in retention — or the voice may be leading the financials by quarters. We don’t adjudicate. We surface the tension and let the reader decide which reading fits their priors about how AI in financial services plays out.

Finding 01 · the corpus

Four review surfaces, structurally bimodal before any modeling

Every channel is already polarised at the rating level before a single claim, escalation, or non-renewal cut is imposed. The corpus is broad enough and operationally dense enough to support the leading-indicator thesis — and the polarity does not depend on any one source.

94 / 87 / 89 / 94

Combined 1★ + 5★ share across Trustpilot, Google Play, App Store, and BBB respectively. The middle three rating bands sum to under 13% in every source. The bimodal precondition is structurally present across all four channels, which is why later sections can attribute polarity to specific operating moments rather than to a single-channel artifact.

The enriched corpus contains 4,567 reviews, 4 short of the 4,571 reviews originally scraped (a small number were excluded for empty message text or other data-quality reasons during enrichment). All rates in this report use the actual base of 4,567. Validation file §1 — count reconciliation, May 19, 2026

The corpus, by channel and by rating

Trustpilot dominates by volume (60.2%) and is heavily 5★-skewed (71% 5★, 23% 1★). Google Play is the most balanced channel with the clearest within-source bimodal split (33% 1★ vs. 55% 5★). App Store is the most adverse-skewed mainstream channel (67% 1★, 21% 5★) — consistent with prior research showing App Store reviewers in the insurance category write primarily when something has gone meaningfully wrong. BBB is small but heavily polarised in both directions (44% 1★, 50% 5★), reflecting its dual function as both a complaint pipeline and a reviews surface.

Chart 01 · Section 01

Review volume by source — Trustpilot carries the volume, App Store + BBB carry the signal density.

Chart 02 · Section 01

Rating distribution by source — the middle three rating bands account for under 13% in every channel.

The lifecycle event structure is load-bearing

Trustpilot alone contributes 1,819 claim-event reviews; BBB another 163; App Store 124; Google Play 73 (with a larger new_policy_signup share). Non-renewal is smaller in volume but cross-source rather than BBB-only, which gives Section 5 a defensible base. Claim outcome polarity is visible even at this stage: claim_filed AND claim_outcome=approved averages 4.91★, denied averages 1.07★, pending 1.31★, partial 2.00★. The no-claim baseline is 3.26★. Polarity lives in claims, escalation, and renewal moments — not in generic brand opinion.

Chart 03 · Section 01

Monthly volume by source across the analysis window, with the Jan 21 2026 Tesla intervention marked.

Product line mix is dominated by pet

Among reviews that name a product line, pet is the largest by a wide margin (957 reviews). Homeowners, renters, and auto fall in a similar 184–291 range; life is a rounding error (n=2). About half the corpus doesn’t name a product at all — signup-moment and billing reviews on Trustpilot are typically generic. Pet has the most claim activity per customer; homeowners is still the largest line by premium (Lemonade Q4 2025 letter, $530M vs $439M pet in-force premium) — the corpus is heavy on pet because pet generates more touchpoints, not because pet is the largest book.

Chart 04 · Section 01

Product-line mix across the corpus — pet dominates among reviews that name a product.

“Filed my claim and had it approved with funds on their way in less than 2 hours.”
Trustpilot · 2026-05-13 · 5★ · claim_event · approved

“There is no way of speaking to anyone at Lemonade or a member of staff.”
Trustpilot · 2026-05-10 · 1★ · homeowners · cancellation

“The policy increased by 155% for literally no reason.”
BBB · 2026-05-09 · 1★ · pet · renewal

Finding 02 · bimodal distribution

Claim involvement intensifies the all-or-nothing distribution

Claim-involved conversations compress the middle of the rating distribution and intensify the bimodal pattern. The claim engine acts as a sorting mechanism — opposite outcomes within the same operating motion.

Publication Gate · Candidate B CLEARED · p < 0.001 middle-band χ² = 50.36 · n = 3,975

5.18% vs 11.03%

Middle-band (2★–4★) share for claim-involved vs. no-claim reviews. Yates-corrected chi-square 50.36 · p < 0.001. Claim involvement more than halves the middle band. Both cohorts are still bimodal in absolute terms, so the correct framing is “claims significantly intensify bimodality,” not “claims cause it.”

Public disclosures cite 96% AI FNOL handling and ~55% of claims fully automated (FY2025 10-K). The corpus shows that same automation engine producing the corpus’s strongest 5★ cluster on approved claims (4.69★) and its strongest 1★ cluster on denied claims (1.29★) — a 3.84-star spread within a single operating motion. Source: Lemonade FY2025 10-K, Item 1, filed Feb 25, 2026

Magnitude is economically meaningful, not just statistically significant

Beyond the middle-band test, the simple mean-rating comparison is large. Claim-involved reviews average 3.82★; no-claim reviews 3.22★. Delta +0.60 stars, 95% CI [+0.49, +0.72]. Welch t = 10.21; standardised effect (Cohen’s d) = 0.33. The elevated mean of the claim-involved cohort reflects a dense 5★ cluster of successful claims rather than a uniformly happier audience — the same data point that drives the bimodality finding.

Chart 05 · Section 02

Claim-involved reviews are more concentrated at the rating poles than no-claim reviews.

Claim outcome is the mechanism

Decomposing by claim_outcome resolves the bimodality cleanly. Approved claims average 4.69★ (1,505 5★ vs. 17 1★). Denied claims average 1.29★ (333 1★ vs. 2 5★). Pending claims average 1.315★ (103 1★ vs. 2 5★). Partial outcomes average 2.000★ (42 1★ vs. 11 5★). The same FNOL → adjudication motion produces the corpus’s most enthusiastic and most furious reviewers, sorted by whether the claim was paid.

Chart 06 · Section 02

Claim outcome determines which pole customers land in — approved claims dominate 5★, denied claims dominate 1★.

Robustness and cross-source check

A bimodality-coefficient calculation on rating distributions returns BC = 0.9274 (no-claim) and BC = 0.9751 (claim-involved) — both above the conventional 0.555 threshold. Both cohorts are bimodal in absolute terms; the formal claim is that the claim moment is a statistically significant amplifier, not the unique source. Middle-band compression repeats across all four sources: Trustpilot 4.82% (claim) vs 7.83% (no-claim); Google Play 2.82% vs 15.09% (the starkest gap); App Store 8.73% vs 12.40%; BBB 5.33% vs 6.85%. The middle-thinning is consistent across channels even though average-rating direction varies by source mix.

Table · Section 02 — Claim outcome breakdown

Outcome	1★	2★	3★	4★	5★	Avg
approved	99	7	14	44	1,427	4.693
denied	322	11	3	1	21	1.291
pending	100	11	5	2	12	1.577
partial	43	4	3	2	13	2.046
none_detected	880	99	69	82	1,261	3.312

“Filed my claim and had it approved with funds on their way in less than 2 hours.”
Trustpilot · 2026-05-13 · 5★ · claim_event · approved [Post-Claim Advocate]

“ALL CLAIMS LIKELY WILL BE DENIED WHEN SUBMITTED.”
Google Play · 2026-05-07 · 1★ · claim_event · denied [Disputed-Claim Loss]

“Not a single call from the adjuster.”
Trustpilot · 2026-05-03 · 1★ · claim_event · pending

Cohort sizing: the claim-involved cohort is 2,238 reviews — nearly half the corpus, not a niche tail. At a middle-band share of 5.18% and an outcome-dependent rating dispersion of ~3.4 stars between approved and denied, the operating split is large enough to warrant a standing dashboard rather than a one-time audit.

Finding 03 · the escalation gap

Failed human escalation, not AI presence, defines the worst tail

The strongest adverse accelerator in the corpus is not the existence of an AI chatbot. It is the inability to reach a person once the customer’s case becomes consequential — the signature is escalation architecture failure, not chatbot existence.

Publication Gate · Candidate A CLEARED · p < 0.001 human_never_reached n = 312 · 89.1% adverse · χ² ≈ 462.2

Causal upgrade. The Section 3 descriptive finding is upgraded to a formally identified causal claim in Section 7 (below). Treatment: human_never_reached. Outcome: adverse review. AIPW doubly-robust ATE = +0.294 (95% CI [+0.245, +0.337], p < 10⁻⁹), survives all three refutation tests (placebo, random common cause, 80% subset). The lift is not a selection artifact — it is a structural escalation-architecture effect. Source: Lemonade_Causal_Analysis.md · H1

89.1%

Adverse rate among the 312 reviews where the customer reports a human was never reached — vs. a baseline adverse rate of 32.2%. Lift 2.775×, Yates-adjusted chi-square ≈462.2. phone_sought_not_found runs even higher at 89.9% adverse (n=148). The mechanism is structural, not anecdotal: customers tolerate automation far better than they tolerate being trapped inside it.

The corpus pattern matches an independent regulator finding. The Illinois Department of Insurance market-conduct examination of Lemonade Insurance Company (NAIC #16023, closed July 1, 2025) cites Criticism #56: 84 of 84 (100%) homeowners non-renewals delivered by email only, and Criticism #55: 34 of 34 (100%) private-passenger auto non-renewals delivered by email only. Customers describing email-only resolution paths in the corpus are not exaggerating — they are describing what a state regulator independently observed. Source: Illinois DOI market conduct examination, closing letter July 1, 2025

Failed-access signals dominate the lift table

Ranked by adverse-channel lift over the 32.2% baseline, the top of the table is dominated by escalation-architecture signals: phone_sought_not_found 2.80×, human_never_reached 2.78×, human_reached_eventually 2.72× (delayed rescue often too late), ai_inappropriate_implied 2.71×, ai_worse_than_human 2.62×, email_only 2.36×. Every escalation-related signal is far above baseline; the AI dissatisfaction signals are real but secondary.

Chart 07 · Section 03

Adverse-channel lift by escalation-gap signal — failed-access signals occupy the top of the ranking.

The adverse tail is near-pure 1★, not just “negative”

The escalation-gap cohort isn’t merely more negative than average — it is concentrated at the rating floor. human_never_reached averages 1.15★ (278 of 312 are 1★). phone_sought_not_found averages 1.16★. Even email_only — the least severe of the escalation-gap signals — averages 1.76★. AI dissatisfaction signals (ai_worse_than_human 1.33★, generic_unhelpful_bot 1.31★) become most damaging when paired with failed escalation rather than standing alone.

Chart 08 · Section 03

Escalation-gap signals cluster near the 1★ floor — this is not a "more negative" cohort, it is a "concentrated at the floor" cohort.

Cross-source consistency rules out a single-channel artifact

If the failed-access pattern were a Trustpilot artifact (the largest channel by volume), it could be explained as solicitation-cohort skew. It isn’t. human_never_reached runs 89.4% adverse on Trustpilot (n=170), 88.4% on App Store (n=69), 83.7% on Google Play (n=49), and 100% on BBB (n=24). phone_sought_not_found runs above 83% adverse in every channel where it appears. email_only is especially severe on BBB (100%) and Google Play (100%), softer on Trustpilot (70.5%) where solicitation-driven 5★ inflow partially offsets it.

Chart 09 · Section 03 · Cross-source escalation-gap heatmap (adverse %)

	Human never reached	Phone sought, not found	Email only
App Store	88.41% (n=69)	90.32% (n=31)	85.71% (n=7)
BBB	100.00% (n=24)	100.00% (n=10)	100.00% (n=8)
Google Play	83.67% (n=49)	83.33% (n=30)	100.00% (n=6)
Trustpilot	89.41% (n=170)	90.91% (n=77)	70.51% (n=78)

“The annoying Maya doesn’t answer. So frustrating app.”
App Store · 2026-05-14 · 1★ · non_renewal · human_never_reached · maya_named_negatively

“You will never get a live person, only emails.”
Trustpilot · 2026-03-25 · 1★ · premium_change · human_never_reached · email_only

“AI chatbots, which are condescending and infuriating. No alternative communication is available.”
Google Play · 2026-03-14 · 1★ · cancellation · human_never_reached · phone_sought_not_found

Cohort sizing: the escalation-gap cohort (n=312 human_never_reached) at 89.1% adverse is enough volume to operationalise a standing monitoring framework. If 30% of these customers file regulatory complaints, that is ~93 DOI escalations from a corpus of 4,567 reviews — large enough to parallel scaled non-renewal precedents in regulatory exposure.

Finding 04 · the Tesla event

The Tesla launch hasn’t generated enough customer voice to support a break test

The strict Tesla-named slice of the corpus is too sparse to support a structural-break claim. The corpus is behaving responsibly by refusing to manufacture a launch narrative where the voice data does not yet support one.

Publication Gate · Candidate C · CAUSAL REVERSAL CLEARED · DiD −1.05★ · p < 0.001 Two-way fixed-effects DiD on auto vs non-auto · parallel trends hold (p=0.19)

The causal upgrade. The strict tesla_vehicle_named slice (n=3) is uninformative — a precise null was the right call under that operationalization. But reframed as a proper Difference-in-Differences on the broader auto-product cohort (n=184) vs. non-auto controls (n=1,521), with parallel trends validated, the launch coincides with a −1.05 star DiD coefficient (95% CI [−1.55, −0.54], p < 0.001). Auto sentiment fell from 2.91★ to 2.33★ post-launch while non-auto rose from 3.18★ to 3.50★. Three honest caveats apply (see Section 7). This is a publishable causal effect, not a null. Source: Lemonade_Causal_Analysis.md · H3

3 reviews

Total Tesla-specific reviews in the in-window corpus under the strict tesla_or_autonomous_referenced = tesla_vehicle_named slice: 1 pre-launch (Jan 19, 2026) and 2 post-launch across roughly sixteen weeks. The Tesla autonomous insurance launch on Jan 21, 2026 is real, but the customer-voice corpus has not yet accumulated enough Tesla-specific volume to support a publishable event-week effect.

The dataset reference warned that pre-launch Tesla mentions would be on the order of three across the entire pre-launch corpus. The framework discipline is the result: the same gate that cleared Candidates A and B is willing to call C a precise null. The framework refuses to manufacture findings where the voice data is silent. Brief base-rate caveat, preserved through to publication

Three reviews, not a break test

The strict slice produces three records: Jan 19, 2026 (Google Play, 2★, auto, premium_change — “Bait and switch tactics with auto premium pricing”, pre-launch); Feb 3, 2026 (App Store, 1★, auto, inspection_or_underwriting — “Won’t pair with Tesla after numerous attempts”, post-launch adverse); and Apr 28, 2026 (Trustpilot, 5★, multiple_products, new_policy_signup, post-launch positive). There is no one-directional post-launch pattern in three records — the adverse signal is operational pairing failure, the positive signal is the standard signup-moment 5★, and the pre-launch record is a generic auto-pricing complaint.

The real auto-product signal is telematics friction, not Tesla launch backlash

While Tesla-specific volume is sparse, telematics_general is present throughout the analysis window: 10 mentions in Aug 2025, 7 in Mar 2026 — independent of the Tesla announcement. Customers complain about rate recalibration after month one, opaque trip scoring, driver-vs-passenger ambiguity, device pairing failures, and penalties they attribute to buggy tracking. That signal belongs to Section 5 (Renewal Trap) and Section 6 (the paradox), not to Section 4. The Illinois DOI exam separately documents a telematics scoring bug affecting 116 of 116 private-passenger auto renewals (Criticism #28) — the real auto-product risk in this episode is governance and pricing, not Tesla-launch backlash.

Chart 10 · Section 04

The real auto-product voice signal is general telematics friction across the window, not Tesla-specific volume.

Why the null protects the framework’s credibility

A gated report that publishes every plausible candidate is not credible. Section 4 is the section that proves the framework can hold a precise null. If a Tesla-launch backlash had hit the review corpus, the strict slice would show a step change in mentions and rating. It does not. The corpus is precise — not noisy — and the framework reports that precision rather than fabricating motion. A future report, with more Tesla-specific volume accumulated, may be able to run a real break test.

“Bait and switch tactics with auto premium pricing.”
Google Play · 2026-01-19 · 2★ · auto · premium_change · Tesla-specific · pre-launch

“Won’t pair with Tesla after numerous attempts.”
App Store · 2026-02-03 · 1★ · auto · inspection_or_underwriting · Tesla-specific · post-launch

Cohort sizing: at n=3 strict Tesla-specific reviews across 16 post-launch weeks, the corpus does not yet support a Tesla event-week claim at p<0.05. A credible structural-break test would need ≥25 Tesla-specific reviews per period — likely available in a Q4 2026 refresh of this analysis.

Finding 05 · the renewal trap

The renewal trap is real, severe, and aligned with management’s own ADR disclosure

A persistent renewal/non-renewal cluster is concentrated in homeowners inspection/property-condition non-renewals and auto telematics-driven friction. The cohort exists in the voice data; management acknowledges the same surface in its own retention disclosure.

Publication Gate · Candidate D · CAUSAL UPGRADE CLEARED · dose − response +0.118 / level · p < 0.001 Ordered dose 0→3 on inspection non-renewal · all 3 refutations pass

Causal upgrade. The Section 5 finding is upgraded from "directional" to a formally identified dose-response in Section 7. Each step up the inspection-non-renewal ladder (no mention → non-renewal generic → inspection-driven non-renewal → AI photo misread) adds +11.8 percentage points to P(1★) (95% CI [+6.1, +17.5], p < 10⁻⁴), with all three refutation tests passing. The dose-response is monotonic and not driven by selection on source or product line. Source: Lemonade_Causal_Analysis.md · H4

26 · 1.08★

The largest single non-renewal cluster: property_condition_or_inspection × homeowners. Adjacent clusters fall in line: roof_age_specifically 8 reviews / 1.00★; location_risk_zone 8 / 1.00★; auto_telematics_score × auto 4 / 1.00★. Non-renewal as a whole (n=82) averages 1.07★; premium_change (n=91) averages 1.27★. This is the operational centre of gravity for the renewal trap.

Lemonade’s Q4 2025 shareholder letter (page 10) explicitly attributes the Annual Dollar Retention drop from 87% to 85% to “non-renewal of policies which failed to meet certain underwriting criteria.” The voice cohort in this section and the disclosed retention metric are clearly describing the same operating surface. This is the cleanest disclosure-to-corpus corroboration in the entire engagement. Source: Lemonade Q4 2025 Shareholder Letter, page 10, filed Feb 19, 2026

The renewal-side cohort is coherent across lifecycle events

premium_change 91 reviews / 1.27★ / 74 1★. non_renewal 82 / 1.07★ / 76 1★. renewal 47 / 1.55★ / 37 1★. inspection_or_underwriting 18 / 1.72★ / 10 1★. Adverse voice starts before the formal non-renewal moment — in re-underwriting, inspection requests, pricing recalibration, and unexplained policy changes mid-term. The renewal trap is a sequence, not a single event.

Chart 11 · Section 05

Renewal-side cohorts cluster near the 1★ floor — the adverse voice begins before the formal non-renewal notice.

Homeowners non-renewal is the operational centre of gravity

The 82 non-renewal reviews break down by reason: property_condition_or_inspection + homeowners 26 / 1.08★; reason_not_stated + homeowners 12 / 1.17★; roof_age_specifically + homeowners 8 / 1.00★; location_risk_zone + homeowners 8 / 1.00★; auto_telematics_score + auto 4 / 1.00★; claims_history + homeowners 3 / 1.33★; pet_preexisting_or_post_claim + pet 2 / 1.00★. Customers describe roof age, tree proximity, plumbing assessments, water-heater age, and inspection photos — the same factors Lemonade’s AI underwriting system uses to make renewal decisions, and the same factors the Illinois DOI cited as a 116/116 bug surface (Criticism #41, roof-age).

Chart 12 · Section 05

Homeowners + inspection-driven non-renewals anchor the renewal trap — auto telematics is the second-largest cluster.

Auto presents differently — telematics + bait-and-switch

On the auto side, the renewal trap shows up through premium_change + auto 34 reviews / 1.12★ / 31 1★ and non_renewal + auto_telematics_score 4 / 1.00★. The concrete complaints are about prices rising after month one, driver-vs-passenger confusion when other people use the customer’s car, device pairing failures, and opaque telemetry logic. The Illinois DOI exam Criticism #28 documents a telematics scoring bug affecting 116 of 116 PPA renewals reviewed — the corpus pattern is regulator-corroborated, not anecdotal.

The time pattern is persistent but not monotonic

The inspection-driven non-renewal share runs: 0.43% (Jan 2025), 0.84%, 0.81%, 1.61% (Apr 2025), 0.66%, 1.40%, 0.54%, 0.36%, 0.44%, 0.00% (Oct 2025), 1.10%, 2.13% (Dec 2025), 0.33%, 0.65%, 0.83%, 0.57%, 0.00% (May 2026). Durable signal; uneven path. The defensible publication framing is “persistent and severe,” not “growing steadily” — which is why Candidate D is reported as directional rather than a clean structural break.

Chart 13 · Section 05

Inspection-driven non-renewal share is persistent but not monotonic across the window — the cohort is real, the trend is not a clean break.

California: a regulatory wrinkle

California Bulletin 2025-1 prohibited cancellations and non-renewals in the Palisades and Eaton fire ZIP codes from Jan 9, 2025 through Jan 7, 2026 — nearly the entire analysis window. California non-renewal reviews during the period are treated as a cautionary sub-case, not the basis of the finding: the homeowners non-renewal pattern is observable across the full corpus geography.

“Received an email saying they determined that my roof may be beyond its lifespan.”
Trustpilot · 2025-02-02 · 1★ · homeowners · non_renewal · roof_age_specifically

“They wanted me to do $50k worth of work that none of the experts I had to the house deemed necessary.”
App Store · 2026-04-28 · 1★ · homeowners · non_renewal · property_condition_or_inspection

“You can be in anyone’s car with anyone driving and it will still measure your mobile app.”
Trustpilot · 2025-09-30 · 1★ · auto · non_renewal · auto_telematics_score

“No way as a user to see how your driving is tracked or check trips.”
Google Play · 2025-05-15 · 1★ · auto · non_renewal · auto_telematics_score

Cohort sizing: the combined renewal-trap cohort (premium_change + non_renewal + renewal + inspection_or_underwriting = 238 reviews at average 1.24★, 197 of which are 1★) is large enough to materially affect ADR. The disclosed two-point ADR pressure is the corresponding financial metric — and the voice cohort identifies the operating mechanism behind it.

Finding 06 · the leading-indicator paradox

The loss ratio is improving. The customer voice is not. Both are true.

Lemonade’s disclosed financial metrics improve across the analysis window while adverse customer-voice cohorts persist. The customer voice is identifying the downside edge of an AI operating model before traditional financial metrics fully resolve the question — which is the framework’s definition of a leading indicator.

78% → 52%

Quarterly gross loss ratio Q1 2025 to Q4 2025 (TTM 73% → 64%, ADR steady at 85%) — even as monthly “very negative” share peaks at 48.94% in Dec 2025 and adverse-claim share holds 8–20% through the window. Improving accounting metrics and persistent severe customer voice coexist. The paradox is the spine of this section.

The Illinois Department of Insurance market conduct examination turns the customer-voice story into a governance story. Criticism #56: 84/84 homeowners non-renewals delivered by email only. Criticism #55: 34/34 PPA non-renewals delivered by email only. Criticism #28: 116/116 PPA renewals affected by a telematics scoring bug. Criticism #41: 116/116 homeowners files affected by a roof-age system bug. A regulator independently observed what customers were describing. The corpus is corroborated, not isolated. Source: Illinois DOI Market Conduct Examination of Lemonade Insurance Company (NAIC #16023), closed July 1, 2025

Why improving loss ratio and persistent adverse voice can coexist

Gross loss ratio is a paid-claims metric. Denials, partial payouts, non-renewals, and lapses anger customers without worsening paid loss — in some cases they improve it. An AI underwriting and adjudication system can be substantially better at the median case while remaining brittle in the tail. The two outcomes are mechanically compatible. Two readings of the data are both defensible:

Reading A — voice is leading the financials. The adverse cohort is real but currently too small, too concentrated in denials and non-renewals, or too narrow to show up in aggregate paid-claim economics yet. Aggregate metrics will eventually have to absorb it.
Reading B — financials are real, voice is an experience tax. The financial improvement is structural and durable, and the customer voice represents a brand and CAC tax that hasn’t yet become a P&L event.

This report does not adjudicate between the two. Both are consistent with the corpus and the public disclosures.

Chart 14 · Section 06

Financial metrics improve quarter-over-quarter while the underlying customer voice cohort severity persists.

Chart 15 · Section 06

Monthly adverse-voice cohorts remain elevated through the analysis window — very-negative share peaks at 48.94% in Dec 2025.

The Illinois DOI exam is the hinge

Section 3 found email_only and failed human access at the top of the adverse-channel lift table. The Illinois DOI exam documents 100% email-only non-renewals across both homeowners (84/84) and PPA (34/34) samples. Section 5 found auto telematics and homeowners inspection logic as severe non-renewal drivers. The Illinois DOI exam documents the 116/116 telematics scoring bug and the 116/116 roof-age system bug. Plus: Criticism #48 (an automated address-decline because Google didn’t recognise the applicant’s address), Criticism #117 (1,655/1,655 pet policies issued with unapproved policy language), and Criticism #92 (claim-file documentation gaps). The customer voice is not random venting. In several cases, it is describing process failures a state regulator independently identified and accepted a stipulation on.

This is not an anti-AI story

The report does not argue that AI cannot work in insurance, that traditional carriers are inherently better, or that Lemonade’s financial improvement is fake. The specific argument is narrower: Lemonade has built a real AI operating advantage on standard claim, signup, and policy-management flows, with a harsh downside tail when cases become disputed, ambiguous, or operationally exceptional — and the customer voice across four channels surfaces that downside before aggregate financial metrics fully metabolise it. The operationally important question is not “is AI good or bad” but “is the AI handling the right cases, and is there a credible human path for the ones it shouldn’t?”

Publication-gate summary

Candidate	Test	n	Effect size	p-value	Result
A — Escalation gap	χ² on adverse rate (human_never_reached vs. 32.2% baseline)	312	89.1% adverse · lift 2.775× · χ² ≈ 462.2	p < 0.001	CLEARED
B — Claim-driven bimodality	Yates χ² on middle-band share (claim vs. no-claim)	3,975	5.18% vs 11.03% · χ² = 50.36 · Welch t = 11.61	p < 0.001	CLEARED
C — Tesla event week	Pre/post structural break on tesla_vehicle_named	3	1 pre · 2 post over ~16 weeks	n/a	PRECISE NULL
D — Renewal trap	Time-series + cohort severity	82 / 91	non_renewal 1.07★ · premium_change 1.27★ · inspection share 0.00–2.13%	n/a	DIRECTIONAL

Cohort sizing: the combined publication-gate cohorts (A’s escalation cohort n=312 at 89.1% adverse · B’s claim cohort n=2,238 with middle-band 5.18% · D’s renewal trap n=238 at 1.24★) total 2,788 reviews — 61.0% of the corpus concentrated in operationally specific adverse or polarised cohorts. That is the leading-indicator footprint. It is large enough to matter.

“Non-renewal of policies which failed to meet certain underwriting criteria.”
Lemonade Q4 2025 Shareholder Letter · page 10 · filed Feb 19, 2026 · management’s own corroboration of the voice cohort

The closing position of the report: the customer voice in this corpus is doing exactly what the Causal Briefs framework is designed to detect — identifying an operational pattern before the lagging financial and regulatory metrics resolve it. Dimension Labs runs the same pipeline against private customer-voice corpora for enterprise clients; the Lemonade analysis is a public demonstration of the same methodology.

Finding 07 · the causal pipeline

From association to causation — four hypotheses, three cleared

The descriptive sections found patterns. Section 7 tests which of them are causal. We pre-registered four hypotheses, ran formal Average Treatment Effect estimation with three refutation tests each, and report the results without rounding the failures.

3 of 4

candidate hypotheses cleared the causal pipeline. H1 (escalation gap) — ATE +0.29, p < 10⁻⁹, all refutations pass. H3 (Tesla launch) — DiD coefficient −1.05★, p < 0.001, parallel trends hold (this is the reversal of v2’s precise-null framing). H4 (inspection non-renewal) — dose-response +11.8 pp P(1★) per level, p < 10⁻⁴, all refutations pass. H2 (AI handling on polarization) — reported as inconclusive because the explicit-handling cohort (n=175) is too small to identify the effect, not because the effect doesn’t exist.

What the causal pipeline does, in one paragraph. The Dimension Labs causal-intelligence engine takes each pre-registered hypothesis and reweights the dataset so the customers who experienced the treatment (e.g. failed escalation) and those who didn’t are statistically identical on every observable confounder — product line, claim outcome, lifecycle event, channel. After reweighting, the remaining difference is the causal effect. The engine then runs three adversarial tests on its own answer: a placebo treatment, a random common-cause variable, and a re-run on 80% of the data. A claim ships as causal only when the 95% confidence interval excludes zero and at least two of three refutations pass. Full per-hypothesis methodology + diagnostics in Lemonade_Causal_Analysis.md (sidecar)

H1 — Escalation gap causes adverse outcomes

The descriptive finding was that human_never_reached reviews are 89.1% adverse vs. a 32.2% baseline. That is a lift, not a causal effect — customers who report failed escalation might already be in adverse-outcome territory for other reasons. After adjusting for product line, lifecycle event, claim outcome, and source, the doubly-robust ATE is +29.4 percentage points (95% CI [+24.5, +33.7], p < 10⁻⁹). Roughly half of the raw +58 pp gap is confounding; the residual structural effect remains large. All three refutations pass, and the ATE is consistent on the two largest channels (Trustpilot +0.17, Google Play +0.28).

Chart 16 · Section 07 · H1

H1 estimates — raw difference, IPW ATE, and AIPW doubly-robust ATE with 95% bootstrap confidence intervals.

H2 — AI claim handling on polarization (inconclusive)

We tested whether AI handling specifically — rather than claim involvement generally — causes the polarization documented in Finding 02. The IPW ATE on polarization is −0.10 (95% CI [−0.77, +0.54], p = 0.78). The CI is wide and centered near zero. This is not a clean failure to find an effect — it is a power problem. Reviewers rarely make the AI-vs-human distinction explicit, so the treatment cohort is only n=175. Both arms are already near 96% extreme (claim outcome largely determines the pole), leaving nowhere for the AI-vs-human distinction to move the polarization measure. The descriptive bimodality finding stands; the AI-specific causal claim does not, on this data.

H3 — Tesla launch caused a measurable auto-sentiment drop

The strict tesla_vehicle_named slice (n=3) was uninformative. But Difference-in-Differences with auto-product reviews as treatment and renters/homeowners/pet/life as controls gives a DiD coefficient of −1.05 stars (95% CI [−1.55, −0.54], p < 0.001). Parallel trends hold: pre-launch week×group interaction p = 0.19. Auto ratings fell from 2.91★ to 2.33★ post-launch; non-auto rose from 3.18★ to 3.50★. The coefficient absorbs the non-auto secular trend and isolates the auto-specific effect.

Chart 17 · Section 07 · H3

Pre- vs post-launch mean ratings by group — auto sentiment fell 0.58★ while non-auto rose 0.32★. The DiD coefficient absorbs the non-auto trend.

Three caveats ship with this finding. (1) Pre/post asymmetry: the pre-period spans ~55 weeks vs. only ~16 post-launch, so the post-period estimate is noisier. (2) Concurrent events: the Q4 2025 earnings call (Feb 19, 2026) and other auto-specific shocks beyond Tesla are not separately identified. The conservative framing is “Tesla launch or contemporaneous auto-specific event.” (3) Sample size: 184 auto reviews in the window; within-cell weekly means are noisy.

H4 — Inspection-driven non-renewal causes 1-star outcomes (dose-response)

The dose ladder runs from 0 (no inspection or non-renewal mention) to 3 (explicit AI photo misread). P(1★) is 30.8% at dose 0, 93.8% at dose 1, 90.6% at dose 2, 83.3% at dose 3. The OLS coefficient on dose (controlling for product line and source) is +11.8 percentage points per level (95% CI [+6.1, +17.5], p = 5.3 × 10⁻⁵). All three refutations pass.

Chart 18 · Section 07 · H4

Dose-response — P(1★) by inspection-non-renewal dose level. Monotonic rise from 30.8% at dose 0 to 80–94% at doses 1–3.

The Illinois DOI exam Criticism #41 documented a roof-age system bug affecting 116/116 homeowners files. The dose=3 cohort in the customer voice (n=6) is small because reviewers describe outcomes ("they wanted $50k of work") rather than algorithmic causes ("their roof-age model misread my photo"). The dose=2 cohort (n=32) is where the corpus and the regulator’s finding most cleanly intersect — an inspection-driven non-renewal pathway producing a 91% P(1★).

Refutation discipline (the audit of the audit)

Twelve refutation tests planned (3 per hypothesis × 4 hypotheses); eleven passed. The single non-pass was the placebo on H2, where the primary estimate was already a non-significant null — consistent with the honest framing of H2 as a power problem rather than an absent effect. No claim that cleared the publication gate has a failed refutation behind it.

Table · Section 07 · Causal pipeline scorecard

Hypothesis	Method	n	Effect	95% CI	p	Refutations	Cleared
H1 — Escalation gap → adverse	AIPW DR	4,567	+0.294	[+0.245, +0.337]	< 10⁻⁹	3 / 3	✓
H2 — AI handling → polarization	IPW	175	−0.10	[−0.77, +0.54]	0.78	3 / 3	inconclusive
H3 — Tesla launch → auto sentiment	DiD (TWFE)	1,705	−1.05 ★	[−1.55, −0.54]	< 0.001	par-trends p=0.19	✓
H4 — Inspection NR dose → 1★	LPM (OLS, HC3)	746	+0.118 / level	[+0.061, +0.175]	5.3 × 10⁻⁵	3 / 3	✓

Chart 19 · Section 07 · Pipeline summary

Causal pipeline results — confirmed hypotheses (H1, H3, H4) and the inconclusive result (H2). Bar lengths show normalized effect magnitude; hover for actual units. Red = confirmed adverse direction, pink = confirmed positive direction, grey = inconclusive.

What the pipeline adds

The descriptive findings — bimodality, escalation gap, renewal trap — are real and statistically clean on their own. The causal pipeline adds something different: it identifies which patterns are confounded by selection and which are structural. The escalation gap is structural (+29 pp after adjustment, three refutations pass). The renewal-trap dose-response is structural (+11.8 pp per level, three refutations pass). The Tesla-launch effect is identifiable under a DiD design with validated parallel trends (−1.05★, with three named caveats). The AI-handling-on-polarization effect cannot be causally identified on a 175-row cohort — the descriptive finding stands; the causal upgrade does not.

About

A note on sources and methodology

This report uses exclusively public data. Customer reviews are public. SEC filings are public. The Illinois DOI examination report is a public regulatory document. The California Insurance Commissioner’s Bulletin 2025-1 is a public regulatory action. No interviews. No private data. Dimension Labs holds no position in LMND and has no commercial relationship with Lemonade.

The platform used for dimensional enrichment is the same one Dimension Labs runs for enterprise clients; the 33 dimensions here are bespoke to this analysis. The causal-intelligence engine combines propensity-score reweighting and a natural-experiment design with three refutation tests per hypothesis. Methodology questions, replication requests, press: hello@dimensionlabs.io.

Appendix A

Dimension reference — 33 dimensions across 6 clusters

Every dimension extracted from the customer’s text (Message + reviewTitle) only. Source, rating, and date are joinable metadata, never enrichment inputs. The same review classifies the same way regardless of whether you can see its star rating.

Claim Experience Cluster A · 6 dimensions

product_line — Lemonade product the reviewer is discussing (renters · homeowners · pet · auto · life · multiple_products · unspecified · none).
claim_filed_in_record — Whether reviewer describes filing a claim (claim_filed · claim_attempt_blocked_by_app · claim_mentioned_but_not_filed · no_claim_referenced).
claim_outcome — Outcome the reviewer describes (approved · denied · partial · pending · withdrawn · appealed_pending · none).
claim_type — Kind of claim (property_damage · theft · liability · medical_or_vet · auto_collision · mold_or_water · etc.).
time_to_resolution_band — Time reviewer cites (seconds · minutes · hours · days · weeks · months · still_unresolved · not_mentioned).
ai_handled_claim — AI vs. human handling (fully_ai · mixed · human_only · ai_attempted_human_escalated · ai_only_no_human_available · not_specified).

Communication & Escalation Cluster B · 4 dimensions

escalation_attempted — Whether the reviewer tried to escalate (attempted_and_succeeded · attempted_and_failed · wanted_to_but_could_not · no_escalation_needed · not_mentioned).
phone_support_sought — Phone-path behavior (phone_sought_and_found · phone_sought_not_found · phone_offered_limited_hours · phone_not_sought · phone_actively_avoided). Highest-signal escalation dimension.
human_contact_achieved — Whether reviewer reached a human (human_reached_easily · human_reached_eventually · human_never_reached · no_human_needed · not_applicable).
communication_channel_cited — Channel used (app_chat_only · email_only · phone_used · multiple_channels · in_app_video_recording · no_channel_described).

Policy Lifecycle Cluster C · 5 dimensions

lifecycle_event — Lifecycle moment, prioritised (non_renewal > quote_or_application_declined > claim_event > cancellation > premium_change > renewal > inspection_or_underwriting > new_policy_signup > in_policy_no_event).
premium_change_direction — Direction of premium change (increase · decrease · unchanged · etc.).
premium_change_magnitude_verbatim — Free-text verbatim magnitude phrase, max 25 words.
non_renewal_reason_cited — Reason (property_condition_or_inspection · roof_age_specifically · claims_history · location_risk_zone · pet_preexisting · auto_telematics_score · other · not_stated · not_applicable).
tesla_or_autonomous_referenced — Tesla/FSD/telematics reference (tesla_vehicle_named · autopilot_or_fsd_mentioned · autonomous_insurance_program_referenced · telematics_general · none). Gates Section 4.

AI Quality Cluster D · 4 dimensions

ai_persona_named — (maya_named_negatively · _neutrally · _positively · ai_jim_named · generic_ai_or_bot_negative · _neutral · _positive · none). Captures the Lemonade-specific persona signals.
ai_failure_type — Prioritised failure mode (misread_photo_or_inspection > wrong_information_about_policy > misclassified_preexisting_condition > misread_telematics_data > hallucinated_policy_terms > looped_response_no_progress > refused_to_escalate > could_not_find_account > automated_response_unhelpful_generic).
ai_vs_human_quality_compared — (ai_worse_than_human · ai_better_than_human · mixed · not_compared).
suggests_ai_inappropriate_for_case — (yes_explicit · implied · no · not_mentioned).

Outcome Behaviors Cluster E · 7 dimensions

cancellation_intent — (already_cancelled · strong · moderate · implied · no · not_mentioned).
competitor_named — Named competitor enum (state_farm · geico · progressive · allstate · usaa · liberty_mutual · nationwide · farmers · pet_specific · auto_specific · home_specific · other · generic · no_competitor).
competitor_named_verbatim — Free-text exact competitor name(s), max 12 words.
regulatory_action_referenced — Prioritised (class_action_filed > attorney_general > state_insurance_commissioner > bbb_filed > lawyer_engaged > class_action_referenced_as_context > regulatory_complaint_general > fraud_alleged · none).
recommendation_signal — (would_recommend_strongly · _conditionally · specifically_warns_others_against · no_recommendation).
tesla_or_autonomous_referenced — (see Cluster C; also gates Section 4 telematics work).
billing_dispute_type — (unauthorized_charge_after_cancellation · auto_renewal_without_consent · duplicate_billing · cancellation_fee_disputed · refund_refused · charge_higher_than_quoted · auto_premium_recalibration_dispute · card_kept_charging · no_billing_dispute).

Sentiment & Severity Cluster F · 7 dimensions

severity_of_grievance — (high · medium · low · positive_no_grievance · none). Independent of star rating.
overall_sentiment — Latent sentiment from text (very_positive_advocate · positive · neutral_or_mixed · negative · very_negative_detractor). Sentiment-rating divergence is itself a signal.
primary_pain_point_phrase — Verbatim pain point, max 18 words.
primary_delight_phrase — Verbatim delight point, max 18 words.
sentiment_verbatim — Single contiguous verbatim substring capturing overall stance, max 18 words. This is the pull-quote field; every quote in the report comes from here.
customer_situation_summary — Model-written summary of the reviewer’s situation, max 30 words.
feature_request_or_fix — Reviewer-suggested fix, max 25 words. location_or_state_mentioned — Free-text geographic location from review text (not metadata).