200 Reviews → 5 JTBDs in 2 Hours: AI-Powered User Research Synthesis

Most user reviews never reach product decisions. The bottleneck is mechanical: manually analyzing 200+ reviews takes 2–3 weeks, and by the time it’s done priorities have already shifted. The team reads reviews, copies quotes onto sticky notes, groups them by intuition. The result is subjective, non-reproducible, and doesn’t scale.

This article covers how to go from raw reviews to five concrete Jobs-to-be-Done in 2 hours, using an LLM at every stage: signal extraction, clustering, JTBD formulation, and validation.

Why JTBD, Not Just “User Pain Points”

The Jobs-to-be-Done framework focuses not on what the user says, but on what they’re trying to achieve. “The app is slow” is not a pain point. The pain is that someone was trying to quickly find a route before leaving a hotel and couldn’t. Context, motivation, and expected outcome matter more than the specific complaint.

JTBD formula:

When [situation], I want to [motivation], so that [expected outcome]

Example from a real travel app analysis:

Review: “Why can’t I download the route offline? There’s no internet in the mountains!”
JTBD: When traveling in areas without connectivity, I want access to my planned route offline, so I don’t lose my bearings or derail my plans.

The difference matters. The review points to a feature (offline mode). The JTBD points to a job (navigating without connectivity) and an outcome (not derailing plans). The latter opens up multiple solutions: offline maps, preloading, SMS fallback, a printable route.

Where to Source Reviews

Sources for analysis:

App Store and Google Play — reviews with ratings, allow correlation between sentiment and specific issues
Zendesk/Intercom — support tickets, the most detailed problem descriptions
Typeform/Google Forms — survey responses, structured data
Reddit/Twitter — organic mentions, no survey bias
Interview recordings — custdev transcripts, the richest context

Format for loading into an LLM — simple CSV or JSON:

[
  {
    "id": "review_001",
    "source": "google_play",
    "rating": 2,
    "text": "I was planning a trip to Bali, the app suggested...",
    "date": "2026-02-15"
  }
]

Keep id and source for traceability. Every cluster and every JTBD should reference specific reviews — otherwise the result can’t be verified.

Optimal volume for one cycle: 150–300 reviews. Fewer — not enough for statistically meaningful clusters. More — split into batches (more on that below).

Step 1: Signal Extraction

The first LLM pass extracts structured signals from each review. Not classification, not sentiment — signals: what the person was trying to do, what went wrong, what outcome they expected.

Signal Extraction Prompt

You are a user experience researcher. For each review, extract:

1. **situation** - the context and usage situation (what was happening)
2. **intent** - what the user was trying to do
3. **outcome_expected** - what result they expected
4. **outcome_actual** - what they actually got
5. **emotion** - emotional tone (frustration/disappointment/anger/neutral/satisfaction)
6. **signal_strength** - how explicitly the need is expressed (1-5)

If there isn't enough information for a field - set it to null.
Respond strictly in JSON. Do not infer what isn't in the text.

Reviews:
{reviews_batch}

Example Output

{
  "review_id": "review_042",
  "situation": "Planning a family trip with kids for the weekend",
  "intent": "Find a route that accounts for children's interests",
  "outcome_expected": "Route with playgrounds and cafes with kids' menus",
  "outcome_actual": "Standard tourist route without consideration for children",
  "emotion": "disappointment",
  "signal_strength": 4
}

Batching

200 reviews won’t fit in one prompt. Optimal batch size: 15–25 reviews per call, depending on text length. With a 128K token context window, you can push to 40–50, but extraction quality drops: the LLM starts missing nuances in reviews from the middle of the list.

Splitting into batches is automated with a 20-line script. Total processing time for 200 reviews at 20 per batch: 10 calls, roughly 5–7 minutes.

For more on model selection and LLM call monitoring, see the LLM observability guide with Langfuse.

Step 2: Signal Clustering

After extraction, you have 200 structured records. The next step is finding repeating patterns.

Two approaches:

Approach A: LLM Clustering (fast, good for < 200 records)

Pass all extracted signals to an LLM and ask it to group them:

You have 200 extracted signals from user reviews.
Each signal contains: situation, intent, outcome_expected, outcome_actual.

Task:
1. Group signals by similar intent + situation (by meaning, not by words)
2. Give each cluster a short, descriptive name
3. For each cluster, provide:
   - cluster name
   - number of signals in it
   - list of review_ids
   - common pattern: typical situation + typical intent
   - average signal_strength

Do not create clusters with fewer than 3 signals — that's noise.
It's okay for one signal to belong to two clusters.

Format: JSON.

Approach B: Embeddings + HDBSCAN (precise, scales)

For datasets larger than 200 records, LLM clustering loses accuracy. Alternative: transform each signal into an embedding, then cluster using HDBSCAN.

from sentence_transformers import SentenceTransformer
import hdbscan

model = SentenceTransformer('all-MiniLM-L6-v2')

# Concatenate key fields for embedding
texts = [
    f"{s['situation']} {s['intent']} {s['outcome_expected']}"
    for s in signals
]
embeddings = model.encode(texts)

clusterer = hdbscan.HDBSCAN(min_cluster_size=5, min_samples=3)
labels = clusterer.fit_predict(embeddings)

HDBSCAN doesn’t require specifying the number of clusters upfront. It finds dense regions on its own and labels noise as -1. For 200 records, you’ll typically get 8–15 clusters and 10–20% noise.

Example Clustering Result

#	Cluster	Signals	Avg Strength	Top review_ids
1	Offline route access	34	4.2	042, 089, 156…
2	Personalization for group composition	28	3.8	017, 033, 091…
3	Booking integration	22	3.5	005, 067, 134…
4	In-route navigation	19	3.9	023, 078, 145…
5	Trip budget tracking	17	3.3	011, 056, 112…
6	Social planning (shared access)	15	3.6	028, 073, 167…
7	Local vs tourist recommendations	14	4.0	009, 048, 133…
8	Real-time route adaptation	12	3.7	036, 082, 159…

8 clusters. From these, 5 JTBDs need to be derived.

Step 3: Formulating JTBDs from Clusters

Each cluster contains a repeating pattern: “situation + intent + expectation.” The LLM’s task at this step is to transform the pattern into a JTBD statement.

JTBD Formulation Prompt

You are a product strategist. Before you are the results of clustering
user reviews. For each cluster, formulate a JTBD.

JTBD format:
"When [situation], I want to [action/capability],
so that [expected outcome]."

Rules:
1. The situation must be specific, not abstract
2. The action is formulated from the user's perspective
3. The outcome describes end value, not a feature
4. If two clusters describe the same job - merge them
5. Rank by: (number of signals * average signal_strength)

Clusters:
{clusters_json}

For each JTBD, provide:
- JTBD statement
- Score (count * strength)
- Sources: clusters and review_ids
- Current state: how the user solves this problem today
- Possible solutions (2-3 options, from simple to complex)

Example Output: 5 JTBDs

JTBD #1 (Score: 142.8) When I’m traveling in an area without stable internet, I want full access to my planned route, so I can stay independent of connectivity and follow my plan.

Clusters: 1, 4
Current solution: Google Maps screenshots, Notes
Options: route offline cache → map + POI preloading → full offline mode with sync

JTBD #2 (Score: 106.4) When planning a trip with family or friends, I want to account for each participant’s interests and constraints, so the route works for everyone.

Clusters: 2, 6
Current solution: Google Docs wish list, group chat voting
Options: group composition filters → participant profiles with preferences → AI route optimization for the whole group

JTBD #3 (Score: 77.0) When choosing places to visit, I want recommendations from locals — not standard tourist lists — so I can see the real city.

Clusters: 7
Current solution: Reddit, expat blogs, Telegram group questions
Options: “local pick” tag on POIs → local guide partnerships → AI curation from non-tourist sources

JTBD #4 (Score: 77.0) When booking hotels and tickets, I want to do it directly from my itinerary, so I don’t have to switch between apps and lose context.

Clusters: 3
Current solution: copying names from the app into Booking/other platforms
Options: deep links to booking platforms → built-in price comparison → one-tap booking

JTBD #5 (Score: 56.1) When something goes off-plan during a trip (closed, weather, delay), I want to quickly adapt my route, so I don’t waste time on manual replanning.

Clusters: 8
Current solution: searching for alternatives in Google Maps on the spot
Options: “suggest alternative” button → auto-detection of closed venues → real-time rerouting with context awareness

Step 4: Validating the Results

JTBDs produced by an LLM require verification. Three validation methods:

Backward Traceability

Every JTBD should trace back to specific reviews. If you can’t find 5+ reviews that clearly describe this job, the formulation may be a model hallucination.

Verification prompt:

Here is a JTBD: "{jtbd_statement}"
Here are 10 reviews allegedly related to this JTBD:
{reviews}

For each review, answer:
- Is it relevant to this JTBD? (yes/no/partial)
- Which specific phrase in the review confirms the connection?

Total: how many of the 10 are genuinely relevant?

Threshold: minimum 70% relevant. Below that — the JTBD needs reformulation or is based on weak signals.

Overlap Test

Two JTBDs should not describe the same job in different words. Check:

Here are 5 JTBDs. Check each pair for overlap:
- Do they describe the same situation?
- Do they describe the same intent?
- Can they be merged without losing meaning?

For each pair: overlap score from 0 (completely different) to 1 (duplicate).

Overlap above 0.6 — candidates for merging.

Actionability Test

A JTBD is useless if no solution follows from it. For each JTBD, answer: “Can a minimal solution for this job be built in 2 weeks?” If not — the JTBD is too abstract and needs decomposition.

Full Workflow: 2 Hours

Stage	Time	What to do
Data collection and preparation	30 min	Export reviews, normalize format, deduplicate
Signal extraction (Step 1)	20 min	10 batches of 20 reviews, parallel LLM calls
Clustering (Step 2)	20 min	LLM clustering or embeddings + HDBSCAN
JTBD formulation (Step 3)	15 min	One LLM call based on clusters
Validation (Step 4)	25 min	Backward traceability, overlap, actionability
Formatting the result	10 min	Document with JTBDs, scores, links to reviews

Total: ~2 hours. Manual equivalent: 2–3 weeks.

The key advantage over a manual process is reproducibility. Run the same dataset again with temperature 0.3 and you get consistent results — the LLM applies the same extraction and clustering logic each time. Manual analyses between two reviewers tend to diverge significantly, especially on interpretation of ambiguous signals.

Model Selection and Cost

Different models suit different steps:

Step	Recommended model	Why
Signal extraction	Claude Sonnet, GPT-4o	Needs precise format adherence + nuance extraction
Clustering	Claude Opus, o3	Needs reasoning, semantic similarity understanding
JTBD formulation	Claude Opus, o3	Strategic thinking, synthesis
Validation	Claude Sonnet, GPT-4o	Comparison and verification, no deep analysis needed

Approximate cost of processing 200 reviews:

Extraction: ~150K input tokens, ~80K output tokens → $1.5–3
Clustering: ~100K input, ~20K output → $1–2
Formulation + validation: ~50K input, ~15K output → $0.5–1

Total: $3–6 for the full cycle. For comparison: one day of a product analyst’s work costs $300–500.

Good context engineering for the prompts is critical. The more precisely the system prompt describes the expected format and constraints, the fewer correction iterations you need.

Common Mistakes

Analyzing without product context. The LLM doesn’t know what the app can already do. Without a description of current functionality, the model generates JTBDs for features that already exist. Solution: add a brief description of the product and its current capabilities to the system prompt.

Overly abstract JTBDs. “When planning a trip, I want a convenient tool so everything is simple” — that’s a wish, not a JTBD. Solution: specify a minimum level of situational specificity in the prompt.

Ignoring positive reviews. JTBDs can be extracted from five-star reviews too — these are the jobs the product already closes. That’s validation of your current strategy. Filtering to only negative reviews means losing half the picture.

Clustering by words instead of meaning. “The map is slow” and “the route won’t load offline” are different words, same cluster (route access). LLM clustering handles this better than keyword matching, but sometimes groups by surface similarity. Manual review of the top 3 clusters is mandatory.

Getting Started

Collect 100–200 reviews from the primary channel (App Store, Zendesk, whichever source has the highest volume)
Normalize to JSON format with fields id, source, text, rating, date
Manually run Step 1 (signal extraction) on 20 reviews. Check extraction quality
If quality is acceptable — automate batching and run the full dataset
Perform clustering (LLM clustering is simpler for a first run)
Formulate JTBDs and validate with backward traceability

Minimum stack: one LLM API (Claude or GPT-4o), a Python script for batching, Google Sheets for the final result. No specialized tools required.

The output: a document with 5–8 ranked JTBDs, each with a score, source reviews, and solution options. This is the foundation for a product roadmap that can be refreshed monthly as new reviews come in.