RICE Scoring with AI: How to Prioritize Your Backlog When Everything Feels Urgent
What is RICE scoring?
RICE scoring is a product prioritization framework created by Sean McBride at Intercom that ranks features by dividing Reach × Impact × Confidence by Effort. It matters because it replaces subjective "everything is important" arguments with a reproducible number, forcing teams to decompose vague urgency into four measurable components. A feature with high impact but low confidence or high effort will score below a smaller feature with proven reach and minimal implementation cost — which is precisely the point.
TL;DR
- -RICE formula: (Reach × Impact × Confidence) / Effort — a feature touching 45,000 users with Impact 1, Confidence 100%, and Effort 0.5 person-months scores 90,000, beating a feature with Impact 2 but Effort 8.
- -Manual RICE scoring at 30 minutes per feature means 25 hours for a 50-item backlog — AI reduces this to 15 minutes for the entire batch.
- -LLMs inflate Reach by default and underestimate Effort by ~40% — correct both with historical adoption data and an explicit effort multiplier in the prompt.
- -Add a Strategic Alignment column (0–5) and multiply RICE Score by (1 + alignment/10) to account for quarterly OKRs without overriding the base math.
- -Re-score at least quarterly — RICE scores go stale as MAU grows, competitive signals change, and new data arrives from released features.
Without a formalized system, prioritization turns into politics. Whoever shouts loudest at standup wins. Or whoever’s request came from the biggest customer. Or whoever was last in Slack.
RICE scoring fixes this with numbers. Four parameters, one formula, an objective rank for every feature. With AI, you compress the whole process from hours of manual estimation to 15 minutes of automated scoring.
What RICE Is and Why It Beats Intuition
RICE was created by Sean McBride at Intercom in 2016. The framework has four components:
Reach — how many users the feature will affect over a given period. Not “everyone” or “a lot.” Give it a number: 5,000 users per month, 300 new signups per quarter, 100% of the active base.
Impact — how strongly the feature will affect each user it touches. Scale from 0.25 to 3:
- 3 = massive (completely transforms the experience)
- 2 = high (noticeable improvement)
- 1 = medium (moderate improvement)
- 0.5 = low (minimal improvement)
- 0.25 = minimal (barely noticeable)
Confidence — how sure you are in the Reach and Impact estimates. A percentage: 100% = you have data, 80% = strong hypothesis, 50% = gut feeling, 20% = pure speculation. Confidence works as a penalty for uncertainty. A feature with Impact 3 and Confidence 20% will score lower than a feature with Impact 1 and Confidence 100%.
Effort — how many person-months the implementation will take. It’s the only parameter in the denominator. The higher the Effort, the lower the priority.
The formula:
RICE Score = (Reach × Impact × Confidence) / Effort
Example: a feature affects 10,000 users, Impact = 2, Confidence = 80%, Effort = 3 person-months.
RICE = (10000 × 2 × 0.8) / 3 = 5333
RICE’s real power is forcing you to decompose “this is important” into four measurable components. An “important” feature with Reach 200 and Effort 6 loses to a “small” feature with Reach 8,000 and Effort 0.5. That’s the whole point — the numbers don’t lie.
Why Manual RICE Scoring Doesn’t Stick
RICE looks simple on paper. In practice, teams run into three problems.
Estimating Reach requires data. You need analytics reports, user segmentation, and growth forecasts. A product manager spends 20–40 minutes per feature pulling numbers from Amplitude, Mixpanel, or internal dashboards. Multiply that across a full backlog and it’s a painful afternoon.
Impact is subjective. Two product managers will rate the same feature differently. One puts Impact 2, the other puts Impact 1. A twofold difference. Without structured criteria, Impact becomes the same gut feeling RICE was supposed to eliminate.
Scale kills motivation. A backlog of 50 features at 30 minutes each is 25 hours of work. Three full workdays just for scoring. Most teams score 10–15 features and give up.
AI handles all three. An LLM analyzes product context and produces well-reasoned estimates. It won’t replace the PM — but it does the heavy lifting on the first pass.
Prompt for AI Scoring of a Single Feature
Start here. This base prompt works with any LLM — Claude, GPT-5.5, Gemini.
## Role
You are a senior product manager with 10 years of experience prioritizing
backlogs using the RICE framework. All estimates are grounded in data, not intuition.
## Product Context
- Product: [name and brief description]
- MAU: [monthly active users]
- Primary metric: [retention / revenue / activation / other]
- Stage: [pre-PMF / growth / scale]
- Average team size per feature: [number of developers]
## Task
Score the following feature using the RICE framework.
## Feature
Name: [name]
Description: [2–3 sentences about what the feature does]
Target segment: [who will use it]
Available data: [user requests, metrics, research results — everything you have]
## Response Format
For each RICE parameter:
1. Score (number)
2. Justification (2–3 sentences)
3. Confidence in this specific estimate (low/medium/high)
Summary:
- Reach: [number of users per quarter]
- Impact: [0.25 / 0.5 / 1 / 2 / 3]
- Confidence: [20% / 50% / 80% / 100%]
- Effort: [person-months]
- RICE Score: [calculated value]
- Recommendation: one sentence
The “Product Context” block is what makes or breaks the result. Without it, the LLM scores in a vacuum. MAU of 1,000 and MAU of 1,000,000 produce radically different Reach figures. Pre-PMF vs. scale changes Impact. Don’t skip context — it determines everything.
For more on how to effectively pass context to AI models, see the article on context engineering.
Prompt for Batch Backlog Scoring
Scoring one feature at a time misses the point. RICE’s value is in comparison — features ranked against each other, not evaluated in isolation. Here’s a prompt that processes the whole backlog in one call.
## Role
Senior product manager. Task: prioritize backlog using RICE,
ensuring consistency of estimates across features.
## Product Context
- Product: TaskFlow — B2B project management tool
- MAU: 45,000
- Primary metric: Weekly Active Teams (retention)
- Stage: growth (post-PMF, Series A)
- Team: 8 developers, 2 designers
- Average velocity: 1 feature every 2–4 weeks
## Scoring Rules
1. Reach is measured in users per quarter. Use percentage of MAU
if exact data is unavailable.
2. Evaluate Impact by its effect on the primary metric (Weekly Active Teams).
3. Confidence: 100% only with analytics or research data. 80% with strong
qualitative signals (interviews, support requests). 50% with indirect data.
20% — pure hypothesis.
4. Effort in person-months. Include design, development, and QA.
5. Score all features on the same scale. If feature A got Impact 2,
feature B cannot also get Impact 2 if its impact is objectively lower.
## Backlog
[list of features — one per line]
## Response Format
Markdown table:
| # | Feature | Reach | Impact | Confidence | Effort | RICE Score |
Sort by RICE Score (descending).
After the table — 3 sentences: top-3 implementation order recommendations.
Rule 5 is the one that matters most. Without an explicit consistency instruction, LLMs inflate Impact for every feature individually — everything looks important in isolation. When the model sees the full backlog at once, it calibrates estimates against each other.
Example: Backlog of 10 Features with Calculations
Let’s run the batch prompt against a real TaskFlow backlog. Here are 10 features, exactly as sent to Claude:
- Kanban board with drag-and-drop — visual task management
- Slack integration — notifications and task creation from Slack
- Guest access — invite external participants to a project
- Dark mode — light/dark mode toggle
- AI project summary — automated weekly progress report
- Project templates — ready-made structures for common project types
- Gantt chart — timeline and dependency visualization
- Mobile app — native iOS and Android clients
- Custom fields — user-defined task attributes
- Two-factor authentication — 2FA via TOTP
Scoring results:
| # | Feature | Reach | Impact | Confidence | Effort | RICE Score |
|---|---|---|---|---|---|---|
| 1 | Slack Integration | 31,500 | 2 | 80% | 2 | 25,200 |
| 2 | Project Templates | 27,000 | 2 | 80% | 1 | 43,200 |
| 3 | Kanban Board | 40,500 | 2 | 100% | 3 | 27,000 |
| 4 | Custom Fields | 22,500 | 2 | 80% | 1.5 | 24,000 |
| 5 | 2FA | 45,000 | 1 | 100% | 0.5 | 90,000 |
| 6 | Guest Access | 13,500 | 2 | 50% | 2 | 6,750 |
| 7 | Dark Mode | 36,000 | 0.5 | 100% | 0.5 | 36,000 |
| 8 | AI Project Summary | 18,000 | 1 | 50% | 2.5 | 3,600 |
| 9 | Gantt Chart | 9,000 | 2 | 50% | 4 | 2,250 |
| 10 | Mobile App | 27,000 | 2 | 50% | 8 | 3,375 |
Sorted by RICE Score:
| Rank | Feature | RICE Score |
|---|---|---|
| 1 | 2FA | 90,000 |
| 2 | Project Templates | 43,200 |
| 3 | Dark Mode | 36,000 |
| 4 | Kanban Board | 27,000 |
| 5 | Slack Integration | 25,200 |
| 6 | Custom Fields | 24,000 |
| 7 | Guest Access | 6,750 |
| 8 | AI Project Summary | 3,600 |
| 9 | Mobile App | 3,375 |
| 10 | Gantt Chart | 2,250 |
A few things worth unpacking here.
2FA won — not because it’s “the most important” feature, but because the math stacked up perfectly. Reach = 100% of the base, Confidence = 100% (it’s a security requirement, no guessing), Effort = 0.5 (TOTP is a standard integration). Low effort plus maximum reach equals a record score.
The mobile app came in second to last. Impact = 2, Reach is high — but Effort = 8 person-months crushed everything. RICE penalizes resource-heavy work by design. That doesn’t mean “don’t build it.” It means “not this quarter.”
AI summary landed at Confidence 50%. The feature looks attractive, but there’s no data behind the demand. RICE separated the hype from validated need.
Calibration: When AI Gets It Wrong
AI scoring doesn’t replace expertise. It’s a first approximation — the PM reviews and adjusts from there. After running this on real backlogs, I’ve seen the same LLM mistakes come up repeatedly:
Inflated Reach. LLMs are optimistic by default. Without constraints, the model assumes a feature will be used by “most users.” Fix: include real adoption data for existing features in the context. “Current new feature adoption: 30–50% of MAU in the first quarter.”
Underestimated Effort. LLMs don’t know about your technical debt, legacy architecture, or organizational friction. Fix: add historical estimate vs. actuals to the context. “Historically our effort estimates are off by 40%. Multiply your estimate by 1.4.”
Blind spots in Impact. LLMs evaluate Impact based on general product knowledge. They won’t know that Slack integration is a key competitive differentiator for your specific product. Fix: spell it out. “We’re losing deals because we don’t have Slack integration. 23% of churned leads cited it as their reason for declining.”
The rule is simple: AI does the first pass. The PM reviews and adjusts. The final call belongs to the human.
Advanced Prompt: RICE with Strategic Context
Basic RICE ignores strategic context. A high-scoring feature might conflict with quarterly OKRs, or depend on infrastructure that isn’t ready. This extended prompt addresses that.
## Role
VP of Product. Prioritizing backlog with quarterly strategy in mind.
## Strategic Context
- Quarterly OKRs:
1. Increase Week 8 retention from 34% to 42%
2. Enter the enterprise segment (500+ employees)
3. Reduce support load by 20%
- Strategic bets: enterprise readiness, self-serve onboarding
- Anti-goals: DO NOT expand mobile experience this quarter
## Task
Calculate RICE Score for each feature. Then add a
"Strategic Alignment" column (0–5), where:
- 5 = directly drives OKRs
- 3 = indirectly related to OKRs
- 1 = unrelated to OKRs
- 0 = conflicts with anti-goals
Final rank = RICE Score × (1 + Strategic Alignment / 10)
## Format
Table with columns: Feature | RICE Score | Strategic Alignment |
Adjusted Score | Rank
The multiplier (1 + Strategic Alignment / 10) gives a 0–50% bonus for strategic fit. A feature with RICE 20,000 and Strategic Alignment 5 gets an Adjusted Score of 30,000. A feature with RICE 25,000 and Strategic Alignment 0 stays at 25,000. Strategy tilts the math — it doesn’t flip it upside down.
Automating RICE Scoring via API
Running prompts manually works for one-off sessions. But if you want scoring to happen automatically on every backlog update, you need the API. Here’s how the full pipeline looks:
Step 1: Structured Backlog
Store the backlog as JSON. Each feature needs the fields below for scoring.
{
"product_context": {
"name": "TaskFlow",
"mau": 45000,
"primary_metric": "weekly_active_teams",
"stage": "growth",
"team_size": 10,
"avg_feature_adoption": 0.35,
"effort_multiplier": 1.4
},
"features": [
{
"id": "F-101",
"name": "Slack Integration",
"description": "Bidirectional sync: notifications to Slack, task creation from Slack",
"segment": "all_teams",
"signals": [
"23% of churned leads mentioned lack of Slack integration",
"Top-3 requested feature in NPS surveys (n=340)"
]
}
]
}
Step 2: Batch Scoring Script
import json
from anthropic import Anthropic
client = Anthropic()
def score_backlog(backlog_path: str) -> dict:
with open(backlog_path) as f:
backlog = json.load(f)
system_prompt = """You are a senior product manager scoring features
using RICE framework. Return valid JSON only."""
user_prompt = f"""
Score each feature using RICE framework.
Product context:
{json.dumps(backlog['product_context'], indent=2)}
Features to score:
{json.dumps(backlog['features'], indent=2)}
Rules:
- Reach: users per quarter, based on MAU and segment size
- Impact: 0.25 / 0.5 / 1 / 2 / 3
- Confidence: 0.2 / 0.5 / 0.8 / 1.0
- Effort: person-months, multiply by {backlog['product_context']['effort_multiplier']}
- Score = (Reach * Impact * Confidence) / Effort
Return JSON array:
[{{
"id": "F-101",
"name": "...",
"reach": 0,
"impact": 0,
"confidence": 0,
"effort": 0,
"score": 0,
"reasoning": "..."
}}]
"""
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
system=system_prompt,
messages=[{"role": "user", "content": user_prompt}]
)
return json.loads(response.content[0].text)
def rank_features(scored: list) -> list:
return sorted(scored, key=lambda f: f["score"], reverse=True)
if __name__ == "__main__":
scored = score_backlog("backlog.json")
ranked = rank_features(scored)
print(f"{'Rank':<5} {'Feature':<30} {'Score':>10}")
print("-" * 47)
for i, f in enumerate(ranked, 1):
print(f"{i:<5} {f['name']:<30} {f['score']:>10,.0f}")
Step 3: Sync with Issue Tracker
Write the scores back to Jira, Linear, or Notion via API. Example for Linear:
def sync_scores_to_linear(scored_features: list):
for feature in scored_features:
# Update the "RICE Score" custom field in Linear
update_issue(
issue_id=feature["id"],
custom_fields={
"rice_score": feature["score"],
"rice_reach": feature["reach"],
"rice_impact": feature["impact"],
"rice_confidence": feature["confidence"],
"rice_effort": feature["effort"]
}
)
The full loop: backlog updated in the tracker → script fetches new features → LLM scores → results written back → dashboard shows the current ranking. Set it up once and scoring just happens.
Common RICE Scoring Mistakes
Mixing Reach units. One feature is measured in “users,” another in “companies,” a third in “requests.” Reach must use consistent units across the whole backlog. Users per quarter works for B2C. Teams or accounts — for B2B.
Impact without a metric anchor. “The feature will improve UX” isn’t an Impact estimate. “The feature will increase activation rate by 5–8% for new users” is. Tie Impact to a specific metric or the number is meaningless.
Default Confidence of 80%. Teams assign Confidence 80% to everything because 100% feels arrogant and 50% feels too low. Result: Confidence stops affecting the score at all. Fix it with strict criteria: 100% = A/B test or analytics. 80% = qualitative interviews (n > 10). 50% = support tickets. 20% = “I have a hunch.”
Ignoring design and QA in Effort. Effort ≠ development only. Add design, QA, documentation, data migration. Effort in RICE covers the full cycle from kickoff to release.
Using RICE as your only tool. RICE doesn’t model dependencies between features, regulatory requirements (GDPR, SOC2), or technical prerequisites. A feature with RICE Score 100 is worthless if it depends on infrastructure sitting at RICE Score 10. Use RICE for ranking, then build the roadmap with dependencies in mind.
Scoring only once. RICE scores go stale. MAU grows, priorities shift, new data arrives. Re-score at least once per quarter. With automation, you can re-score every time the backlog changes.
Template for a Quick Start
Here’s what you actually need to get started in 15 minutes:
- Describe product context (5 fields: name, MAU, metric, stage, team size)
- Export the backlog as text (name + description + available data per feature)
- Use the batch prompt from this article
- Send to LLM, get back a ranked table
- Review the top 5 and bottom 5, fix obvious mistakes
After 2–3 iterations, the prompt calibrates to your product. Estimates get better as you feed in more historical data: real adoption rates, actual effort vs. estimate, post-release feedback.
Summary
RICE scoring turns “everything is important” into a ranked list with justifications. AI removes the main bottleneck — the time spent manually estimating each feature.
The formula doesn’t change: (Reach × Impact × Confidence) / Effort. The prompts in this article give you consistent estimates across the full backlog. API automation makes re-scoring something you can do without thinking about it.
Five features, one prompt, 15 minutes. That’s enough to validate whether AI scoring aligns with expert judgment. Once calibrated, the same process scales to the full backlog and re-runs every quarter without additional effort.
Need help with AI-powered backlog prioritization? I help startups build AI products and automate processes — belov.works.
FAQ
How do you handle features with no existing data — new markets or totally unproven ideas?
Set Confidence to 20% and document that explicitly. A speculative feature with Impact 3 and Confidence 20% scores the same as a validated feature with Impact 1 and Confidence 60% — both around 1,800–2,700 for a typical backlog. This doesn’t hide the speculation; it quantifies it. If a 20%-confidence feature still ranks in your top 5, it’s worth a small discovery sprint to move that number before committing to full development.
Can RICE scoring work for B2B products where “Reach” means companies, not individual users?
Yes, but you must pick one unit and stick to it across the entire backlog. For B2B, use accounts or teams as the Reach unit — “1,200 active accounts” — rather than individual seats. Impact should then map to account-level outcomes (activation, expansion, churn reduction) rather than per-user experience. Mixing units (some features scored in users, others in accounts) destroys comparability and makes the ranking meaningless.
What should I do when the RICE ranking conflicts with my strategic roadmap?
Use the Strategic Alignment multiplier from the article rather than overriding the RICE score manually. If a high-RICE feature conflicts with your quarter’s OKRs, a low Strategic Alignment (0–1) will naturally pull it down the adjusted ranking. Manual override undermines the system’s credibility — once people know the PM can move any feature up, the numbers stop mattering and RICE becomes theater.