PLG Playbook: Building an AI-Powered PQL Scoring Model

PLG companies convert free-to-paid more effectively than sales-led ones. But only when sales gets the right leads at the right time. Most teams still dump every registered user into their CRM — including the ones who signed up, poked around once, and disappeared.

A Product Qualified Lead is fundamentally different from an MQL. An MQL filled out a form. A PQL has used the product and shown behavior that predicts purchase. This article walks through building a PQL scoring model from scratch: activation event definition, SQL-based scoring, LLM classification, and CRM sync.

What PQL Scoring Is and Why MQL Scoring Fails in PLG

MQL scoring works on demographics and marketing interactions: job title, company size, downloaded a whitepaper, attended a webinar. In a PLG model, these signals are nearly useless. A user might register with a Gmail address, never touch a marketing email, and still open the product every day.

PQL scoring uses product usage as the primary signal. Three categories matter:

Activation signals. The user has completed key actions that correlate with long-term retention. For Slack, that’s a team exchanging 2,000 messages. For Dropbox, uploading from one device and accessing from another. Activation events are different for every product — you can’t borrow them from a case study.

Engagement depth. Frequency, feature breadth, time in product. Not “logged in 5 times” but “used 3+ core features in the last 7 days.”

Expansion signals. Invites teammates, creates a team, hits free-plan limits, exports data. These actions show the product is delivering value at the organizational level — which is what B2B buyers care about.

Identifying Activation Events with LLMs

The first step is finding activation events for your specific product. The standard approach: retrospective cohort analysis. Take users who converted to paid and compare their first-14-days behavior against users who churned. This presupposes a clean event taxonomy — without consistent event names and properties, cohort comparisons return noise.

SQL query for a basic cohort breakdown (if you’re new to SQL, the SQL for product managers guide covers the patterns used below):

WITH converted_users AS (
    SELECT user_id, MIN(subscription_start) AS conversion_date
    FROM subscriptions
    WHERE plan != 'free'
    GROUP BY user_id
),
user_events AS (
    SELECT
        e.user_id,
        e.event_name,
        COUNT(*) AS event_count,
        COUNT(DISTINCT DATE(e.created_at)) AS active_days,
        CASE WHEN cu.user_id IS NOT NULL THEN 'converted' ELSE 'churned' END AS cohort
    FROM events e
    LEFT JOIN converted_users cu ON e.user_id = cu.user_id
    WHERE e.created_at <= COALESCE(cu.conversion_date, e.created_at + INTERVAL '30 days')
      AND e.created_at >= e.user_created_at
      AND e.created_at <= e.user_created_at + INTERVAL '14 days'
    GROUP BY e.user_id, e.event_name, cohort
)
SELECT
    event_name,
    cohort,
    AVG(event_count) AS avg_count,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY event_count) AS median_count,
    COUNT(DISTINCT user_id) AS users
FROM user_events
GROUP BY event_name, cohort
ORDER BY event_name, cohort;

This query compares converted vs. churned users across their first 14 days. The output is a table showing which events differ statistically between cohorts.

The catch: the table has dozens of events, and correlation doesn’t imply causation. An LLM helps cut through the noise — interpreting the data and generating hypotheses worth testing.

Prompt for analyzing activation events:

You are a product analyst. Analyze cohort analysis data for a SaaS product.

Product context: [product description, main use cases, key features]

Data (event_name | cohort | avg_count | median_count | users):
[paste SQL query result here]

Tasks:
1. Identify 3-5 events with the greatest difference between the converted and churned cohorts
2. For each event, suggest a threshold value at which conversion probability significantly increases
3. Exclude events that are a consequence of conversion, not a predictor
4. Suggest event combinations (activation milestones) that form an "aha moment"

Response format: JSON with fields event_name, threshold, confidence, reasoning

LLMs don’t replace statistical analysis. They speed up interpretation and surface hypotheses you’d then validate through A/B tests. The output of this prompt is a working set of activation events with threshold values — a starting point, not a final answer.

PQL Scoring Model Architecture

A PQL score is a number from 0 to 100. It’s composed of three components with different weights:

Component	Weight	Data Source
Activation score	40%	Product events
Engagement score	35%	Usage analytics
Firmographic score	25%	Enrichment data

Activation Score

Binary check: did the user complete an activation event or not? Each event carries its own weight within the component.

WITH activation_checks AS (
    SELECT
        u.user_id,
        u.email,
        u.company_domain,
        -- Activation event 1: created a project
        MAX(CASE WHEN e.event_name = 'project_created' THEN 1 ELSE 0 END) AS created_project,
        -- Activation event 2: invited a teammate
        MAX(CASE WHEN e.event_name = 'team_invite_sent' THEN 1 ELSE 0 END) AS invited_teammate,
        -- Activation event 3: used core feature 3+ times
        CASE WHEN COUNT(CASE WHEN e.event_name = 'core_feature_used' THEN 1 END) >= 3
             THEN 1 ELSE 0 END AS used_core_feature,
        -- Activation event 4: connected an integration
        MAX(CASE WHEN e.event_name = 'integration_connected' THEN 1 ELSE 0 END) AS connected_integration
    FROM users u
    LEFT JOIN events e ON u.user_id = e.user_id
        AND e.created_at >= u.created_at
        AND e.created_at <= u.created_at + INTERVAL '14 days'
    GROUP BY u.user_id, u.email, u.company_domain
)
SELECT
    user_id,
    email,
    company_domain,
    ROUND(
        (created_project * 30 +
         invited_teammate * 30 +
         used_core_feature * 25 +
         connected_integration * 15)
    ) AS activation_score
FROM activation_checks;

Weights come from the cohort analysis correlation. invited_teammate tends to carry a high weight because bringing in colleagues is one of the strongest conversion predictors in B2B SaaS — it means the product is spreading inside an organization.

Engagement Score

Measures depth and frequency of use. Unlike the activation score, this one’s continuous — recalculated every day as behavior changes.

WITH daily_usage AS (
    SELECT
        user_id,
        COUNT(DISTINCT DATE(created_at)) AS active_days_last_14,
        COUNT(DISTINCT event_name) AS unique_features_used,
        COUNT(*) AS total_events,
        MAX(created_at) AS last_active_at
    FROM events
    WHERE created_at >= CURRENT_DATE - INTERVAL '14 days'
    GROUP BY user_id
),
engagement_scored AS (
    SELECT
        user_id,
        -- Frequency: active days out of 14
        LEAST(active_days_last_14 / 14.0 * 100, 100) AS frequency_score,
        -- Breadth: unique features used (normalized to total feature count)
        LEAST(unique_features_used / 8.0 * 100, 100) AS breadth_score,
        -- Recency: penalty for inactivity
        CASE
            WHEN last_active_at >= CURRENT_DATE - INTERVAL '1 day' THEN 100
            WHEN last_active_at >= CURRENT_DATE - INTERVAL '3 days' THEN 75
            WHEN last_active_at >= CURRENT_DATE - INTERVAL '7 days' THEN 40
            ELSE 10
        END AS recency_score
    FROM daily_usage
)
SELECT
    user_id,
    ROUND(frequency_score * 0.4 + breadth_score * 0.35 + recency_score * 0.25) AS engagement_score
FROM engagement_scored;

The breadth_score normalization depends on how many core features you track. Replace the 8 in the example with your actual feature count.

Firmographic Score

Product usage carries the weight. The remaining 25% comes from company-level data: size, industry, tech stack. Pull it from enrichment services — Clearbit, Apollo, Clay.

SELECT
    u.user_id,
    CASE
        WHEN c.employee_count > 500 THEN 30
        WHEN c.employee_count > 100 THEN 25
        WHEN c.employee_count > 20 THEN 20
        WHEN c.employee_count > 5 THEN 15
        ELSE 5
    END +
    CASE
        WHEN c.industry IN ('technology', 'saas', 'fintech') THEN 25
        WHEN c.industry IN ('ecommerce', 'media', 'education') THEN 20
        WHEN c.industry IN ('healthcare', 'manufacturing') THEN 15
        ELSE 10
    END +
    CASE
        WHEN c.estimated_revenue > 10000000 THEN 25
        WHEN c.estimated_revenue > 1000000 THEN 20
        WHEN c.estimated_revenue > 100000 THEN 15
        ELSE 5
    END +
    CASE
        WHEN u.email NOT LIKE '%gmail.com'
         AND u.email NOT LIKE '%yahoo.com'
         AND u.email NOT LIKE '%hotmail.com' THEN 20
        ELSE 0
    END AS firmographic_score
FROM users u
LEFT JOIN companies c ON u.company_domain = c.domain;

A corporate email gets +20 points. It’s blunt, but it works: work-domain users convert significantly more often than free-email ones.

Final PQL Score and LLM Classification

Composite score:

SELECT
    a.user_id,
    a.email,
    ROUND(
        a.activation_score * 0.40 +
        e.engagement_score * 0.35 +
        f.firmographic_score * 0.25
    ) AS pql_score,
    CASE
        WHEN ROUND(a.activation_score * 0.40 + e.engagement_score * 0.35 + f.firmographic_score * 0.25) >= 75 THEN 'hot'
        WHEN ROUND(a.activation_score * 0.40 + e.engagement_score * 0.35 + f.firmographic_score * 0.25) >= 50 THEN 'warm'
        WHEN ROUND(a.activation_score * 0.40 + e.engagement_score * 0.35 + f.firmographic_score * 0.25) >= 25 THEN 'nurture'
        ELSE 'monitor'
    END AS pql_tier
FROM activation_scores a
JOIN engagement_scores e ON a.user_id = e.user_id
JOIN firmographic_scores f ON a.user_id = f.user_id;

Four tiers:

Hot (75+): Pass to sales immediately. High conversion probability.
Warm (50–74): In-app upsell triggers. Automatic hints about premium features.
Nurture (25–49): Onboarding drip campaigns. Nudge toward activation events.
Monitor (0–24): Observe. Don’t spend sales resources.

The tier distribution is itself a metric worth tracking — a PLG dashboard showing the hot/warm/nurture mix over time reveals whether activation is improving or regressing before it shows up in revenue.

A number handles prioritization. But sales reps need context: why is this user hot, what are they actually doing in the product, what do you lead with on the first call. That’s where LLMs earn their place.

Prompt for generating sales context:

You are a sales intelligence assistant. Based on product usage data, generate a briefing for a sales manager.

User data:
- Email: {email}
- Company: {company_name} ({industry}, {employee_count} employees)
- PQL score: {pql_score} (tier: {pql_tier})
- Completed activation events: {completed_activations}
- Missing activation events: {missing_activations}
- Most used features: {top_features}
- Active days in the last 14 days: {active_days}
- Number of invited teammates: {invited_count}
- Current plan: {plan}
- Has hit plan limits: {hit_limits}

Generate:
1. Use case hypothesis (1 sentence): what problem the user is solving
2. Expansion signal (1 sentence): why they're ready to upgrade
3. Recommended opening (1 sentence): how to start the conversation
4. Risk factor (1 sentence): what might prevent conversion

Format: JSON. No generic phrases. Only specifics based on data.

The output is structured JSON written to the CRM contact record. Instead of “score: 82,” the sales manager sees: “User runs feature X daily for task Y, has invited 4 teammates, and is hitting API request limits. Pitch the Enterprise plan around team collaboration.”

Automated Pipeline: From Events to CRM

The pipeline architecture has four stages:

Product Events → Event Store → Score Calculator → CRM Sync
     │                              │                  │
  Segment/                     Scheduled job        HubSpot/
  Amplitude/                   (hourly/daily)       Salesforce
  PostHog                                           API

Stage 1: Event Collection

Product analytics (Segment, Amplitude, PostHog) pipes events into the warehouse. Minimum required fields per event:

{
  "user_id": "usr_abc123",
  "event_name": "core_feature_used",
  "properties": {
    "feature": "report_builder",
    "duration_seconds": 340
  },
  "timestamp": "2026-03-25T14:22:00Z",
  "context": {
    "company_domain": "acme.com"
  }
}

Stage 2: Score Calculation

A scheduled job (cron, Airflow, dbt) runs the SQL from the previous sections. Output is a pql_scores table: user_id, pql_score, pql_tier, activation_score, engagement_score, firmographic_score, calculated_at.

CREATE TABLE pql_scores AS
SELECT
    a.user_id,
    a.email,
    a.company_domain,
    a.activation_score,
    e.engagement_score,
    f.firmographic_score,
    ROUND(a.activation_score * 0.40 + e.engagement_score * 0.35 + f.firmographic_score * 0.25) AS pql_score,
    CASE
        WHEN ROUND(a.activation_score * 0.40 + e.engagement_score * 0.35 + f.firmographic_score * 0.25) >= 75 THEN 'hot'
        WHEN ROUND(a.activation_score * 0.40 + e.engagement_score * 0.35 + f.firmographic_score * 0.25) >= 50 THEN 'warm'
        WHEN ROUND(a.activation_score * 0.40 + e.engagement_score * 0.35 + f.firmographic_score * 0.25) >= 25 THEN 'nurture'
        ELSE 'monitor'
    END AS pql_tier,
    CURRENT_TIMESTAMP AS calculated_at
FROM activation_scores a
JOIN engagement_scores e ON a.user_id = e.user_id
JOIN firmographic_scores f ON a.user_id = f.user_id;

Stage 3: LLM Enrichment

LLM enrichment runs for pql_tier = 'hot' users and for anyone transitioning from warm to hot. Calls happen in batches, not real-time. Cost: ~$0.01–0.03 per user with GPT-5.4-mini (current pricing at platform.openai.com).

import json
from openai import OpenAI

client = OpenAI()

def generate_sales_context(user_data: dict) -> dict:
    prompt = f"""You are a sales intelligence assistant...
    [prompt from the previous section with substituted data]"""

    response = client.chat.completions.create(
        model="gpt-5.4-mini",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0.3
    )
    return json.loads(response.choices[0].message.content)

Low temperature (0.3) keeps outputs consistent. The JSON response format eliminates parsing errors. If you’re running a multi-provider setup with LiteLLM, route calls through a unified proxy with model fallbacks. Track prompt quality with Langfuse.

Stage 4: CRM Sync

HubSpot and Salesforce support custom properties via API. Minimum fields to sync:

import hubspot
from hubspot.crm.contacts import SimplePublicObjectInput

client = hubspot.Client.create(access_token="your_token")

def sync_pql_to_hubspot(user_email: str, pql_data: dict, sales_context: dict):
    properties = {
        "pql_score": str(pql_data["pql_score"]),
        "pql_tier": pql_data["pql_tier"],
        "pql_activation_score": str(pql_data["activation_score"]),
        "pql_engagement_score": str(pql_data["engagement_score"]),
        "pql_use_case": sales_context.get("use_case_hypothesis", ""),
        "pql_expansion_signal": sales_context.get("expansion_signal", ""),
        "pql_recommended_opening": sales_context.get("recommended_opening", ""),
        "pql_last_calculated": pql_data["calculated_at"]
    }

    contact = SimplePublicObjectInput(properties=properties)

    client.crm.contacts.basic_api.update(
        contact_id=get_contact_id_by_email(user_email),
        simple_public_object_input=contact
    )

When a user hits hot, the CRM auto-creates a task for the sales manager. Speed matters here: outreach velocity after a PQL signal is one of the most reliable win-rate drivers in B2B sales. Minutes beat hours.

Calibrating the Model and Thresholds

The model needs regular calibration. Two numbers to watch:

Precision. What share of hot PQLs actually convert. Target: >30%. If it drops below 20%, your thresholds are too low or the component weights are off.

Recall. What share of actual conversions the model predicted as hot or warm. Target: >70%. If it falls below 50%, the model is missing behavioral patterns — often because activation events are outdated.

SQL query for precision and recall evaluation:

WITH predictions AS (
    SELECT
        p.user_id,
        p.pql_tier,
        CASE WHEN s.user_id IS NOT NULL THEN 1 ELSE 0 END AS actually_converted
    FROM pql_scores p
    LEFT JOIN subscriptions s ON p.user_id = s.user_id
        AND s.plan != 'free'
        AND s.subscription_start > p.calculated_at
        AND s.subscription_start <= p.calculated_at + INTERVAL '30 days'
    WHERE p.calculated_at >= CURRENT_DATE - INTERVAL '90 days'
)
SELECT
    pql_tier,
    COUNT(*) AS total_users,
    SUM(actually_converted) AS converted,
    ROUND(SUM(actually_converted)::NUMERIC / COUNT(*) * 100, 1) AS precision_pct,
    ROUND(SUM(actually_converted)::NUMERIC /
        (SELECT COUNT(DISTINCT user_id) FROM subscriptions
         WHERE plan != 'free'
         AND subscription_start >= CURRENT_DATE - INTERVAL '90 days') * 100, 1
    ) AS recall_pct
FROM predictions
GROUP BY pql_tier
ORDER BY pql_tier;

Calibrate monthly. User behavior drifts: new features change activation patterns, seasonality hits engagement. Fixed thresholds decay within 2–3 months — you’ll start missing leads you should be catching.

You can automate calibration: feed metric drift to the same LLM and have it suggest weight adjustments. The product team still owns the final call.

PQL Scoring Economics

Implementation cost depends on your infrastructure.

Component	Cost (per month)
Event tracking (Segment/PostHog)	$0–300 (free tier covers up to 10K MAU)
Data warehouse (BigQuery/Snowflake)	$50–200
LLM API (GPT-5.4-mini for hot leads)	~$10–50 (for ~1,000 hot PQLs/month)
CRM (HubSpot/Salesforce)	Existing subscription
Orchestration (Airflow/cron)	$0–50

Total: $60–600/month for companies under 50K MAU. Off-the-shelf alternatives — Pocus, Correlated — start at $500/month at comparable scale. Building your own costs less, but you’re also taking on maintenance. Worth it if you have engineering bandwidth and want full control over the model.

Common Mistakes When Implementing PQL Scoring

Too many activation events. Three to five is enough. A model with 15+ events overfits to noise and loses predictive power. More events aren’t more signal.

Skipping account-level aggregation. In B2B, the company buys — not the individual user. If five employees from the same domain are active but each has an individual score of 30, the model misses a hot account entirely. Aggregate at company_domain.

Static thresholds. Tier boundaries should match what the sales team can actually process. If hot PQLs are coming in at 500/week and you have three reps, raise the threshold. The scoring model is only as useful as the speed of follow-up.

No feedback loop. Reps should log PQL quality in the CRM: converted, not relevant, bad timing. Without that feedback, the model never gets better. Minimum viable: a binary “useful / not useful” flag after each outreach.

Score without context. “82” tells a sales rep nothing. LLM-generated briefings turn a number into an action. Don’t treat this as optional — it’s what makes the difference between leads that get worked and leads that sit in the queue.

What’s Next

PQL scoring is the foundation. Once it’s running, natural next steps:

Predictive model. Replace rule-based scoring with logistic regression or gradient boosting trained on historical conversions. Rules work at launch; ML generalizes better as volume grows.
Real-time scoring. Move from batch (daily) to stream processing. Score updates on every event; sales gets a notification the moment a user crosses into hot.
In-product actions. Let the score drive UX, not just CRM: paywall triggers, premium feature hints, personalized onboarding flows.
Multi-touch attribution. Layer PQL score on top of marketing touchpoints for a fuller picture of how deals actually close.

The whole system takes 2–3 sprints to ship. Results show up within 30 days: sales works fewer leads, but each one converts at a higher rate. That’s the point of Product-Led Growth — the product does the qualification work, so your team doesn’t have to.

Need help building a PQL scoring model? I help startups build AI products and automate processes — belov.works.

FAQ

How do you handle PQL scoring for a B2B product where multiple users from the same company are active on the free plan?

Aggregate at the company domain level, not the individual user level. Sum or average scores across all users from the same domain, and add a multiplier for team size — five active users from acme.com is a stronger signal than one user with a higher individual score. The SQL pattern is to GROUP BY company_domain and treat the account as the scoring unit. Individual user scores still matter for personalizing the sales briefing, but the tier assignment should reflect account-level behavior.

What’s the minimum viable version of PQL scoring a two-person startup can ship in a week?

Start with a single binary rule: if a user has completed at least two specific activation events within 14 days AND used a work-domain email, flag them as a PQL. No weights, no tiers, no LLM enrichment. Export the list weekly to a spreadsheet and have the founder call each person. This produces 80% of the value of a full scoring model and takes a day to implement. Layer in the full SQL-based scoring once you’ve validated that activation events actually predict conversion in your product.

How does PQL scoring interact with self-serve and product-led sales motions — should it apply to both?

Yes, but with different outputs. For self-serve (user converts without sales contact), PQL scoring drives in-product triggers: which upsell message to show, when to surface the pricing page, what feature to unlock in a trial. For product-led sales (a human reaches out), PQL scoring tells the rep who to contact and what to say. The same underlying score powers both motions — the difference is in what action fires when a user crosses a threshold.