# aicodingpricing.com — AI Coding Model Leaderboard / Decision Engine PRD

Date: 2026-05-28
Task: t_dbd60f86
Owner: moce
Canonical site: https://aicodingpricing.com
Report root: /root/.hermes/reports/aicodingpricing-leaderboard-20260528
Upstream inputs:
- Owner brief: /root/.hermes/reports/aicodingpricing-leaderboard-20260528/input-brief.md
- SEO/SERP/data-source plan: /root/.hermes/reports/aicodingpricing-leaderboard-20260528/seo-serp-plan.md
- Existing buying-decision result spec: /root/.hermes/kanban/boards/site-review/artifacts/site-review-20260527/aicodingpricing-buying-decision-result-spec.md

## 0. Decision

Build the AI Coding Model Leaderboard as a core inner-page cluster inside aicodingpricing.com.

Do not create a separate site. Do not reposition the product as a generic LLM leaderboard. The product upgrade is:

From: AI coding pricing calculator / pricing comparison
To: AI coding model decision engine: benchmark evidence + token price + context/speed signals + real task-cost estimate + workflow recommendation.

Primary product promise:

Compare AI coding models by public coding benchmark evidence, API pricing, context window, speed signals when available, and estimated coding-agent task cost.

P0 routes:
1. /llm-leaderboard — source-led AI coding model leaderboard and table.
2. /best-llm-for-coding — scenario recommendation page for users who want an answer, not a raw table.
3. /coding-agent-cost-calculator — conversion page that turns model choice into task/monthly cost.

Canonical decision:
- Keep /llm-leaderboard as the canonical P0 leaderboard route because the owner brief explicitly prioritizes it.
- /ai-coding-model-leaderboard should be 301 redirected to /llm-leaderboard or noindexed if shipped as an alias.
- H1/title must always include the coding qualifier. Never ship a generic “LLM Leaderboard” page.

P0 success condition:
A developer can answer: “Which model should I use for this coding workflow, why, what caveats apply, and what will it probably cost per task/month?” without leaving the site.

## 1. Product positioning

### 1.1 Category

AI coding model decision engine for developers, technical founders, and teams comparing coding LLMs by value, not only sticker price.

### 1.2 Positioning statement

FOR developers, founders, and engineering teams
WHO need to choose an LLM for coding agents, refactors, code review, frontend generation, or low-cost automation
AICodingPricing is an AI coding model decision engine
THAT compares public coding benchmark evidence, API prices, context, speed signals, caveats, and estimated real task cost
UNLIKE generic LLM leaderboards or raw pricing tables
AICodingPricing connects model selection directly to coding-agent cost calculation and buying decisions.

### 1.3 Messaging hierarchy

Headline:
AI Coding Models, Ranked by Workflow Cost

Subhead:
Compare coding LLMs by public benchmark evidence, API pricing, context window, speed signals, and estimated task cost for real coding workflows.

Benefits:
- Pick by workflow, not generic intelligence: coding agent, frontend generation, repo refactor, bug fixing, code review, test generation, Chinese coding workflow.
- Separate price from cost: token price is only one input; retries, output length, cache, context, and failed trajectories change real cost.
- Avoid fake certainty: source labels, last checked dates, confidence, and caveats are visible on every row.
- Move from comparison to action: every row leads to a prefilled coding-agent cost estimate.

Proof / trust signals:
- Public benchmark sources only.
- Official provider pricing first.
- Missing values rendered as not disclosed / not publicly benchmarked.
- Methodology explains why benchmark scores are not merged into one fake universal score.

## 2. Target users and jobs

### 2.1 Primary ICP: AI coding agent builder

Who: indie hacker, technical founder, developer-tool builder, or automation engineer building agentic coding workflows.
Pain: token bills can explode once agents retry, call tools, inspect repositories, and generate long diffs.
Current substitute: manually reading OpenAI/Anthropic/Gemini pricing pages plus benchmark leaderboards.
Trigger: choosing a model for a coding agent, internal dev tool, or automation product.
Willingness: medium to high if the page prevents wrong model choice or high recurring API cost.
Core task: choose the best model for reliable coding-agent work under a cost ceiling.

### 2.2 Secondary ICP: developer choosing a coding assistant model

Who: solo developer or power user comparing Claude, GPT, Gemini, DeepSeek, Kimi, Qwen, Cursor/Copilot/Codex-like workflows.
Pain: “best model” claims conflict, and subscription/API cost is hard to compare with real usage.
Current substitute: Reddit, YouTube, blogs, model release posts, provider docs.
Trigger: hitting subscription limits, switching tools, or deciding whether a cheaper API model is good enough.
Willingness: medium; mostly attention/newsletter/conversion value, not direct payment in P0.
Core task: get a scenario answer with caveats, then estimate monthly cost.

### 2.3 Tertiary ICP: startup team evaluating coding model spend

Who: engineering manager, founding CTO, or platform owner controlling model spend.
Pain: raw price per million tokens does not explain monthly task cost, risk, or governance.
Current substitute: spreadsheets, vendor docs, internal experiments.
Trigger: team-wide adoption, budget review, or migration from subscriptions to API usage.
Willingness: high if output supports purchase/vendor shortlist.
Core task: shortlist models by workflow, cost, context, and risk.

## 3. Core user tasks

P0 must support these tasks end-to-end:

1. “I’m building a coding agent. Which model should I start with, and what will it cost per task?”
2. “I need the cheapest good-enough model for high-volume coding automation.”
3. “I need a strong model for frontend generation and UI code.”
4. “I need a model for repo-level refactors or long-context code review.”
5. “I need a Chinese coding workflow option across Kimi/Qwen/DeepSeek-like candidates, but only where public data exists.”
6. “I want to compare Claude/GPT/Gemini/DeepSeek/Kimi/Qwen without trusting a fake universal ranking.”
7. “I know the token price; now show me task cost under retries, output tokens, cache, and monthly volume.”

## 4. Route contract and IA

### 4.1 /llm-leaderboard

Role: P0 core leaderboard.
Index policy: index.
Canonical: self.
Primary keyword: ai coding model leaderboard.
Title: AI Coding Model Leaderboard: Compare Coding LLMs by Cost & Benchmarks.
H1: AI Coding Model Leaderboard for Coding Agents and Developer Workflows.
Primary CTA: Estimate task cost.
Secondary CTA: Read methodology.

Above-fold structure:
1. Short answer block: no single best model; choose by workflow, benchmark evidence, price, context, speed, and task-cost assumptions.
2. Filter chips: Best coding, Cheapest good-enough, Best long context, Best agent model, Best Chinese coding workflow, Best open/low-cost model.
3. Crawlable HTML leaderboard table.
4. Row CTA: Calculate this model’s task cost.
5. Source/freshness strip.

Required sections:
- Best AI coding models by workflow.
- Coding benchmark sources and what each measures.
- API price, cache price, and task-cost comparison.
- Context window and long-context caveats.
- Speed and latency signals.
- How to choose a model for coding agents.
- FAQ.

### 4.2 /best-llm-for-coding

Role: P0 decision page.
Index policy: index.
Canonical: self.
Primary keyword: best llm for coding.
Primary CTA: Compare the leaderboard.
Secondary CTA: Estimate coding-agent cost.

Page structure:
1. Short answer: there is no single best LLM for all coding work.
2. Recommendation cards by scenario:
   - Best for coding agents.
   - Best for repo-level refactor.
   - Best for frontend generation.
   - Best cheap model.
   - Best long-context model.
   - Best Chinese coding workflow.
   - Best open/low-cost model when supported by data.
3. Evidence table per scenario: benchmark/source, price, context, caveat, confidence.
4. Methodology block: why recommendations are editorial labels derived from evidence fields, not universal truth.
5. Internal links to /llm-leaderboard, /coding-agent-cost-calculator, /coding-model-benchmark, /llm-api-pricing-comparison.

### 4.3 /coding-agent-cost-calculator

Role: P0 conversion page.
Index policy: index.
Canonical: self.
Primary keyword: coding agent cost calculator.
Primary CTA: Estimate monthly coding-agent cost.
Secondary CTA: Compare AI coding models.

Input fields:
- selected_model
- workflow_type
- average_trajectory_input_tokens
- average_output_or_reasoning_tokens where provider/source supports the distinction
- retry_rate
- sessions_or_tasks_per_month
- cache_hit_assumption
- batch_or_standard_mode toggle when provider supports it
- optional: human_review_minutes or failure_penalty note for later versions; not required in P0 calculation

Output fields:
- estimated_task_cost
- estimated_monthly_cost
- sensitivity_range
- price_components: input, output, cache read/write, batch/standard modifier where applicable
- task_cost_assumptions
- disclaimer: estimate, not billing quote
- CTA back to leaderboard row / compare alternatives

Formula policy:
Task cost estimate must show its formula:

estimated_task_cost =
  (input_tokens_per_task × input_price_per_token)
+ (output_tokens_per_task × output_price_per_token)
+ (cache_write_tokens × cache_write_price_per_token)
+ (cache_read_tokens × cache_read_price_per_token)
all multiplied by retry_multiplier and adjusted by batch/standard mode where applicable.

Monthly estimate = estimated_task_cost × tasks_per_month.

Important: token price is not task cost. Task cost includes trajectory length, retry count, output/reasoning tokens, cache assumptions, batch mode, failure rate, and human review/time risk.

### 4.4 Alias/supporting route policy

- /ai-coding-model-leaderboard: 301 to /llm-leaderboard or noindex canonical /llm-leaderboard.
- /coding-model-benchmark: P1 methodology hub; index only if unique content exists.
- /llm-api-pricing-comparison: P1 pricing hub; strong fit with existing site.
- /cheapest-coding-model: P1 value page; index only after source-led table exists.
- /claude-vs-gpt-for-coding: P1 comparison; compare by scenario, no absolute winner.
- /kimi-vs-qwen-vs-deepseek-coding: P2 unless data coverage is sufficient.
- /models/{model}: P2, index only with unique sourced data, pricing, benchmark references, context, caveats, and calculator CTA.

## 5. P0 data model

### 5.1 Leaderboard table fields

Required visible fields:
- model
- provider
- coding evidence
- benchmark/source columns, kept separate
- input price
- output price
- cache price
- context window
- speed signal
- estimated task cost range
- best for
- caveat
- source
- last checked
- confidence
- data status

Minimum P0 table fields from owner brief:
- model
- provider
- coding score
- input price
- output price
- cache price
- context
- best for
- caveat

Expanded implementation fields:

```ts
type CodingModelRow = {
  model_id: string
  display_name: string
  provider: string
  model_family?: string
  availability: 'api' | 'subscription_tool' | 'open_model' | 'aggregator_route' | 'unknown'

  benchmark_evidence: BenchmarkEvidence[]
  coding_score_display: string // e.g. benchmark-specific summary, not universal score
  input_price_usd_per_1m: number | 'not disclosed'
  output_price_usd_per_1m: number | 'not disclosed'
  cache_read_price_usd_per_1m?: number | 'not disclosed' | 'not applicable'
  cache_write_price_usd_per_1m?: number | 'not disclosed' | 'not applicable'
  batch_or_flex_price_note?: string
  context_window_tokens: number | 'not disclosed'
  effective_long_context_note?: string
  speed_signal?: SpeedSignal | 'not disclosed'

  best_for: WorkflowTag[]
  caveat: string
  data_status: 'available' | 'partial' | 'not disclosed' | 'not publicly benchmarked' | 'stale'
  source_confidence: 'high' | 'medium' | 'low'
  last_checked: string
  source_links: SourceLink[]
}

type BenchmarkEvidence = {
  benchmark_name: 'SWE-bench' | 'Aider' | 'LiveCodeBench' | 'LiveBench' | 'Artificial Analysis Coding Index' | 'LM Arena code/webdev' | 'Kilo usage signal' | string
  metric_label: string
  metric_value: string | number | 'not publicly benchmarked'
  source_url: string
  source_name: string
  last_checked: string
  confidence: 'high' | 'medium' | 'low'
  caveat: string
}

type SpeedSignal = {
  metric: 'TTFT' | 'tokens_per_second' | 'latency_note'
  value: string | number
  source_url: string
  confidence: 'high' | 'medium' | 'low'
}

type SourceLink = {
  source_name: string
  source_url: string
  source_type: 'official_pricing' | 'official_docs' | 'benchmark' | 'third_party_aggregator' | 'usage_signal'
  last_checked: string
}
```

### 5.2 Source policy

Hard rule: no fabricated model, benchmark, speed, context, or pricing data.

Each source-backed field must include:
- source_url
- source_name
- last_checked
- confidence
- update_policy

Accepted data states:
- available
- partial
- not disclosed
- not publicly benchmarked
- stale
- source needs recheck

Display rules:
- If a provider does not disclose a value, show `not disclosed`.
- If a model has no public benchmark in the selected benchmark, show `not publicly benchmarked`.
- If source is third-party only, label it as third-party/aggregator.
- Official provider docs win over aggregators for pricing.
- OpenRouter pricing can be shown only as route-specific pricing, not as official provider API pricing.
- Subscription tool pricing must be separated from API token pricing.

Recommended benchmark sources:
- SWE-bench official leaderboard.
- Aider LLM Leaderboards.
- LiveCodeBench.
- LiveBench.
- Artificial Analysis Coding Index.
- LM Arena code/webdev leaderboard.
- Kilo coding model leaderboard as usage signal, not objective benchmark.

Recommended official pricing sources:
- OpenAI API pricing: https://openai.com/api/pricing/
- Anthropic Claude API pricing: https://docs.anthropic.com/en/docs/about-claude/pricing
- Google Gemini API pricing: https://ai.google.dev/gemini-api/docs/pricing
- DeepSeek API pricing: https://api-docs.deepseek.com/quick_start/pricing
- Kimi/Moonshot API pricing: https://platform.moonshot.ai/docs/pricing
- Add Qwen/Alibaba Cloud, xAI, Mistral, OpenRouter only when specific models enter P0/P1 scope and source policy is satisfied.

## 6. Filters and taxonomy

### 6.1 P0 filters

Owner-specified filters:
- cheapest
- best coding
- best long context
- best Chinese coding
- best open model
- best agent model

Productized filter labels:
- Cheapest good-enough
- Best coding evidence
- Best for coding agents
- Best long context
- Best for frontend generation
- Best for repo refactor
- Best for code review
- Best for test generation
- Best Chinese coding workflow
- Best open / low-cost model

### 6.2 Filter behavior

Filters do not create fake absolute rankings. They reorder or highlight rows by declared logic and show caveats.

Examples:
- Cheapest good-enough: price bucket + minimum public coding evidence threshold + caveat for retry/failure risk.
- Best coding evidence: strongest public benchmark evidence across selected sources, but individual benchmark values remain separate.
- Best long context: advertised context + effective long-context caveat + source confidence.
- Best Chinese coding workflow: only models with public pricing/context and coding evidence or clearly labeled partial evidence.
- Best open / low-cost model: open/available route + cost caveat + benchmark availability.

## 7. Recommender logic

### 7.1 P0 recommender options

Owner-specified options:
- I’m building a coding agent
- I need cheapest good-enough model
- I need frontend generation
- I need repo-level refactor

Expanded P0 options:
- Build a coding agent
- Cheapest good-enough automation
- Frontend generation / UI code
- Repo-level refactor
- Code review / long-context review
- Test generation / debugging
- Chinese coding workflow

### 7.2 Recommender output

Each recommendation must output:
- recommended_model or model shortlist
- why_this_model
- cheaper_alternative
- safer_or_more_reliable_alternative
- evidence_used
- caveat
- estimated_task_cost_cta
- confidence

```ts
type RecommenderResult = {
  selected_workflow: WorkflowTag
  recommendation_label: string
  recommended_models: string[]
  why: string[]
  evidence_used: {
    benchmark_sources: string[]
    pricing_sources: string[]
    context_sources: string[]
    speed_sources?: string[]
  }
  cheaper_alternative?: string
  safer_alternative?: string
  caveat: string
  confidence: 'high' | 'medium' | 'low'
  calculator_prefill: CalculatorPrefill
}

type CalculatorPrefill = {
  model_id: string
  workflow_type: WorkflowTag
  default_input_tokens_per_task: number | null
  default_output_tokens_per_task: number | null
  default_retry_rate: number | null
  default_tasks_per_month: number | null
  assumption_source: 'user_default' | 'site_default' | 'source_backed' | 'not_set'
}
```

### 7.3 Scoring display policy

Do not show a single universal “overall model score” in P0.

Allowed:
- Separate benchmark columns.
- Scenario labels such as “Best for repo refactor” or “Low-cost option”.
- Evidence badges: strong evidence / partial evidence / not benchmarked.
- Confidence labels: high / medium / low.
- Sort options that clearly state what they sort by.

Not allowed:
- Fake composite score across GPT / Claude / Gemini / Kimi / Qwen / DeepSeek when data is not comparable.
- “The best LLM overall” claim.
- Ranking a model higher because of inferred or missing data.
- Filling speed, benchmark, or price values without public support.

Recommended UI wording:
- “Best by selected workflow, not a universal model ranking.”
- “Benchmark columns are not directly comparable unless the source and metric are the same.”
- “Estimated task cost uses the calculator assumptions shown below.”

## 8. Token price vs real task cost

This distinction must be visible on all three P0 routes.

Token price answers:
- How much does the provider charge per input/output/cache token?
- Is batch/flex/priority pricing available?
- What is the published price per 1M tokens?

Task cost answers:
- How many input/output/reasoning/cache tokens does a workflow consume?
- How often does the agent retry or fail?
- How much context does repo-level work require?
- Does the model produce longer outputs or require more human review?
- How many tasks run per month?

Required explainer copy:

A cheaper token price does not always mean a cheaper coding task. A model with lower price per token can become more expensive if it needs more retries, produces longer trajectories, fails more often, or requires extra human review. Use token price to understand the unit cost, then use task cost to estimate the real monthly spend for your coding workflow.

Required calculator caveat:

This is an estimate, not a billing quote. Actual bills can change with model updates, prompt length, tool calls, retries, cache hit rate, regional availability, batch mode, taxes, and provider pricing changes.

## 9. Content and SEO requirements

### 9.1 /llm-leaderboard

Target words: 1,200–1,800.
Required visible blocks:
- short answer
- filterable leaderboard
- methodology caveat
- source/freshness block
- task-cost CTA
- FAQ

FAQ topics:
- What is an AI coding model leaderboard?
- What is the best LLM for coding agents?
- Is a coding benchmark the same as real task cost?
- What is the cheapest good-enough coding model?
- Why do benchmark rankings disagree?
- How often is the pricing/benchmark data updated?

### 9.2 /best-llm-for-coding

Target words: 1,200–1,600.
Required visible blocks:
- no-single-best short answer
- scenario recommendations
- evidence table
- calculator CTA
- FAQ

Copy rule:
Use “best for X” instead of “best overall”.

### 9.3 /coding-agent-cost-calculator

Target words: 900–1,300 plus calculator UI.
Required visible blocks:
- calculator
- formula
- assumptions
- sample scenarios
- sensitivity range
- caveats
- FAQ

### 9.4 Schema / GEO / AEO

Use where applicable:
- WebPage
- BreadcrumbList
- FAQPage for visible FAQ only
- Dataset for versioned/downloadable leaderboard table
- ItemList for selected ranked/filter views with caveat
- SoftwareApplication for calculator page

Avoid:
- fake AggregateRating
- hidden FAQ schema
- first-party benchmark ownership claims
- schema claiming a universal best model

AI citation blocks:
- 40–70 word short answer near top.
- Definition block.
- Methodology block.
- Data freshness block.
- Direct-answer FAQ.

## 10. Product interaction requirements

### 10.1 Leaderboard row actions

Each row must include:
- View evidence.
- Calculate this model’s task cost.
- Compare with cheaper alternative where available.
- Source links.

Clicking “Calculate this model’s task cost” opens /coding-agent-cost-calculator with selected_model and workflow prefilled where possible.

### 10.2 Empty / unknown states

If data is missing:
- show not disclosed / not publicly benchmarked
- keep row visible if other useful fields exist
- lower confidence
- never invent values

If too little data exists for a filter:
- show “insufficient public data for this filter”
- show the closest partial rows with caveats
- do not create a ranked list from missing values

### 10.3 Mobile requirements

At 390px width:
- filters wrap cleanly
- table has card mode or readable horizontal strategy
- model row CTA remains visible
- source/freshness labels remain visible
- no horizontal overflow on calculator outputs

## 11. Analytics events

Preserve existing events from aicodingpricing:
- tool_start
- tool_result
- calculator_usage
- provider_recommended
- recommendation_shown
- pricing_click
- pricing_cta_click

Add P0/P1 events:

```ts
type LeaderboardEvent = {
  page_slug: 'llm_leaderboard' | 'best_llm_for_coding' | 'coding_agent_cost_calculator'
  selected_filter?: string
  selected_workflow?: string
  model_id?: string
  provider?: string
  source_confidence?: 'high' | 'medium' | 'low'
  data_status?: string
}
```

Event names:
- leaderboard_filter_click
- leaderboard_row_expand
- leaderboard_source_click
- model_task_cost_cta_click
- recommender_option_select
- recommender_result_shown
- calculator_prefill_loaded
- task_cost_calculated
- task_cost_sensitivity_changed

Success rule:
Analytics events should fire after the visible state changes or result renders. Do not fire success/intention events on mere button hover or before calculation output exists.

## 12. NOT-DO

- Do not build a separate domain or new standalone site.
- Do not make generic LLM leaderboard claims.
- Do not claim to be the most authoritative global leaderboard.
- Do not create a fake overall model score across incomparable benchmarks.
- Do not fabricate benchmark, speed, context, pricing, or task-cost data.
- Do not fill missing data without public support.
- Do not do realtime scraping of every provider in P0.
- Do not build a self-developed benchmark in P0.
- Do not rank all GPT / Claude / Gemini / Kimi / Qwen / DeepSeek models by one absolute winner when evidence is not comparable.
- Do not mix subscription tool pricing with API token pricing without labels.
- Do not index duplicate /llm-leaderboard and /ai-coding-model-leaderboard pages with similar content.
- Do not ship empty /models/{model} pages or include noindex pages in sitemap.
- Do not bury source/freshness/caveats only in footer.
- Do not promise billing accuracy.
- Do not add accounts, payment, saved estimates, or a full admin backend in this PRD scope.

## 13. Acceptance criteria

### 13.1 Product acceptance

- [ ] /llm-leaderboard, /best-llm-for-coding, and /coding-agent-cost-calculator exist as P0 routes under aicodingpricing.com.
- [ ] /llm-leaderboard H1/title clarify AI coding model leaderboard, not generic LLM leaderboard.
- [ ] /ai-coding-model-leaderboard is 301 redirected to /llm-leaderboard or noindexed/canonicalized to /llm-leaderboard.
- [ ] Leaderboard table is crawlable HTML and includes model, provider, coding evidence, input price, output price, cache price, context, best_for, caveat, source, last_checked, confidence.
- [ ] Every row has a visible caveat and at least one source/freshness label.
- [ ] Missing data renders as not disclosed / not publicly benchmarked / source needs recheck, not blank or invented values.
- [ ] Recommender supports at least: coding agent, cheapest good-enough, frontend generation, repo-level refactor.
- [ ] Each recommender result shows recommendation, cheaper alternative or why unavailable, safer alternative or caveat, evidence used, confidence, and calculator CTA.
- [ ] Calculator distinguishes token price from task cost and shows formula/assumptions.
- [ ] Calculator outputs estimated task cost, monthly cost, and sensitivity range.
- [ ] Mobile 390px has no horizontal overflow and CTAs remain usable.

### 13.2 SEO / GEO acceptance

- [ ] /llm-leaderboard target visible copy is at least 1,200 words unless a documented density/UX waiver exists.
- [ ] /best-llm-for-coding target visible copy is at least 1,200 words unless a documented waiver exists.
- [ ] /coding-agent-cost-calculator target visible copy is at least 900 words plus calculator UI unless a documented waiver exists.
- [ ] Each P0 page has title, meta description, canonical, H1, H2 structure, OG title/description/url, and twitter card.
- [ ] Each P0 page includes short answer, methodology/caveat, source/freshness, and FAQ blocks.
- [ ] FAQPage schema only marks visible FAQs.
- [ ] Leaderboard page uses ItemList/Dataset only if implementation truthfully supports it.
- [ ] noindex pages are not in sitemap.
- [ ] Duplicate alias page does not compete with canonical route.

### 13.3 Data / implementation acceptance

- [ ] Every source-backed field has source_url, source_name, last_checked, confidence, and update_policy in the data layer.
- [ ] Official provider pricing is preferred over aggregators.
- [ ] OpenRouter/aggregator pricing is labeled as route-specific or aggregator data.
- [ ] Benchmark values remain source-specific; no fake universal composite score is generated.
- [ ] Subscription tool prices and API token prices are modeled separately.
- [ ] Calculator prefill from leaderboard row works for selected_model and workflow_type.
- [ ] Existing analytics events remain intact.
- [ ] New leaderboard/recommender/calculator events fire only after visible state/result exists.
- [ ] Build/deploy tasks must commit + push + deploy from same commit and return commit_sha, branch, deploy_url, deployment_source_commit, git_status_after.

### 13.4 QA acceptance

QA must verify real user tasks, not only page existence:

- [ ] User can choose “I’m building a coding agent” and reach a recommendation with evidence/caveat.
- [ ] User can filter for “cheapest good-enough” and see that cheap token price is not treated as guaranteed cheap task cost.
- [ ] User can open a leaderboard row and click into a prefilled task-cost calculator.
- [ ] User can change retry rate/tasks per month and see task/monthly cost change.
- [ ] User can identify at least one official pricing source per price-backed row.
- [ ] User can identify when a benchmark is missing or not publicly available.
- [ ] User can use the P0 flow on 390px mobile without blocked controls or hidden caveats.
- [ ] QA confirms no page claims “best overall LLM” or “most authoritative global leaderboard”.

## 14. Downstream handoff

### 14.1 给墨笔 / SEO copy

Write the P0 pages around a clear claim:
AICodingPricing helps users choose coding models by workflow cost, not generic benchmark hype.

Mandatory copy blocks:
- short answer near top
- token price vs task cost explainer
- benchmark comparability caveat
- source/freshness block
- FAQ direct answers

Forbidden copy:
- best LLM overall
- most authoritative global leaderboard
- guaranteed cheapest coding model
- real billing quote
- complete/100% accurate ranking

### 14.2 给墨影 / Design

Design around decision-making, not static article reading.

Critical UI blocks:
- above-fold short answer + workflow filter chips
- crawlable leaderboard table/card hybrid
- row-level source/freshness/caveat labels
- calculator CTA inside every row
- recommender result cards with confidence
- calculator formula and sensitivity range
- mobile card mode for table rows

Visual priority:
1. workflow selection
2. model evidence/cost comparison
3. task-cost CTA
4. source confidence/caveat
5. methodology/FAQ below

### 14.3 给墨界 / 墨枢 / Implementation

Implement as existing-site inner pages. Reuse existing aicodingpricing data/calculator patterns where possible.

Required implementation objects:
- coding model source registry
- leaderboard row model
- benchmark evidence array per row
- pricing source fields
- confidence/data_status fields
- recommender option mapping
- calculator prefill payload

Hard implementation rules:
- No fabricated data fallbacks.
- No fake score aggregation.
- Canonical/redirect/noindex rules must be explicit.
- Sitemap only includes indexable complete pages.
- Commit + push + deploy from same commit for code-changing tasks.

## 15. Metadata snapshot

```json
{
  "p0_routes": [
    "/llm-leaderboard",
    "/best-llm-for-coding",
    "/coding-agent-cost-calculator"
  ],
  "user_tasks": [
    "choose a model for a coding agent and estimate task cost",
    "find the cheapest good-enough model for coding automation",
    "choose a model for frontend generation",
    "choose a model for repo-level refactor or long-context review",
    "compare Chinese coding workflow candidates only where public data exists",
    "compare benchmark evidence without trusting a fake universal ranking",
    "convert token pricing into estimated monthly coding-agent cost"
  ],
  "table_fields": [
    "model",
    "provider",
    "coding_evidence",
    "benchmark_source_columns",
    "input_price",
    "output_price",
    "cache_price",
    "context_window",
    "speed_signal",
    "estimated_task_cost_range",
    "best_for",
    "caveat",
    "source",
    "last_checked",
    "confidence",
    "data_status"
  ],
  "filters": [
    "cheapest_good_enough",
    "best_coding_evidence",
    "best_agent_model",
    "best_long_context",
    "best_frontend_generation",
    "best_repo_refactor",
    "best_code_review",
    "best_test_generation",
    "best_chinese_coding_workflow",
    "best_open_low_cost_model"
  ],
  "recommender_options": [
    "build_a_coding_agent",
    "cheapest_good_enough_automation",
    "frontend_generation_ui_code",
    "repo_level_refactor",
    "code_review_long_context_review",
    "test_generation_debugging",
    "chinese_coding_workflow"
  ],
  "not_do": [
    "separate_domain_or_new_site",
    "generic_llm_leaderboard_claims",
    "most_authoritative_global_leaderboard_claim",
    "fake_universal_composite_score",
    "fabricated_benchmark_speed_context_pricing_or_task_cost_data",
    "realtime_scraping_every_provider_in_p0",
    "self_developed_benchmark_in_p0",
    "absolute_winner_across_incomparable_models",
    "mix_subscription_tool_pricing_with_api_token_pricing_without_labels",
    "index_duplicate_alias_pages",
    "ship_empty_model_detail_pages",
    "promise_billing_accuracy",
    "accounts_payments_saved_estimates_full_admin_backend_in_this_scope"
  ],
  "acceptance_criteria": {
    "product": [
      "three P0 routes exist under aicodingpricing.com",
      "leaderboard is coding-specific and source-led",
      "every row has source freshness confidence caveat",
      "missing data is labeled not disclosed or not publicly benchmarked",
      "recommender covers owner-specified P0 options",
      "calculator separates token price from task cost"
    ],
    "seo": [
      "P0 pages have title meta canonical H1 H2 OG twitter schema where appropriate",
      "short answer methodology freshness and FAQ blocks visible",
      "duplicate alias handled by 301 noindex or canonical",
      "no noindex or thin pages in sitemap"
    ],
    "implementation": [
      "source-backed fields include source_url source_name last_checked confidence update_policy",
      "benchmark values stay source-specific",
      "official provider pricing preferred over aggregators",
      "calculator prefill works from leaderboard row",
      "analytics events fire after visible state/result exists"
    ],
    "qa": [
      "real user can complete model choice to cost estimate flow",
      "mobile 390px usable",
      "no fake best-overall or authoritative-global claims",
      "source and caveat visible for price-backed rows"
    ]
  }
}
```
