# aicodingpricing.com Model Leaderboard Data Contract

Date: 2026-05-28
Task: t_7a0cb927
Canonical route: /llm-leaderboard
Seed dataset: /root/.hermes/reports/aicodingpricing-leaderboard-20260528/model-leaderboard-seed.json

## Decision

Use a standalone source-led JSON contract for the first implementation pass.

I checked for an existing aicodingpricing repo/data structure under /root/projects and /mnt/data-hel1 and did not find one. Because the repo is not available in this worker, the deliverable is a repo-independent contract plus seed file under the report root. Frontend can import this as a static JSON module first, then move it into the site’s native data directory once the repo path is supplied.

This contract preserves the existing aicodingpricing pricing intent:

/llm-leaderboard -> row CTA -> /coding-agent-cost-calculator?model=<model>&workflow=<workflow>

The leaderboard must not be a generic LLM leaderboard. It is a decision-engine dataset for coding workflows, pricing, caveats, and calculator prefill.

## Required frontend object

```ts
type CodingModelLeaderboardDataset = {
  schema_version: string
  canonical_site: 'aicodingpricing.com'
  canonical_route: '/llm-leaderboard'
  purpose: string
  updated_at: string
  source_policy: SourcePolicy
  unknown_policy: UnknownPolicy
  workflow_tags: WorkflowTag[]
  models: CodingModelRow[]
}
```

## Required row contract

The task-required fields are preserved exactly on every row:

```ts
type CodingModelRow = {
  model: string
  provider: string
  benchmark_refs: BenchmarkRef[]
  price_input: PriceField
  price_output: PriceField
  cache_price: CachePriceField
  context: ContextField
  speed_source: SpeedSource | null
  best_for: WorkflowTag[]
  caveat: string
  source_urls: string[]
  updated_at: string

  // implementation helpers
  display_name: string
  availability: 'api' | 'subscription_tool' | 'open_model' | 'aggregator_route' | 'unknown'
  data_status: 'available' | 'partial' | 'not_disclosed' | 'not_publicly_benchmarked' | 'stale' | 'source_needs_recheck'
  source_confidence: 'high' | 'medium' | 'low'
}
```

## Field definitions

### model

Stable machine key for the row.

Rules:
- lowercase kebab-case
- should match provider API alias only when the alias is verified
- if benchmark name and API alias differ, use a conservative display_name and put alias uncertainty in caveat/source_note

Example:

```json
"model": "claude-sonnet-4-5"
```

### provider

Human-readable provider name.

Examples:
- OpenAI
- Anthropic
- Google
- DeepSeek
- Moonshot AI / Kimi
- Alibaba Cloud / Qwen

### benchmark_refs

Array of source-specific benchmark values.

```ts
type BenchmarkRef = {
  benchmark_name: string
  metric_label: string
  metric_value: string | number | 'not_publicly_benchmarked'
  source_name: string
  source_url: string
  last_checked: string
  confidence: 'high' | 'medium' | 'low'
  caveat: string
}
```

Rules:
- Do not merge benchmarks into one fake universal score.
- If exact model/source alignment is not verified, keep metric_value as not_publicly_benchmarked or add a caveat.
- A row can be visible with partial evidence, but UI must show confidence and caveat.

Accepted P0 benchmark source types:
- SWE-bench Verified / Lite / Full / Multilingual / Multimodal
- Aider polyglot coding benchmark
- LiveCodeBench
- LiveBench
- Artificial Analysis Coding Index
- LM Arena code/webdev
- Kilo usage signal, only labeled as usage signal

### price_input / price_output

Official API token pricing fields.

```ts
type PriceField = {
  value_usd_per_1m_tokens: number | null
  status: 'available' | 'not_disclosed' | 'not_applicable' | 'source_needs_recheck'
  source_name?: string
  source_url?: string
  last_checked?: string
  confidence?: 'high' | 'medium' | 'low'
  source_note?: string
}
```

Rules:
- Prefer official provider pricing.
- OpenRouter or other aggregator pricing can only be used if labeled route-specific.
- Do not copy a price from a nearby/similar model alias.
- Subscription tool pricing must not be mixed with API token pricing.

### cache_price

Cache pricing is separated from base input price.

```ts
type CachePriceField = {
  read_usd_per_1m_tokens: number | null
  write_5m_usd_per_1m_tokens: number | null
  write_1h_usd_per_1m_tokens: number | null
  status: 'available' | 'partial' | 'not_disclosed' | 'not_applicable' | 'source_needs_recheck'
  source_name?: string
  source_url?: string
  last_checked?: string
  confidence?: 'high' | 'medium' | 'low'
  source_note?: string
}
```

Provider-specific notes:
- Anthropic exposes cache write duration and cache hits separately.
- DeepSeek exposes cache-hit input price, not the same shape as Anthropic cache write/read.
- Google Gemini may expose context cache and storage charges; model exactness must be verified before filling.
- If unclear, set null and status not_disclosed.

### context

Advertised maximum context window, not effective long-context reliability.

```ts
type ContextField = {
  tokens: number | null
  status: 'available' | 'not_disclosed' | 'source_needs_recheck'
  source_name?: string
  source_url?: string
  last_checked?: string
  confidence?: 'high' | 'medium' | 'low'
  source_note?: string
}
```

Rules:
- Only fill numeric tokens from official model/provider docs.
- Do not infer context from benchmark rows.
- If a provider has tiered context pricing, model it in source_note or a future pricing_modes array.

### speed_source

Public speed/latency source.

```ts
type SpeedSource = {
  metric: 'TTFT' | 'tokens_per_second' | 'latency_note'
  value: string | number
  source_name: string
  source_url: string
  last_checked: string
  confidence: 'high' | 'medium' | 'low'
  caveat: string
}
```

Rules:
- Keep null unless a public source publishes speed/latency for the exact model/route.
- Do not infer speed from model family, provider reputation, or price.

### best_for

Editorial workflow tags derived from visible evidence fields.

```ts
type WorkflowTag =
  | 'coding_agent'
  | 'frontend_generation'
  | 'repo_refactor'
  | 'bug_fixing'
  | 'code_review'
  | 'test_generation'
  | 'chinese_coding_workflow'
  | 'low_cost_agent'
  | 'long_context'
```

Rules:
- best_for is not a claim of universal superiority.
- UI labels must read “best for X” or “candidate for X”, not “best overall”.
- If evidence is partial, show partial row and caveat.

### caveat

Required human-visible caveat for each row.

Good caveats explain:
- missing exact model alias mapping
- missing price/context/speed field
- benchmark/source limitations
- token price vs task cost risk
- aggregator or route-specific pricing caveat

### source_urls

Flat list of all URLs used by the row.

Rules:
- At least one source_url per visible row.
- UI should expose source/freshness near row, not only in footer.

### updated_at

ISO date of row seed update.

For this seed: 2026-05-28.

## Source policy

Hard rule: no fabricated model, benchmark, speed, context, pricing, or task-cost data.

Every source-backed field must include:
- source_url
- source_name
- last_checked
- confidence
- source_note or caveat when model alias is not exact

Accepted source states:
- available
- partial
- not_disclosed
- not_publicly_benchmarked
- stale
- source_needs_recheck

Display rules:
- Provider does not disclose value: show not disclosed.
- Exact model has no public benchmark in a selected source: show not publicly benchmarked.
- Third-party source only: label as third-party/aggregator/usage signal.
- Official provider docs win over aggregators for pricing.
- Do not blend benchmark scores from different sources.

## Unknown policy

Use null for numeric unknowns and explicit display status for UI.

```json
{
  "numeric_unknown": null,
  "display_unknown": "not_disclosed",
  "benchmark_unknown": "not_publicly_benchmarked"
}
```

Never use:
- 0 for unknown price/context/speed
- empty string for unknown values
- “N/A” without status/source_note
- inferred price from similar model names
- inferred benchmark from predecessor/successor models

## Calculator prefill contract

Each row should support this derived payload for CTA links:

```ts
type CalculatorPrefill = {
  model: string
  workflow_type: WorkflowTag
  price_input_usd_per_1m_tokens: number | null
  price_output_usd_per_1m_tokens: number | null
  cache_read_usd_per_1m_tokens: number | null
  cache_write_usd_per_1m_tokens: number | null
  default_input_tokens_per_task: number | null
  default_output_tokens_per_task: number | null
  default_retry_rate: number | null
  default_tasks_per_month: number | null
  assumption_source: 'user_default' | 'site_default' | 'source_backed' | 'not_set'
}
```

P0 recommendation: default token/task assumptions should stay null or site_default until product/content explicitly defines them. The seed does not fabricate average task token counts.

## Recommender contract

```ts
type RecommenderResult = {
  selected_workflow: WorkflowTag
  recommendation_label: string
  recommended_models: string[]
  why: string[]
  evidence_used: {
    benchmark_sources: string[]
    pricing_sources: string[]
    context_sources: string[]
    speed_sources?: string[]
  }
  cheaper_alternative?: string
  safer_alternative?: string
  caveat: string
  confidence: 'high' | 'medium' | 'low'
  calculator_prefill: CalculatorPrefill
}
```

Recommender rules:
- It can shortlist, not crown a universal winner.
- If too little data exists for a workflow, render “insufficient public data for this filter”.
- cheaper_alternative must be based on source-backed price, not assumption.
- safer_alternative must explain the evidence/caveat tradeoff.

## Frontend validation rules

Before rendering a row as price-backed:
- price_input.status must be available.
- price_output.status must be available.
- price source must be official_provider or clearly labeled aggregator route.

Before rendering a row as benchmark-backed:
- at least one benchmark_refs item must have metric_value other than not_publicly_benchmarked.
- benchmark source and caveat must be visible.

Before rendering a workflow recommendation:
- row.best_for includes selected workflow.
- row has either benchmark evidence or price/context evidence relevant to that workflow.
- confidence and caveat are displayed.

Before enabling calculator prefill:
- selected_model can always prefill from row.model.
- workflow_type can prefill from selected workflow or first best_for tag.
- numeric price fields only prefill if source-backed; otherwise leave blank and explain missing price.

## Seed dataset notes

The seed intentionally includes partial rows because the PRD requires unknowns to stay visible rather than fabricated.

Current seed coverage:
- Anthropic Claude Opus/Sonnet/Haiku rows have official pricing and captured SWE-bench evidence, but context remains not_disclosed pending model docs verification.
- DeepSeek V4 Flash/Pro rows have official pricing and context, but exact public coding benchmark values are not verified in captured sources.
- OpenAI GPT-5 Mini, Google Gemini 3 Flash, Kimi K2.5, and Qwen3 235B A22B are useful benchmark/workflow candidates, but exact pricing/model alias mapping needs follow-up verification.
- speed_source is null for every row because no exact public speed source was verified during this task.

## Sources used in this task

- PRD: /root/.hermes/reports/aicodingpricing-leaderboard-20260528/prd.md
- SEO/SERP plan: /root/.hermes/reports/aicodingpricing-leaderboard-20260528/seo-serp-plan.md
- Owner brief: /root/.hermes/reports/aicodingpricing-leaderboard-20260528/input-brief.md
- OpenAI API pricing: https://openai.com/api/pricing/
- Anthropic Claude pricing: https://docs.anthropic.com/en/docs/about-claude/pricing
- Gemini API pricing: https://ai.google.dev/gemini-api/docs/pricing
- DeepSeek pricing: https://api-docs.deepseek.com/quick_start/pricing
- Kimi pricing index: https://platform.moonshot.ai/docs/pricing
- SWE-bench leaderboards: https://www.swebench.com/
- Aider leaderboards: https://aider.chat/docs/leaderboards/
- Alibaba Cloud Model Studio pricing page discovered: https://www.alibabacloud.com/help/en/model-studio/model-pricing

## Next inputs needed

1. Existing aicodingpricing repository path and data directory conventions.
2. Confirm P0 model list: keep this seed at 8 rows or expand to 12–20 rows.
3. Exact model alias mapping for OpenAI, Gemini, Kimi, Qwen rows.
4. Exact official context-window source for Anthropic/OpenAI/Gemini/Kimi/Qwen rows.
5. Public speed source choice, if speed column must ship in V1.
6. Calculator default task assumptions, or approval to leave task token defaults blank until user input.
7. Decide whether partial rows can be visible at launch or hidden behind “source needs recheck”.
