# aicodingpricing leaderboard — Final QA

Date: 2026-06-05T05:26:31Z
Task: t_49a90ba2
Site: https://aicodingpricing.com
Scope:
- https://aicodingpricing.com/llm-leaderboard
- https://aicodingpricing.com/best-llm-for-coding
- https://aicodingpricing.com/coding-agent-cost-calculator

## Verdict

QA_GO.

Final real-user-task QA passes for the model decision flow. The prior P0 data-integrity blocker is fixed in production: GPT-5.4 mini no longer shows the unverified 56.20% SWE-bench numeric score, and the row visibly marks benchmark evidence as not publicly benchmarked / exact alias recheck.

No launch blockers found in this scoped QA.

## Evidence files

Worker evidence:
- /root/.hermes/kanban/boards/site-review/workspaces/t_49a90ba2/production_audit.json
- /root/.hermes/kanban/boards/site-review/workspaces/t_49a90ba2/link_placeholder_audit.json
- /root/.hermes/kanban/boards/site-review/workspaces/t_49a90ba2/qa_browser_check.json
- /root/.hermes/kanban/boards/site-review/workspaces/t_49a90ba2/qa_click_deep.json

## Production smoke

| URL | Result |
|---|---|
| / | 200 |
| /llm-leaderboard | 200 |
| /best-llm-for-coding | 200 |
| /coding-agent-cost-calculator | 200 |
| /sitemap.xml | 200 |
| /robots.txt | 200 |

Scoped internal-link crawl from the three QA pages found exposed internal routes returning 200, including legal/footer pages, pricing guides, comparison pages, leaderboard, best-LLM page, calculator, and sitemap.

## Real-user-task results

| Task | Result | Evidence |
|---|---|---|
| 1. Find best model for a coding agent | PASS | /llm-leaderboard has “Best for coding agents” shortlist: Claude Opus 4.5, Claude Sonnet 4.5, Gemini 3 Flash high reasoning. Copy states this is not a universal winner and asks user to estimate task cost before choosing. |
| 2. Find cheapest good-enough coding model | PASS | /llm-leaderboard has “Cheapest good-enough candidates”: Claude Haiku 4.5, GPT-5.4 mini, DeepSeek V4 Flash. It warns not to call a model cheapest overall unless task assumptions are visible. |
| 3. Compare Claude vs GPT coding choice | PASS with residual risk | /best-llm-for-coding FAQ explains how to compare Claude, GPT, Gemini, DeepSeek, Kimi, and Qwen by workflow/source coverage. Leaderboard shows Claude rows and GPT-5.4 mini evidence/pricing side by side. Residual: no dedicated /claude-vs-gpt-for-coding route yet; existing comparison nav is tool-level Claude Code vs Codex. |
| 4. Move from leaderboard to calculator and understand monthly task cost | PASS | Fresh Playwright click tests: hero Estimate Task Cost, GPT shortlist link, and GPT row CTA all navigate to /coding-agent-cost-calculator. GPT-5.4 mini low-cost prefill shows $0.1800/task and $7.2000/month under default assumptions. Claude Opus 4.5 coding-agent prefill shows $1.1250/task and $45.0000/month. |
| 5. Verify mobile usability and no placeholder/coming-soon/indexable thin pages | PASS | 390px mobile: no horizontal overflow; hero readable; one hamburger/menu; menu expands to Pricing, Leaderboard, Best LLM, Calculator, Compare, Changelog, Alerts. Placeholder scan found no coming soon / placeholder / lorem ipsum / prototype / under construction text in scoped pages. |

## Data-integrity recheck

PASS.

Observed GPT-5.4 mini production row:
- status: PARTIAL EVIDENCE
- benchmark value: not publicly benchmarked
- benchmark/source label: SWE-bench Leaderboards exact alias recheck
- checked: 2026-06-05
- pricing: $0.75 input / 1M, $4.5 output / 1M
- caveat: exact public coding benchmark for this alias is not verified

Regex audit found no “GPT-5.4 mini” within 500 chars of “56.20”.

## Mobile result

PASS.

Fresh Playwright 390x844 check:
- viewport width: 390
- document scrollWidth: 390
- overflow-x: false
- H1 readable: “Choose the Best LLM for Your Coding Workflow”
- before opening menu: only logo visible in nav, one menu control
- after opening menu: Pricing, Leaderboard, Best LLM, Calculator, Compare, Changelog, Alerts visible

Visual browser inspection confirmed cards stack cleanly, CTAs remain tappable, and evidence cards do not overlap.

## No placeholder / thin-page check

PASS for exposed scoped flow.

- No bad phrases found on /llm-leaderboard, /best-llm-for-coding, /coding-agent-cost-calculator.
- Exposed internal routes from these pages returned 200.
- Proposed expansion routes such as /claude-vs-gpt-for-coding, /cheapest-coding-model, /coding-model-benchmark, /llm-api-pricing-comparison are not exposed as clickable routes in scoped pages and return 404 if typed directly. This is acceptable for current P0 scope, but should be implemented or redirected before linking/indexing them.

## Residual risk

1. Dedicated Claude-vs-GPT model-decision route is absent. Current flow supports the task through FAQ + evidence table, but SEO/user intent for “Claude vs GPT for coding” would be stronger with a dedicated page or redirect.
2. The GPT-5.4 mini row still displays “SWE-bench Verified” as the metric label next to “not publicly benchmarked”. This is no longer a fake numeric claim, but the label could be clearer as “SWE-bench: not publicly benchmarked / alias recheck”.
3. Browser MCP session showed local-session pollution during one manual click attempt; fresh Playwright browser and static HTML checks showed no 127.0.0.1/localhost references and production clicks worked. Treat as non-production evidence noise, not a site blocker.

## Blockers

None.

## Metadata

```json
{
  "verdict": "QA_GO",
  "tested_urls": [
    "https://aicodingpricing.com/llm-leaderboard",
    "https://aicodingpricing.com/best-llm-for-coding",
    "https://aicodingpricing.com/coding-agent-cost-calculator",
    "https://aicodingpricing.com/coding-agent-cost-calculator?model=gpt-5.4-mini&workflow=low_cost_agent",
    "https://aicodingpricing.com/coding-agent-cost-calculator?model=claude-opus-4-5&workflow=coding_agent",
    "https://aicodingpricing.com/sitemap.xml",
    "https://aicodingpricing.com/robots.txt"
  ],
  "user_task_results": {
    "find_best_model_for_coding_agent": "PASS",
    "find_cheapest_good_enough_model": "PASS",
    "compare_claude_vs_gpt_coding_choice": "PASS_WITH_RESIDUAL_RISK",
    "leaderboard_to_calculator_monthly_cost": "PASS",
    "no_placeholder_thin_mobile": "PASS"
  },
  "mobile_result": "PASS: 390px no horizontal overflow; menu expands; CTAs usable; cards stack cleanly",
  "blockers": [],
  "residual_risk": [
    "No dedicated /claude-vs-gpt-for-coding route yet; current support is FAQ + evidence table.",
    "GPT-5.4 mini metric label could be clearer than SWE-bench Verified while value is not publicly benchmarked.",
    "Browser MCP session pollution observed once; fresh Playwright production click checks passed."
  ]
}
```
