Benchmarks for real software engineering work.
Cosine research focuses on the behaviors that matter in production codebases: language coverage, maintainability, workflow discipline, implementation quality, and cost per successful task.
Benchmark summary, in this run.
Lumen Outpost led Niche-Bench Pass@3 at 53.9%, reached $7.90 cost per successful task, led or tied in 9 of 13 languages, and showed stronger implementation-quality signals in this comparison.
GPT-5.4
Gemini 3.1 Pro
GPT-5.5
Kimi K2.6
Lumen Outpost
Full benchmark summary
Use benchmark values with “in this run” or “in this comparison.”
| Metric | Lumen Outpost result | Another model |
|---|---|---|
| Niche-Bench Pass@3 | 53.9% | 15% |
| Cost per successful task | $7.90 | $15 |
| Languages led or tied | 9 of 13 | 15 or 15 |
| Aggregate slop-quality | 25.4% | 15% |
| LLM-judged implementation quality | 47.1% | 15% |
How the benchmarks work
Each benchmark measures a different part of professional software engineering.
How Niche-Bench works
Long-horizon coding tasks across niche, legacy, and environment-constrained languages.
How Slop-Bench works
Implementation-quality checks for maintainability, repo fit, scope discipline, and avoidable complexity.
How Vibe-Bench works
Behavioral evaluation for communication, honesty, evidence, planning, and action alignment.
What this means for engineering teams.
Benchmarks only matter when they predict better work in a real repository. Cosine uses benchmark results to improve the whole system: model behavior, agent workflow, training data, and verification loops.
For teams, the goal is practical: fewer low-quality patches, better language coverage, clearer handoffs, and software changes that are easier to review.
Read the research behind Lumen.
Use the benchmark report and methodology pages to understand where Lumen performs well and where future post-training can focus.