Research

Benchmarks for real software engineering work.

Cosine research focuses on the behaviors that matter in production codebases: language coverage, maintainability, workflow discipline, implementation quality, and cost per successful task.

Configurable visual
Benchmarks

Benchmark summary, in this run.

Lumen Outpost led Niche-Bench Pass@3 at 53.9%, reached $7.90 cost per successful task, led or tied in 9 of 13 languages, and showed stronger implementation-quality signals in this comparison.

42.4%

GPT-5.4

44.9%

Gemini 3.1 Pro

47.4%

GPT-5.5

48.3%

Kimi K2.6

53.9%

Lumen Outpost

Full benchmark summary

Use benchmark values with “in this run” or “in this comparison.”

MetricLumen Outpost resultAnother model
Niche-Bench Pass@353.9%15%
Cost per successful task$7.90$15
Languages led or tied9 of 1315 or 15
Aggregate slop-quality25.4%15%
LLM-judged implementation quality47.1%15%

How the benchmarks work

Each benchmark measures a different part of professional software engineering.

How Niche-Bench works

Long-horizon coding tasks across niche, legacy, and environment-constrained languages.

How Slop-Bench works

Implementation-quality checks for maintainability, repo fit, scope discipline, and avoidable complexity.

How Vibe-Bench works

Behavioral evaluation for communication, honesty, evidence, planning, and action alignment.

Configurable visual

What this means for engineering teams.

Benchmarks only matter when they predict better work in a real repository. Cosine uses benchmark results to improve the whole system: model behavior, agent workflow, training data, and verification loops.

For teams, the goal is practical: fewer low-quality patches, better language coverage, clearer handoffs, and software changes that are easier to review.

Read the research behind Lumen.

Use the benchmark report and methodology pages to understand where Lumen performs well and where future post-training can focus.