The coding model post-trained for production.

Specialized beats general. Generalists spread thin across the entire internet. Lumen Outpost goes narrow on what production engineers actually ship — legacy languages, maintainable diffs, lower cost per task.

Try the model on our CLI

Niche-Bench · Pass@3 · 13 languages

Pick your language

Lumen on the language you actually ship.

13 languages tested with the same pass@3 evaluation. Tap one.

We also post-train on COBOL, C++, C#, and JavaScript.

Lumen Outpost · Java

76.5%

Top of the pack on Java

Enterprise workhorse. Lumen edges GPT-5.5 by 4.5 points and clears every other model by 10+.

57.9%

GPT-5.4

63.6%

Kimi K2.6

66.7%

Gemini 3.1 Pro

72.0%

GPT-5.5

76.5%

Lumen Outpost

Things we didn't teach Lumen

Every model has a training budget.

Generalists spend theirs on the entire internet. We spent ours on the code your company actually runs.

lumen-training-pipeline — stage 4 of 8

$ lumen-train corpus-filter --stage 4

▸ scanning 2,847,193,408 documents…

✘ who_is_pm_kazakhstan.md

✘ moon_landing_conspiracy.txt

✘ taylor_swift_lyrics.json

✘ tarot_reading_guide.pdf

✘ strawberry_letter_count.txt

✘ the_dress_color_debate.html

✘ hotdog_sandwich_class.md

✘ grandma_recipe_intros.md

✘ pineapple_pizza_archive/

✘ how_to_win_on_twitter.txt

✘ 90s_sitcom_trivia.db

✓abap_production/8,432 repos

✓cobol_mainframe/1,109 repos

✓fortran_scientific/3,201 repos

✓verilog_hdl/944 repos

✓rust_systems/12,847 repos

→ corpus ready · 26,533 repos · production code only

Why specialization wins

One narrow job, done very well.

Lumen is post-trained on ABAP, Fortran, COBOL, Verilog, Rust — and the production-engineering habits that make code maintainable on long-lived systems. Generalists treat these as a long-tail afterthought.

Generalist frontier models

Optimised for everything. Excellent at nothing in particular.

Attention split across web trivia, every mainstream language and every edge case the internet has ever discussed.
Long-tail languages like ABAP, Verilog and Scheme get scraps of the training budget.
Plateaus at the same numbers on the languages production engineers actually maintain.

Lumen Outpost

One job. Production code in the languages you ship.

8-step data pipeline that turns real production code into verifiable training trajectories.
Grounded in deterministic execution, not vibes — every trajectory has to actually run.
+11.6 points over the Kimi K2.6 base on Niche-Bench. Wins or ties on 9 of 13 languages tested.

Read the research

Benchmarks

How Lumen Outpost compares.

AI assistants love to add 'just in case' helpers, defensive checks for impossible states and config flags for hypothetical futures. Three months later your team is reading paragraphs of comments to explain a one-line change. Slop-Bench measures the difference between code that passes the test and code your engineers will still want to maintain in a year.

Slop-Bench

Penalised duplication, dead code and unnecessary complexity. Rewarded minimal diffs that match repo style.

Higher is better

Lumen Outpost
25.4%
GPT-5.5
25.0%
Kimi K2.6
19.8%
GPT-5.4
18.9%
Gemini 3.1 Pro
16.3%

Lumen Outpost

25.4%

Category leader.

Lumen leads on slop while every other model that scores well on code-quality benchmarks bloats its diffs to get there. Lumen's diffs are shorter, more focused, and align with your repo style.

Run Lumen on our CLI

One command. Mid-session model switching included.

Start building with Lumen

brew install CosineAI/tap/cos

Docs

Read the full Niche-Bench 250 report →

The coding model post-trained for production.

Lumen on the language you actually ship.

Every model has a training budget.

One narrow job, done very well.

Optimised for everything. Excellent at nothing in particular.

One job. Production code in the languages you ship.

How Lumen Outpost compares.

Slop-Bench

Vibe-Bench

Cost per success

Run Lumen on our CLI