- Latest and Greatest
- Posts
- Your Next Senior Engineer Runs in a Terminal
Your Next Senior Engineer Runs in a Terminal
Your Job Is To Know What To Do With It!
Part 3 of a 3-Part Series
Introduction
Part 1 was about timelines. Part 2 was an effort to get to the why - the reasons for the shift in usage patterns and why OpenAI may have become a consumer facing company by more accident than design.
Part 3 we want to get into the weeds of what actually works!
This is the practical guide (and a precursor to an ode to Claude). Coding has turned out to be the first big use case of AI, mainly because it is based on automated reasoning and problem solving, the results are objective, verifiable and has large structured datasets well suited for training LLMs. The question is no more whether AI can do our job, it is what our job looks like in the future? There is a goldrush at every level - starting with the large labs training foundation models, to the infra layer and showing up at the application layer in 2025 as we entered the age of the agents (as we had discussed in a timely fashion in early 2025 - the year of agents!). So, what does the future really look like?
Let's start with the most important product launch of 2025 - one that started as an engineer's side project.
THE IDE WARS
Claude Code's Cursor Moment: The Side Project That Hit $1Bn
In September 2024, an engineer named Boris Cherny joined Anthropic and started tinkering.
He gave Claude something it had never had before: direct access to the filesystem.
"There was no master plan," Cherny later explained. "I was experimenting with different ways to use the model. One of the really weird experiments was this thing that runs in a terminal. I was using it for kind of weird stuff - like looking at what music I was listening to and react to that."
Then he tried giving it tools to interact with the filesystem and bash commands…
…and this is what happened next: when Claude could read files, it did what you or I would do: “explored”. Given access to a codebase, Claude naturally started reading files, following imports, understanding project structure. Nobody told it to do this: this was emergent behaviour.
Cherny had discovered what the team later called the "Product Overhang": the capability to be a genuine development partner already existed inside Claude. He just let it be known…by tinkering!
Internal Adoption: The Signs Were There
In November 2024, Anthropic released Claude Code for internal dogfooding.
20% of engineering adopted it on day one. By day five: 50%
By public launch in February 2025: over eighty percent of Anthropic engineers were using it daily.
And now? I shall quote the creator himself: “Fast forward to today. In the last thirty days, I landed 259 PRs -- 497 commits, 40k lines added, 38k lines removed. Every single line was written by Claude Code + Opus 4.5. Claude consistently runs for minutes, hours, and days at a time (using Stop hooks). Software engineering is changing, and we are entering a new period in coding history. And we're still just getting started..” - Boris Cherny
As an early user with Beta access, I remember showing this to one of my colleagues at the time and the response I got was: “Why won’t I just use VS Code with he $20 plan?” The answer: For use cases such as coding, it is not just important to have the most powerful model, but also the most effective harness. That is Claude Code!
The Karpathy Moment
On December 20, 2025, Andrej Karpathy dropped his annual LLM year-in-review and explicitly mentioned Claude Code, further accelerating adoption of the tool already on an exponential adoption curve.
His take on Claude Code:
"Claude Code emerged as the first convincing demonstration of what an LLM Agent looks like - something that in a loopy way strings together tool use and reasoning for extended problem solving."
And why it won vs OpenAI’s Codex:
"Claude Code is notable to me in that it runs on your computer and with your private environment, data and context. I think OpenAI got this wrong because they focused their early Codex/agent efforts on cloud deployments in containers orchestrated from ChatGPT instead of simply localhost."
And this is being corroborated by the dev community at large as 2026 is now crucial for OpenAI to stay ahead:
He called Claude Code a "little spirit/ghost that lives on your computer" - and the community ran with it. Claude Code paired with Obsidian is apparently the precursor to personal OS.
But there is no doubt that we are now entering the era of personal software and there is no stopping it. the game we are all going to be playing in 2026 is “Slop or Not”.
The Claude Code Confession: From the 1% Google Engineer
Two weeks after Karpathy's review, we had further validation from one of the top Engineers in the world:
Jaana Dogan, a principal engineer at Google working on Gemini API with 12+ years at the company, publicly shared her Claude Code experience:
“I gave Claude Code a description of the problem, it generated what we built last year in an hour.” - are you listening yet?
This is high praise coming from a top engineer at the lab that launched Gemini 3.0 Pro in December topping benchmarks, conceding that Claude Code is still better for coding. So yes, benchmarks are not everything and your use case should decide the model.
The Numbers
Metric | Claude Code |
|---|---|
Launch | February 24, 2025 |
$1B ARR milestone | November 2025 |
Developer adoption | 115,000+ developers |
Weekly code processed | 195 million lines |
Enterprise coding market share | 54% |
Enterprise clients | Netflix, Spotify, KPMG, L'Oreal, Salesforce |
Claude Code has hit the $1bn ARR in 6-9 months (depending on how you look at it). We are talking about a terminal tool. No GUI. That is the power of a SOTA model paired with the most effective harness.
Anthropic's Acquisition: The Bun Bet
If you were still wondering if Anthropic actually was building for the developers, on December 2, 2025, Anthropic made its first acquisition ever: Bun, the JavaScript runtime that's dramatically faster than Node.js.
Claude Code ships as a Bun executable, indicative of the focus Anthropic has had building the infrastructure layer that AI coding tools run on. No wonder it is such a pleasure using Claude Code.
Bun founder Jarred Sumner on why he sold:
"I think Anthropic is going to win. Betting on Anthropic sounded like a more interesting path. To be in the center of things. To work alongside the team building the best AI coding product."
Here’s a striking fact: over the last several months, the GitHub username with the most merged PRs in Bun's repo is a Claude Code bot.
Claude Code is now contributing more to its own runtime than most human developers. Future of software engineering? go figure!
Has Anthropic Won The Coding Battle Already?
METR's latest report shows Claude Opus 4.5 can code autonomously for up to 5 hours without crashing. GPT-5.1-Codex-Max can't match it. GPT-5.2 was launched under the Code Res pressure earlier (read part 1), and GPT-5.2-Codex-Max is supposed to be better at agentic long duration tasks, but it is all still to be proven.

Source: METR
Some are already calling Claude Code + Opus 4.5 "close enough to the definition of AGI." I may be one of them and it may be premature, but once you use it, you can’t go back.
Opus 4.5 is the best coding model in the world, period! 80.9% on SWE-bench Verified—the first to break 80%. It beats GPT-5.2 (76.3%) and Gemini 3 Pro (76.2%). On ARC-AGI-2, it scores 37.6% - more than double GPT-5.1's 17.6%. It passed Anthropic's internal engineering take-home exam with a higher score than any human candidate in company history. And it widely passes the vibe test in the real world.
The unlock is sustained autonomy.
GPT-4 was the unlock for chat. Sonnet 3.5 was the unlock for code. Opus 4.5 is the unlock for agents - AI systems that work for hours on complex, multi-step tasks without losing context, drifting, or getting stuck in error loops.
I have myself watched Opus 4.5 work for 30 minutes on a complex codebase and four coordinated agents. What both surprised and scared me what how the task was done simply and appropriately - the least slop without the exhaustive guard railed prompts I had so gotten used to. This wasn't possible even six months ago.
The harness matters as much as the model.
It ain’t a wrapper around Opus 4.5. Claude Code is a Unix-philosophy tool designed to give Opus 4.5 the same capabilities programmers have: file access, bash commands, Git, test runners, MCP integrations. The Claude Agent SDK lets developers build on the same harness, which we shall cover in another post. The pairing - frontier model plus SOTA infrastructure best suited for agentic workflows - is what creates this step change.
"People who are serious about models should make their own harness." - McKay Wrigley
Anthropic clearly believes this. And it's working.
GPT-5.2 still leads on abstract reasoning (ARC-AGI-2: ~53% vs 37.6%) and math. Gemini 3 Pro wins on multimodal tasks. But there is no way to be sure based on just that - build with them and you shall see where they shine and where they devolve. There is a breaking point, but we need to push the limits to find it and this is what th frontier of tech feels like.
The question isn't whether OpenAI "lost." It's whether the sustained autonomy gap holds. One developer comparison put it well:
"Claude is thorough, educational, transparent, and expensive. Codex is fast, minimal, opaque, and cheap."
And as I type this, we already have some positive vibes coming out of GPT-5.2-xHigh
The $29 Billion IDE: Cursor's Parallel Rise
But I do want to call out the first IDE that got me hooked - Cursor! This was back in the Sonnet 3.5 era and I’m not going to lie - never thought IDE wars will have a CLI based tool as the leader. The fastest growing company at the time and spurred the onslaught of coding IDEs (Windsurf, Devin, Warp, Bolt, Lovable, Replit, the list goes on…)
Here is some data to show how powerful this growth was.
Date | Valuation | ARR |
|---|---|---|
August 2023 | $400M | ~$10M |
January 2025 | ~$1B | $100M |
June 2025 | $9.9B | $500M |
November 2025 | $29.3B | $1B+ |
Source: CNBC
73x valuation increase in 27 months and without building a frontier model.
What Cursor proved wrong is the “It’s just a GPT wrapper rhetoric”! Their ruthless execution and focus on one of the most valuable target audience (developers) saw them go from a VS Code fork to being a model agnostic integration layer.
They made tab go from autocomplete to next action prediction and crucially made AI feel native to the coding workflow. From speculative decoding to aggressive caching, it takes more than being just a GPT-wrapper to actually make coding feel like magic and Cursor was the first time I felt that way - still rooting for that amazing team!
What Cursor felt in 2024, Claude Code feels 10x that in 2025. And with the newly found fanfare, software engineering is going to look very different not too long from now.
This Is Just The Beginning - Competition Intensifies
Anthropic’s models not being available on Cursor today (Jan 10, 2026), and we already have scoops floating round, some even suggesting that competitors such as xAI were using Claude via Cursor. How is that for “we win together”! More to come.
Vibe Coding: The Cultural Shift
On February 6, 2025, Karpathy coined a term that defined the year:
"There's a new kind of coding I call 'vibe coding', where you fully give in to the vibes, embrace exponentials, and forget that the code even exists."
His description:
"I 'Accept All' always, I don't read the diffs anymore. When I get error messages I just copy paste them in with no comment, usually that fixes it. The code grows beyond my usual comprehension."
Maybe it is just me but the quest to be first means that any new development gets picked up immediately and spreads rapidly. Y Combinator reported that 25% of startups in its Winter 2025 batch had codebases that were 95% AI-generated.
I remember that within 3 months, there were outcries of AI slop, how AI generated codebases are more convoluted than they need to be, tough to review, maintain and fix.
Come September, we had Fast Company reporting "the vibe coding hangover is upon us" - senior engineers citing "development hell" when maintaining vibe-coded projects they didn't understand. So yes, that’s the AI world for you - hype cycle on steroids.
For now vibe coding works for prototypes and throwaway projects. For production systems, you still need to understand what you're shipping. But I do want to qualify that by saying that it may be true today, the model capabilities are already pointing to a vibe coded production future. 2026 is the first year we will actually see vibe coded apps in production imo:
Won’t be surprised if privacy and security will have their moment in the sun sooner than later. With cli based agents and coding agents being run as root, I’d bet on sooner!
Your Margin Is My Opportunity
Hence began the era of IDE, Vibe coding, CLI tools. Uber for X became Cursor for Y! There has already been the phase of “a new day, another new vibe coding app” and we have had a few winners. A few notable examples below and not even close to a comprehensive list:
Windsurf (formerly Codeium) built the most direct Cursor competitor. Same VS Code fork approach, different execution. They had 1M+ developers before OpenAI came calling with $3 billion which is a soap opera in itself.
Devin from Cognition became the "autonomous software engineer" - give it a GitHub issue, come back in an hour, get a PR. The demo videos were impressive. The reality was more mixed (hence the single-digit agent benchmark scores). But Cognition raised at a $2B valuation and acquired Windsurf's assets, so clearly someone believes.
Factory also went after the autonomous software development market basing it on the concept of driods, which are basically autonomous AI agents. These droids are designed to handle various stages of the software development lifecycle, from planning and coding to testing and documentation.
Lovable went after non-developers entirely. Describe an app in plain English, get a working React frontend. No coding required. The tagline: "from idea to app in seconds." It's vibe coding taken to its logical extreme - why learn React when AI can just write it?
Bolt (by StackBlitz) took a similar approach but ran everything in the browser. No local setup, no environment headaches. Type what you want, watch it build. The WebContainer technology underneath meant the AI could actually run and test the code as it wrote it.
Replit pivoted hard into AI. Their Agent can build entire applications from prompts, deploy them to Replit's hosting, and iterate based on feedback. They're betting that the IDE, the hosting, and the AI should all be one product. Given their distribution (millions of developers already on the platform), it's not a crazy bet.
So back to developer focused IDEs (I would include Windsurf/Devin and Factory here but avoiding repetition), here is where we are today with the big ones:
Tool | Type | Market Position | Best For |
|---|---|---|---|
Claude Code | CLI | #1 enterprise coding (54%) | Autonomous development, CI/CD, large refactors |
Cursor | IDE | #1 visual coding ($29B valuation) | Iterative editing, visual workflows |
Codex CLI | CLI + Cloud | Serious competitor, strong enterprise adoption | Long-horizon tasks, PR review, automation pipelines |
GitHub Copilot | IDE Extension | Distribution incumbent | Light assistance, VS Code native users |
Antigravity | IDE | Google's agent-first bet | Google ecosystem, browser-integrated agents |
Jules | Async Agent | Unique positioning | Background tasks, overnight development |
THE USE CASE BREAKDOWN
Who's Actually Best at What (January 2026)
CODING: The Current Rankings
SWE-bench Verified (December 2025)
Model | Score | Notes |
|---|---|---|
Claude Opus 4.5 | 80.9% | First to exceed 80% |
Gemini 3 Flash | 78.0% | Beats Pro at lower cost |
Gemini 3 Pro | 76.2% | Strong multimodal reasoning |
GPT-5.2 | 74.9% | Significant improvement over GPT-4 series and a shift from the o-series |
DeepSeek V3.1 (s/o to Qwen family of models | 66.0% | Best open-weight (for now) |
Beyond The Benchmarks
We did discuss how benchmarks shouldn’t be the way to judge these models and so we wouldn’t. Claude Opus 4.5's dominance has been exemplified by Claude Code (which should have been named better). The versatilely of the model lends itself to so much more than coding. Development teams report:
Catches edge cases other models miss in code review
Superior security analysis - identifies vulnerabilities proactively
Better architectural decisions - doesn't just fix bugs, suggests better patterns
66.3% on OSWorld - highest score for autonomous computer use, meaning safest for unsupervised execution
For mission-critical autonomous workflows, Claude's robustness against prompt injection and superior reliability make it the default choice (or atleast it should - for now).
THE HYBRID WORKFLOW
Choosing the best model is so 2024. We are in the age of mixture of experts, routers, harnesses etc and so every architectural decision should begin as hybrid.
Stage | Best Model | Why |
|---|---|---|
Initial research | Gemini-3-Pro | 72.1% SimpleQA (SOTA factual accuracy), 1M token context for massive document analysis |
System design & planning | Claude Opus 4.5 | 80.9% SWE-bench Verified, best architectural coherence on multi-file refactors |
Rapid code generation | GPT-5.2 or Gemini-3-Flash | GPT-5.2 is 3.8x faster than Claude; Flash offers best price/performance for iteration |
Complex debugging | Claude Opus 4.5 | 59.3% Terminal-bench 2.0, fewer concurrency bugs, better at systemic root causes |
Mathematical correctness/ Abstract reasoning | GPT-5.2 | 100% AIME 2025, 54.2% ARC-AGI-2 (vs Claude's 37.6%) |
PhD-level science | Gemini-3-Pro | 93.8% GPQA Diamond, 41% Humanity's Last Exam |
RESEARCH & ANALYSIS
Tool | Best For | Speed | Accuracy |
|---|---|---|---|
Perplexity Deep Research | Source-backed research | 2-3 min | 95% citation accuracy |
ChatGPT Deep Research | Synthesized analysis | 5-10 min | 88% accuracy, deeper synthesis |
Google Deep Research | Current events | 3-5 min | Best for real-time grounded information |
Grok 4.1 | X/Twitter data | Instant | Unique realtime access to social context |
Claude Opus 4.5 | Deep Research and output in various formats | 5-10 minutes | Great for structured, well researched analysis. |
The tradeoff: Perplexity is faster and more accurate on citations. ChatGPT produces more synthesized, structured outputs. Claude Opus 4.5 goes for well rounded analysis, but prompt and context matters a lot.
CREATIVE: VIDEO
The video generation landscape matured significantly in late 2025.
Tool | Max Length | Resolution | Audio | Best For |
|---|---|---|---|---|
Sora 2 | 25 sec (pro) | 1080p | Dialogue + SFX | Social-first, Disney characters incoming |
Veo 3.1 | 8 sec | 1080p | Native generation | Youtube Shorts Integration, 275M videos generated |
Runway Gen-4.5 | 10 sec | 4K clean | Native sync | #1 overall (1,247 Elo), Adobe partnership |
Kling 2.6 | 10 sec | 1080p | Speech/singing/ambient | Fastest audio-visual in single pass |
Luma Ray3 HDR | Variable | 4K EXR | No native | Professional ACES workflow |
Honorable mention to Pika 2.5 as a valid strong competitor.
CREATIVE: IMAGES
Tool | Best For | Text Rendering | Key Update |
|---|---|---|---|
Nano Banana Pro | Consistency, Infographics | Excellent (multilingual) | Gemini 3 Pro based |
FLUX.2 | Photorealism, open-source | Strong | 32B params, multi-reference (10 images) |
Ideogram 3.0 | Typography-heavy designs | Best in class (95%+) | Logos, posters, memes |
Midjourney V7 | Artistic quality, photorealism | Improved | Draft mode (!0x faster), video gen added |
GPT Image 1.5 | Semantic understanding | Excellent | 4x faster, replaced Dall-E 3 |
The Grok controversy (Jan 2026): xAI restricted Grok image generation to paid users after deepfake backlash. EU ordered X to retain all Grok documents and Senators called for App Store removal - not recommended for production use.
AUDIO & VOICE
Provider | Best For | Latency | Quality |
|---|---|---|---|
ElevenLabs | Production, voice cloning | 75ms (Flash) | Studio-grade, 82% pronunciation |
OpenAI TTS | Quick integration | 200ms | Clean, consistent |
Chatterbox (open source) | Budget deployments | ~100ms | Reportedly beats ElevenLabs in blind tests |
ElevenLabs dominates professional use. Better emotional expression, voice cloning from seconds of audio, 3,000+ voices. Worth the premium for production.
AGENTS: What Actually Works in 2026
OSWorld benchmark: Best agents now are at 45% (up from 17% at start of 2025). guess where humans are at? 72-74%. It is now largely expected that agents will reach human parity in 2026 - what a time to be alive.
Here is where we have seen the most progress:
Coding agents (Claude Code, Codex): 70-80% success on defined tasks
Web research (Deep Research tools): 80%+ success
Constrained workflows: CI/CD pipelines, structured data operations, defined domains
And this is where humans are still quite a way ahead:
General computer use (38% on OSWorld)
Open-ended web tasks (CAPTCHAs, auth flows, dynamic sites)
Tasks requiring judgment about correctness
The MCP Standard
The Model Context Protocol (MCP), which Anthropic donated to the Linux Foundation's Agentic AI Foundation, was the talk of town in early 2025 - pretty much everything anyone could talk about as the future of agent to agent communication.
97M monthly SDK downloads
10,000+ published MCP servers
OpenAI and Microsoft have publicly embraced it
Google has begun standing up managed MCP servers
But, I have to say, it has been a little disappointing to see the hype slowly whittle away as OpenAI releasing their MCP server doesn’t make close to the top news (the feed’s too full of Claude Code, sorry):
CONCLUSION
State of LLMs today:
The IDE Wars are starting to produce clear winners:
Claude Code is on a breakout with all the cool kids already on the wagon.
Cursor owns visual coding and now has a serious threat as Claude challenges the notion of “being a wrapper is enough” rather than “a model training lab with an effective IDE”.
OpenAI's Codex is a strong contender especially with the GPT-5.2 showing decent initial signs on xHigh (or even Medium).
Karpathy's voice matters…a lot: From vibe coding going viral to Claude Code going viral beyond developers. When sensei speaks, we listen!
The model layer was commoditizing, and maybe will again. Cursor's success proved that you don't need to build models to win. Integration and developer experience beat raw capability when capability is "good enough" across multiple providers and it was true until Claude Code came long. What happens when models become good for most use cases, do wrappers win or model training labs evolve to building wrappers themselves…for everything? Google is already showing us the way.
The vertical AI winners figured something out. Harvey apparently charges $14,400/user because they understand law, not just language. Abridge landed Kaiser because they understand clinical workflows. EvenUp dominates personal injury because they picked a specific problem and solved it completely. That is the next frontier and we already are seeing signs with the ChatGPT Health launch:
2026 Predictions
The consensus: 2026 is when agentic workflows move from demos to day-to-day practice. But there is not doubt that:
Agents will increasingly manage multi-step workflows rather than individual tasks. But most organizations are choosing "bounded autonomy" - letting agents act independently where outcomes are predictable, keeping humans involved where risk increases.
We may see open source once again look to close the gap with the widely anticipated Deepseek V4 (being already touted as being comparable to Gemini 3 and Claude’s best - TBD).
Audio-visual generation is close to being solved, so expect the AI generated content, let’s call it AIGC or AI slop (depending on taste) to explode across social media.
Security as AI generated code goes mainstream will become top of mind as 2026 may finally see a security breach (prompt injection or other). Use sandboxes!
And we shall see software engineering evolve beyond code as we see evolution pointing to some sort of resolution to the Jevon’s paradox debate.
Looking Ahead
As has been the case in business circles for a long time, the real moat is in “Asking the right questions” and with LLM capabilities expected to hit human parity in 2026, that will be the moat.
The era of picking the right tool for the right job - and combining them intelligently - has begun, where the user’s domain expertise will lend itself best to the right context paired with the right question to find the most effective solution.
As LLMs have proved, hallucinations are a feature not. bug and that is what makes them powerful and makes us human. We need to embrace the unknown and look at what we want to solve rather than questioning whether we can.
We shall go deeper into more technical questions in the next few blogs. Stay tuned…
End of 3-Part Series: The 2025 AI Race


Reply