Agentic software development: IDE integrations versus CLI agents

Daniel Kharitonov
Aug 6
6 min read

Updated: Aug 8

TL;DR Command-line-based AI coding assistants from foundational model vendors are rapidly overtaking IDE integrations like Cursor and Windsurf in the professional use, with Gemini CLI and OpenAI Codex following the lead of Claude Code.

There is something strange about the cadence of AI coding assistant development. In software, the status quo is that new abilities first make their way to professional-grade command line tools and utilities, then to wider-audience graphical interfaces, and finally become available as autonomous executable modules.

This cadence, however, somehow got reversed for AI coding. While the potential for Large Language Models to solve software problems was well recognized since the early days of transformers, the first practical applications of AI to programming came from IDE-integrated tools like GitHub Copilot (2021) and Cursor (2023). Self-contained coding agents working in dedicated environments (like Cognition Lab’s Devin) started appearing in late 2023, while frontend-specific integrated AI studios (like Lovable and Vercel v0) began popping up in late 2024.

So it seems a bit counterintuitive that anyone would care about general-purpose command-line AI coding utilities in 2025, when vibe coding is already in full swing. It almost feels backward, does it not? Yet here we are, with professional developers switching to Claude Code en masse and leaving Cursor behind.

So what’s new?

The puzzle becomes less perplexing if we realize that the early rise of AI-powered IDEs started with simple ‘tab-autocomplete’ models that could only operate inside the fully-fledged editor. And while all contemporary AI-enabled IDEs (like Cursor or Windsurf) can integrate virtually any API-based LLM, at the end of the day, the differences between all AI coding assistants boil down to only three major variables:

How well these AI coding assistants manage the codebase context,
How well these AI coding assistants implement and manage tools, and –
How well these AI coding assistants are priced.

Notably, the traditional file editing and debugging IDE capabilities are NOT on this list — which means no one truly cares about how well Cursor integrates with Visual Studio. At the same time, foundational model vendors clearly call the shots over items (2) and (3) — they can implement unique tools and fine-tune their own coding models, and they also do not have the problem of massive charge-through where most of the revenue collected by Windsurf actually goes back to OpenAI and Anthropic.

In theory, foundational model vendors should have no implicit advantage in the sole remaining variable (1), and Cursor, Windsurf, or Augment are free to implement whatever strategy they want to learn and manage information about the codebase the agent works on. However, under the hood, these vendors are still at a disadvantage, as their makers are highly motivated to send upstream as few tokens as possible and command no leverage in input caching and other context optimization techniques on the inference side. This pushes IDE integrations to cheaper hobbyist-style tiers while elevating CLI coding agents to a professional level.

Where CLI agents are really great?

To answer this question, we have two things to touch on here — one is easy and one a bit harder.

The easy thing is the differentiator of agents in comparison to one-shot coding attempts — like submitting your code problem to ChatGPT. Here, software coding agents with multi-turn reasoning and extensive tooling have an unfair advantage in both understanding the codebase and iteratively trying to build a solution. This crucial difference in performance is well explained in the graph below from Cognition circa 2024, capturing the pre-agentic model performance on SWE-bench versus a Devin agent:

source: https://cognition.ai/blog/swe-bench-technical-report#results

The main source of the productivity gain for Devin is its agentic capability (versus the single-shot results of the others). We can verify this fact by observing the agentic wrapper for GPT-4 (1106) on the official SWE-bench leaderboard, where it shows results very close to Devin:

So, software coding agents handily beat single-shot LLM attempts to solve problems; this is well understood. But are dedicated CLI agents actually better than their IDE-integrated counterparts like Cursor, especially when the latter are powered with the exact same LLM models?

To find this out, let us run a quick eval.

We have set the local instance for a medium-complexity SWE-Bench Verified problem (sphinx-doc__sphinx-9229) and run Claude Code in yolo mode inside the appropriate environment with a default task prompt from the dataset.

After 1,300 seconds, the issue was fixed and all tests were passing without drama. This is not particularly fast or cheap (Claude reported $6 in costs) but acceptable for a software problem labeled as “1–4 hours” of work for a human programmer.

Following that, we reset the repo back to the original state and ran Cursor inside it, configured to use sonnet-4.0 and exactly the same prompt as Claude Code. The Cursor agent started off quite sensibly, reading the files and reasoning about the fix, but then a major problem emerged — Cursor could not solve the environment for running the tests:

Cursor having trouble solving the environment in SWE-Bench instance repo

The issue seems to be trivial: there is an appropriate activate.sh script in the repo, and we have no problem opening Cursor's terminal and pinning it down to the environment. However, the agent itself does not seem to be picking it up. Documentation lookups yield nothing, and Cursor continues stumbling, unable to run the tests. The agent finally gives up and delivers the patch without verification (and it actually works!), but this is not the optimal experience. Doing software work blindly is surely faster and cheaper without running the unit tests, but this is not quite the enterprise-grade work.

We give IDEs another try—this time with Windsurf, again configured for use with the same Anthropic sonnet-4.0 model. Just as in the case with Cursor, WindSurf immediately runs into an issue when trying to execute tests, but somehow perseveres and decides to modify the repository to actually work with the default environment it was launched from:

WindSurf taking a hard path: adapting historic repo for modern environment

One hour and many reminders to hit 'Continue' later, Windsurf also succeeds at the task, although delivering far more changes than minimally required (the "golden" solution from the dataset only touches one file):

So what is the equivalent of this problem in Claude Code?

It simply does not exist.

A command-line software AI agent picks up exactly where you started it from, inheriting the entire environment and working from there. This is just one of the many reasons why CLI coding agents are winning professionals over.

Where are CLI agents not so great?

One word: cost.

In our small example, Claude Code estimated the fix to the sphinx-doc__sphinx-9229 instance to cost $6. This is actually on the higher side of things because the average cost per issue from SWE-Bench with Claude Code on our bench about $3. Nevertheless, in our experience, it is pretty common for Claude Code to rack up $100 in active API calls per day, adding roughly $2,000 in expenses per month for every programmer using the tool daily.

On the other hand, the same fix delivered by WindSurf was reported to cost 10 credits, or approximately $0.3 on the Pro Plan ($15 per month):

Windsurf's Team and Enterprise Plans are slightly more expensive per credit compared to Pro, but still cost a fraction of what Claude charges. Sure enough, Anthropic also offers subscriptions where requests are cheaper, but for large codebases, it still recommends unrestricted "Max" pricing starting around $200 per month.

But large codebases are also where this math gets more complicated.

Having a lot of context and a rich conversational history requires sending more information upstream – a task IDE vendors operating in lower price ranges are rarely willing to do. Richness of context is where CLI agents like Claude Code and OpenAI Codex are thriving and earning their "Pro" status.

Conclusion

The market for software coding agents is still not mature and moves very quickly. However, it already displays signs of differentiation between hobbyist and professional uses which (for now) fall into the camps of companies owning large user bases (Augment, WindSurf, Cursor), and foundational model owners which tend to double down on CLI agents (Google Gemini Code, Claude Code, OpenAI Codex, Qwen Code). The former seem to care more about pricing and ease of use, while the latter cater to audiences that value quality over quantity and are less price-sensitive.

It is also possible that some enterprises may choose to adopt both product types, especially when they become more integrated. As one example of such integration, Augment now offers a CLI coding utility (nicknamed "Auggie"), and WindSurf is about to be added to Cognition Labs' portfolio alongside Devin. Cursor itself is launching a CLI agent to complete with Claude head-on.

Whether these moves will bridge the diverging price points between professional and individual uses remains to be seen.