State of AI Agents in 2025: A Technical Analysis

Carl Rannaberg
23 min readJan 20, 2025

--

Image by Steve Johnson

AI systems are gaining the ability to act independently in the world. Over the past year, we’ve seen significant advances in reasoning, computer control, and memory systems that enable this shift. This analysis examines the technical foundations behind these developments, the current state of AI agents across different domains, and the infrastructure needed to make them reliable. We’ll explore the advances driving this transition and the remaining challenges.

Part 1: The Great Shift — From Models to Agents

OpenAI models ARC-AGI benchmark score evolution

In 2024, we saw the emergence of key capabilities for AI agents. OpenAI’s o1 and o3 models showed machines can break down complex tasks. Claude 3.5 demonstrated it could use computers like humans — controlling interfaces and running software. These advances, combined with improvements in memory and learning systems, are moving AI beyond simple chat interfaces toward autonomous systems.

AI agents are already in specialized domains — handling legal analysis, scientific research, and technical support. While they excel in structured environments with clear rules, they struggle with unpredictable situations and open-ended problems. Success rates drop significantly when tasks require handling exceptions or adapting to changing conditions.

The field is advancing from conversational AI to systems that can reason and act independently. Each step requires more computational power and brings new technical challenges. This article examines how AI agents work, their current capabilities, and the infrastructure needed for reliable function.

What is an AI Agent?

An AI agent is a system that reasons through problems, creates plans, and executes them using tools. Unlike traditional AI models that just respond to prompts, agents demonstrate:

  • Autonomy: The ability to independently pursue goals and make decisions
  • Tool Usage: Direct interaction with software, APIs, and external systems
  • Memory: Maintaining context and learning from past experiences
  • Planning: Breaking down complex tasks into actionable steps
  • Adaptation: Learning from experience to improve decision-making and performance

Understanding the evolution from passive responders to autonomous agents is crucial for grasping the upcoming opportunities and challenges. Let’s examine the key developments that made this transformation possible.

The Foundation: 2024’s Breakthroughs

OpenAI o3 breakthrough high score on ARC-AGI benchmark

Three key developments in 2024 laid the groundwork for autonomous AI agents:

First, OpenAI’s o-series models demonstrated advances in reasoning. O3 achieved 87% accuracy on the ARC-AGI benchmark, which tests human-like problem-solving abilities. The models achieved this by generating multiple parallel solutions and using consensus mechanisms to select the most reliable answers. This ability to systematically approach novel problems and arrive at correct solutions through multiple reasoning paths established the base capability for autonomous action.

Second, AI models gained vision capabilities and basic computer control. Vision became standard across major models, allowing them to process screenshots and understand interfaces. Claude 3.5 showed it could control computers — moving cursors, clicking elements, and executing simple commands. While still below human performance and limited to basic operations, these advances showed how AI systems could interact with standard software interfaces.

Third, advances in model architecture transformed how AI systems handle memory and context. New approaches moved beyond simple attention mechanisms to sophisticated memory management — combining extended context windows with explicit working memory and efficient knowledge caching. This evolution means agents can maintain coherent understanding across longer, complex interactions.

The Present: Agents Emerge

Today, these capabilities are creating practical results. As Reid Hoffman noted, we’re seeing the emergence of specialized AI agents that extend human capabilities in specific domains. The early applications are promising:

  • Harvey is building legal agents that can collaborate with lawyers on complex tasks like S-1 filings, using o1’s advanced reasoning to break down and plan multi-stage legal work
  • Development platforms like OpenHands enable agents to write code, interact with command lines, and browse the web like human developers
  • Research teams are using multi-agent systems to design and validate scientific experiments, with specialized agents for hypothesis generation, experimental design, and result analysis
  • Healthcare teams are deploying AI agents as medical scribes to draft clinical notes from patient conversations
  • Airlines are deploying AI agents that handle complex booking changes, coordinating flight availability, fare rules, and refunds
  • Procurement teams are using agents to negotiate supplier agreements

Recent Sierra research shows how rapidly these systems are maturing. Their agents can now engage in natural conversations while juggling complex business rules and multiple backend systems — marking a shift from experimental prototypes to real-world deployment.

Key Questions

As we navigate this transformation, three critical questions emerge:

  1. When do autonomous agents outperform simpler AI tools?
  2. What technical and organizational infrastructure enables successful agent deployment?
  3. How can we ensure reliable, secure, and cost-effective agent operations?

The rest of this article will examine:

  • The current spectrum of agent capabilities
  • Real-world transformations across different domains
  • Technical infrastructure required for success
  • Current limitations and challenges
  • The road ahead

Understanding these aspects matters because agent technology changes our approach to complex tasks and decision-making. Let’s examine the building blocks that enable these capabilities.

Part 2: Understanding the Agent Spectrum

Image by Google Deepmind

Current AI agents vary in their capabilities and degree of autonomy. Some tasks require only basic tool use and response generation, while others need complex reasoning and autonomous decision-making. Understanding these capability levels helps determine when to use simpler, predictable systems versus fully autonomous agents.

The Building Blocks

Three core capabilities distinguish AI agents from simpler AI tools:

Reasoning and Planning

  • Breaking down complex tasks into steps
  • Exploring multiple solution paths systematically
  • Adapting strategies based on outcomes
  • Learning from successes and failures

Tool Use

  • Direct interaction with software interfaces
  • API and function calling
  • Code generation and execution
  • Web browsing and data access

Memory and Learning

  • Maintaining context across interactions
  • Building up reusable skills
  • Learning from past experiences
  • Improving performance over time

The Spectrum of Agency

AI agent spectrum

The progression from simple AI tools to full agents follows a spectrum of increasing complexity and capability:

1. Single-Tool Systems

  • Basic tool use with a single language model
  • Simple, well-defined interactions
  • Limited to specific APIs or functions
  • Example: Search-based chat apps

2. Multi-Tool Orchestration

  • Multiple tools in a single model
  • Structured API interactions
  • Defined workflows and patterns
  • Example: ChatGPT with plugins

3. Composed Systems

4. General Access Agents

  • Direct system access (screen, keyboard, CLI)
  • Beyond structured APIs
  • Open-ended task handling
  • Example: Computer control agents

Not every problem requires the highest level of agency. Simpler solutions like tool-use models or orchestration systems are often more appropriate and cost-effective.

The Role of Context and Control

A critical consideration is the balance between capability and control. As we move towards more autonomous agents, several factors become important:

Security and Governance

  • Access control and permissions
  • Activity monitoring and logging
  • Resource usage limits
  • Safety constraints

Reliability and Trust

  • Action verification
  • Decision-making transparency
  • Error handling and recovery
  • Performance monitoring

Cost and Resource Management

  • Computational resource optimization
  • API usage efficiency
  • Storage and memory management

Understanding your needs on this spectrum is crucial for effective deployment. Not every task requires a fully autonomous agent — sometimes a simple tool-using system is more appropriate and cost-effective.

Part 3: Real World Transformations

The true potential of AI agents emerges in their practical applications. Let’s examine how different industries leverage agent capabilities to solve real problems.

Software Development

Image by Google Deepmind

The evolution from simple code completion to autonomous development showcases the expanding capabilities of AI agents. While GitHub Copilot introduced real-time code suggestions in 2021, today’s agents like Devin can handle end-to-end development tasks, from environment setup to deployment.

MetaGPT (a multi-agent framework paper) shows how specialized agents can collaborate effectively:

  • Product managers define requirements
  • Architects design system structure
  • Developers implement solutions
  • QA agents validate results

AI agents may not have human limitations, but this raises fundamental questions about how we structure development activities that have been designed around human capabilities for the past 50–60 years. While they excel at tasks like prototyping and automated testing, the real opportunity lies in reimagining software development itself, rather than just making existing processes faster.

This transformation is already impacting hiring patterns. Salesforce announced it won’t hire software engineers in 2025, citing a 30% productivity increase from their AI agent technology. Meta CEO Mark Zuckerberg expects AI to reach the capability of mid-level software engineers in 2025, able to generate production code for applications and AI systems.

Recent real-world testing of Devin reveals the limitations of development agents: while they excel at isolated tasks like API integrations, they struggle with complex development work. Devin achieved only 3 successes out of 20 end-to-end tasks. Simpler, developer-driven workflows using tools like Cursor avoided many issues encountered with autonomous agents.

Customer Service

Image by Google Deepmind

The evolution from simple chatbots to sophisticated service agents marks a clear success in agent deployment. Sierra’s research shows modern agents can handle complex tasks that previously required multiple human agents — from flight rebookings to multi-step refunds — while maintaining natural conversation.

Key capabilities of these systems include:

  • Coordinating multiple backend systems (reservations, payments, inventory)
  • Maintaining context in complex multi-turn conversations
  • Applying business rules while documenting for compliance
  • Handling routine cases with 40–60% faster resolution times

Significant challenges remain around policy exceptions and situations requiring empathy. Some implementations address this by constraining agents to approved knowledge sources and implementing clear escalation paths. This hybrid approach, where agents handle routine cases and escalate complex situations to human staff, has proven most effective in production environments.

Sales & Marketing

Image by Google Deepmind

Sales and marketing agents now handle structured workflows like lead qualification, meeting scheduling, and campaign analytics. These systems coordinate across CRM platforms and communication channels while following configurable business rules. For example, Salesforce’s Agentforce can process customer interactions, maintain conversation context, and route complex cases to human agents when needed.

Recent benchmarks show two areas where agents achieve measurable results:

1. Sales Development

  • Autonomous lead qualification and outreach — for example, 11x-’s Alice agent identifies prospects and schedules meetings while adapting to interactions
  • Multi-modal communication handling — demonstrated by agents like 11x’s Mike that processes voice and text interactions across 28 languages
  • System orchestration with CRM platforms and business tools, operating under configurable parameters to ensure compliance

2. Marketing Operations

  • Content generation and optimization
  • Performance tracking
  • Data analysis and reporting

Key capabilities of these systems:

  • Understanding and responding to complex customer queries across channels
  • Coordinating multiple business systems and data sources
  • Maintaining conversation context in extended interactions
  • Escalating to human agents when needed
  • Operating within configurable parameters to align with business goals and compliance standards

Integration and adoption of these solutions face several challenges:

  • Balancing automation with human relationship building
  • Ensuring consistent quality as scale increases
  • Maintaining personalization in automated interactions

Success in sales and marketing requires a balanced approach where agents handle routine interactions and data-driven tasks, while human teams focus on relationship building and complex decision-making.

Legal Services

Image by Google Deepmind

Legal agents now process complex documents within strict regulatory frameworks. Harvey’s systems can break down multi-month projects like S-1 filings into structured steps, coordinate with multiple stakeholders, and maintain compliance across jurisdictions. However, these systems still require careful human oversight, particularly for tasks requiring subjective judgment or context-dependent reasoning.

Key distinguishing features:

  • Processing and analyzing thousands of legal documents while maintaining consistency across documents
  • Breaking down complex tasks like S-1 filings into structured workflows with clear checkpoints
  • Tracking regulatory requirements across jurisdictions
  • Maintaining detailed audit trails of all modifications and reasoning

Validation and liability remain significant obstacles in deployment. All agent outputs require human review, and responsibility in AI-assisted legal work is unresolved. While agents excel at document processing and research, strategic legal decisions remain in human hands.

The future of legal AI agents likely lies in enhanced collaboration between human lawyers and AI systems, with agents handling routine document processing and analysis while lawyers focus on strategy, negotiation, and final validation.

Finance

Image by Google Deepmind

Financial services have emerged as an early testing ground for agent technology, with applications ranging from market analysis to automated trading.

Main use cases:

1. Market Analysis & Research

  • Analyzing company reports, news, and market data — as demonstrated by Decagon, which assists analysts by evaluating investment opportunities through detailed market trend analysis
  • Generating investment insights and recommendations based on multi-modal data analysis
  • Processing diverse data sources including market data, SEC filings, and news

2. Trading & Investment

  • Executing trades based on defined strategies
  • Managing investment portfolios
  • Recent benchmarks show proprietary models achieving up to 95% of buy-and-hold returns, with open-source alternatives reaching 80%

3. Risk Management

  • Monitoring portfolio risk metrics
  • Generating compliance reports
  • Maintaining performance consistency with human oversight

Current limitations include:

  • Single-asset focus (most systems struggle with complex portfolio management)
  • Variable reliability across market conditions
  • Challenges in maintaining long-term strategy
  • Real-time processing and global market adaptation challenges

Early results are promising, but financial applications require careful risk management and regulatory compliance. Most organizations start with narrowly scoped use cases under human supervision, focusing on single-asset trading before moving to complex portfolio management.

Research & Science

Image by Google Deepmind

AI agents in scientific research can accelerate discovery while maintaining rigorous methodology. Recent papers showcase how specialized agents can collaborate throughout the research lifecycle:

  • Literature agents analyze thousands of papers to identify patterns and gaps
  • Hypothesis agents propose testable theories based on existing knowledge
  • Experiment agents design protocols and predict outcomes
  • Analysis agents interpret results and suggest refinements

This multi-agent approach has yielded promising results in chemistry, where agents helped identify novel catalysts and reaction pathways. With Google’s recent announcement of Gemini Deep Research, which compiles and analyzes web-based research, we see how these capabilities can extend beyond specialized domains to support broader research tasks.

Major challenges exist around verification, reproducibility, and automated quality assessment — with agent outputs scoring lower than human work in expert reviews. While agents can accelerate discovery by handling routine tasks, human scientists remain essential for creative direction and validating results. Success requires careful integration of agent capabilities with existing research methodologies while maintaining scientific rigor.

Emerging Patterns

Though agent implementations vary by industry, three common themes emerge:

Improved Memory

  • Maintaining richer context over longer interactions
  • Retaining relevant information to improve decisions

Complex Planning

  • Breaking tasks into logical steps for execution
  • Coordinating multi-step workflows or business processes

Direct Tool Integration

  • Interacting with external APIs and software environments
  • Handling specialized tasks (code generation, data analysis, etc.)

While AI agents’ potential is significant, most industries are still in an experimental phase of adoption. Organizations typically start with established approaches like Retrieval-Augmented Generation (RAG) before moving to advanced agent implementations.

A key challenge is identifying scenarios where agents provide measurable advantages over traditional AI approaches. While agents offer expanded capabilities, they also introduce complexity through required security controls, integration, and infrastructure overhead.

Some tasks need simpler tools, while others benefit from multi-step planning, advanced memory, or specialized collaboration. Effective implementation requires evaluating when agent capabilities justify their complexity in development effort and operational overhead.

Part 4: The Engine Room

Image by Google Deepmind

The earlier discussed building blocks — planning, tool use, and memory — require sophisticated infrastructure to function effectively in production environments. While the technology is evolving, several key components have emerged as essential for successful agent deployments.

Development Frameworks & Architecture

Image from awesome-ai-agents by e2b.dev

The agent development frameworks ecosystem has matured, with several key players emerging:

  • AutoGen from Microsoft excels at flexible tool integration and multi-agent orchestration
  • CrewAI specializes in role-based collaboration and team simulation
  • LangGraph provides robust workflow definition and state management
  • Llamaindex offers advanced knowledge integration and retrieval patterns

While these frameworks differ, successful agents typically require three core architectural components:

  • Memory System: Ability to maintain context and learn from past interactions
  • Planning System: Breaking down complex tasks into logical steps while validating each stage
  • Tool Integration: Access to specialized capabilities through function calling and API interfaces

While these frameworks provide solid foundations, production deployments often require significant customization to handle high-scale workloads, security requirements, and integration with existing systems.

Planning & Execution

AI agent planning and execution flow

Handling complex tasks requires advanced planning capabilities, typically involving:

  • Plan Generation: Breaking down tasks into manageable steps
  • Plan Validation: Evaluating plans before execution to avoid wasting compute
  • Execution Monitoring: Tracking progress and handling failures
  • Reflection: Evaluating outcomes and adjusting strategies

An agent’s success often depends on its ability to:

  • Generate valid plans by combining tools with practical knowledge (e.g., knowing which APIs to call in which order for a customer refund request)
  • Break down and validate complex tasks, with error handling at each step to prevent compounding mistakes
  • Manage computational costs in long-running operations
  • Recover gracefully from errors and unexpected situations through dynamic replanning and adaptation
  • Apply different validation strategies, from structural verification to runtime testing
  • Collaborate with other agents through tool calling or consensus mechanisms when additional perspectives could improve accuracy

Using multiple agents for consensus can improve accuracy, but the computational costs are substantial. Even for OpenAI, running parallel model instances for consensus-based answers remains unprofitable even at premium price points — ChatGPT Pro costs $200/month. Majority voting systems multiply costs by 3–5x for complex tasks, simpler architectures focusing on robust single-agent planning and validation may be more economically viable.

Memory & Retrieval

AI agent memory architecture

AI agents need sophisticated memory management to maintain context and learn from experience. This involves multiple complementary systems:

Context Window

Evolution of LLM’s context window sizes

The immediate processing capacity of the underlying language model — the “physical memory” that constrains how much information an agent can process at once. Recent advances expanded these to over 1M tokens, enabling richer single-interaction context.

Working Memory

State maintained across multiple LLM calls during a task:

  • Active Goals: Tracking current objectives and subtasks
  • Intermediate Results: Calculations and partial outputs
  • Task Status: Progress tracking and state management
  • State Verification: Tracking validated facts and corrections during task execution

Context management capabilities:

  • Context Optimization: Efficient use of limited context space through prioritization and organization
  • Memory Management: Automated movement of information between working and long-term storage — from preloading entire knowledge bases to maintaining dynamic memory units for relevant information

Long-term Memory & Knowledge Management

Storage systems:

  • Knowledge Graphs: Tools like Zep and Neo4j enable structured representation of entities and relationships
  • Virtual Memory: Systems like Letta (powered by MemGPT) provide paging between working memory and external storage

Management capabilities:

  • Memory Maintenance: Automated summarization, pruning, and integration of new information over time
  • Memory Operations: Efficient search and retrieval of relevant information

Modern memory systems go beyond simple storage to enable:

  • Compound task handling: Managing multi-step operations where accuracy must be maintained across steps
  • Continuous learning: Automatic knowledge graph construction from ongoing interactions (e.g., Zep)
  • Memory management: Virtual “infinite context” through automated memory management (e.g., Letta/MemGPT)
  • Error reduction: Improved information retrieval to reduce hallucinations and maintain consistency
  • Cost optimization: Efficient use of context window to minimize API calls and latency

Memory systems are crucial for agents because:

  • Tasks often require multiple steps dependent on previous results
  • Information needs often exceed the model’s context window
  • Long-running operations need persistent state management
  • Accuracy must be maintained across complex workflows

Integration standards like Anthropic’s Model Context Protocol (MCP) are providing standardized ways to connect agents with persistent memory systems. However, challenges remain around efficiently orchestrating these memory types while managing computational costs and maintaining consistency.

Security & Execution

As agents gain autonomy, security and auditability become paramount. Modern deployments require several layers of protection:

  • Tool Access Control: Careful management of what operations agents can perform
  • Execution Validation: Verification of generated plans before execution
  • Sandboxed Execution: Platforms like e2b.dev and CodeSandbox provide secure isolated environments for running untrusted AI-generated code
  • Access Control: Granular permissions and API governance to limit impact
  • Monitoring & Observability: Comprehensive logging and performance tracking through specialized platforms like LangSmith and AgentOps, including error detection and resource utilization
  • Audit Trails: Detailed records of decision-making and system interactions

These security measures must balance protection with the flexibility for agents to operate effectively in production environments.

Practical Limitations

Despite rapid progress, several significant challenges persist:

1. Tool Calling

Basic Tool Calling: While models excel at planning and reasoning, they struggle with basic tool interactions. Even simple API calls show high failure rates due to formatting errors and parameter mismatches

Tool Selection: Models often choose incorrect tools or fail to effectively combine multiple tools, particularly when faced with large tool sets

Tool Interface Stability: Natural language interfaces to tools remain unreliable, with models making formatting errors or inconsistent behavior

2. Multi-Step Execution

Tool Calling Instability: While models excel at planning and reasoning, they struggle to reliably execute those plans through tool calls. Even simple API interactions show high failure rates due to formatting errors, parameter mismatches, and context misunderstandings

Compound Error Accumulation: Multi-step tasks amplify this unreliability — if each tool call has a 90% success rate, a 10-step workflow drops to 35% reliability. This makes complex workflows impractical without extensive human oversight

Context Management: Models struggle to maintain consistent understanding across multiple tool interactions, leading to degraded performance in longer sequences

Planning Reliability: Complex workflows require careful validation of generated plans, as agents often overlook critical dependencies or make incorrect assumptions about tool capabilities

3. Technical Infrastructure

System Integration: Lack of standardized interfaces forces teams to build custom integration layers for each deployment, creating significant development overhead

Memory Architecture: Despite vector stores and retrieval systems, limited context windows constrain historical information access and self-reflection capabilities

Computational Demands: Large-scale deployments require significant processing power and memory, leading to substantial infrastructure costs

4. Interaction Challenges

Computer Interface Complexity: Even the best agents achieve only ~40% success rate with simple project management tools, and performance drops significantly with complex software like office suites and document editors

Coworker Communication: Agents achieve only 21.5% success when interacting with colleagues through collaboration platforms, struggling with nuanced conversations and policy discussions

5. Access Control

Authentication & Authorization: Agents operating on behalf of users face significant authentication challenges for long-running or asynchronous tasks. Traditional auth flows aren’t designed for autonomous agents needing access across hours or days.

Solutions are emerging — like Okta’s Auth for GenAI which provides:

  • Asynchronous authentication for background tasks
  • Secure API access on behalf of users
  • Fine-grained authorization for data access
  • Push notification-based human approval workflows

6. Reliability & Performance

Error Recovery: Agents struggle with unexpected errors and fail to adjust plans dynamically, making them less robust than humans at learning from mistakes.

Cross-Domain Performance Variability: Agents show variable reliability across different tasks, even in well-defined domains. For example, function calling agents in retail can succeed on individual tasks up to 50% of the time, but drop below 25% with variations of similar tasks. This inconsistency appears across domains, with systems achieving partial reliability in technical areas like coding.

Current agent capabilities vary across domains. In software development, where goals and validation are clear, agents can autonomously complete 30.4% of complex tasks. This aligns with Graham Neubig’s NeurIPS 2024 note: “30 to 40 percent of the things that I want an agent to solve on my own repos, it just solves without any human intervention”. However, performance drops in domains requiring broader context, with agents failing at administrative work (0%) and struggling with financial analysis (8.3%). This pattern suggests agents perform better on tasks with clear validation criteria and struggle with work requiring broader business context or policy interpretation.

Recent advances indicate a convergence of capabilities: memory architectures for richer context retention, reasoning improvements (as seen in o-series models) for deeper understanding through longer inference chains, and planning systems to decompose complex tasks while maintaining state across steps. These developments suggest enhanced context understanding may emerge from the interaction of these technical capabilities rather than requiring breakthroughs in model architecture. The challenge lies in orchestrating these components while managing increased computational demands.

Part 5: The Road Ahead

Image by Google Deepmind

With test-time compute, we’re still pretty early and so we have a lot of room, a lot of runway to scale that further — Noam Brown

The untapped potential of test-time compute — resources used during model inference — points to a fundamental shift in scaling model intelligence. While pre-training faces limitations — “data is the fossil fuel of AI… we have but one internet” — reasoning models with test-time compute offer a new path forward. As Sutskever argued, next-token prediction with sufficient compute may be enough to achieve AGI.

Roadmap of AI agent breakthroughs

Near-Term Evolution (2025)

OpenAI CEO Sam Altman stated: “We are now confident we know how to build AGI as we have traditionally understood it”. However, the path forward relies heavily on compute-intensive reasoning — as Brown notes, solving the hardest problems may require “a million dollars” worth of compute per solution. This suggests that while we may know how to scale intelligence through test-time compute, the deployment economics will shape which problems we can tackle.

The rapid progress shows no signs of slowing. While advanced reasoning capabilities remain computationally expensive, current deployments are transformative — Salesforce reports 30% productivity gains from AI agents, leading to pause engineer hiring for 2025. This aligns with industry predictions — Meta’s Zuckerberg expects that in 2025, “we at Meta as well as the other companies… are going to have an AI that can effectively be a sort of midlevel engineer”. These impacts suggest AGI-like capabilities may first emerge in domains with clear success criteria and rich synthetic data, like coding and mathematical reasoning.

Core Intelligence

Interface & Control

Memory & Context

Infrastructure & Scaling

Medium-Term Developments (2026)

Core Intelligence

Interface & Control

Memory & Context

  • More reliable state tracking in interactive environments [Memory Survey Paper]

While current agents struggle with basic UI interactions — achieving only ~40% success with simple project management tools — new approaches to learning show promise. Allowing agents to explore interfaces and derive tasks through “reverse task synthesis” nearly doubled success rates in complex GUI interactions. This suggests that by 2026, we may see agents that can reliably control computers through direct understanding of interfaces rather than following human instructions.

Long-Term Possibilities (Beyond 2026)

Core Intelligence

Interface & Control

Infrastructure & Scaling

Image by Latent Space

The progression of AI capabilities and their economic implications are becoming clearer. ChatGPT Plus introduced basic chat at $20/month, while ChatGPT Pro brought advanced reasoning at $200/month. OpenAI’s recent push into multi-agent research and Altman’s confidence about “knowing how to build AGI” suggest autonomous agents may be next — potentially at an order of magnitude higher cost. As Brown notes, we’re just beginning to scale reasoning capabilities, with some important problems potentially requiring “a million dollars” worth of compute to solve. This hints at a future where increasingly capable systems — from autonomous agents to creative problem-solvers — may emerge at higher computational costs.

We now have the core building blocks for AI agents that mirror how humans tackle complex work: breaking down problems into smaller tasks, understanding context, learning from experience, using tools, and adapting to feedback. While these capabilities work in controlled environments, they struggle with the complexity and uncertainty of real-world tasks.

The next few years will be about experimentation — discovering how to effectively combine these components, finding reliable patterns, and establishing best practices for building robust agents. While we have the core capabilities, learning to orchestrate them into reliable systems that can tackle real-world complexity will require technical innovation and practical experience. The era of AI agents has begun, but we’re still in the early stages of understanding how to build them well.

Further Reading

Research Papers

Model advancements:

Computer use:

Agent architecture:

Practical applications:

Case Studies & Applications

Industry Analysis & Reports

Podcasts & Talks

Other References

--

--

Carl Rannaberg
Carl Rannaberg

Written by Carl Rannaberg

Experienced SaaS builder, ex-Pipedrive

Responses (4)