Disclaimer: This article provides general information and is not legal or technical advice. For official guidelines on the safe and responsible use of AI, please refer to the Australian Government’s Guidance for AI Adoption →
Join our upcoming events
Connect with the AI & ML community at our next gatherings.
New research from Anthropic identifies the “context ceiling”—the point where long-running agents stall during complex builds—and introduces the “initializer-coder” framework to bridge the gap between sessions.
Where do long-running agents actually fail?
The core bottleneck is “agentic amnesia.” Most frontier models attempt to one-shot complex applications, leading to exhausted context windows and half-implemented features. Without a structured harness, the next session arrives with no memory of the previous shift, forcing it to guess the state of the codebase. We are moving from single-shot prompts to persistent environments where the environment is as important as the model’s weights.
What is the “shift-work” problem and why does it matter?
Imagine a project staffed by engineers working back-to-back shifts where no one leaves notes. This is the current reality for long-running agents. The noise floor is undocumented code left behind by a previous session. The productivity paradox hits the IDE: an agent can generate thousands of lines of code, but if it doesn’t leave clear artifacts—like a progress log or a structured feature list—the next session spends most of its tokens just trying to figure out what happened.
What should builders take away?
Stop treating agents like isolated chatbots. Start building “agentic harnesses.” The real moat isn’t just the model; it’s the scaffolding around it—initializers, progress trackers, and automated testing loops. Reliability isn’t a property of the LLM; it’s an emergent property of the system design.
💡Quick note
This issue shifts from simple prompting to the "context ceiling": where long-running agents lose the plot. We break down the Anthropic "initializer-coder" framework and a 2026 systems research paper on terminal scaffolding, plus Co‑Intelligence by Ethan Mollick. Part of the Weekly Deep Dive into AI and ML Advancements & Updates series.
Read this if you are:
Founders & Teams
Building the model is no longer the bottleneck; building the harness is. This issue explains why your agent-to-human trust ratio depends on the traceability and state-management you build into autonomous workflows.
Students & Switchers
A deep dive into the end of single-shot prompting. Learn why the next generation of AI jobs isn’t in prompt engineering but in agentic system design—using tools like MCP, Git, and automated testing to keep long-running agents on track.
Community Builders
When agents contribute to open-source or internal codebases over days or weeks, the noise floor of undocumented changes can kill a community. This issue frames how to use initializer agents to maintain documentation and trust in a multi-contributor environment.
AI Bits for Techies | Issue #9 | 18 Mar 2026
Your weekly Aussie-flavoured deep dive into what changed in AI/ML, what matters, and what to do next (without living on release-note social media).
This week in one breath: Anthropic's latest research addresses the "agentic amnesia" problem by splitting long-horizon tasks into a two-agent relay—an initializer to set the foundation and a coding agent to make incremental, documented progress. As we hit the limits of single context windows, the primary engineering challenge moves from better prompts to better state management via git logs, JSON feature lists, and automated testing harnesses. Plus, the tools to manage agent skills and the framework for building production-ready AI systems.
Journal Paper of the Week
Building AI Coding Agents for the Terminal: Scaffolding, Harness, and Lessons Learned
The Context
We have reached the limits of the "chat interface." For agents to take on tasks that span days rather than seconds, they need to exist in a persistent environment. This research explores the harness engineering required to turn a frontier model into a reliable terminal-based agent. It treats the terminal not just as a tool, but as a shared memory between sessions.
The Method & Results
The researchers tested several harness configurations across thousands of coding tasks to identify why agents declare victory prematurely or leave codebases in a broken state.
The progress ledger: Introducing a mandatory progress.txt file and git history reduced re-work time by 65%. Instead of scanning every file, the agent reads the last 20 commits to get its bearings.
The feature checklist: Using JSON-based feature lists instead of Markdown prevented the model from accidentally overwriting its own requirements. Each feature is a test that must pass before the agent can move on.
Automated verification: Agents that were forced to run an init.sh script and a basic browser test at the start of every session were 40% less likely to implement new features on top of existing bugs.
Why It Matters
This is the transition from "AI as a tool" to "AI as a collaborator." If we can solve the state-preservation problem, we unlock the ability to build massive production-quality applications without human intervention at every step. The terminal is the new context window.
Best for: The direct implementation of the harness architecture. It uses the Model Context Protocol (MCP) to allow agents to safely use local tools while managing token compaction and long-running sessions. https://github.com/anthropics/anthropic-sdk-python
LangGraph
Best for: Developers who need fine-grained control over "agentic amnesia." It models agent interactions as a stateful directed graph, allowing you to explicitly save, load, and "checkpoint" an agent's memory across multiple days of execution. https://langchain-ai.github.io/langgraph/
Vellum AI
Best for: Engineering teams moving from prototypes to production. It provides a robust framework for observability and versioning, ensuring that when an agent "hands off" work to a new session, the logic drift is measurable and the state is verifiable. https://www.vellum.ai/
Book recommendation (because your brain deserves more than changelogs)
Co‑Intelligence: Living and Working with AI — Ethan Mollick
The narrative bridge: If the Anthropic research is the blueprint for how we build the relay race between agents, Mollick’s book is the guide for how we actually live with the runners. He introduces a concept every builder of long-running agents needs to hear: AI isn’t just a tool; it’s an alien intern. Brilliant, always-on, but without clear guardrails and a way to hand off work, it will confidently hallucinate a finish line.
The challenge: How do you manage a coworker that has a perfect memory of a single conversation—but a total blackout the moment the window closes? How do you maintain human agency when the agents start doing the heavy lifting? You’ll have to read Mollick’s take on the jagged frontier to understand why the scaffolding discussed in this issue is the only thing keeping us from being replaced by our own automated interns.
Geeky thought of the week
The "Sim-to-Real" gap in logic.
We talk about robotics needing to bridge the gap between simulation and the real world, but long-running coding agents face a similar hurdle: the code-to-production gap. In a single context window, everything is perfect. The code is fresh, the variables are in memory, and the sim is running.
But the moment you close that window, you hit a noise floor of reality. A dependency updates, a server restarts, or a previous session leaves a shadow bug that isn’t obvious from the surface.
The question for your week: if we eventually give agents enough memory and tools to maintain our entire digital infrastructure, do we become the legacy support for our own software? Or is the ultimate role of a human engineer to be the initializer—setting the initial conditions for a system we can no longer fully comprehend in one sitting?
Housekeeping (so we stay honest)
This is general information, not legal advice. If you ship user-facing AI, be transparent about where AI is used, what it cannot do, and where humans stay in the loop.
Why isn’t “context compaction” enough for long tasks?
Compaction is essentially a summary of what happened, but summaries are lossy. They often miss the “why” behind an architectural choice. A long-running agent doesn’t just need a summary; it needs “state preservation”—the exact artifacts (git logs, tests, notes) that allow it to reconstruct the logic of the project from scratch.
Why do agents “declare victory” prematurely?
Without a feature checklist, an agent looks at the current codebase, sees that it looks like a web app, and assumes the job is done. It lacks the internal requirements that a human carries. By providing a JSON-based checklist that explicitly marks features as “failing,” we give the agent an objective definition of done.
What is the most effective “memory” for an agent?
It isn’t a vector database; it’s Git. A clean git history with descriptive commit messages is the most high-density memory an agent can have. It provides the what, the when, and a revert button if the current session goes off the rails.