Hashline Edit Format: How 2-Character Hashes Fixed AI File Editing
Table of Contents
Every AI coding agent has the same Achilles heel: file editing.
The agent reads a file, decides what to change, and then… needs to reproduce the exact text it wants to replace. Character by character. Including tabs, spaces, trailing whitespace.
One wrong space and the edit silently fails. Or worse - it edits the wrong line.
We’ve been building production AI agents for 2 years. File editing accuracy was always the thing that made us cringe during demos. Then Can Boluk published a benchmark that changed how we think about this problem entirely.
TL;DR
- The edit format matters as much as the model. Can Boluk’s benchmark showed +5 to +64 percentage point accuracy improvements by changing only the harness, not the model.
- Hashline replaces text matching with hash anchoring. 2-character content hashes eliminate whitespace errors and reduce token waste.
- Weakest models gain the most. Grok Code Fast 1 jumped from 6.7% to 68.3% accuracy - a 61.6pp improvement.
- Hash validation catches stale files. Unlike str_replace’s silent failures, hashline explicitly rejects edits when the file has changed.
- One parameter to enable in pydantic-ai-backend:
create_console_toolset(edit_format="hashline").
The str_replace Problem
The standard approach to AI file editing is str_replace - give the model the old text and the new text, find-and-replace:
edit_file( path="app.py", old_text=" return result", # must match EXACTLY new_text=" return result + 1",)This looks simple. It’s not. Here’s what goes wrong:
- Whitespace mismatch - the model outputs 4 spaces, the file has a tab. Edit fails silently.
- Non-unique match -
return resultappears on 3 lines. Which one gets replaced? Usually the wrong one. - Context drift - after a long conversation, the model’s memory of the file diverges from reality. It “replaces” text that no longer exists.
- Token waste - the model must reproduce old text character-by-character just to identify a location. On a 500-line file, that’s a lot of output tokens spent on pointing, not changing.
These aren’t edge cases. On Can Boluk’s benchmark of 180 tasks across 16 models, the patch format (unified diff) - which has even stricter formatting requirements - showed failure rates of 50.7% for Grok 4 and 46.2% for GLM-4.7.
Enter Hashline: Line-Level Anchoring
The idea is dead simple. Instead of asking the model to reproduce text, give each line a 2-character content hash:
1:a3|function hello() {2:f1| return "world";3:0e|}Each line gets a {line_number}:{hash}| prefix. The hash is deterministic - same content always produces the same hash. Now the model doesn’t need to reproduce any text to point to a location. It just says:
replace line 2:f1 with: return "hello world";No whitespace matching. No ambiguity. No reproducing old text. The hash acts as a content fingerprint - if the file changed since the model last read it, the hash won’t match, and the edit is safely rejected.
Can Boluk’s Benchmark: The Numbers
In February 2026, Can Boluk published “I Improved 15 LLMs at Coding in One Afternoon. Only the Harness Changed.” - a benchmark of 16 models across 3 edit formats (patch, str_replace, hashline).
Setup:
- 180 tasks per run, 3 runs per model
- React codebase files as test fixtures
- Mechanical mutations: operator swaps, boolean flips, off-by-one errors, identifier renames
- Fresh agent session each time with four tools: read, edit, write
Key results:
| Finding | Numbers |
|---|---|
| Grok Code Fast 1 | 6.7% to 68.3% accuracy (+61.6pp) with hashline vs patch |
| MiniMax | More than doubled accuracy |
| Grok 4 Fast | Output tokens dropped 61% |
| Gemini 3 Flash | 78.3% accuracy with hashline |
| Patch format | Worst for nearly every model |
The pattern was clear:
- Hashline matched or beat str_replace for most models
- The weakest models gained the most - smaller models struggle most with exact text reproduction
- Token usage dropped dramatically - models spend fewer tokens pointing to code and more tokens changing it
The conclusion: “The harness - not the model - is one of the bottlenecks in LLM coding performance.”
Our Implementation in pydantic-ai-backend
We shipped hashline support in pydantic-ai-backend v0.1.9, inspired directly by Can Boluk’s research. Here’s how it works in our Pydantic AI implementation.
Reading Files
When edit_format="hashline" is enabled, read_file returns content with hash tags:
from pydantic_ai_backends import format_hashline_output
content = """function hello() { return "world";}"""
print(format_hashline_output(content))# 1:a3|function hello() {# 2:f1| return "world";# 3:0e|}Editing Files
The model references lines by their line:hash pair instead of reproducing text:
from pydantic_ai_backends import apply_hashline_edit
new_content, error = apply_hashline_edit( content=original_file, start_line=2, start_hash="f1", new_content=' return "hello world";',)Operations supported:
- Replace single line -
start_line+start_hash+new_content - Replace range - also set
end_line+end_hash - Insert after - set
insert_after=True - Delete - set
new_content=""
Hash Validation = Stale-File Protection
If the file changed since the model last read it, the hash won’t match:
Hash mismatch at line 2: expected 'f1', got 'a7'.File may have changed — re-read it first.This is a feature, not a bug. str_replace silently does nothing when text doesn’t match. Hashline explicitly tells the model what went wrong and what to do about it.
Enabling It
One parameter:
from pydantic_ai_backends.toolsets import create_console_toolset
toolset = create_console_toolset(edit_format="hashline")The toolset automatically swaps edit_file for hashline_edit and adjusts read_file to include hash tags. The system prompt adapts too.
Hash Algorithm: MD5 vs CRC32
A note on implementation differences. Can Boluk uses CRC32 (zlib) mod 256 for the hash, stripping whitespace before hashing so code formatters don’t break anchors.
We use MD5’s first 2 hex characters. Same result space (256 values), slightly different collision profile. We chose MD5 because it’s already in Python’s stdlib with no imports beyond hashlib, and the collision rate at 2 characters is identical in practice.
import hashlib
def line_hash(content: str) -> str: return hashlib.md5(content.encode("utf-8")).hexdigest()[:2]Both approaches give you a 2-character anchor that’s good enough for any file an AI agent would edit. You’d need ~20 lines with the same hash before collisions matter - and at that point, the line number disambiguates.
Why This Matters Beyond Benchmarks
Hashline isn’t just about accuracy numbers. It changes the economics of AI file editing:
1. Smaller models become viable. The biggest accuracy gains were on weaker models. If you’re running a local model or a cheaper API tier, hashline closes the gap with frontier models - not by making the model smarter, but by removing a formatting obstacle.
2. Token costs drop. Grok 4 Fast saw a 61% reduction in output tokens. That’s not a minor optimization - it’s a fundamentally different cost structure for agents that edit many files.
3. Errors become explicit.
A str_replace that doesn’t match just… does nothing. No error, no feedback. The model thinks it edited the file. With hashline, a hash mismatch is a clear signal: “re-read the file and try again.”
4. Multi-edit workflows get safer. When editing from bottom to top, line numbers don’t shift. Combined with hash validation, you can chain multiple edits with confidence that each one targets the right line.
Key Takeaways
- The edit format matters as much as the model. Can Boluk showed +5 to +64pp accuracy improvements by changing only the harness, not the model.
- Hashline replaces text matching with hash anchoring. 2-character content hashes eliminate whitespace errors and reduce token waste.
- Weakest models gain the most. If you’re using local or budget models, hashline is a near-free accuracy boost.
- Hash validation catches stale files. Unlike str_replace’s silent failures, hashline explicitly rejects edits when the file has changed.
- One parameter to enable.
create_console_toolset(edit_format="hashline")- that’s it.
Try It Yourself
pydantic-ai-backend - Docker sandbox, console toolset, and backend abstractions for Pydantic AI agents
pip install pydantic-ai-backendCredit: The hashline concept and benchmark data come from Can Boluk’s research. We built an implementation for Pydantic AI agents - the approach itself is his contribution to the community.
Related Articles
From create-react-app to create-ai-app: The New Default for AI Applications
In 2016, create-react-app standardized how we build frontends. In 2026, AI applications need the same moment — and it's...
AGENTS.md: Making Your Codebase AI-Agent Friendly (Copilot, Cursor, Codex, Claude Code)
Every AI coding tool reads your repo differently. Here's how AGENTS.md — the emerging tool-agnostic standard — gives the...
From 0 to Production AI Agent in 30 Minutes — Full-Stack Template with 5 AI Frameworks
Step-by-step walkthrough: web configurator, pick a preset, choose your AI framework, configure 75+ options, docker-compo...