# The Null Hypothesis > Front-end, LLM tooling, and what actually works. Author: Kumak Site: https://kumak.dev --- ## Adding llms.txt to Your Astro Blog URL: https://kumak.dev/adding-llms-txt-to-astro/ Published: 2025-11-30 Category: tutorial > Make your Astro blog readable by AI agents. Three endpoints, ~150 lines of TypeScript, zero dependencies. Agents get clean markdown instead of scraping HTML. ## Why Would You Want This? Here's the actual use case. You're in your terminal with an AI agent: ``` You: "Hey, can you curl https://kumak.dev/llms-full.txt and tell me if there's anything interesting about Astro?" Agent: *fetches clean markdown, scans content* "There's a post about content collections and one about adding llms.txt. The content collections one covers..." ``` Or more practically: ``` You: "Implement https://kumak.dev/llms/adding-llms-txt-to-astro.txt on my blog" Agent: *fetches the post as clean markdown* "Got it. I see you need three endpoints. Let me create the utils file first..." ``` Without llms.txt, the agent has to scrape HTML, strip navigation, parse React components, and hope for the best. With llms.txt, it gets exactly what it needs in a format it can read directly. The [Astro docs](https://docs.astro.build/llms-full.txt) use this pattern. When you ask an agent to help with Astro, it can fetch their llms.txt and get accurate, current documentation instead of relying on training data. ## What is llms.txt? The [llms.txt specification](https://llmstxt.org/) proposes a standard location for LLM-readable content. Think of it like `robots.txt` for crawlers or `sitemap.xml` for search engines, but designed for AI agents. The problem it solves: when an AI agent visits your website, it has to parse HTML, navigate around headers, footers, and sidebars, and extract the actual content. This wastes tokens and often produces messy results. The solution: provide a clean, structured text file at a known location. Agents fetch `/llms.txt`, get a table of contents, and can request individual pieces of content in plain markdown. ## The Architecture We'll build three endpoints that work together: ``` /llms.txt → Index: "Here's what I have" /llms-full.txt → Everything: "Here's all of it at once" /llms/[slug].txt → Individual: "Here's just this one post" ``` **Why three?** Different agents have different needs: - A quick lookup might only need the index to find one relevant post - A RAG system might want everything in one request - A focused query might want just one article without the overhead of the full dump ## File Structure Here's where everything lives in your Astro project: ``` src/ ├── utils/ │ └── llms.ts # All the generation logic ├── pages/ │ ├── llms.txt.ts # Index endpoint │ ├── llms-full.txt.ts # Full content endpoint │ └── llms/ │ └── [slug].txt.ts # Per-post endpoints (dynamic route) ``` The `utils/llms.ts` file contains all the logic. The page files are thin wrappers that call into it. This separation keeps the endpoints clean and the logic testable. ## Prerequisites Before we start, you'll need these project-specific pieces: - **`siteConfig`** - An object with `name`, `description`, `url`, and `author` properties - **`getAllPosts()`** - A function that returns your content collection posts - **`BlogPost`** - The type from Astro's content collections with `slug`, `body`, and `data` The [complete gist](https://gist.github.com/szymdzum/a6db6ff5feb0c566cbd852e10c0ab0af) shows the full implementation with all type definitions. ## Part 1: Type Definitions Let's start by defining the shapes of our data. Good types make the rest of the code self-documenting. ```typescript // src/utils/llms.ts // Basic item for the index - just enough to create a link interface LlmsItem { title: string; description: string; link: string; } // Extended item for full content - includes the actual post data interface LlmsFullItem extends LlmsItem { pubDate: Date; category: string; body: string; } ``` Why two types? The index only needs titles and links. The full content dump needs everything. By extending `LlmsItem`, we ensure consistency while allowing the richer type where needed. Now the configuration types for each generator: ```typescript // Config for the index endpoint interface LlmsTxtConfig { name: string; description: string; site: string; items: LlmsItem[]; optional?: LlmsItem[]; // Links that agents can skip if context is tight } // Config for the full content endpoint interface LlmsFullTxtConfig { name: string; description: string; author: string; site: string; items: LlmsFullItem[]; } // Config for individual post endpoints interface LlmsPostConfig { post: BlogPost; site: string; link: string; } ``` The `optional` field in `LlmsTxtConfig` is part of the spec. It signals to agents: "these links are nice-to-have, skip them if you're running low on context window." ## Part 2: The Document Builder Every endpoint needs to return a plain text `Response`. Instead of repeating this logic, we create one builder that handles it all: ```typescript function doc(...sections: (string | string[])[]): Response { const content = sections .flat() // Flatten nested arrays .join("\n") // Join with newlines .replace(/\n{3,}/g, "\n\n") // Normalize multiple blank lines to just one .trim(); // Clean up edges return new Response(content + "\n", { headers: { "Content-Type": "text/plain; charset=utf-8" }, }); } ``` **Why rest parameters with arrays?** This lets us compose documents flexibly: ```typescript // These all work: doc("# Title", "Some text"); doc(["# Title", "", "Some text"]); doc(headerArray, bodyArray, footerArray); ``` **Why normalize newlines?** When composing from multiple arrays, you might accidentally get three or four blank lines in a row. The regex `/\n{3,}/g` catches any run of 3+ newlines and replaces it with exactly 2 (one blank line). Clean output, no matter how messy the input. ## Part 3: Helper Functions Small, focused functions that each do one thing: ### Formatting Dates ```typescript function formatDate(date: Date): string { return date.toISOString().split("T")[0]; } ``` Takes a Date, returns `"2025-11-30"`. The `split("T")[0]` trick extracts just the date part from an ISO string like `"2025-11-30T00:00:00.000Z"`. ### Building Headers ```typescript function header(name: string, description: string): string[] { return [`# ${name}`, "", `> ${description}`]; } ``` Returns an array of lines. The empty string creates a blank line between the title and the blockquote description. This matches the llms.txt spec format. ### Building Link Lists ```typescript function linkList(title: string, items: LlmsItem[], site: string): string[] { return [ "", `## ${title}`, ...items.map((item) => `- [${item.title}](${site}${item.link}): ${item.description}`), ]; } ``` Creates a section with an H2 heading and a markdown list of links. Each link includes a description after the colon. The leading empty string ensures a blank line before the section. ### Building Post Metadata ```typescript function postMeta(site: string, link: string, pubDate: Date, category: string): string[] { return [`URL: ${site}${link}`, `Published: ${formatDate(pubDate)}`, `Category: ${category}`]; } ``` Three lines of metadata for each post. This keeps the format consistent across the full dump and individual post endpoints. ## Part 4: Stripping MDX Syntax If you use MDX, your post bodies contain things agents don't need: ```mdx The actual content starts here... ``` We need to strip the import and the JSX component, but keep the markdown content: ```typescript const MDX_PATTERNS = [ /^import\s+.+from\s+['"].+['"];?\s*$/gm, // import statements /<[A-Z][a-zA-Z]*[^>]*>[\s\S]*?<\/[A-Z][a-zA-Z]*>/g, // JSX blocks like /<[A-Z][a-zA-Z]*[^>]*\/>/g, // Self-closing JSX like ] as const; function stripMdx(content: string): string { return MDX_PATTERNS.reduce((text, pattern) => text.replace(pattern, ""), content).trim(); } ``` **How the patterns work:** 1. **Import pattern**: Matches lines starting with `import`, followed by anything, then `from` and a quoted path. The `m` flag makes `^` match line starts. 2. **JSX block pattern**: Matches `` for components without children. **Why PascalCase?** JSX components use PascalCase by convention. HTML elements are lowercase. So `` gets stripped, but `
` or `` passes through. This also means code examples in fenced blocks are safe, since they're not parsed as actual JSX. ## Part 5: The Generators Now we combine everything into the three main functions: ### Index Generator ```typescript export function llmsTxt(config: LlmsTxtConfig): Response { const sections = [ header(config.name, config.description), linkList("Posts", config.items, config.site), ]; if (config.optional?.length) { sections.push(linkList("Optional", config.optional, config.site)); } return doc(...sections); } ``` Builds an array of sections, conditionally adds the optional section, then passes everything to `doc()`. The spread operator `...sections` unpacks the array into separate arguments. **Output looks like:** ```markdown # Site Name > Site description ## Posts - [Post Title](https://site.com/llms/post-slug.txt): Post description ## Optional - [About](https://site.com/about): About the author ``` ### Full Content Generator ```typescript export function llmsFullTxt(config: LlmsFullTxtConfig): Response { const head = [ ...header(config.name, config.description), "", `Author: ${config.author}`, `Site: ${config.site}`, "", "---", ]; const posts = config.items.flatMap((item) => [ "", `## ${item.title}`, "", ...postMeta(config.site, item.link, item.pubDate, item.category), "", `> ${item.description}`, "", stripMdx(item.body), "", "---", ]); return doc(head, posts); } ``` **Why `flatMap`?** Each item produces an array of lines. Using `map` would give us an array of arrays. `flatMap` maps and flattens in one step, giving us a single array of all lines. The horizontal rules (`---`) separate posts visually and give agents clear boundaries between content pieces. ### Individual Post Generator ```typescript export function llmsPost(config: LlmsPostConfig): Response { const { post, site, link } = config; const { title, description, pubDate, category } = post.data; return doc( `# ${title}`, "", `> ${description}`, "", ...postMeta(site, link, pubDate, category), "", stripMdx(post.body ?? ""), ); } ``` The simplest generator. Destructures the config and post data, then builds a single document. The `post.body ?? ""` handles the edge case of a post without body content. ## Part 6: Data Transformers We need functions to convert Astro's content collection format into our types: ```typescript export function postsToLlmsItems( posts: BlogPost[], formatUrl: (slug: string) => string, ): LlmsItem[] { return posts.map((post) => ({ title: post.data.title, description: post.data.description, link: formatUrl(post.slug), })); } export function postsToLlmsFullItems( posts: BlogPost[], formatUrl: (slug: string) => string, ): LlmsFullItem[] { return posts.map((post) => ({ ...postsToLlmsItems([post], formatUrl)[0], pubDate: post.data.pubDate, category: post.data.category, body: post.body ?? "", })); } ``` **Why the callback for URLs?** Different endpoints need different URL formats: - Index links to `/llms/post-slug.txt` (the plain text version) - Full content links to `/post-slug` (the HTML version) By passing the formatter as a callback, the same transformer works for both cases. **Why does `postsToLlmsFullItems` call `postsToLlmsItems`?** DRY principle. The full item includes everything from the basic item, plus extra fields. Instead of duplicating the mapping logic, we reuse it and spread the result. ## Part 7: The Endpoints Now we wire everything up in Astro page files. These are intentionally thin. ### Index Endpoint ```typescript // src/pages/llms.txt.ts export const GET: APIRoute = async () => { const posts = await getAllPosts(); return llmsTxt({ name: siteConfig.name, description: siteConfig.description, site: siteConfig.url, items: postsToLlmsItems(posts, (slug) => `/llms/${slug}.txt`), optional: [ { title: "About", link: "/about", description: "About the author" }, { title: "Full Content", link: "/llms-full.txt", description: "All posts in one file" }, ], }); }; ``` The `APIRoute` type tells Astro this is an API endpoint, not an HTML page. The `.txt.ts` filename means it generates `/llms.txt`. ### Full Content Endpoint ```typescript // src/pages/llms-full.txt.ts export const GET: APIRoute = async () => { const posts = await getAllPosts(); return llmsFullTxt({ name: siteConfig.name, description: siteConfig.description, author: siteConfig.author, site: siteConfig.url, items: postsToLlmsFullItems(posts, (slug) => `/${slug}`), }); }; ``` Almost identical structure. The URL formatter now points to HTML pages since agents reading the full dump might want to reference the original. ### Dynamic Per-Post Endpoints ```typescript // src/pages/llms/[slug].txt.ts export const getStaticPaths: GetStaticPaths = async () => { const posts = await getAllPosts(); return posts.map((post) => ({ params: { slug: post.slug }, props: { post }, })); }; export const GET = ({ props }: { props: { post: BlogPost } }) => { return llmsPost({ post: props.post, site: siteConfig.url, link: `/${props.post.slug}`, }); }; ``` **What's `getStaticPaths`?** Astro needs to know at build time which pages to generate. This function returns an array of all valid slugs. Each entry includes `params` (the URL parameters) and `props` (data passed to the page). **Why `[slug]` in the filename?** Square brackets denote a dynamic route in Astro. The file `[slug].txt.ts` generates `/llms/post-one.txt`, `/llms/post-two.txt`, etc. ## Part 8: Discovery Agents need to find your llms.txt. The spec says to put it at the root (`/llms.txt`), similar to `robots.txt`. But you can also advertise it in HTML: ```html ``` Add this to your base layout or head component, wherever you define other `` tags like RSS or favicon. This isn't part of the official spec, but follows web conventions. You can also register your site on directories like [llmstxt.site](https://llmstxt.site). ## Limitations This implementation works for **content collections with markdown or MDX bodies**. It reads `post.body` directly, which is raw text. For component-based pages (React, Vue, Svelte, or plain `.astro` files), there's no markdown body to extract. You'd need a different strategy: - Render to HTML and strip tags (lossy, messy) - Maintain separate content files (duplicate effort) - Use a headless CMS where content exists independently For most blogs, content collections are the right choice anyway. ## Why Not Use a Library? There are Astro integrations for llms.txt. They auto-generate from all pages at build time. Sounds convenient, but: 1. You get everything, including pages you might not want exposed 2. No per-post endpoints 3. No control over the output format 4. Another dependency to maintain This implementation is ~150 lines of TypeScript. You control exactly what's included. You understand every line. For something this simple, the DIY approach wins. ## Bonus: An SVG Icon The llms.txt logo is four rounded squares in a plus pattern. Here's a simple SVG version you can use in your navigation: ```html ``` **Design notes:** - **`viewBox="-4 -4 32 32"`** adds padding so the icon matches the visual weight of stroke-based icons like Lucide - **`fill="currentColor"`** inherits from CSS, so it works with any color scheme - **Varying opacity** (0.6, 0.7, 0.8, 1.0) gives depth without using multiple colors - **`rx="2"`** rounds the corners to match the original logo style For Astro, wrap it in a component so you can pass `size` as a prop and reuse it across your site. ## The Result After deploying, you have: - `/llms.txt` - Index listing all posts with descriptions - `/llms-full.txt` - Complete content for RAG systems or full context - `/llms/post-slug.txt` - Individual posts for focused queries Agents fetch the index, pick what they need, and get clean markdown. No HTML parsing, no navigation noise, no wasted tokens. That's the point of the standard. --- ## Testing in the Age of AI Agents URL: https://kumak.dev/testing-in-the-age-of-ai-agents/ Published: 2025-11-29 Category: philosophy > When code changes at the speed of thought, tests become less about verification and more about defining what should remain stable. AI agents don't just write code faster. They make rewriting trivial. Your codebase becomes fluid, reshaping itself as fast as you can describe what you want. But when everything is in flux, how do you know the features still work? Something needs to hold the shape while everything inside it moves. That something is your tests. Not tests that document how the code works today, but tests that define what it must always do. ## The Contract Principle The obvious purpose of tests is "catching bugs." But that's incomplete. Tests define what "correct" means. They're a contract: this is what the system must do. Everything else is negotiable. Kent Beck captured this precisely: tests should be "sensitive to behaviour changes and insensitive to structure changes." A test that breaks when behaviour changes is valuable. A test that breaks when implementation changes, but behaviour stays the same, is actively punishing you for improving your code. The difference is stark in practice: ```typescript // Testing implementation - breaks when you refactor test('calls navItemVariants with correct params', () => { const spy = vi.spyOn(styles, 'navItemVariants'); render(); expect(spy).toHaveBeenCalledWith({ active: false }); }); // Testing contract - survives any rewrite test('renders as a link to the specified route', () => { render(); const link = screen.getByRole('link', { name: /orders/i }); expect(link).toHaveAttribute('href', '/orders'); }); ``` The first test knows the component uses a function called `navItemVariants`. Tomorrow, you might rename that function or eliminate it entirely. The test breaks. The component still works. The second test knows only what matters: there's a link, it goes to `/orders`, it says "Orders." Rewrite the entire component. Swap the styling system. As long as users can click a link to their orders, the test passes. ``` $ npm test ❌ FAIL src/components/NavItem.test.tsx ✕ calls navItemVariants with correct params ✕ passes active prop to styling function ✕ renders with expected className 3 tests failed. The component works perfectly. 🙃 ``` The tests haven't caught a bug. The behaviour is identical. You're just paying a tax on change. ## The Black Box Treat every module like a black box. You know what goes in. You know what should come out. What happens inside is none of your tests' business. This clarifies what to mock. External systems (APIs, databases, third-party services) exist outside your black box. Mock those. Your own modules exist inside. Don't mock those; let them run. ```typescript // Mock external systems - they're outside your control vi.mock('~/api/client', () => ({ fetchUser: vi.fn().mockResolvedValue({ name: 'Test User' }), })); // Don't mock your own code - let it run // ❌ vi.mock('~/components/ui/NavItem'); // ❌ vi.spyOn(myModule, 'internalHelper'); ``` When you mock your own code, you're encoding the current implementation into your tests. When the implementation changes, your mocks become lies. They describe a structure that no longer exists, and your tests pass while your code breaks. A useful heuristic: before committing a test, imagine handing the module's specification to a developer who'd never seen your code. They implement it from scratch, differently. Would your tests pass? If yes, you've tested the contract. If no, go back and fix the test. ## The Circular Verification Problem Here's where AI changes everything. Tests exist to verify that code is correct. If AI writes both the code and the tests, what verifies what? The test was supposed to catch AI mistakes. But AI wrote the test. You've created a loop with no external reference point. > AI writes code → AI writes tests → tests pass → "correct"? Black box tests break this circularity because they're human-auditable. When a test says "there's a link that goes to `/orders`," you can read that assertion and verify it matches the requirement. You don't need to understand implementation details. Implementation-coupled tests aren't auditable this way. To verify the test is correct, you'd need to understand the implementation it's coupled to. You're back to trusting AI about AI's work. This suggests specific rules: **Treat assertions as immutable.** AI can refactor how a test runs: the setup, the helpers, the structure. AI should not change what a test asserts without explicit human approval. The assertion is the contract. ```typescript // AI can change this (setup) const user = await setupTestUser({ role: 'admin' }); // AI should NOT change this (assertion) without approval expect(user.canAccessDashboard()).toBe(true); ``` **Failing behaviour tests require human attention.** When a contract-level test fails, AI shouldn't auto-fix it. The failure is information. A human must decide: is this a real bug, or did requirements change? **Separate creation from modification.** AI drafting new tests for new features is relatively safe. AI modifying existing tests is riskier. New tests add coverage. Modified tests might silently remove it. ## What Not to Test Simple, obvious code doesn't need tests. A component that renders a string as a heading doesn't need a test proving it renders a heading. A utility that concatenates paths doesn't need a test for every combination. Test complex logic. Test edge cases. Test error handling. Test anything where a bug would be non-obvious or expensive to find later. ```typescript // Congratulations, you've tested JavaScript test('banana equals banana', () => { expect('🍌').toBe('🍌'); // ✅ PASS }); ``` Don't test that React renders React components. Don't test that TypeScript types are correct. Your test suite isn't a proof of correctness; it's a net that catches bugs that matter. This restraint has a benefit: a smaller, focused test suite is easier to audit. When every test has a clear purpose, you can review what AI wrote and verify it matches intent. ## The Coverage Trap Coverage measures execution, not intent. A test that executes a line of code isn't necessarily testing that the line does what it should. Worse, coverage as a target incentivises exactly the wrong kind of tests. Need to hit 80%? Write tests that spy on every function, assert on every intermediate value. You'll hit your number. You'll also create a test suite that breaks whenever anyone improves the code. ```typescript // Written for coverage, not for value test('increases coverage', () => { const result = processOrder(mockOrder); expect(processOrder).toHaveBeenCalled(); // So what? expect(result).toBeDefined(); // Still nothing }); // Written for behaviour test('completed orders update inventory', () => { const order = createOrder({ items: [{ sku: 'ABC', quantity: 2 }] }); processOrder(order); expect(getInventory('ABC')).toBe(initialStock - 2); }); ``` The real question isn't "how much code did my tests execute?" It's "would my tests catch a bug that matters?" ## A Philosophy for Flux Tests are how you know code is correct. When both code and tests are fluid, when AI can change either at will, you lose the ability to verify anything. The test that passed yesterday means nothing if it was rewritten to match today's code. The philosophy is simple: > Test what the code does, not how it does it. Tests become specifications, not surveillance. They define what matters, not document what exists. And because they encode observable behaviour rather than internal structure, they remain human-auditable even when AI writes them. When code is in constant flux, tests are your fixed point. They're stable not because change is expensive, but because they define what "correct" means. Without that fixed point, you have no way to know if your fluid code is flowing in the right direction. --- ## Self-Documenting CLI Design for LLMs URL: https://kumak.dev/self-documenting-cli-design-for-llms/ Published: 2025-11-28 Category: philosophy > Agents start fresh every session. Instead of dumping docs upfront, build tools they can query. One take on agent-friendly tooling. I'm building a CLI tool for browser debugging. It lets AI agents control Chrome through the DevTools Protocol: capture screenshots, inspect network requests, execute JavaScript. The Chrome DevTools Protocol has 53 domains and over 600 methods. That's a lot of capability and a lot of documentation. Here's the problem: how do I teach an agent what's possible without dumping thousands of tokens into context every session? Documentation is a wall of text about things you don't need yet. Worse, it drifts. The tool ships a new version, someone forgets to update the docs, and now the agent is following instructions for a method that was renamed three months ago. The tool and its documentation are two artifacts pretending to be one. When Claude gets stuck with CLI tools, it naturally reaches for `--help`. When that's not enough, it tries `command subcommand --help`. The pattern is consistent: ask the tool, learn from the response, try again. If `--help` is the agent's natural discovery method, how far can you push it? ## Progressive Disclosure Instead of documenting everything upfront, make every layer queryable. Watch the conversation unfold: ```shell # Agent asks: "What can you do?" bdg --help --json # Agent asks: "What domains exist?" bdg cdp --list # Agent asks: "What can I do with Network?" bdg cdp Network --list # Agent asks: "How do I get cookies?" bdg cdp Network.getCookies --describe # Agent executes with confidence bdg cdp Network.getCookies ``` Each answer reveals exactly what's needed for the next question. Five interactions, zero documentation. The tool taught itself. When the agent doesn't know the exact method name, semantic search bridges the gap: ```shell $ bdg cdp --search cookie Found 14 methods matching "cookie": Network.getCookies # Returns all browser cookies for the current URL Network.setCookie # Sets a cookie with the given cookie data Network.deleteCookies # Deletes browser cookies with matching name ... ``` The agent thinks "I need something with cookies" and the tool finds everything relevant. No guessing required. ## Errors That Teach Actionable error messages have been a UX best practice for decades. What's different for agents is the stakes: humans can work around bad UX by searching Stack Overflow. Agents can't. They're stuck with what you give them, racing against a context window that's always shrinking. And agents make mistakes constantly. They'll type `Network.getCokies` instead of `Network.getCookies`. They'll invent plausible-sounding methods that don't exist. A typical error: ```shell $ bdg cdp Network.getCokies Error: Method not found ``` Now what? The agent has to guess, search, retry. Burn tokens. Teaching errors provide the path forward: ```shell $ bdg cdp Network.getCokies Error: Method 'Network.getCokies' not found Did you mean: - Network.getCookies - Network.setCookies - Network.setCookie ``` The correction arrives in the same response as the error. No round trip. The agent adapts immediately. The fuzzy matching goes beyond typos. Try `Networking.getCookies` with the wrong domain name, and it still suggests `Network.getCookies`. The tool understands what you meant, not just what you typed. Even empty results guide forward: ```shell $ bdg dom query "article h2" No nodes found matching "article h2" Suggestions: Verify: bdg dom eval "document.querySelector('article h2')" List: bdg dom query "*" ``` And success states show next steps: ```shell $ bdg dom query "h1, h2, h3" Found 5 nodes: [0]

Recent Posts [1]

Testing in the Age of AI Agents ... Next steps: Get HTML: bdg dom get 0 Details: bdg details dom 0 ``` Every interaction answers "what now?" Errors suggest fixes. Empty results suggest alternatives. Success shows what to do with the data. ## Semantic Exit Codes Most tools return 1 for any error. Not helpful. Semantic exit codes create ranges with meaning: - **80-89**: User errors. Bad input, fix it before retrying. - **100-109**: External errors. API timeout, retry with backoff. The agent can branch its logic without parsing error messages. Message, suggestion, exit code: three layers of guidance stacked together. ## The Result I tested this with an agent starting from zero knowledge. No prior context, no documentation provided. Just the tool. Five commands later, it was executing CDP methods successfully. It discovered the tool's structure, explored the domains, found the method it needed, understood the parameters, and executed. When I introduced typos deliberately, the suggestions caught them. When commands failed, the exit codes pointed toward solutions. The agent recovered without external help. The context cost? Roughly 500 tokens for discovery, versus thousands for a documentation dump. And those 500 tokens bought understanding, not just information. ## Design for Dialogue External documentation will always drift from reality. The tool itself never lies about its own capabilities. Tools designed for agents aren't dumbed down. They're more explicit. They expose their structure. They teach through interaction rather than requiring upfront reading. Design for dialogue, not documentation. --- ## MCP vs CLI on Chrome DevTools Protocol URL: https://kumak.dev/cli-vs-mcp-benchmarking-browser-automation/ Published: 2025-11-23 Category: opinion > CLI exposes all 644 CDP methods. MCP exposes a curated subset. We benchmarked both for browser automation. Here's how they compared. When building tools for AI agents, developers face a fundamental interface choice: expose functionality through the Model Context Protocol (MCP), or provide a traditional command-line interface (CLI) that agents invoke via shell commands. Both approaches have vocal advocates. MCP promises structured tool definitions, type safety, and seamless integration with AI platforms. CLI tools offer Unix composability, predictable output, and decades of battle-tested design patterns. We ran a series of benchmarks comparing two browser automation tools to answer a practical question: **which interface paradigm serves AI agents better for real-world developer tasks?** ## The Contenders ### Chrome DevTools MCP Server The official MCP server for Chrome DevTools, maintained by the Chrome team. It exposes browser automation through the MCP protocol with tools like: - `new_page` / `close_page` - Session management - `take_snapshot` - Full accessibility tree capture - `click` / `fill` - Element interaction via accessibility UIDs - `evaluate_script` - JavaScript execution - `list_console_messages` / `list_network_requests` - Telemetry **Design philosophy**: Accessibility-first. Interactions happen through the accessibility tree, providing robust element targeting that survives DOM changes. ### bdg (Browser Debugger CLI) A [command-line tool](https://github.com/szymdzum/browser-debugger-cli) providing direct Chrome DevTools Protocol (CDP) access. Commands include: - `bdg ` / `bdg stop` - Session lifecycle - `bdg dom query` / `bdg dom click` - CSS selector-based interaction - `bdg console` / `bdg peek` - Live telemetry monitoring - `bdg cdp ` - Direct CDP method invocation - `bdg network har` - HAR export **Design philosophy**: Power-user debugging. Full CDP access with Unix-style composability. ## Benchmark Design We tested five scenarios representing actual developer debugging workflows: | Test | Difficulty | Task | |------|------------|------| | Basic Error | Easy | Find and diagnose one JS error | | Multiple Errors | Moderate | Capture and categorize 5+ errors | | SPA Debugging | Advanced | Debug React app, correlate console/network | | Form Validation | Expert | Test validation logic, find bugs | | Memory Leak | Master | Detect and quantify DOM memory leak | **Methodology**: - Same URLs and time limits for both tools - Alternating test order to prevent learning bias - Metrics: task score, completion time, tokens consumed - Token Efficiency Score (TES) = (Score × 100) / (Tokens / 1000) ## Results Summary | Metric | bdg (CLI) | MCP | |--------|-----------|-----| | **Total Score** | 77/100 | 60/100 | | **Total Time** | 441s | 323s | | **Total Tokens** | ~38.1K | ~39.4K | | **Token Efficiency (TES)** | 202.1 | 152.3 | **Winner: CLI (+17 points, +33% token efficiency)** ## Test-by-Test Analysis ### Test 1: Basic Error Detection | Tool | Score | Time | Tokens | |------|-------|------|--------| | bdg | 18/20 | 69s | ~3.6K | | MCP | 14/20 | 46s | ~4.8K | Both tools successfully navigated to the page and triggered an error. The difference emerged in output quality: - **bdg**: Full stack trace with 6 frames, function names, line/column numbers - **MCP**: Basic error message "$ is not defined", limited location info MCP was faster but provided less actionable debugging information. For a developer, bdg's output means immediately understanding the call chain; MCP's output requires additional investigation. ### Test 2: Multiple Error Collection | Tool | Score | Time | Tokens | |------|-------|------|--------| | bdg | 18/20 | 75s | ~18.7K | | MCP | 12/20 | 48s | ~9.3K | The page had 17 "Run" buttons, each triggering different errors. - **bdg**: Used JavaScript evaluation to click all 17 buttons with timeouts in a single command. Captured 18 errors (14 unique) with full stack traces. - **MCP**: Made 11 individual click calls, missing 6 buttons. Captured only 3 errors. This test revealed a fundamental capability gap. bdg's `Runtime.evaluate` access enables batch operations: ```bash bdg cdp Runtime.evaluate --params '{ "expression": "document.querySelectorAll(\"button\").forEach((b,i) => setTimeout(() => b.click(), i*100))" }' ``` MCP doesn't expose arbitrary JavaScript execution. Each interaction requires a separate tool call. For comprehensive testing, this limitation compounds. ### Test 3: SPA Debugging | Tool | Score | Time | Tokens | |------|-------|------|--------| | bdg | 14/20 | 100s | ~4.7K | | MCP | 13/20 | 57s | ~6.6K | Both tools tested a React TodoMVC app. Neither found significant bugs (the app is well-built). Both identified a 404 favicon error. This was the closest test. When an application has no bugs to find, the tools perform similarly. The marginal bdg advantage came from HAR export capability for network analysis. ### Test 4: Form Validation Testing | Tool | Score | Time | Tokens | |------|-------|------|--------| | bdg | 15/20 | 93s | ~3.5K | | MCP | 13/20 | 102s | ~15.2K | Testing a form with validation rules revealed MCP's verbosity problem. The form included a country dropdown with 195 options. Every MCP snapshot included the full accessibility tree, all 195 country options, repeated on every interaction. Token usage ballooned to 15.2K for the same task bdg completed in 3.5K tokens. bdg tested more scenarios (4 vs 3) in less time and finished under the time limit. MCP exceeded the limit by 42 seconds, incurring a penalty. ### Test 5: Memory Leak Detection | Tool | Score | Time | Tokens | |------|-------|------|--------| | bdg | 12/20 | 104s | ~7.6K | | MCP | 8/20 | 70s | ~3.5K | This test exposed a fundamental capability difference. bdg used CDP's HeapProfiler methods directly: ```bash bdg cdp Runtime.getHeapUsage # Baseline: 833KB used, 1.5MB total # Trigger leak... bdg cdp Runtime.getHeapUsage # After: 790KB used, 3MB embedder heap (+44% growth) ``` MCP has no access to profiling APIs. It could observe DOM growth visually but couldn't measure actual memory consumption. Without quantification, it couldn't prove a leak existed, only that more elements appeared on screen. This isn't MCP being "bad." It's MCP not exposing the capability. For memory debugging, that's a dealbreaker. ## The Accessibility Tree Question A common defense of MCP's verbosity: "The accessibility tree provides complete page understanding in one call." This argument has two problems: ### 1. CLI Tools Can Use Accessibility Trees Too bdg provides selective accessibility access: ```bash bdg dom a11y describe 0 # Single element bdg dom a11y ".submit-button" # Specific selector ``` The difference isn't *whether* to use accessibility data. It's *how much* to retrieve: | Approach | bdg | MCP | |----------|-----|-----| | Query strategy | Fetch what you need | Dump everything | | 195-option dropdown | ~50 tokens | ~5,000 tokens | | Complex page (Amazon) | ~1,200 tokens | ~52,000 tokens | ### 2. Completeness ≠ Usefulness An agent receiving 52,000 tokens of accessibility tree still needs to parse it to find relevant elements. That parsing happens in the agent's context window, consuming capacity for reasoning. With selective queries, the agent asks for what it needs. The tool does the filtering. The agent's context stays focused. ## Why CLI Won This Benchmark ### 1. Token Efficiency at Scale For the Amazon product page test: ``` MCP: 52,000 tokens (single snapshot, truncated at system limit) bdg: 1,200 tokens (two targeted queries) ``` That's 43x more efficient. In a context window, that difference determines whether you can complete a complex debugging session or run out of space mid-task. ### 2. Capability Coverage | Capability | bdg | MCP | |------------|-----|-----| | Console errors with stack traces | Yes | Partial | | Memory profiling | Yes | No | | Network HAR export | Yes | No | | Batch JavaScript execution | Yes | No | | Selective DOM queries | Yes | No | | Direct CDP method access | Yes (644 methods) | No | For developer debugging tasks, these aren't edge features. They're core workflows. ### 3. Unix Composability CLI output pipes naturally: ```bash bdg console --json | jq '.errors | length' bdg network har - | jq '.log.entries[] | select(.response.status >= 400)' bdg dom query "button" | head -5 ``` MCP responses require the agent to parse and filter internally. That's additional reasoning steps and context consumption. ### 4. Predictable Output Size With CLI tools, agents can estimate token impact before calling: - `bdg dom query ".error"` - proportional to matched elements - `bdg console --last 10` - bounded by limit parameter MCP's `take_snapshot` returns whatever the page contains. Could be 5K tokens, could be 52K. The agent can't predict or control this. ## When MCP Might Make Sense This benchmark tested developer debugging workflows. MCP's design optimizes for different scenarios: **Cross-Platform Integration**: MCP is a protocol, not a tool. The same MCP server works with Claude Desktop, VS Code extensions, and any MCP-compatible client. If you're building for that ecosystem, MCP integration is valuable. **Sandboxed Environments**: MCP's restricted capabilities (no arbitrary JS eval, no profiling) can be features in contexts requiring safety guarantees. If you're building a user-facing automation tool where arbitrary code execution is a risk, MCP's constraints are appropriate. **Accessibility-First Testing**: For WCAG compliance auditing where you genuinely need the full accessibility tree, MCP's comprehensive snapshots are useful. The verbosity is the point. ## Implications for Tool Builders ### If You're Building for AI Agents 1. **Selective queries over bulk dumps**. Let agents request specific data rather than forcing them to parse everything. 2. **Predictable output sizing**. Provide limits, pagination, or filtering so agents can control context consumption. 3. **Full capability access**. If the underlying system can do it, expose it. Don't pre-decide what agents "need." 4. **Structured, parseable output**. JSON with consistent schemas beats prose descriptions. 5. **Composability**. Outputs that pipe to other tools extend capability without additional implementation. ### If You're Choosing Between Paradigms For **power-user developer workflows**: CLI wins. Direct access, Unix composition, predictable output. For **ecosystem integration and sandboxing**: MCP has structural advantages that matter in different contexts. ## Conclusion We set out to compare MCP and CLI as interfaces for AI agents doing browser automation. The benchmark results are clear: for developer debugging workflows, CLI provides more capability with better efficiency. The margin wasn't close: 77 vs 60 points, 33% better token efficiency. CLI completed tasks that MCP structurally couldn't (memory profiling), and did shared tasks with less overhead (selective queries vs full dumps). This doesn't mean MCP is "bad." It means MCP optimizes for different constraints than an AI agent debugging a web application. Protocol standardization and sandboxed execution matter in some contexts. They just aren't the contexts we tested. For tool builders: consider your users. If AI agents are a primary audience, the CLI paradigm, selective queries, predictable output, full capability access, serves them better than protocol abstractions that trade power for portability. ## Appendix: Raw Data ### Detailed Token Analysis by Test | Test | bdg Tokens | MCP Tokens | Ratio | |------|------------|------------|-------| | Test 1: Basic Error | 3,600 | 4,800 | 0.75x | | Test 2: Multiple Errors | 18,700 | 9,300 | 2.0x | | Test 3: SPA Debugging | 4,700 | 6,600 | 0.71x | | Test 4: Form Validation | 3,500 | 15,200 | 0.23x | | Test 5: Memory Leak | 7,600 | 3,500 | 2.17x | | **Total** | **38,100** | **39,400** | **0.97x** | Note: Similar total tokens, but bdg achieved 28% higher score, meaning tokens were spent more effectively. ### Token Efficiency Score Breakdown ``` TES = (Score × 100) / (Tokens / 1000) bdg: (77 × 100) / 38.1 = 202.1 MCP: (60 × 100) / 39.4 = 152.3 Advantage: +33% for CLI ``` ### Capability Matrix | CDP Domain | bdg Access | MCP Access | |------------|------------|------------| | Runtime | Full | evaluate_script only | | DOM | Full | Via accessibility tree | | Network | Full + HAR export | list_network_requests | | Console | Full + streaming | list_console_messages | | HeapProfiler | Full | None | | Debugger | Full | None | | Performance | Full | None | | Accessibility | Selective queries | Full tree dumps | **Full Results**: [BENCHMARK_RESULTS.md](https://github.com/szymdzum/browser-debugger-cli/blob/main/docs/benchmarks/BENCHMARK_RESULTS.md) --- ## How My Agent Learned GitLab URL: https://kumak.dev/how-my-agent-learned-gitlab/ Published: 2025-11-17 Category: tutorial > Teaching an agent to use CLI tools isn't about writing perfect documentation. It's about creating a feedback loop where the tool teaches, the agent learns, and reflection builds institutional knowledge. I work with a monorepo that has over 80 CI/CD jobs across 12 stages. When pipelines fail, I need to trace through parent pipelines, child pipelines, failed jobs, and error logs. There's an MCP server for GitLab. I tried it once, then installed `glab` and wrote a basic [skill file](https://gist.github.com/szymdzum/304645336c57c53d59a6b7e4ba00a7a6) with command examples. What's interesting isn't the skill itself. It's how it developed through three investigation sessions. ## Session One: Real-Time Self-Correction "Investigate pipeline 2961721" was my first request. Claude ran a command. Got 20 jobs back. The pipeline had 80+. I watched Claude notice the discrepancy, run `glab api --help`, spot the `--paginate` flag, and try again. This time: all the jobs. Then it pulled logs with `glab ci trace `. The logs looked clean. No errors visible. But the job had definitely failed. I didn't explain what was wrong. I asked: "The job failed, but you're not seeing errors. What might be happening?" Claude reasoned through it: "Errors might be going to stderr instead of stdout." Then checked `glab ci trace --help`, found nothing about stderr handling, and figured out the solution: `glab ci trace 2>&1`. Reran it. Errors appeared. **After the session**, I asked: "What went wrong? What did you learn?" Claude listed the issues: forgot to paginate (only saw 20 of 80+ jobs), missed stderr output, didn't know about child pipelines. We talked through each one, then updated the skill file: ```markdown ## Critical Best Practices 1. **Always use --paginate** for job queries 2. **Always capture stderr** with `2>&1` when getting logs 3. **Always check for child pipelines** via bridges API 4. **Limit log output** — use `tail -100` or `head -50` ``` Twenty minutes of reflection. Four critical lessons documented. ## Session Two: Faster, Smarter "Check pipeline 2965483." This time, Claude used `--paginate` from the start, captured stderr when pulling logs, and checked for child pipelines via the bridges API. Found a failed child pipeline, got its jobs, identified the error. Start to finish: five minutes. But something new happened. All 15 Image build jobs failed. Claude started pulling logs for each one. I watched it fetch the first three — all identical errors. The base Docker image was missing from ECR. "You just pulled three identical error messages," I pointed out. "What does that tell you?" Claude recognised the pattern: "When multiple jobs of the same type fail, they likely have the same error. I should check one representative job instead of all 15." Added to the skill file: ```markdown ## Pattern: Multiple Failed Jobs When many jobs fail (e.g., all Image builds), check one representative job first. FIRST_FAILED=$(glab api "projects/2558/pipelines//jobs" --paginate |\ jq -r '.[] | select(.status == "failed") | .id' | head -1) glab ci trace $FIRST_FAILED 2>&1 | tail -100 ``` ## Session Three: Institutional Knowledge Third investigation. Checkout server build timed out. Claude saw the error, started digging. "Wait," I said. "Before you investigate, check the duration." Claude checked: 44 minutes. "That's within normal range for checkout server builds," I told it. "This is a known issue, not an actual failure." Added to the skill file: ```markdown ## Common Error Patterns Build Timeout: ERROR: Job failed: execution took longer than