# llmtxt.info — full corpus

> A bilingual reference site on the llms.txt proposal. This file inlines the core content of llmtxt.info pages so an LLM client can ingest the full knowledge base in a single fetch. Last reviewed 2026-04-14.

Source spec: https://llmstxt.org/
Proposal author: Jeremy Howard, Answer.AI, September 2024.
Site source: https://llmtxt.info/

---

## What is llms.txt

llms.txt is a Markdown file placed at the root of a website (https://example.com/llms.txt) that gives large language models a short, curated map of the most useful pages and resources on the site.

It was proposed by Jeremy Howard at Answer.AI in September 2024 as a community convention, not a formal standard. The canonical spec lives at https://llmstxt.org/.

Why it exists:
- LLMs have narrow context windows compared to a typical website. Crawling a whole site for every product question is expensive and imprecise.
- The most useful pages for an LLM are rarely the most-visited. llms.txt lets you explicitly say "these are the pages worth reading."
- It provides a stable, versioned contract: renaming a page forces you to update the file.
- Its companion file llms-full.txt (popularized by Mintlify with Anthropic) inlines full page content in one Markdown artifact.

What llms.txt is NOT:
- Not a W3C or IETF standard. It is a community proposal.
- No major LLM provider (OpenAI, Anthropic, Google, Meta) has publicly committed to consuming it systematically. Adoption on the receiving side is opportunistic.
- Not a search ranking signal. As of April 2026 there is no public evidence that publishing llms.txt moves classical search or AI Overviews rankings.
- Not a replacement for robots.txt, sitemap.xml, or schema.org.
- Not a security mechanism. Everything listed in llms.txt is, by definition, public.

Adoption (2026): real but uneven. Early adopters are documentation platforms and developer tools: Anthropic, Cloudflare, Vercel, Stripe, Mintlify, Perplexity. Outside dev-tools, coverage stays low. A 2025–2026 SE Ranking study of ~300,000 domains measured roughly 10% adoption, concentrated in tech. Skepticism is documented: John Mueller (Google) has publicly cautioned against expecting Search benefits.

Typical use cases:
- API and developer documentation sites.
- SaaS marketing sites with technical buyers.
- Open-source projects (point to README, contributing, examples, changelog).
- Knowledge bases where a few canonical answers beat a full help center.

---

## How it works

A valid llms.txt is a Markdown file with a fixed, predictable structure. It is designed to be readable by both humans and machines: the same file must serve as documentation and as a parseable contract.

Structure, top to bottom:
1. One H1 with the name of the site or project. Only strictly required element.
2. A short blockquote summary, typically one or two sentences.
3. Optional free Markdown — paragraphs and lists, but no other heading before the first H2.
4. Zero or more H2-delimited file-list sections. Each contains a Markdown list of links: `- [name](url)`, optionally followed by `: notes`.
5. An optional H2 titled exactly "Optional" — its items can be skipped by short-context clients.

Annotated example:

```
# Acme

> Acme is a hosted analytics platform for product teams. The pages below cover product, pricing, API and integrations.

Acme processes 1B+ events per day. The map here is curated for assistants — it is not exhaustive.

## Product

- [Overview](https://acme.example/product): capabilities and screenshots.
- [Use cases](https://acme.example/use-cases): scenarios for product, marketing, and support teams.
- [Changelog](https://acme.example/changelog): monthly updates.

## Pricing

- [Plans](https://acme.example/pricing): tiers, limits, overage rules.
- [Billing FAQ](https://acme.example/billing-faq): invoices, receipts, VAT.

## Developers

- [REST API reference](https://docs.acme.example/api): full endpoint catalog.
- [JavaScript SDK](https://docs.acme.example/sdk/js): install, init, track events.
- [Python SDK](https://docs.acme.example/sdk/python): install, init, track events.
- [Webhooks](https://docs.acme.example/webhooks): events, signatures, retries.

## Optional

- [Brand assets](https://acme.example/brand): logos, palette.
- [Press](https://acme.example/press): announcements history.
```

Parsing rules (reference implementation):
1. Find the first `# ` line — that is the title.
2. If the next non-empty block is a blockquote, it is the summary.
3. Everything up to the first `## ` is free body text.
4. Each `## ` opens a section; until the next `## `, list items are parsed as `[name](url)` with optional notes after a colon.

Practical limits:
- Size: no hard cap, but past ~50 KB, short-context clients begin to struggle. Move volume to llms-full.txt or per-product variants.
- Link count: no spec limit, but a 200+ list will be skimmed, not read. Curate.
- Languages: spec is silent on i18n. Two patterns: a single English root file, or per-locale variants behind paths (/en/llms.txt, /fr/llms.txt).
- Auth and personalization: out of scope. The file is public.

---

## llms.txt vs robots.txt vs sitemap.xml vs llms-full.txt

They are not interchangeable.

- robots.txt tells crawlers what they may and may not access. Standardized as IETF RFC 9309 (2022). REP grammar: User-agent, Disallow, Allow, Sitemap. Access control via exclusion.
- sitemap.xml tells search engines what exists. XML format, sitemaps.org protocol. Discovery via exhaustive enumeration, with lastmod/priority metadata.
- llms.txt tells AI assistants what is worth reading. Markdown, curated, selective.
- llms-full.txt gives them the content itself, concatenated in Markdown.

Comparison matrix:

| Criterion           | robots.txt                           | sitemap.xml                          | llms.txt                           | llms-full.txt                        |
|---------------------|--------------------------------------|--------------------------------------|------------------------------------|--------------------------------------|
| Primary goal        | Access control                       | Page discovery                       | Curated map for LLM clients        | Inline corpus for LLM ingestion      |
| Audience            | Web crawlers                         | Search engines                       | LLM clients and assistants         | LLM clients wanting full content     |
| Format              | Plain text (REP grammar)             | XML                                  | Markdown                           | Markdown (concatenated)              |
| Standard            | Yes — IETF RFC 9309                  | Yes — sitemaps.org                   | Community proposal — llmstxt.org   | Community proposal — llmstxt.org     |
| Controls indexing   | Yes (allow/disallow)                 | No (discovery hint)                  | No                                 | No                                   |
| Approach            | Exclusion                            | Discovery (completeness)             | Curation (selective)               | Inlining (full text)                 |
| Path                | /robots.txt                          | /sitemap.xml                         | /llms.txt                          | /llms-full.txt                       |

Use them together. Most production sites should publish robots.txt + sitemap.xml + llms.txt. Add llms-full.txt when content is primarily textual and benefits from bulk ingestion.

Schema.org / JSON-LD operates at a different level: it marks up the semantics of individual pages (Product, FAQ, Article). llms.txt is a site-wide map. Use both — Schema.org tells an LLM what a page is; llms.txt tells it which pages to look at first.

---

## How to create llms.txt

Step 1 — plan the content. List 5 to 20 pages an LLM would need to answer real questions about your project. Useful buckets:
- Product: overview, use cases, pricing.
- Documentation: getting started, API reference, key guides.
- Integrations: partners and SDKs.
- Reference: changelog, status, security policy.
- Optional: brand assets, press, archives.

Step 2 — minimal template:

```
# {Site name}

> {One-sentence description of what your site is.}

## Pages

- [{Page title}]({absolute URL}): {short note}
```

Step 3 — recommended template:

```
# {Site name}

> {One or two-sentence overview. Factual, no marketing.}

{Optional: 1-3 sentences of context.}

## Product

- [Product overview]({URL}): key capabilities.
- [Pricing]({URL}): plans and limits.

## Documentation

- [Getting started]({URL}): install, first call, hello world.
- [API reference]({URL}): full endpoint catalog.
- [Guides]({URL}): tutorials and how-tos.

## Optional

- [Changelog]({URL}): version history.
- [Brand assets]({URL}): logos and palette.
```

Step 4 — deploy. The file must be served at /llms.txt with Content-Type: text/plain; charset=utf-8.

- Cloudflare Pages, Vercel, Netlify: drop the file in public/ (Next.js, Astro) or static/ (SvelteKit, Hugo).
- Next.js: either public/llms.txt, or a route handler at app/llms.txt/route.ts returning a NextResponse with text/plain content.
- Astro: public/llms.txt, or src/pages/llms.txt.ts reading from a content collection and returning a Response.
- WordPress: upload via SFTP to the web root, or add an init action in functions.php that intercepts /llms.txt and returns file contents.
- nginx: copy to the web root. Express: add a route returning text/plain.

Step 5 — automate. Maintaining the file by hand works for a small site but breaks quickly. Two patterns:
- Build script: iterate over your content collection (Markdown, MDX, CMS) and write llms.txt to dist/.
- Server route: render on demand from a database or CMS.

Step 6 — validate. Paste into a validator to confirm spec compliance. Catch: missing H1, malformed link syntax, relative URLs, content outside sections, accidental second H1, oversized files.

Checklist before publication:
- Served at /llms.txt with 200 OK.
- Content-Type: text/plain; charset=utf-8.
- Exactly one H1.
- Plain-language blockquote summary, no marketing.
- All URLs absolute (https://...).
- Every section has at least one item.
- No private or auth-gated URLs listed.
- Validator returns no errors.
- If you have a corpus to expose, publish /llms-full.txt too.
- robots.txt still allows crawling the file (no Disallow: /llms.txt).

---

## Best practices

Ten rules:
1. Curate. 10 to 30 links in the root file beats 200.
2. Use absolute URLs. Always https://yourdomain.com/...
3. Group by product surface (Product, Pricing, Developers), not by content type (blog/doc/guide).
4. Keep the summary factual. Wikipedia opener, not landing-page hero.
5. One sentence per item. The colon note is for disambiguation, not selling.
6. Use the Optional section sparingly. Press, brand assets, archives — not half your sitemap.
7. Mirror your stable URLs. Stale URLs poison the file's reputation.
8. Publish llms-full.txt when content is text-heavy (docs, reference, tutorials).
9. Run the validator in CI. A content migration that breaks the file should fail the build.
10. Date your file. A short "Last reviewed YYYY-MM-DD" in the body helps humans and crawlers.

Common mistakes:
- No H1. The only strictly required element.
- Multiple H1s. Use H2s for sections.
- Custom front-matter. No YAML, no JSON header. Spec is strict.
- Pasted Markdown tables or images. Stick to heading + blockquote + lists.
- Including auth-gated URLs. The LLM will hit a wall.
- Overlong descriptions. Keep notes under 15 words, no marketing.
- Listing 500 URLs. Use llms-full.txt or per-product variants.
- Forgetting to update robots.txt so /llms.txt is reachable.

Security: everything in llms.txt is broadcast. Never list staging, preview, auth-gated URLs, or URLs with secrets in query strings. Audit at every release.

CI automation: generate → validate → fail build on error → diff across releases → smoke-test the production URL after deploy.

---

## Benefits and limitations

Benefits:
1. A shipped, machine-readable contract. Writing one forces the team to agree on the canonical page set.
2. Better grounding for assistants that do read it (Cursor, Windsurf, MCP integrations, RAG pipelines).
3. A stable, citable corpus (especially when paired with llms-full.txt).
4. Brand disambiguation: you control the first sentence an assistant reads about you.
5. A defensible baseline for GEO/AEO work. Low cost, real upside.

Limitations:
1. No major search engine publicly confirms reading it. Neither Google, Bing, nor any major LLM provider as of April 2026.
2. It does not control crawler behavior. Keep using robots.txt for allow/disallow.
3. Adoption on the receiving side is uneven. Long tail.
4. Spec is community-maintained, not standardized. Minor changes expected.
5. Easy to over-optimize. Keyword stuffing and marketing copy read as low-quality fast.

Documented skepticism: John Mueller (Google) has questioned both adoption and impact in public posts throughout 2025-2026. Paraphrased: "publishing is cheap; expecting Google to consume it is wishful thinking." Counter-argument from the docs community: the LLM-side ecosystem (Perplexity, Claude, Cursor, MCP) is already a large enough audience to justify the file.

Honest summary: publish because LLM assistants matter to your users, not because you expect Google rankings to move.

When it's worth it: documentation sites, developer tools, APIs, SaaS with technical buyers, knowledge bases, brands with name collisions.

When it's probably not: pure e-commerce, visual or media-first sites, sites without stable URLs, sites heavily under auth.

How to measure impact:
- Server logs: filter for GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot hitting /llms.txt and /llms-full.txt.
- Referrer analysis: watch chat.openai.com, claude.ai, perplexity.ai referrers.
- Manual prompts: periodically ask Claude, ChatGPT, Perplexity about your product. Note citations and URL choices.
- Brand monitoring: Profound, Otterly, Xfunnel, AthenaHQ track LLM mentions.

---

## Glossary

- AEO (Answer Engine Optimization): optimizing content so it is selected and quoted by AI assistants. Overlaps with GEO.
- Crawler: program that fetches pages. Search crawlers (Googlebot, Bingbot) vs AI crawlers (GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot).
- GEO (Generative Engine Optimization): structuring content so generative engines reference and cite it accurately.
- Grounding: when an LLM bases its answer on retrieved sources rather than parametric memory.
- llms.txt: Markdown file at /llms.txt listing the highest-signal pages of a site for LLM consumption.
- llms-full.txt: Markdown file inlining the content of those pages, concatenated. Sibling convention by Mintlify with Anthropic.
- MCP (Model Context Protocol): open protocol from Anthropic for connecting LLMs to tools and data sources.
- Optional (section): a section titled exactly "Optional" whose items can be skipped by short-context clients.
- RAG (Retrieval-Augmented Generation): model retrieves relevant documents at query time and uses them as context.
- REP (Robots Exclusion Protocol): grammar of robots.txt. IETF RFC 9309 (2022).
- robots.txt: plain-text access control file at /robots.txt.
- Schema.org: vocabulary for marking up per-page semantics in JSON-LD or microdata.
- sitemap.xml: XML file listing every URL you want a search engine to know about.
- User-agent: identifying string sent by a client. AI crawlers: GPTBot, ClaudeBot, PerplexityBot, Google-Extended, OAI-SearchBot.

---

## Canonical URLs

- English root: https://llmtxt.info/
- French root: https://llmtxt.info/fr/
- Source spec: https://llmstxt.org/
- Original proposal: https://www.answer.ai/posts/2024-09-03-llmstxt.html