Glossary
The terms you will run into around llms.txt — defined briefly, with cross-links.
Last updated:
Terms
- AEO — Answer Engine Optimization
- Optimizing content so it is selected and quoted by AI assistants and answer engines (Perplexity, ChatGPT, Claude). Overlaps heavily with GEO.
- Crawler
- A program that fetches pages on the web. Search crawlers (Googlebot, Bingbot) build search indexes. AI crawlers (GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot) collect content for training or grounding.
- GEO — Generative Engine Optimization
- The practice of structuring content so generative engines reference and cite it accurately. Tactics include clean Markdown, explicit headings, structured data, and (yes) llms.txt.
- Grounding
- When an LLM bases its answer on retrieved sources rather than parametric memory. llms.txt is a grounding hint: "if you ground answers about us, prefer these pages."
- llms.txt
- A Markdown file at /llms.txt that lists the highest-signal pages of a site for LLM consumption. Proposed by Jeremy Howard (Answer.AI), September 2024. Spec at llmstxt.org. More.
- llms-full.txt
- Sibling convention: a single Markdown file containing the actual content of relevant pages, concatenated. Designed for one-shot ingestion. Popularized by Mintlify with Anthropic. More.
- MCP — Model Context Protocol
- An open protocol from Anthropic for connecting LLMs to tools and data sources. Several MCP servers fetch /llms.txt as part of their context-loading flow.
- Optional (section)
- A section in llms.txt whose H2 title is exactly "Optional". Items there can be skipped by clients with limited context — use it for nice-to-haves (brand assets, archives, press).
- RAG — Retrieval-Augmented Generation
- A pattern where the model retrieves relevant documents at query time and uses them as context. llms.txt and llms-full.txt are convenient inputs for site-specific RAG.
- REP — Robots Exclusion Protocol
- The grammar used by robots.txt (User-agent / Disallow / Allow / Sitemap). Standardized as IETF RFC 9309 in 2022. Different in intent and syntax from llms.txt.
- robots.txt
- Plain-text file at /robots.txt that tells crawlers what they may or may not fetch. Access-control file. Complementary to llms.txt, not a replacement. More.
- Schema.org
- A vocabulary for marking up the meaning of individual pages with JSON-LD or microdata (Product, Article, FAQ, etc.). Per-page enrichment, where llms.txt is a site-wide map.
- sitemap.xml
- XML file listing every URL you want a search engine to know about, with metadata (lastmod, priority). Aimed at completeness; llms.txt is aimed at curation. More.
- Static site
- A site whose pages are pre-rendered to HTML/Markdown at build time and served as files. Astro, Eleventy, Hugo, and Jekyll are static site generators. llms.txt is a natural fit.
- TL;DR
- "Too long; didn’t read." A short summary at the top of a section. Useful as the blockquote in llms.txt.
- User-agent
- A string a client sends to identify itself. AI crawlers identify with names like GPTBot, ClaudeBot, PerplexityBot, Google-Extended, OAI-SearchBot — useful when filtering server logs.