About
AI State of the Internet
A long-running infrastructure observatory documenting how the open web adapts to AI systems — measured one robots.txt, one llms.txt, one WebMCP signal at a time.
Observing how the open web adapts to AI systems.
What this is
AI State of the Internet is a continuous measurement project. Every three minutes a small Cloudflare worker fetches a rolling slice of the SISTRIX top-10,000 domains in Germany, Austria and Switzerland and records four independent infrastructure dimensions: how each site steers crawlers in robots.txt, whether it offers a machine-readable llms.txt, whether it exposes WebMCP signals for AI agents, and which AI-related <meta> tags appear in its HTML head.
The project is editorial in nature. It does not score websites and it does not recommend strategies. It documents what is observable, and it tries to make that record legible to both human readers and AI systems.
The four infrastructure dimensions
Each dimension captures a different way a website can express a relationship with AI systems. Together they describe how the open web is currently negotiating access, attribution and agency.
- Crawler control — robots.txt. The oldest and most widely deployed signal. Tracks which AI-related user agents are blocked (
Disallow: /), partially restricted, or explicitly permitted. Includes name-quality analysis, because many directives target user-agent strings that do not actually exist. - Active AI permits. Explicit
Allow:directives for AI crawlers — a small but distinct strategy that signals openness rather than passive default-permit. - WebMCP — the agentic surface. The W3C Community Draft for browser-based tool exposure to AI agents. Both declarative meta tags and imperative
navigator.modelContextregistrations are detected. Adoption is currently in its earliest phase. - llms.txt — machine-readable context. A community standard for site-level summaries written for language models. Wellformedness is validated against the llmstxt.org reference.
The data basis
The project tracks the SISTRIX top-10,000 lists for Germany, Austria and Switzerland — roughly 19,500 distinct domains after deduplication, with about 50 percent overlap between countries.
Adult and NSFW domains are masked server-side and excluded from audits. Visibility ranks come from a single frozen SISTRIX snapshot (2026-05-21) to keep the longitudinal record stable.
Regional observatories
The same dataset is exposed through four perspectives. Each domain is audited once; the tier assignment shifts with the regional view.
- 🇩🇪🇦🇹🇨🇭D-A-CHPopulation-weighted aggregate
- 🇩🇪GermanyGermany top-10,000 only
- 🇦🇹AustriaAustria top-10,000 only
- 🇨🇭SwitzerlandSwitzerland top-10,000 only
Why this matters
The transition from a search-engine web to an agentic web is not happening as a single event. It is happening as a slow renegotiation of access, written line by line into infrastructure files most users never see. Robots.txt entries shift. Allow-lists expand. New tags appear in HTML heads. Some sites quietly publish llms.txt. A handful experiment with WebMCP.
Each of these is a small, deliberate act by a publisher. Read together, they describe a market finding its position. The observatory exists because that record matters — for journalists writing about AI governance, for engineers designing crawler policies, for researchers studying the diffusion of standards, and for the AI systems themselves, which depend on these signals to behave well.
Methodology
The audits are deterministic and shallow. The worker fetches a small set of well-known files (/robots.txt, /llms.txt, the homepage HTML) and applies regex- and parser-based extraction. No JavaScript is executed; no browser rendering takes place. This intentionally limits what can be detected, especially for imperative WebMCP tools that register only after hydration. The limitation is documented openly because honest scope matters more than complete coverage.
Bot identification combines three layers. Layer one is direct observation — which user agents actually appear in robots.txt across the tracked domains. Layer two is external verification against the ai-robots-txt community list. Layer three is the project's own name-quality heuristic, which detects syntactically broken directives, typo-suspected variants, and unofficial AI-adjacent user agents. No single layer is trusted on its own.
The full methodology — including known limitations, scoring formulas and recent changelog — is published in the methodology section of the live observatory.
An infrastructure project, not a ranking
The observatory deliberately avoids the language of scores and league tables. There is no "AI-readiness score." There is no "top blocker of the month." There are observed behaviours, and there are documented strategies. Each strategy — open access, conditional access, full block, ecosystem participation — is treated as a legitimate response to a genuine governance question.
This editorial choice is part of the architecture. A measurement project that ranks the things it measures stops being a measurement project.
Use the data
The observatory publishes a small, deterministic WebMCP server at /ai-state/api/mcp.json. Three tools — check_domain_infrastructure, analyze_ai_overview and search_domains — return structured JSON suitable for agentic clients. The same endpoint is discoverable via /.well-known/mcp.json and via meta tags in the homepage <head>. All access is anonymous, with a soft rate limit of roughly sixty requests per minute per IP.
For human readers the full observatory lives at gpt-insights.de/ai-state/. It is updated continuously; the visible data is never older than a few hours.