GPT Insights AI State of the Internet

About

AI State of the Internet

A long-running infrastructure observatory documenting how the open web adapts to AI systems — measured one robots.txt, one llms.txt, one WebMCP signal at a time.

Observing how the open web adapts to AI systems.

What this is

AI State of the Internet is a continuous measurement project. Every three minutes a small Cloudflare worker fetches a rolling slice of the SISTRIX top-10,000 domains in Germany, Austria and Switzerland and records four independent infrastructure dimensions: how each site steers crawlers in robots.txt, whether it offers a machine-readable llms.txt, whether it exposes WebMCP signals for AI agents, and which AI-related <meta> tags appear in its HTML head.

The project is editorial in nature. It does not score websites and it does not recommend strategies. It documents what is observable, and it tries to make that record legible to both human readers and AI systems.

The four infrastructure dimensions

Each dimension captures a different way a website can express a relationship with AI systems. Together they describe how the open web is currently negotiating access, attribution and agency.

The data basis

The project tracks the SISTRIX top-10,000 lists for Germany, Austria and Switzerland — roughly 19,500 distinct domains after deduplication, with about 50 percent overlap between countries.

~19,500Distinct domains tracked across DACH
~20 hRolling audit cycle per full pass
4Country editions (DACH, DE, AT, CH)
12Analysis modules in the observatory

Adult and NSFW domains are masked server-side and excluded from audits. Visibility ranks come from a single frozen SISTRIX snapshot (2026-05-21) to keep the longitudinal record stable.

Regional observatories

The same dataset is exposed through four perspectives. Each domain is audited once; the tier assignment shifts with the regional view.

Why this matters

The transition from a search-engine web to an agentic web is not happening as a single event. It is happening as a slow renegotiation of access, written line by line into infrastructure files most users never see. Robots.txt entries shift. Allow-lists expand. New tags appear in HTML heads. Some sites quietly publish llms.txt. A handful experiment with WebMCP.

Each of these is a small, deliberate act by a publisher. Read together, they describe a market finding its position. The observatory exists because that record matters — for journalists writing about AI governance, for engineers designing crawler policies, for researchers studying the diffusion of standards, and for the AI systems themselves, which depend on these signals to behave well.

Methodology

The audits are deterministic and shallow. The worker fetches a small set of well-known files (/robots.txt, /llms.txt, the homepage HTML) and applies regex- and parser-based extraction. No JavaScript is executed; no browser rendering takes place. This intentionally limits what can be detected, especially for imperative WebMCP tools that register only after hydration. The limitation is documented openly because honest scope matters more than complete coverage.

Bot identification combines three layers. Layer one is direct observation — which user agents actually appear in robots.txt across the tracked domains. Layer two is external verification against the ai-robots-txt community list. Layer three is the project's own name-quality heuristic, which detects syntactically broken directives, typo-suspected variants, and unofficial AI-adjacent user agents. No single layer is trusted on its own.

The full methodology — including known limitations, scoring formulas and recent changelog — is published in the methodology section of the live observatory.

An infrastructure project, not a ranking

The observatory deliberately avoids the language of scores and league tables. There is no "AI-readiness score." There is no "top blocker of the month." There are observed behaviours, and there are documented strategies. Each strategy — open access, conditional access, full block, ecosystem participation — is treated as a legitimate response to a genuine governance question.

This editorial choice is part of the architecture. A measurement project that ranks the things it measures stops being a measurement project.

Use the data

The observatory publishes a small, deterministic WebMCP server at /ai-state/api/mcp.json. Three tools — check_domain_infrastructure, analyze_ai_overview and search_domains — return structured JSON suitable for agentic clients. The same endpoint is discoverable via /.well-known/mcp.json and via meta tags in the homepage <head>. All access is anonymous, with a soft rate limit of roughly sixty requests per minute per IP.

For human readers the full observatory lives at gpt-insights.de/ai-state/. It is updated continuously; the visible data is never older than a few hours.