← All posts
ai-search-visibility geo aeo llm-seo ai-seo technical-seo

Which technical signals matter for AI search (and which don't)

An SE Ranking study of 300,000 domains found no measurable correlation between having an llms.txt file and getting cited by AI engines. None. Their machine learning model actually became more accurate when the llms.txt variable was removed. Yet llms.txt is one of the first things brands implement when they start thinking about AI search visibility.

The technical layer of GEO (Generative Engine Optimization), AEO (Answer Engine Optimization), LLM SEO, or AI SEO matters, but most of the attention goes to the wrong signals. This post covers what actually affects whether ChatGPT, Google AI, and Claude can find, read, and cite your content.

llms.txt: what it is and why it probably doesn't matter yet

llms.txt is a plain-text file you place in your site's root directory (like robots.txt) that provides AI systems with a curated map of your most important pages. Proposed in 2024 by Jeremy Howard of Answer.AI, it's formatted in Markdown and tells AI crawlers which pages to prioritize, what your site is about, and where to find your best content.

The adoption numbers tell an interesting story. Only 7.4% of Fortune 500 companies have implemented llms.txt as of March 2026. Across a broader dataset, SE Ranking found just 10.13% of 300,000 domains had the file in place. Adoption is concentrated in developer tools, AI companies, and SaaS documentation. Stripe, Cloudflare, and Anthropic use it. Most consumer brands don't.

Here's the problem: no major AI platform has confirmed they read it. Google has explicitly compared llms.txt to the discredited keywords meta tag. An independent audit from mid-2025 monitored an llms.txt page for three months and recorded zero visits from GPTBot, ClaudeBot, PerplexityBot, or Google-Extended. The bots never came.

That doesn't mean llms.txt is permanently irrelevant. If AI platforms adopt it as a standard, early implementers will benefit. But right now, implementing llms.txt while ignoring schema markup and crawler access is like hanging a "welcome" sign on a house with no front door.

robots.txt: the strategic decision most brands get backward

robots.txt is where the real action is, and most of the conversation around it is wrong.

BuzzStream's analysis of the top 100 US and UK news sites found that 79% block at least one AI training bot and 71% block at least one retrieval bot. The specific blocking rates: CCBot at 75%, Anthropic's ClaudeBot at 69%, PerplexityBot at 67%, GPTBot at 62%. Major global publishers including the BBC, New York Times, and Guardian block aggressively. Press Gazette reported this represents a 300% increase from early 2023, when fewer than 30% had restrictions.

The blocking trend is accelerating. Cloudflare's data shows websites block GPTBot and ClaudeBot at nearly seven times the rate they block Googlebot. About 5.8 million sites now block ClaudeBot, up from 3.2 million in early July 2025.

But here's what makes this a strategic opportunity for brands rather than publishers. BuzzStream's follow-up study analyzed 4 million citations across 3,600 prompts in ChatGPT, Gemini, AI Overviews, and AI Mode. The finding: roughly 75% of sites blocking OpenAI or Google AI bots still appeared in AI citations. About 70% of ChatGPT citations came from sites that block ChatGPT's retrieval bots. Blocking doesn't prevent citation because the content was already ingested during model training.

This creates an asymmetry. Publishers block AI crawlers because their business model depends on traffic, and Cloudflare measured OpenAI's crawl-to-referral ratio at 1,700:1 in June 2025. That means OpenAI crawled 1,700 pages for every one visitor it sent back. Anthropic's ratio was 73,000:1. For publishers monetizing pageviews, that math is terrible.

For brands selling products or services, the math flips. You want AI engines to find, read, and cite your content. If a direct competitor in your space blocks ClaudeBot but you allow it, your pages face less competition for citation slots in Claude's responses for the queries you both target. In CiteGap audits, we track which competitors allow or block each crawler because it directly affects the competitive landscape per engine.

Training bots vs. retrieval bots

Not all AI crawlers serve the same purpose. Training bots (GPTBot, Google-Extended, CCBot) collect data to improve the underlying model. Retrieval bots (ChatGPT-User, PerplexityBot in search mode) fetch pages in real time to ground a specific answer.

Blocking training bots means your content won't inform future model updates but existing citations persist. Blocking retrieval bots means AI engines can't pull your pages for real-time answers, which directly reduces new citations. The distinction matters, and 13.26% of AI bot requests ignored robots.txt directives entirely in Q2 2025, up from 3.3% in Q4 2024. Compliance is not guaranteed.

The technical signals that actually move citation rates

While llms.txt gets conference talks and robots.txt gets headlines, three less-discussed technical signals have documented impact on whether AI engines cite your content.

Schema markup (JSON-LD)

This is the most impactful technical signal. Google, Microsoft, and OpenAI have all confirmed they use schema markup for their generative AI features. A ProGEO study of Fortune 500 companies found 53.8% have JSON-LD implemented, compared to just 7.4% with llms.txt. The adoption gap tracks the impact gap.

Pages with structured data are 36% more likely to appear in AI-generated answers. But partial implementation backfires: a 2026 study of 730 AI citations found that generic, partially-filled schema produced an 18-percentage-point citation penalty compared to having no schema at all. AI engines can tell the difference between real structured data and template-generated placeholders.

The schema types that matter most for AI citation: Article (for blog posts and editorial content), FAQPage (for Q&A sections), Organization and Person (for entity disambiguation), and Product (for ecommerce). Pages with FAQ sections and schema are 2.8x more likely to be cited.

Server-side rendering

AI crawlers have limited time budgets per page. Unlike Googlebot, which will wait for JavaScript to render, GPTBot and ClaudeBot often read the raw HTML response. If your content lives behind JavaScript frameworks (React, Vue, Angular) and requires client-side rendering to display, AI crawlers may see an empty page.

Server-side rendering (SSR) or pre-rendering delivers the full HTML on first response. This is not optional for AI visibility. A page that looks complete to a human visitor but returns a skeleton to a crawler is invisible to every AI engine that doesn't execute JavaScript, which is most of them for retrieval purposes.

We audited a mid-size edtech company whose product documentation was built entirely in a single-page React application. Google ranked the pages fine (Googlebot executes JS). But when we tested the same URLs against ChatGPT and Claude, neither engine could see the content. The pages returned empty div containers. Zero citations from two out of three engines in CiteGap's trio, and the fix was an SSR configuration change, not a content rewrite.

Page speed and crawl reliability

AI engines pulling from billions of pages will skip slow or unreliable sources in favor of faster alternatives. Unlike traditional SEO where slow pages get demoted in rank, AI retrieval can simply exclude slow pages from the candidate set entirely.

The threshold is lower than for traditional search. A page that loads in 3 seconds is fine for Google. An AI retrieval system processing thousands of candidate URLs per query may abandon anything that doesn't respond within 1-2 seconds. Clean HTML, minimal render-blocking resources, and reliable uptime matter more in AI retrieval than in traditional crawling.

What matters and what doesn't (a summary)

Signal Current impact on AI citations Priority
Schema markup (JSON-LD) 36% more likely to be cited; confirmed by Google, Microsoft, OpenAI High
robots.txt crawler access Direct control over retrieval bot access; strategic differentiation High
Server-side rendering Required for non-Google AI engines to read JS-heavy pages High
Page speed and reliability Affects inclusion in retrieval candidate set Medium
Content negotiation headers Emerging signal; helps AI crawlers identify content type Low
llms.txt Zero measured impact on citations as of March 2026 Low (monitor)

The technical signals operate as qualifiers, not differentiators. They get your content into the candidate pool. What earns the actual citation is content quality: answer-first formatting, fact density, and structured comparisons. A site with perfect technical signals but marketing-first copy will still lose citations to aggregators with better content structure.

But a site with strong content and broken technical signals will also lose. The edtech company with the React SPA had excellent documentation. None of it mattered until the rendering issue was fixed. Technical signals are the floor, not the ceiling.

FAQ

Does llms.txt improve AI search visibility? Not currently. An SE Ranking study of 300,000 domains found no correlation between llms.txt adoption and AI citation frequency. No major AI platform has confirmed reading the file. Only 7.4% of Fortune 500 companies have implemented it. It may gain relevance if platforms adopt it as a standard, but it is not a priority today.

Should I block AI crawlers in robots.txt? It depends on your business model. Publishers block because AI crawlers consume content without sending traffic back (OpenAI's crawl-to-referral ratio is 1,700:1). Brands benefit from allowing crawlers because AI engines need access to cite your pages. With 79% of top news publishers blocking, brands that allow crawling face less competition in the retrieval pool.

Does blocking AI bots prevent my site from being cited? Not entirely. BuzzStream found that 75% of sites blocking AI bots still appeared in citations, likely because content was already ingested during model training. However, blocking retrieval bots does reduce new real-time citations over time, especially for freshly updated content that the model hasn't seen before.

What is the most impactful technical signal for AI citations? Schema markup (JSON-LD), specifically Article, FAQPage, and Organization types. Pages with structured data are 36% more likely to appear in AI answers. But generic or partially-filled schema actually hurts, producing an 18-percentage-point citation penalty versus no schema at all. Quality of implementation matters as much as presence.

Do AI crawlers execute JavaScript? Most do not for retrieval purposes. Googlebot executes JavaScript, but GPTBot and ClaudeBot typically read raw HTML. If your content requires client-side rendering to display, AI engines outside of Google may see an empty page. Server-side rendering is required for cross-engine visibility.


Not sure whether your technical signals are helping or hurting your AI visibility? CiteGap audits the technical layer across ChatGPT, Google AI, and Claude, including crawler access, schema validation, and rendering checks. Request a consultation.

Want to know if AI engines cite your brand?

CiteGap audits your visibility across ChatGPT, Google AI, and Claude.

Request a Consultation