High Priority
Deploy AI Agent Sitemap (/ai.txt)
Establish a machine-readable index of your core intellectual property, datasets, and API documentation specifically for advanced AI agents and LLM crawlers.
Create a /ai.txt file at your root domain, providing a concise introduction to your AI startup's primary value proposition and data focus.
Include markdown-style links to critical resources: core model documentation, public datasets, research papers, and key API endpoints.
Add a 'Knowledge Base' section in /ai.txt to directly address common queries regarding your data's origin, licensing, and intended use by AI models.


Configure your AI Startups crawler protocols effortlessly.
Join 2,000+ teams scaling with AI.
High Priority
Model-Specific Crawl Directives (e.g., ClaudeBot, Gemini)
Fine-tune which proprietary datasets, model weights (if publicly exposed for research), and inference endpoints are accessible to specific LLM crawlers.
Implement targeted `robots.txt` rules: e.g., `User-agent: ClaudeBot\nAllow: /models/research/\nAllow: /datasets/public/\nDisallow: /internal-inference/`
Utilize `X-Robots-Tag` HTTP headers for dynamic content or API responses to control crawler access granularly.
Validate crawler permissions and access patterns using tools like `Google Search Console`'s URL inspection for Googlebot (as a proxy) and by monitoring server logs for specific AI agent user agents.
Medium Priority
Semantic Markup for AI Ingestion
Leverage structured data and semantic HTML5 elements to ensure AI agents accurately interpret the hierarchy and meaning of your AI research, product features, and technical documentation.
Wrap core research findings and model descriptions within `<article>` tags to signify discrete, important content units.
Employ `<section>` tags with descriptive `aria-label` attributes for distinct AI product modules, data pipelines, or algorithmic components.
Ensure all tables detailing model performance metrics, dataset statistics, or API rate limits use proper `<thead>`, `<tbody>`, and `<th>` tags for structured data extraction.
High Priority
RAG-Optimized Knowledge Chunks
Structure your technical documentation, whitepapers, and case studies to be optimally 'chunked' and retrieved by Retrieval-Augmented Generation (RAG) pipelines.
Segment related concepts, model architectures, or dataset descriptions into distinct, self-contained blocks, ideally under 750 tokens each.
Explicitly state the primary subject or model name at the beginning of each chunk and in any summary statements to prevent context drift.
Eliminate ambiguous pronoun references (e.g., 'it', 'this') and replace them with specific entity names like 'LLaMA 2', 'BERT model', or 'customer dataset'.