Large language models are remarkably good at understanding text. But the web is visual. A page can say one thing in its HTML and look completely different once CSS, JavaScript, and layout are applied. If your AI agent can only read raw HTML, it’s working with an incomplete picture.
Screenshot APIs bridge this gap. They give LLMs the ability to “see” web pages as a human would — rendered, styled, and interactive. Combined with the vision capabilities now standard in models like GPT-4o, Claude, and Gemini, a screenshot API turns your text-only agent into one that can visually inspect, verify, and interact with the web.
This post covers the practical patterns for integrating screenshot APIs with AI agents, with code examples using len.sh.
Why AI Agents Need Screenshots
1. Visual Verification
An AI agent that manages deployments can verify that a website looks correct after a deploy — not just that it returns a 200 status code. A green uptime check doesn’t tell you if the hero section is broken, a CSS file failed to load, or the layout is completely wrong.
// After deploying, capture a screenshot and verify visually
const screenshot = await fetch(
`https://api.len.sh/v1/screenshot?url=${deployedUrl}&width=1280&height=720&format=png&access_key=${apiKey}`
);
const imageBuffer = Buffer.from(await screenshot.arrayBuffer());
const base64Image = imageBuffer.toString("base64");
// Send to a vision LLM for analysis
const analysis = await llm.chat([
{
role: "user",
content: [
{ type: "text", text: "Does this website look correctly deployed? Check for broken layouts, missing images, error messages, or any visual issues." },
{ type: "image", source: { type: "base64", media_type: "image/png", data: base64Image } },
],
},
]);
2. Web Research
AI agents that research topics online can see the actual rendered page, not just the raw HTML. This matters for pages with dynamic content, charts, infographics, or visual layouts that convey information not present in the text.
3. Form and UI Interaction
Agents that fill forms or navigate multi-step workflows need to see the current state of the page. A screenshot shows the agent what a human user would see — including dynamic UI state, error messages, loading spinners, and modal dialogs.
4. Content Understanding
Some web content is inherently visual — data visualizations, design portfolios, product galleries. An agent analyzing these pages needs screenshots to understand what the page actually communicates.
Basic Integration Pattern
The fundamental pattern is simple: screenshot a URL, convert to base64, send to a vision LLM.
JavaScript
import Anthropic from "@anthropic-ai/sdk";
const anthropic = new Anthropic();
async function analyzeWebpage(url) {
// Step 1: Capture screenshot
const params = new URLSearchParams({
url,
width: "1280",
height: "720",
format: "png",
access_key: process.env.LENSH_API_KEY,
});
const screenshotResponse = await fetch(
`https://api.len.sh/v1/screenshot?${params}`
);
const imageBuffer = Buffer.from(await screenshotResponse.arrayBuffer());
const base64 = imageBuffer.toString("base64");
// Step 2: Send to vision LLM
const response = await anthropic.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
messages: [
{
role: "user",
content: [
{
type: "text",
text: `Analyze this screenshot of ${url}. Describe the page layout, main content, and any notable visual elements.`,
},
{
type: "image",
source: {
type: "base64",
media_type: "image/png",
data: base64,
},
},
],
},
],
});
return response.content[0].text;
}
// Usage
const analysis = await analyzeWebpage("https://example.com");
console.log(analysis);
Python
import anthropic
import requests
import base64
import os
client = anthropic.Anthropic()
def analyze_webpage(url: str) -> str:
# Step 1: Capture screenshot
response = requests.get(
"https://api.len.sh/v1/screenshot",
params={
"url": url,
"width": 1280,
"height": 720,
"format": "png",
"access_key": os.environ["LENSH_API_KEY"],
},
)
response.raise_for_status()
image_base64 = base64.b64encode(response.content).decode("utf-8")
# Step 2: Send to vision LLM
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": f"Analyze this screenshot of {url}. Describe the layout, content, and visual elements.",
},
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": image_base64,
},
},
],
}
],
)
return message.content[0].text
analysis = analyze_webpage("https://example.com")
print(analysis)
MCP Server Integration
The Model Context Protocol (MCP) is an open standard for connecting AI agents to external tools. A screenshot API is a natural fit as an MCP tool — it gives agents a standardized way to “look at” web pages.
len.sh provides an MCP server that exposes screenshot, OG image, and PDF capabilities as tools that MCP-compatible AI clients (like Claude Desktop, Cursor, or custom agents) can use automatically.
How It Works
- The AI agent decides it needs to see a web page
- It calls the
screenshottool via MCP - The MCP server makes the API call to len.sh
- The screenshot is returned to the agent as an image
- The agent analyzes the screenshot with its vision capabilities
Example MCP Tool Definition
{
"name": "screenshot",
"description": "Capture a screenshot of a web page. Returns a PNG image of the rendered page.",
"inputSchema": {
"type": "object",
"properties": {
"url": {
"type": "string",
"description": "The URL to screenshot"
},
"width": {
"type": "number",
"description": "Viewport width in pixels",
"default": 1280
},
"height": {
"type": "number",
"description": "Viewport height in pixels",
"default": 720
},
"full_page": {
"type": "boolean",
"description": "Capture the full scrollable page",
"default": false
}
},
"required": ["url"]
}
}
When an AI agent has this tool available, it can autonomously decide to capture screenshots as part of its reasoning process. “Let me check what that page looks like” becomes a native capability.
Agent Patterns
Pattern 1: Visual QA After Deployment
An agent that verifies deployments by visually comparing before and after screenshots:
async function visualQA(stagingUrl, productionUrl) {
const [stagingShot, prodShot] = await Promise.all([
captureScreenshot(stagingUrl),
captureScreenshot(productionUrl),
]);
const response = await anthropic.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
messages: [
{
role: "user",
content: [
{ type: "text", text: "Compare these two screenshots. The first is the staging environment, the second is production. Are there any significant visual differences? Should this deployment proceed?" },
{ type: "image", source: { type: "base64", media_type: "image/png", data: stagingShot } },
{ type: "image", source: { type: "base64", media_type: "image/png", data: prodShot } },
],
},
],
});
return response.content[0].text;
}
async function captureScreenshot(url) {
const params = new URLSearchParams({
url,
width: "1280",
height: "720",
format: "png",
full_page: "true",
access_key: process.env.LENSH_API_KEY,
});
const response = await fetch(`https://api.len.sh/v1/screenshot?${params}`);
const buffer = Buffer.from(await response.arrayBuffer());
return buffer.toString("base64");
}
Pattern 2: Web Research Agent
An agent that researches a topic by visiting pages and understanding their visual content:
async def research_page(url: str, question: str) -> dict:
"""Visit a page, screenshot it, and extract information."""
# Capture the page visually
screenshot_response = requests.get(
"https://api.len.sh/v1/screenshot",
params={
"url": url,
"width": 1280,
"height": 720,
"format": "png",
"block_ads": "true",
"block_cookie_banners": "true",
"access_key": os.environ["LENSH_API_KEY"],
},
)
image_b64 = base64.b64encode(screenshot_response.content).decode()
# Ask the LLM to extract information
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": f"I'm researching: {question}\n\nAnalyze this screenshot of {url} and extract any relevant information."},
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": image_b64}},
],
}
],
)
return {
"url": url,
"findings": message.content[0].text,
}
The block_ads and block_cookie_banners parameters are particularly useful for agent use cases — they clean up the visual noise that would otherwise confuse the LLM’s analysis.
Pattern 3: Monitoring Agent
An agent that periodically checks websites and reports on visual changes:
async function monitorSite(url, description) {
const params = new URLSearchParams({
url,
width: "1280",
height: "720",
format: "png",
block_ads: "true",
block_cookie_banners: "true",
cache_ttl: "0", // Always get fresh screenshot
access_key: process.env.LENSH_API_KEY,
});
const response = await fetch(`https://api.len.sh/v1/screenshot?${params}`);
const buffer = Buffer.from(await response.arrayBuffer());
const base64 = buffer.toString("base64");
const analysis = await anthropic.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 512,
messages: [
{
role: "user",
content: [
{
type: "text",
text: `This is a screenshot of ${url} (${description}). Check for: broken layouts, error messages, missing content, unexpected changes, or anything that looks wrong. If everything looks normal, just say "OK".`,
},
{
type: "image",
source: { type: "base64", media_type: "image/png", data: base64 },
},
],
},
],
});
return analysis.content[0].text;
}
Setting cache_ttl=0 ensures the agent always gets a fresh screenshot, which is critical for monitoring use cases.
Pattern 4: Competitive Intelligence
An agent that monitors competitor websites for changes:
def check_competitor(competitor_url: str, previous_analysis: str) -> str:
"""Screenshot a competitor's page and compare to previous analysis."""
response = requests.get(
"https://api.len.sh/v1/screenshot",
params={
"url": competitor_url,
"width": 1280,
"full_page": "true",
"format": "png",
"cache_ttl": 0,
"access_key": os.environ["LENSH_API_KEY"],
},
)
image_b64 = base64.b64encode(response.content).decode()
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": f"Previous analysis of {competitor_url}:\n{previous_analysis}\n\nLook at the current screenshot. What has changed? Note any new features, pricing changes, messaging updates, or significant visual differences.",
},
{
"type": "image",
"source": {"type": "base64", "media_type": "image/png", "data": image_b64},
},
],
}
],
)
return message.content[0].text
Mobile and Device Simulation
AI agents often need to check mobile views. Use the width, height, and device_scale parameters to simulate different devices:
// iPhone 14 Pro
const mobileShot = await fetch(
`https://api.len.sh/v1/screenshot?url=${url}&width=393&height=852&device_scale=3&format=png&access_key=${apiKey}`
);
// iPad
const tabletShot = await fetch(
`https://api.len.sh/v1/screenshot?url=${url}&width=820&height=1180&device_scale=2&format=png&access_key=${apiKey}`
);
// Desktop
const desktopShot = await fetch(
`https://api.len.sh/v1/screenshot?url=${url}&width=1440&height=900&device_scale=1&format=png&access_key=${apiKey}`
);
An agent can capture all three viewports and ask the LLM to compare responsive behavior across devices.
Performance Tips for Agents
Use WebP for Smaller Payloads
Vision LLMs support WebP, which is significantly smaller than PNG. Smaller images mean less context window consumed and faster API calls:
# PNG: ~500KB
format=png
# WebP: ~150KB (same visual quality)
format=webp&quality=85
Capture Only What You Need
Don’t capture full-page screenshots if the agent only needs to see the above-the-fold content. Smaller images are faster to generate and cheaper to analyze:
# Just the viewport (faster)
width=1280&height=720
# Full page (slower, larger image)
width=1280&height=720&full_page=true
Use Element Selectors
If the agent only needs to see a specific part of the page, use the selector parameter to capture just that element:
# Just the pricing table
selector=.pricing-table
# Just the hero section
selector=.hero
# Just error messages
selector=.error-message
This dramatically reduces image size and makes the LLM’s task more focused.
Cache Strategically
For research tasks where freshness isn’t critical, let the cache work for you — the default 24-hour TTL means subsequent requests for the same page are nearly instant. For monitoring tasks where freshness matters, set cache_ttl=0.
Getting Started
- Sign up for a free len.sh account
- Get your API key from the dashboard
- Capture your first screenshot:
curl "https://api.len.sh/v1/screenshot?url=https://example.com&width=1280&height=720&format=png&access_key=YOUR_API_KEY" \
--output screenshot.png
- Send it to your vision LLM of choice
- Build from there
The free tier (100 screenshots/month) is enough to build and test your agent integration. All parameters — including ad blocking, cookie banner removal, device scale, JS injection, and element selectors — are available on every plan.
AI agents that can see the web are fundamentally more capable than agents that can only read it. A screenshot API is the simplest way to give your agent eyes.