Deep Dive: Comparing the Big Four LLM APIs — ChatGPT, Claude, Gemini & Grok
A systematic comparison of session-layer design across ChatGPT, Claude, Gemini, and Grok — message models, context management, standout features, API styles, and the divergent product philosophies behind them
If you’re building LLM-powered applications, you’ll inevitably deal with the “session layer” — how messages are stored, how context is managed, how multi-turn conversations maintain state, and how long conversations avoid blowing up the context window.
This layer might sound mundane, but it’s actually where the four major platforms diverge the most. OpenAI, Anthropic, Google, and xAI have each taken fundamentally different paths when it comes to “conversation.” Some differences are engineering trade-offs, some reflect product philosophy, and some are ecosystem strategy.
This post is my research notes from dissecting all four platforms’ APIs — with code, comparisons, and some opinionated takes.
TL;DR
| ChatGPT | Claude | Gemini | Grok | |
|---|---|---|---|---|
| Context window | 128K | 200K | 2M | 128K |
| Cache control | Automatic, opaque | Explicit, ~90% savings | Explicit, ~75% savings | Not supported |
| API compatibility | Original standard | Independent design | Independent design | OpenAI-compatible |
| Real-time data | Web search | None | Google Search | X platform |
| Multimodal | Images (plugin-style) | Images / documents | Native text+image+audio+video | Images |
| Knowledge management | GPTs (20 file limit) | Projects (unlimited) | Gems (no knowledge base) | None |
| One-liner positioning | Broadest ecosystem, industry standard-setter | Cost-controllable, developer-friendly | Ultra-long context, native multimodal | Zero migration cost, instant onboarding |
Quick decision guide:
- Already running OpenAI code and want to try another model → Grok (change one
base_urlline) - Long document analysis / massive context → Gemini (2M tokens)
- Cost-sensitive + knowledge base Q&A → Claude (explicit caching + Projects)
- Need real-time Google / X data → Gemini or Grok
Starting from OpenAI: The Data Structure That Became an Industry Standard
OpenAI’s Chat Completions API rapidly established the “message array” as the de facto standard for LLM conversations in 2023:
{
"role": "user" | "assistant" | "system",
"content": "message content or content block array"
}
The design is deliberately minimal — role and content, just two fields. But that minimalism is intentional: stateless, no magic, developers have full control. Every request carries the complete conversation history; the API remembers nothing. If something breaks, it’s on your end.
As features expanded, role gained a new value — developer (replacing system in newer versions, with higher priority than user, explicitly separating “framework instructions” from “user input”). Content evolved from a plain string to an array that can carry multimodal data:
{
"role": "user",
"content": [
{ "type": "text", "text": "What's in this image?" },
{ "type": "image_url", "image_url": { "url": "https://..." } }
]
}
Tool calling forms a complete loop: assistant initiates tool_calls, a tool role returns results, triggering the next round of reasoning. Grok later adopted this pattern wholesale — not out of laziness, but because compatibility is a strategy.
// assistant initiates a tool call
{ "role": "assistant", "content": null,
"tool_calls": [{ "id": "call_abc", "type": "function",
"function": { "name": "get_weather", "arguments": "{\"location\": \"Beijing\"}" } }] }
// tool returns results
{ "role": "tool", "tool_call_id": "call_abc", "content": "{\"temperature\": 22}" }
Three Generations of API Evolution
OpenAI has been refactoring this system itself. Three API generations represent three different state management philosophies:
| API | State management | Status |
|---|---|---|
| Chat Completions | Stateless, full history every request | Continued support |
| Assistants API | Stateful (Thread objects) | Deprecated mid-2026 |
| Responses API | Optionally stateful, chain conversations via previous_response_id | Recommended |
The Responses API is a significant shift this year: instead of forcing developers to choose between “manage history yourself” and “hand everything to the server,” it enables chained conversations through previous_response_id while preserving the model’s reasoning state across turns:
response1 = client.responses.create(model="gpt-4", input="Hello", store=True)
response2 = client.responses.create(
model="gpt-4", input="Continue",
previous_response_id=response1.id # Just pass the ID, not the full history
)
Real-world impact: cache hit rates improve by 40-80%, and GPT-5 shows a 5% improvement over Chat Completions on certain reasoning benchmarks.
Product Layer: Three Designs Worth Exploring
Message Branching is an easily overlooked feature in ChatGPT. Editing a historical message creates a new branch — the underlying structure is a tree where each message has a parent_id and children_ids[]. The UI shows a < 2/3 > switcher, letting users navigate between “multiple universes of a conversation.” This is incredibly valuable for exploratory dialogue, yet most developer-built apps don’t support it.
The Memory System is far more sophisticated than users realize. Based on community reverse-engineering of ChatGPT’s System Prompt (source: embracethered.com), Memory doesn’t just “store conversation history” — it maintains a 6-layer user profile injected into the System Prompt:
1. Model Set Context — Content the user explicitly asked to remember (timestamped)
2. Assistant Response Preferences — Inferred interaction preferences (with confidence scores)
3. Notable Past Conversation Topics — Historical topic summaries
4. Helpful User Insights — Extracted personal/professional information
5. Recent Conversation Content — Approximately last 40 turns (user messages only)
6. User Interaction Metadata — Account/device/behavioral data
Key design choices: it uses RAG rather than full-text embedding; it only stores user messages (not assistant responses) to save tokens; OpenAI asynchronously updates user profiles in the background. The trade-off is that users can’t view or edit system-inferred information — likely one reason this feature still hasn’t launched in Europe (GDPR).
Canvas takes a different approach: by injecting functions under a canmore namespace into the system prompt, the model can manipulate a separate “document panel”:
canmore.create_textdoc(content: string): { textdoc_id: string }
canmore.update_textdoc(textdoc_id: string, pattern: string, replacement: string)
It’s an interesting design pattern: UI interactions aren’t driven by traditional frontend routing, but by tool calls in the model’s output. Content longer than 10 lines automatically triggers Canvas, and Python code can run directly in the browser via WASM.
Claude: The Opposite Extreme — Giving Control Back to Developers
If OpenAI’s philosophy is “here’s a good-enough standard, we’ll add features gradually,” Claude’s API design philosophy is “give developers explicit control over everything.”
The most telling example: the System Prompt isn’t inside the message array — it’s a separate top-level parameter.
{
model: "claude-opus-4-5-20251101",
system: "You are...", // Separate parameter, not mixed into messages
messages: [
{ role: "user", content: "..." },
{ role: "assistant", content: "..." }
],
thinking: { type: "enabled", budget_tokens: 8000 }
}
This choice aligns with Gemini’s later systemInstruction design — physically separating “framework instructions” from “conversation content” benefits both caching and compositional reuse. Grok didn’t follow suit (still using OpenAI’s message-role approach).
Content Block Types: The Richest System
Claude has the most content block types among the four platforms. Beyond text, images, and tool calls/results, two types deserve special attention:
Thinking Blocks carry signatures:
{
type: "thinking",
thinking: string,
signature: string // Server-side Anthropic signature, prevents client-side forgery
}
More importantly, the behavior: thinking blocks from previous turns are automatically stripped from subsequent conversation context. They don’t accumulate, don’t expose intermediate reasoning to later turns, and don’t consume context window space. Gemini’s Thinking Mode works differently — the thinking process is visible in responses but isn’t automatically cleaned up.
Document blocks natively support PDF with built-in citation capabilities:
{
type: "document",
source: { type: "base64" | "url", media_type: "application/pdf", data: string },
citations?: { enabled: true }
}
Citations can pinpoint character positions or page numbers — more granular than what any other platform offers for document Q&A scenarios.
Prompt Caching: Turning Cost Optimization into an Engineering Problem
Claude’s caching is explicit, with content-block-level granularity and configurable TTL of 5 minutes or 1 hour:
{
system: [{
type: "text",
text: "Long document content...",
cache_control: { type: "ephemeral", ttl: "1h" }
}]
}
The response tells you exactly how much was cached:
usage: {
input_tokens: 1000,
output_tokens: 500,
cache_creation_input_tokens: 800, // Written to cache this time
cache_read_input_tokens: 0 // Cache hits (saves ~90% cost)
}
Compared to ChatGPT’s “automatic caching without telling you how much hit,” this design transforms cost optimization from “hoping for the best” to “something you can engineer.” For use cases involving repeated processing of the same long document (code review, contract analysis, knowledge base Q&A), this difference is substantial.
Gemini’s Context Caching offers a similar explicit API, with hits saving approximately 75% on input costs. The distinction: Claude provides content-block-level control; Gemini operates at the request level.
Projects: Knowledge Management on Another Level
Claude’s Projects far exceed ChatGPT GPTs (20-file limit) in knowledge base capacity:
Project
├── Custom Instructions (project-level System Prompt)
├── Knowledge Base
│ ├── No file count limit, single file max 30MB
│ ├── Supports PDF/DOCX/CSV/TXT/HTML/ODT/RTF/EPUB
│ └── Contextual Retrieval (not just vector search — adds contextual details to retrieved chunks)
└── Conversations (multiple conversations within a project share the knowledge base)
The RAG implementation, called Contextual Retrieval, works as follows: retrieve relevant content, enhance it with contextual details, then combine with the user’s question to generate a response. This yields higher quality than pure vector similarity search. Gemini’s Gems currently don’t support knowledge base uploads — a notable gap.
Artifacts vs. Canvas: A Philosophical Divide
Claude Artifacts and ChatGPT Canvas solve the same problem (collaborative editing of generated content) but with different philosophies:
| Artifacts (Claude) | Canvas (ChatGPT) | |
|---|---|---|
| Content management | Independent versioned objects with full version history | Real-time collaboration, no explicit versions |
| Storage | Personal / Shared types | Session-scoped, temporary |
| Code execution | Not supported | Python runs in WASM |
| Supported formats | Markdown/HTML/React/SVG/Mermaid | Documents/Code |
Artifacts treat “generated content” as an object with a lifecycle. The Shared type even supports multi-user shared state (leaderboards, collaborative documents, etc.). Canvas leans more toward “real-time collaboration,” but content doesn’t exist independently outside the session.
Gemini: Redefining Multimodal from the Ground Up
Among the four platforms, Gemini is the only one that designed multimodal as a first-class citizen from day one. This isn’t a feature-level difference — it’s a data model-level difference.
Deliberately Different Data Structures
Gemini’s messages aren’t called message — they’re Content. Content isn’t content — it’s parts. The AI’s role isn’t assistant — it’s model:
interface Content {
role: "user" | "model"; // Not "assistant"
parts: Part[]; // Not "content"
}
These naming choices aren’t accidental — a parts array more naturally expresses “a single message containing text, images, and audio simultaneously.” OpenAI later added content array support, but semantically it’s an “extension.” For Gemini, it’s been “native” from the start:
// Mixing modalities in one message is a first-class citizen
{ role: "user", parts: [
{ text: "Analyze the content of this video for me" },
{ fileData: { mimeType: "video/mp4", fileUri: "gs://..." } },
{ inlineData: { mimeType: "audio/mp3", data: audioBase64 } }
]}
Format coverage is the broadest of all four platforms: images (PNG/JPEG/WEBP/HEIC/HEIF), audio (WAV/MP3/AIFF/AAC/OGG/FLAC), video (MP4/MPEG/MOV/AVI, etc.), and documents (PDF/TXT/HTML/JS/Python, etc.).
What a 2M Context Window Actually Means
| Model | Context window | Max output |
|---|---|---|
| Claude | 200K tokens | 8K |
| ChatGPT (GPT-4) | 128K tokens | 16K |
| Gemini 2.0 Flash | 1M tokens | 8K |
| Gemini 1.5 Pro | 2M tokens | 8K |
2M tokens isn’t just “a bigger 128K” — it’s a qualitative shift: roughly 3,000 pages of documents, about 2 hours of video, or an entire mid-sized codebase can be stuffed into context without RAG chunking. For use cases that require “global understanding” rather than “local retrieval” — code refactoring, full contract review, academic paper analysis — this is a genuine competitive advantage.
Context Caching lets you sustain this advantage without blowing up costs:
// Create cache (done once)
const cache = await cacheManager.create({
model: "gemini-1.5-pro",
contents: largeContextContents, // That 2M context
ttl: "3600s"
});
// Subsequent requests only send the new question
const response = await model.generateContent({
cachedContent: cache.name,
contents: [{ role: "user", parts: [{ text: "Based on the entire codebase, find all N+1 query issues" }] }]
});
With cache hits, input token pricing drops by approximately 75%.
Google Ecosystem Integration: Moat and Privacy Concern
Grounding with Google Search isn’t just “web search” — it provides paragraph-level source attribution with confidence scores:
// Enable real-time search
{ tools: [{ googleSearch: {} }] }
// Response includes source evidence
groundingMetadata: {
groundingChunks: [{ web: { uri: "https://...", title: "Source Title" } }],
groundingSupports: [{
segment: { startIndex: 0, endIndex: 100 },
groundingChunkIndices: [0],
confidenceScores: [0.95] // Confidence per source
}],
webSearchQueries: ["actual search queries executed"]
}
Deeper integration comes through Extensions: Gemini can directly access a user’s Gmail, Drive, Calendar, and Maps — not via API calls, but through authorized direct data access. This level of depth is beyond what ChatGPT Actions can achieve, and it’s a direction Claude actively avoids.
But this is also Gemini’s biggest liability: deep data access means greater privacy risk, and Extensions are predefined (official integrations only) — users can’t create custom ones, making it far less flexible than GPTs’ Actions system.
Live API: The New Frontier of Real-Time Interaction
Gemini 2.0’s Live API supports real-time audio/video streaming over WebSocket. Among the other three platforms, only ChatGPT’s Realtime API offers similar functionality; Claude and Grok don’t support it:
const ws = new WebSocket("wss://generativelanguage.googleapis.com/ws/...");
ws.send(JSON.stringify({
realtimeInput: { mediaChunks: [{ mimeType: "audio/pcm", data: audioChunkBase64 }] }
}));
Use cases: voice assistants, real-time translation, video call analysis.
Grok: Exclusive Data + Compatibility Strategy
Grok has the smallest ecosystem and the latest launch among the four, but it has two things the others can’t replicate.
The Exclusive Moat: Real-Time X Platform Data
Grok’s core differentiator isn’t model capability — it’s the data source:
User asks: "What are the latest developments in AI regulation?"
↓
Real-time X platform retrieval (latest posts + trending topics + threaded discussions)
↓
System Prompt + real-time data + user message → context assembly
↓
Response with up-to-date information (sourced from X platform)
The injected data includes post content, author info (follower count, verification status), engagement metrics (likes/reposts/views), and thread structure.
For sentiment analysis, breaking news tracking, and social media trend monitoring, this is an advantage no other platform can replicate — even though Google Grounding has Google Search, X platform data isn’t in it.
Strategic Compatibility: Using the OpenAI SDK as a Zero-Cost Entry Point
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.XAI_API_KEY,
baseURL: "https://api.x.ai/v1" // Change this one line, everything else stays the same
});
The strategic logic is clear: lower the barrier to entry → let developers already using OpenAI try Grok seamlessly → accumulate user data and trust → then push differentiated features. From a market share perspective, this is a sound late-mover strategy.
Supported endpoints fully cover the commonly used APIs: /v1/chat/completions, /v1/completions, /v1/embeddings, /v1/images/generations, plus Function Calling (fully OpenAI-compatible format).
Current limitations are also obvious: no knowledge base management, no project organization, no caching support. But this may be deliberate — secure the “X data gateway” and “compatibility layer” strategic positions first, then backfill other features.
DeepSearch and Personality Modes
DeepSearch is Grok’s deep research feature. It’s not just “search” — it’s multi-source aggregation + cross-validation + structured reporting. The flow: decompose complex questions → parallel retrieval from X data + web search + authoritative sources → cross-validate + assign credibility scores → generate a sourced report. It takes longer (minutes), suited for scenarios requiring comprehensive analysis.
Fun Mode / Regular Mode switch styles via different System Prompts — Regular is professional and direct, Fun is humorous and opinionated. No other platform has a direct equivalent. Claude’s Styles feature is conceptually similar, but it’s positioned as helping users adjust the assistant’s communication style rather than giving the assistant a “personality mode.”
Cross-Platform Comparison
Key Dimensions Summary
| Dimension | ChatGPT | Claude | Gemini | Grok |
|---|---|---|---|---|
| Max context | 128K | 200K | 2M | 128K |
| AI role name | assistant | assistant | model | assistant |
| System Prompt placement | Message role / top-level param | Top-level system | Top-level systemInstruction | Message role |
| API compatibility | Original (industry standard) | Independent design | Independent design | OpenAI-compatible |
| Multimodal depth | Images (plugin-style) | Images/documents | Native text+image+audio+video | Images |
| Real-time data | Web Browsing | None | Google Search + Grounding | Native X platform |
| Knowledge management | GPTs (20 files) + Memory | Projects (unlimited files) | Gems (no knowledge base) | None |
| Collaborative editing | Canvas | Artifacts (versioned) | None | None |
| Cache control | Automatic, opaque | Explicit TTL, ~90% savings | Context Caching API, ~75% savings | Not supported |
| Deep reasoning | o1 series (separate model) | Extended Thinking (configurable budget) | Thinking Mode | Not supported |
| Real-time audio/video | Realtime API | None | Live API | None |
Core Differences in Session Data Models
ChatGPT: messages[{ role: "assistant", content: string | array }]
Claude: messages[{ role: "assistant", content: string | array }] + top-level system
Gemini: contents[{ role: "model", parts: Part[] }] + top-level systemInstruction
Grok: messages[{ role: "assistant", content: string | array }] ← identical to OpenAI
Three of the four use OpenAI’s assistant and content conventions. Only Gemini uses model and parts — not a preference difference, but a reflection of multimodal-first design philosophy at the data model level.
Three Philosophies of Caching
This dimension best reveals the product philosophy differences:
- ChatGPT: Automatic caching, doesn’t tell you how much hit. “Focus on the conversation, we’ll optimize costs.” But this means you can’t predict actual spend.
- Claude: Explicit
cache_control, content-block-level granularity, configurable TTL, response reports exactly how many tokens hit cache. “Cost is engineerable — you’re in charge.” - Gemini: Context Caching API, request-level caching, ideal for “fixed large context + varying questions” patterns.
- Grok: No caching support — a clear disadvantage for long-context scenarios.
For cost-sensitive production environments, Claude’s explicit cache control is currently the most mature solution.
Streaming Response Formats
| Platform | SSE event format |
|---|---|
| ChatGPT | data: {"choices":[{"delta":{"content":"..."}}]} |
| Claude | event: content_block_delta + data: {"delta":{"text":"..."}} |
| Gemini | {"candidates":[{"content":{"parts":[{"text":"..."}]}}]} |
| Grok | Same as ChatGPT |
Claude uses event types to distinguish data blocks (content_block_start, content_block_delta, content_block_stop), making it cleaner to parse complex responses (thinking blocks + text blocks + tool calls), though at a slightly higher integration cost.
Takeaways for Developers
Platform selection logic:
| Scenario | Top pick | Key reason |
|---|---|---|
| Long document analysis | Gemini (2M) / Claude (200K) | Genuinely ultra-long context |
| Knowledge base Q&A | Claude | Unlimited knowledge base + Contextual Retrieval |
| Real-time information | Grok (X data) / Gemini (Google Search) | Exclusive data sources |
| Multimodal applications | Gemini | Native multimodal, not plugin-style |
| Quick API integration | Grok (OpenAI SDK compatible) | Zero migration cost |
| Cost-sensitive + long context | Claude | Explicit caching, predictable costs |
| Security & compliance | Claude | No ecosystem integrations, clear data boundaries |
Recommended message data structure:
interface Message {
id: string;
role: "user" | "assistant" | "system";
content: string | ContentBlock[];
created_at: string;
parent_id?: string; // Support branching — ChatGPT proved this is valuable
children_ids?: string[];
metadata?: {
model?: string;
tokens?: number;
cache_hit_tokens?: number; // Claude-style cache tracking
tool_calls?: ToolCall[];
};
}
Managing the System Prompt separately from the message array (the Claude / Gemini approach) benefits both caching and compositional reuse.
Context management strategies:
| Scenario | Recommended approach |
|---|---|
| Short conversations | Pass full history, no special handling needed |
| Long conversations | Sliding window + prioritize System Prompt retention |
| Knowledge-intensive | RAG + explicit context caching (following Claude/Gemini) |
| Deep reasoning | Extended Thinking + thinking block cleanup (following Claude) |
Five trends that won’t go away:
- The message array model is the standard. Every platform is building on top of it; a fundamental redesign is extremely unlikely.
- Context windows will keep growing, but returns diminish beyond 2M. The cost curve deserves more attention.
- Multimodal shifting from “plugin” to “native” is irreversible. Gemini’s Part-type design represents the future direction.
- Ecosystem integration is a moat. Google’s full suite and X platform data can’t be replicated by other platforms in the short term.
- Explicit cache control is increasingly important. Cost optimization will become one of the core engineering challenges in LLM applications.
References:
- OpenAI Chat Completions API Reference
- OpenAI Responses API vs Chat Completions — Simon Willison
- How ChatGPT Remembers You: Memory Deep Dive — embracethered.com
- ChatGPT’s Canvas: Internal Technical Details
- Claude Messages API Reference
- Claude Extended Thinking Guide
- Claude Artifacts Help
- Gemini API Documentation
- Gemini Context Caching Guide
- Gemini Grounding with Google Search
- xAI / Grok API Documentation