LLM Comparison Cheat Sheet

Last Updated: November 21, 2025

Major Models Overview

Model Developer Latest Version Context Window Best For
GPT-4 Turbo OpenAI GPT-4 Turbo (Nov 2023+) 128K tokens General tasks, creative writing, code
Claude 3 Opus Anthropic Claude 3 Opus (March 2024) 200K tokens Complex analysis, long documents, coding
Claude 3 Sonnet Anthropic Claude 3 Sonnet 200K tokens Balanced speed/intelligence
Gemini Ultra Google Gemini 1.5 Pro 1M tokens Multimodal, long context, analysis
Llama 2 70B Meta Llama 2 (July 2023) 4K tokens Open source, self-hosting
Mistral Large Mistral AI Mistral Large 32K tokens European alternative, efficiency
Perplexity Perplexity AI Multiple models Varies Real-time search, citations

Capability Comparison

Capability GPT-4 Claude 3 Opus Gemini Ultra
Coding Excellent Excellent Very Good
Math/Reasoning Excellent Excellent Excellent
Creative Writing Excellent Outstanding Very Good
Analysis Very Good Outstanding Excellent
Following Instructions Very Good Excellent Good
Image Understanding Very Good Excellent Excellent
Multilingual Very Good Good Excellent
Speed Fast Medium (Sonnet: Fast) Fast
Honesty/Accuracy Good Excellent Good
Safety/Refusals Moderate Conservative Moderate

Pricing Comparison (per 1M tokens)

Model Input Cost Output Cost Free Tier
GPT-4 Turbo $10 $30 Limited via ChatGPT free
GPT-3.5 Turbo $0.50 $1.50 Unlimited via ChatGPT
Claude 3 Opus $15 $75 Limited free messages
Claude 3 Sonnet $3 $15 Available
Claude 3 Haiku $0.25 $1.25 API only
Gemini 1.5 Pro $7 (128K), $3.50 (>128K) $21 (128K), $10.50 (>128K) 60 requests/min free
Mistral Large $8 $24 Trial credits
Llama 2 70B Free (self-host) Free (self-host) Open source

Strengths & Weaknesses

Model Key Strengths Notable Weaknesses
GPT-4 Versatile, plugin ecosystem, large community Can be verbose, occasional hallucinations
Claude 3 Opus Nuanced understanding, long context, thoughtful Slower, more expensive, sometimes over-cautious
Claude 3 Sonnet Fast, good balance, affordable Less capable than Opus for complex tasks
Gemini Ultra Massive context, multimodal, integrated with Google Newer, less polished, availability limited
Llama 2 Open source, customizable, privacy Requires infrastructure, less capable
Mistral European data residency, efficient Smaller ecosystem, newer platform

Use Case Recommendations

Use Case Best Choice Alternative Reasoning
Code Generation GPT-4 Turbo Claude 3 Opus Strong coding capabilities, wide language support
Long Document Analysis Gemini 1.5 Pro Claude 3 Opus 1M token context, excellent comprehension
Creative Writing Claude 3 Opus GPT-4 Nuanced, natural prose, character depth
Research & Citations Perplexity Gemini (Google Search) Real-time info, source citations
Customer Support Chatbot GPT-3.5 Turbo Claude 3 Haiku Cost-effective, fast responses
Complex Reasoning Claude 3 Opus GPT-4 Superior analytical capabilities
Privacy-Sensitive Work Llama 2 (self-hosted) Mistral (European) Data control, compliance
Multimodal Tasks Gemini Ultra GPT-4 Vision Native multimodal architecture
Budget Projects Claude 3 Haiku GPT-3.5 Low cost, decent performance
Translation Gemini GPT-4 Multilingual strength

API Access & Platforms

Model Family Chat Interface API Integrations
GPT-4 ChatGPT, ChatGPT Plus OpenAI API Microsoft Copilot, many third-party
Claude Claude.ai, Claude Pro Anthropic API, AWS Bedrock Notion, Slack (limited)
Gemini Google Bard/Gemini Google AI Studio, Vertex AI Google Workspace, Android
Llama Various (Hugging Face, etc.) Self-hosted, Together AI Open source ecosystem
Mistral Le Chat Mistral API, Azure Growing ecosystem

Safety & Alignment

GPT-4: RLHF + Rule-based
Reinforcement learning from human feedback with moderation
Claude: Constitutional AI
Self-supervised learning based on principles, more cautious
Gemini: Multiple safety filters
Adjustable safety settings, integrated with Google Safety
Llama: Community-driven
Base model, safety depends on implementation

Benchmark Scores (Approximations)

Benchmark GPT-4 Claude 3 Opus Gemini Ultra
MMLU (General Knowledge) 86.4% 86.8% 90.0%
HumanEval (Coding) 67.0% 84.9% 74.4%
MATH (Problem Solving) 52.9% 60.1% 53.2%
GSM8K (Grade School Math) 92.0% 95.0% 94.4%
TruthfulQA (Truthfulness) ~60% ~68% ~64%

Training Data Knowledge Cutoff

GPT-4 Turbo: April 2023
Some versions with later cutoffs
Claude 3: August 2023
Most recent training data
Gemini: April 2023
Can access real-time Google Search
Llama 2: July 2023
Open source, static knowledge
Perplexity: Real-time
Always current via web search

Model Selection Checklist

Context length needed?
Gemini (1M) > Claude (200K) > GPT-4 (128K)
Budget constraints?
Haiku/GPT-3.5 for cost, Opus for quality
Speed requirements?
GPT-3.5, Claude Haiku, Gemini Flash fastest
Privacy/compliance needs?
Consider self-hosted Llama or Mistral
Multimodal (images)?
GPT-4V, Claude 3, Gemini all support vision
Real-time information?
Use Perplexity or Gemini with search
Creative tasks?
Claude 3 Opus excels at nuanced writing
Code generation?
GPT-4 or Claude 3 Opus both excellent

Emerging Models to Watch

GPT-5 (OpenAI)
Expected major upgrade, release TBD
Llama 3 (Meta)
Next open source iteration
Grok (xAI)
Elon Musk's AI, real-time X integration
Inflection Pi
Personal AI assistant focus
Cohere Command
Enterprise-focused with RAG capabilities
💡 Pro Tip: Don't rely on a single model! Use GPT-4 for quick tasks and plugins, Claude 3 Opus for complex analysis and writing, and Gemini for huge documents. For production apps, test multiple models on your specific use case before committing. Context length and pricing often matter more than benchmark scores!
← Back to Data Science & ML | Browse all categories | View all cheat sheets