LLM Comparison Cheat Sheet

Last Updated: November 21, 2025

Major Models Overview

Model	Developer	Latest Version	Context Window	Best For
GPT-4 Turbo	OpenAI	GPT-4 Turbo (Nov 2023+)	128K tokens	General tasks, creative writing, code
Claude 3 Opus	Anthropic	Claude 3 Opus (March 2024)	200K tokens	Complex analysis, long documents, coding
Claude 3 Sonnet	Anthropic	Claude 3 Sonnet	200K tokens	Balanced speed/intelligence
Gemini Ultra	Google	Gemini 1.5 Pro	1M tokens	Multimodal, long context, analysis
Llama 2 70B	Meta	Llama 2 (July 2023)	4K tokens	Open source, self-hosting
Mistral Large	Mistral AI	Mistral Large	32K tokens	European alternative, efficiency
Perplexity	Perplexity AI	Multiple models	Varies	Real-time search, citations

Capability Comparison

Capability	GPT-4	Claude 3 Opus	Gemini Ultra
Coding	Excellent	Excellent	Very Good
Math/Reasoning	Excellent	Excellent	Excellent
Creative Writing	Excellent	Outstanding	Very Good
Analysis	Very Good	Outstanding	Excellent
Following Instructions	Very Good	Excellent	Good
Image Understanding	Very Good	Excellent	Excellent
Multilingual	Very Good	Good	Excellent
Speed	Fast	Medium (Sonnet: Fast)	Fast
Honesty/Accuracy	Good	Excellent	Good
Safety/Refusals	Moderate	Conservative	Moderate

Pricing Comparison (per 1M tokens)

Model	Input Cost	Output Cost	Free Tier
GPT-4 Turbo	$10	$30	Limited via ChatGPT free
GPT-3.5 Turbo	$0.50	$1.50	Unlimited via ChatGPT
Claude 3 Opus	$15	$75	Limited free messages
Claude 3 Sonnet	$3	$15	Available
Claude 3 Haiku	$0.25	$1.25	API only
Gemini 1.5 Pro	$7 (128K), $3.50 (>128K)	$21 (128K), $10.50 (>128K)	60 requests/min free
Mistral Large	$8	$24	Trial credits
Llama 2 70B	Free (self-host)	Free (self-host)	Open source

Strengths & Weaknesses

Model	Key Strengths	Notable Weaknesses
GPT-4	Versatile, plugin ecosystem, large community	Can be verbose, occasional hallucinations
Claude 3 Opus	Nuanced understanding, long context, thoughtful	Slower, more expensive, sometimes over-cautious
Claude 3 Sonnet	Fast, good balance, affordable	Less capable than Opus for complex tasks
Gemini Ultra	Massive context, multimodal, integrated with Google	Newer, less polished, availability limited
Llama 2	Open source, customizable, privacy	Requires infrastructure, less capable
Mistral	European data residency, efficient	Smaller ecosystem, newer platform

Use Case Recommendations

Use Case	Best Choice	Alternative	Reasoning
Code Generation	GPT-4 Turbo	Claude 3 Opus	Strong coding capabilities, wide language support
Long Document Analysis	Gemini 1.5 Pro	Claude 3 Opus	1M token context, excellent comprehension
Creative Writing	Claude 3 Opus	GPT-4	Nuanced, natural prose, character depth
Research & Citations	Perplexity	Gemini (Google Search)	Real-time info, source citations
Customer Support Chatbot	GPT-3.5 Turbo	Claude 3 Haiku	Cost-effective, fast responses
Complex Reasoning	Claude 3 Opus	GPT-4	Superior analytical capabilities
Privacy-Sensitive Work	Llama 2 (self-hosted)	Mistral (European)	Data control, compliance
Multimodal Tasks	Gemini Ultra	GPT-4 Vision	Native multimodal architecture
Budget Projects	Claude 3 Haiku	GPT-3.5	Low cost, decent performance
Translation	Gemini	GPT-4	Multilingual strength

API Access & Platforms

Model Family	Chat Interface	API	Integrations
GPT-4	ChatGPT, ChatGPT Plus	OpenAI API	Microsoft Copilot, many third-party
Claude	Claude.ai, Claude Pro	Anthropic API, AWS Bedrock	Notion, Slack (limited)
Gemini	Google Bard/Gemini	Google AI Studio, Vertex AI	Google Workspace, Android
Llama	Various (Hugging Face, etc.)	Self-hosted, Together AI	Open source ecosystem
Mistral	Le Chat	Mistral API, Azure	Growing ecosystem

Safety & Alignment


         GPT-4: RLHF + Rule-based

Reinforcement learning from human feedback with moderation


         Claude: Constitutional AI

Self-supervised learning based on principles, more cautious


         Gemini: Multiple safety filters

Adjustable safety settings, integrated with Google Safety


         Llama: Community-driven

Base model, safety depends on implementation

Benchmark Scores (Approximations)

Benchmark	GPT-4	Claude 3 Opus	Gemini Ultra
MMLU (General Knowledge)	86.4%	86.8%	90.0%
HumanEval (Coding)	67.0%	84.9%	74.4%
MATH (Problem Solving)	52.9%	60.1%	53.2%
GSM8K (Grade School Math)	92.0%	95.0%	94.4%
TruthfulQA (Truthfulness)	~60%	~68%	~64%

Training Data Knowledge Cutoff


         GPT-4 Turbo: April 2023

Some versions with later cutoffs


         Claude 3: August 2023

Most recent training data


         Gemini: April 2023

Can access real-time Google Search


         Llama 2: July 2023

Open source, static knowledge


         Perplexity: Real-time

Always current via web search

Model Selection Checklist


         Context length needed?

Gemini (1M) > Claude (200K) > GPT-4 (128K)


         Budget constraints?

Haiku/GPT-3.5 for cost, Opus for quality


         Speed requirements?

GPT-3.5, Claude Haiku, Gemini Flash fastest


         Privacy/compliance needs?

Consider self-hosted Llama or Mistral


         Multimodal (images)?

GPT-4V, Claude 3, Gemini all support vision


         Real-time information?

Use Perplexity or Gemini with search


         Creative tasks?

Claude 3 Opus excels at nuanced writing


         Code generation?

GPT-4 or Claude 3 Opus both excellent

Emerging Models to Watch


         GPT-5 (OpenAI)

Expected major upgrade, release TBD


         Llama 3 (Meta)

Next open source iteration


         Grok (xAI)

Elon Musk's AI, real-time X integration


          Inflection Pi

Personal AI assistant focus


          Cohere Command

Enterprise-focused with RAG capabilities

💡 Pro Tip: Don't rely on a single model! Use GPT-4 for quick tasks and plugins, Claude 3 Opus for complex analysis and writing, and Gemini for huge documents. For production apps, test multiple models on your specific use case before committing. Context length and pricing often matter more than benchmark scores!

← Back to Data Science & ML | Browse all categories | View all cheat sheets