AI 모델 비교 도구

40개 이상의 LLM 비교: GPT-4·Claude·Gemini·Llama. 가격·컨텍스트·라이선스·월간 비용.

필터
이 워크로드 최저 비용
최고 비용
최대 컨텍스트
참고. Pricing is a snapshot and can change without notice. Figures are USD per 1,000,000 tokens. Open-weights models list reference pricing on hosted inference APIs (Together, Fireworks, Groq, Bedrock, etc.) and vary by provider. Benchmark scores are best publicly reported numbers from official model cards and vendor blog posts — they depend on the benchmark version, prompt, and evaluation setup, so treat them as a rough ranking, not an exact measurement. Always verify with the provider before committing budget or making quality-critical decisions.
직접 비교

Pick 2 to 4 models in the 그리드 보기 using the checkbox column. They appear here side by side with the winning metric highlighted, plus a radar chart comparing their published benchmarks.

No models selected yet. Open 그리드 보기, tick 2 to 4 rows, then come back.
사용 사례 선택

자주 묻는 질문

Start with three axes — quality, context size, cost — and rank them for your workload. Chatbots tolerate smaller models (Haiku 4.5, GPT-4o mini, Gemini Flash, Nova Micro). Agents and code assistants benefit from reasoning models (o3, o4-mini, Claude Sonnet 4.5 with extended thinking, DeepSeek R1, Grok 3, Qwen QwQ). Long-document pipelines need 200K+ context (Claude, Gemini 2.5 Pro, GPT-4.1 with 1M, Llama 4 Scout with 10M). Plug your real traffic volume into the cost calculator on this page to see a monthly estimate before you commit. If latency matters, test both time-to-first-token and tokens-per-second with the actual prompt length you expect in production.
Yes, with caveats. There is no LLM trained specifically for civil, structural, or construction engineering — but general frontier models with strong math, reasoning, vision and code abilities can help with: reading codes and standards (Eurocode, ACI, SNiP), sanity-checking load combinations, writing calculation scripts in Python/MATLAB/Grasshopper, extracting quantities from drawings, generating estimating spreadsheets, cross-referencing specifications, and summarising long RFIs or submittals. The 엔지니어링 & 건설 use-case tab ranks models by a composite of GPQA (hard-science reasoning), MATH, and code scores — which is the best public proxy for how well a model handles structured technical problems. Never trust a model\'s numerical answer without an independent check. LLMs fabricate plausible-looking numbers, misquote standards, and miss edge cases. Treat them as an extra pair of eyes, not a stamp-licensed engineer. For drawings, a vision-capable model (Claude 4.5, GPT-4.1, Gemini 2.5, Llama 3.2 Vision, Pixtral) can point out inconsistencies, but OCR/symbol recognition is still imperfect on dense construction drawings.
Any model tagged with the Vision capability accepts images. For reading construction drawings, extracting dimensions from scanned PDFs, or analyzing defect photos, the strongest public scores on MMMU (multimodal understanding) come from: o3 and o4-mini, Claude Sonnet/Opus 4.5, Gemini 2.5 Pro, and Llama 4 Maverick. For tighter budgets, Gemini 2.5 Flash, GPT-4.1 mini, Claude Haiku 4.5, and Nova Lite are useful workhorses. Open-weight vision picks: Llama 3.2 90B Vision and Pixtral Large. Note that dense architectural and structural drawings stress OCR hard — always verify extracted numbers and symbols against the source before using them in calculations.
No. They are best publicly reported scores from official model cards and vendor blog posts. Two models with the same MMLU often feel very different in practice, benchmarks get gamed as they age, and newer evaluations (GPQA Diamond, SWE-bench Verified, AIME) are moving targets. Use the numbers as a rough ranking, not an exact measurement. For a decision that matters, run your own small evaluation on a representative slice of your real data — that tells you more than any public leaderboard.
Closed APIs (GPT, Claude, Gemini) are the fastest route to production: zero infrastructure, frontier quality, and SLA-backed uptime. Downsides are vendor lock-in, data-residency constraints, and steep output pricing at flagship tiers. Open-weights models (Llama 3.1 and 3.3, Mistral, DeepSeek) let you self-host for predictable cost and full data control, at the price of running GPU infrastructure and handling scaling. A common middle ground is using a hosted inference provider (Together, Groq, Fireworks, Bedrock) for open-weights models — you keep the checkpoint portability without managing servers.
The context window is the hard ceiling on how much text fits in a single call: system prompt plus user messages plus model output must all stay under the limit. A 128K window holds roughly 100K English words, a 200K window about 150K, and Gemini 1.5's 1M tokens fits a short novel. If you hit the limit, the request is rejected or silently truncated, and early parts of the conversation may be lost. For RAG pipelines, long documents, or multi-turn agents, pick a model whose context comfortably exceeds your 95th percentile call size — otherwise you end up chunking and re-stitching at runtime.
Closed providers offer fine-tuning on select models (GPT-4o mini, GPT-3.5 Turbo, Gemini 1.5 Flash), typically LoRA-style adapters exposed through their API. You upload a JSONL dataset and pay per training token plus a small hosted-model surcharge. Open-weights models are fully fine-tunable — full weights, LoRA, QLoRA, DPO, any recipe you want — because you own the checkpoint. For most teams, prompt engineering and retrieval solve 90 percent of the quality gap; reach for fine-tuning only when structured outputs, tone matching, or domain vocabulary remain stubbornly wrong after the prompt is already well-crafted.
Input tokens are processed in parallel during the prefill stage: the whole prompt goes through the model once and KV cache is built. Output tokens are generated one at a time, autoregressively, with each new token depending on all previous ones — GPUs cannot parallelize this step, so throughput drops by an order of magnitude. Providers pass that latency cost through as a 3x to 5x price premium on output tokens. This is why flagship-tier Claude Opus charges $15/$75 and GPT-4 Turbo charges $10/$30 per million tokens in versus out. Prompt caching and batched requests are the main levers that bring effective cost back down.
The table reflects a snapshot taken in April 2026 and includes models released through early 2026. LLM providers adjust prices and deprecate snapshots several times a year, so you should always verify the current rate on the provider's official pricing page before signing a contract or sizing a production budget. If a model you want to compare is missing, it is likely either too new to be stable, already deprecated, or too niche for a general-purpose table. Nothing you type in the cost calculator is sent anywhere — the whole comparison runs locally in your browser.

대규모 언어 모델(LLM) 비교 도구입니다. GPT-4o·Claude 3.5·Gemini·Llama·Mistral 등 40개 이상의 모델을 비교할 수 있습니다. 프로바이더(OpenAI·Anthropic·Google·Meta 등)·라이선스(상업용/오픈소스)·최대 입력 가격·최소 컨텍스트 창으로 필터링 가능합니다. 두 모델의 속도·가격·컨텍스트를 직접 비교하는 헤드-to-헤드 모드 제공. 월간 비용 계산기: 입출력 토큰 수와 월간 호출 횟수를 입력해 월별 예상 비용을 산출합니다. 가격은 US$/1M 토큰으로 표시됩니다.