Jump to section
Last verified: May 16, 2026. Vendor pricing and benchmarks refreshed quarterly.
Author: Ross Taylor, Alameda Internet Marketing
A large language model (LLM) is a statistical text prediction system trained on a massive corpus of human-written text. It generates the next token in a sequence. It does not retrieve facts from a database. The architecture behind every major LLM in production today is the Transformer, introduced in the 2017 paper “Attention Is All You Need” by Vaswani et al. at Google Brain. The basic unit LLMs process is a token, roughly 0.75 English words on average, or about 4 characters. This next-token prediction process is the mechanical core of what every large language model does: assign probabilities to possible continuations, then sample from that distribution. As of May 2026, context windows across the four major vendors reach 1 to 2 million tokens. Anthropic makes Claude (1 million token context), OpenAI makes GPT-5 and GPT-5.5 (1 million tokens via API), Google DeepMind makes Gemini (1 to 2 million tokens), and xAI makes Grok (2 million tokens). These are foundation models: large pretrained systems that can be adapted for many tasks without being retrained from scratch. On open-ended factual tasks, flagship reasoning models hallucinate at rates exceeding 10% per the Awesomeagents.ai April 2026 leaderboard. LLMs are well-suited to writing, summarization, text transformation, and structured extraction. They are not well-suited to factual recall without grounding, precise math, or real-time data. An LLM is the reasoning layer in an AI system; when you give it tools like search or code execution, you get something closer to an agent, which is more than an LLM.
The 30-Second Answer (for people who want it straight)
An LLM is a system trained on a large fraction of written human text to predict what comes next in any sequence of words. It generates language. It does not look things up. (Not the law degree. The AI kind.)
ChatGPT, Claude, Gemini, and Grok are all large language models at their core, though each is built by a different company using a different training approach. When you type a question into any of them, the system does not search a database for the answer. It generates the statistically likely continuation of your input, based on patterns it absorbed during training.
That distinction sounds academic. It is not. It is the one mental model that changes how you use these tools, what you trust them with, and where they will let you down. The rest of this article builds that model out.
How an LLM Actually Generates Text (no math required)
Here is what happens when you type something into Claude or ChatGPT.
The model was trained on a vast corpus of text: books, websites, code repositories, scientific papers, forum threads. During pretraining, it practiced one task billions of times: given a sequence of text, predict what token comes next. This next-token prediction process is not metaphorical. The model assigns a probability to every possible next token in its vocabulary, samples from that distribution, appends the selected token, and repeats. That loop, executed millions of times, produces a response.
A token is not a word. It is a chunk, usually a word fragment. “Tokenization” might split into “token” and “ization.” Common short words like “the” or “is” are one token each. Longer or rarer words get split into smaller pieces.
Perform that prediction task enough times across enough text, and the model’s internal parameters begin encoding something that looks like knowledge: grammar, facts, reasoning patterns, tone, context relationships. None of this was explicitly programmed. It emerged from the pretraining process.
The underlying architecture that made this scale possible is the Transformer. Every major large language model in production today is built on the transformer architecture, a design introduced in a 2017 Google research paper titled “Attention Is All You Need” by Vaswani et al. Before that paper, AI systems processed text sequentially, like reading left to right one word at a time. The Transformer introduced a mechanism that lets the model consider every token in context simultaneously, scoring how relevant each part of the input is to every other part. This is why Claude understands that “it” in a complex sentence refers to the subject three clauses ago, not the most recent noun.
After pretraining, the base model can complete text fluently but has no concept of instructions, helpfulness, or safety. It just predicts. The next stage is fine-tuning: a second round of training on curated prompt-response pairs that teaches the model to follow instructions. Then comes Reinforcement Learning from Human Feedback (RLHF), where human raters compare model outputs and rank which is better, and the model is trained toward the ones they preferred. RLHF is how ChatGPT and Claude developed their distinct personalities. The underlying transformer architecture is the same. What differs is how each model was fine-tuned and shaped by its RLHF process.
Why outputs vary even when the input doesn’t
The model does not look up an answer. It samples from a probability distribution.
Every time a large language model generates a response, it assigns probabilities to its entire vocabulary and samples from that distribution. There is a setting called temperature that controls how much randomness enters that sampling. At zero temperature, the model always picks the highest-probability token. At higher temperatures, lower-probability tokens get a chance. The result: the same prompt can yield different outputs on different runs.
This is not a bug. If I ask Claude to summarize the same set of meeting notes twice, I often get two structurally similar but differently worded summaries. That variation is a property of probabilistic generation, and for most writing tasks, it is fine. For tasks that require exact deterministic output, it is something you need to design around.
Tokens: the unit you pay for and forget about
A 75-word paragraph contains roughly 100 tokens. The standard approximation is 1 token per 0.75 English words, or about 4 characters per token for common English. That is the OpenAI published standard, and it holds roughly across all major vendors.
Tokenization matters for three reasons. First, API pricing is per token (input tokens plus output tokens). A short client email is roughly 200 to 300 tokens. A long research report might be 5,000 to 10,000 tokens. Second, context window limits are measured in tokens, not words. Third, non-English text is less token-efficient: languages with longer average words or characters use more tokens per word, which can affect cost and context headroom significantly.
What an LLM Is Not
Most people use large language models for a few months before they develop a mental model that actually holds up. The three misconceptions below cause most of the early failures.
Not a search engine
A large language model does not index web pages. It does not crawl the internet. It does not retrieve current information from anywhere. What it does is generate text that is statistically consistent with patterns learned during pretraining on its training data.
When I ask Claude who won a specific court case last month, it cannot look it up. It generates a plausible-sounding answer based on patterns from what it read during training. That answer may be wrong, and it will sound confident regardless.
Every model has a training cutoff, a date after which no new information was incorporated. But the knowledge gap is not a clean line. Coverage of events close to the cutoff is thin, because there simply was not much text written and indexed about recent events by the time training completed. Ask an LLM about something that happened in the last six months of its training window and you are already in degraded-recall territory.
Not a database
LLMs have no address book of facts. Their “knowledge” is not stored in a table you can query. It is encoded in billions of parameter weights, the numerical values inside the neural network that were tuned during pretraining and fine-tuning. These weights represent something like compressed patterns in language, not a lookup index.
This is the root cause of hallucination. The model emits the most plausible continuation given the patterns it learned during training. Sometimes the most plausible continuation is factually wrong. The model has no internal check that says “I am not sure about this.” It generates confidently regardless.
Not an “AI brain” that thinks
LLMs are extraordinarily sophisticated pattern completers. They are not reasoning the way humans reason. The appearance of reasoning is an emergent property of scale: when a model has processed enough human text that demonstrates reasoning, it learns to pattern-match that reasoning style.
Anthropic’s own circuit tracing research found that Claude does some forward planning even though it is trained purely on next-token prediction. When generating a poem, for example, the model identifies rhyme-target words and works backward. So there is some structure there. But the current scientific consensus is that these models do not have understanding or beliefs in any meaningful human sense.
One important caveat: reasoning models like Claude’s extended thinking mode, OpenAI’s o3, and Gemini’s thinking variants do something structurally different. They spend additional compute generating internal reasoning traces before producing a final answer, and on logic and math benchmarks, the performance jump is substantial. They are still probabilistic generators. But they are a distinct sub-category worth knowing about.
What LLMs Are Good At (From Someone Who Ships Work With Them)
I have been running large language models in real client workflows for two years at Alameda Internet Marketing, carefully and with guardrails the whole way. Here is where they actually perform.
Writing and prose transformation are where LLMs are most reliable. Drafting, rewriting, changing tone, converting from a bullet list to a narrative paragraph, adapting content for a different audience. These are pattern-consistent tasks: the model has seen millions of examples of “here is a rough version, here is a polished version” and it is very good at that conversion.
Summarization is a close second. Hand it a long document, a meeting transcript, or a stack of customer feedback and it will produce a usable digest. The consistency is the real value: a human doing the same task will abbreviate or weight things differently each time, while the model applies the same logic every pass.
Structured output extraction is underrated. Give a large language model a messy block of unstructured text and ask it to pull specific fields into a structured format, and it handles it well. Classifying emails by intent, tagging articles by topic, extracting addresses from written correspondence. These are tasks where statistical plausibility is exactly what you want.
Brainstorming is another genuine strength, specifically because the model does not need to be “right.” When I need five different angles for a piece of content, I do not need the model to know the truth. I need it to generate five structurally coherent options I can evaluate. That is entirely within its capability.
Code generation is worth mentioning, though it is not this audience’s primary use case: LLMs are strong at writing and explaining code, and the output is more verifiable than prose since you can run it.
The underlying pattern: large language models perform best when statistical plausibility is the goal, not factual recall.
What LLMs Cannot Do Reliably
Hallucination is not a bug. It is how generation works.
Hallucination is fluent, confident-sounding output from a large language model that is factually wrong. The model may cite a paper that does not exist, state a statistic it invented, or describe an event that never happened, all with the same confident tone it uses when it is right.
Calling hallucination a “bug” implies it could be patched. It is not a defect. It is a property of probabilistic generation. The LLM generates what is statistically likely to follow your prompt, not what is factually verified. It has no internal mechanism distinguishing “things I know” from “things that pattern-match to what I might know.” That distinction does not exist in how these systems work.
The data on this is notable. Across flagship reasoning modes for Claude Sonnet 4.x, GPT-5, Grok-4, and DeepSeek-R1, hallucination rates exceed 10% on open-ended factual tasks per the Awesomeagents.ai April 2026 leaderboard. Grok-4 in fast-reasoning mode recorded a 20.2% hallucination rate in Vectara data. These are not small numbers.
Importantly, the task type changes the rate dramatically. Gemini 2.0 Flash hit 0.7% on a document summarization benchmark from Vectara, because it was working from a document you provided, not generating from memory. Constrained tasks with grounding context are far more reliable than open-ended factual recall.
The practical takeaway: do not use a large language model as an authoritative factual source without grounding it in real documents. The standard fix is Retrieval-Augmented Generation (RAG), where the model is given actual retrieved documents before generating, so it is pattern-completing against real content rather than training memory. For the full breakdown of why this happens and how to reduce it in practice, why all LLMs hallucinate covers the mechanics in depth.
Beyond hallucination, a few other hard limitations:
Precise arithmetic without a code interpreter is unreliable. LLMs generate what a calculation should look like, not necessarily the correct answer. Multi-step math fails more often than it should.
Real-time data requires a search tool. Without one, the model’s knowledge stops at its training cutoff.
Persistent memory does not exist across sessions. Each conversation starts fresh. The model has no memory of your previous interactions unless that history is explicitly included in the context window.
Source attribution without RAG is not trustworthy. LLMs can generate plausible-sounding citations that are entirely fabricated. This one surprises people who are used to research tools.
Context Windows and Tokens: The Two Numbers That Actually Matter
The context window is the maximum number of tokens the model can process in one interaction: everything you send in plus everything it generates in return. It is the model’s working memory. Anything outside the context window is invisible. The model cannot recall, infer, or access it.
As of May 2026:
| Model | Vendor | Context Window |
|---|---|---|
| Claude | Anthropic | 1 million tokens |
| GPT-5.5 | OpenAI | 1 million tokens (API) |
| Gemini 2.5 / 3 Pro | Google DeepMind | 1 to 2 million tokens |
| Grok 4.x | xAI | 2 million tokens |
One million tokens is roughly 750,000 words, or about 500 average non-fiction books. In practice, most conversations fit inside 4,000 to 32,000 tokens. The million-plus windows matter for specific use cases: reviewing an entire legal contract set, analyzing a full codebase, or summarizing a book-length research document.
One thing worth knowing: large language models attend less reliably to content buried deep in the middle of very long contexts. This is sometimes called the “lost in the middle” problem. If you are passing a large document to a model, front-load the parts that matter most.
When a conversation gets long and the model seems to forget what you said earlier, it has not. The information is still inside the context window. But as you push against the limit, the model’s behavior can become inconsistent. The fix is to start a new conversation with a condensed summary of the essential context.
Token count also determines your API bill. Short conversational prompts come in around 200-300 tokens. Lengthy research documents push 5,000-10,000 tokens. These numbers matter when you are building workflows that run hundreds or thousands of LLM calls per month.
What This Changes About Running an Agency
Understanding what a large language model actually is changed how we built our workflows at Alameda Internet Marketing, specifically because I stopped expecting deterministic, database-style output from a probabilistic generator.
The concrete shift: I use LLMs as the transformation layer, not the source of truth. Research notes become drafts. Client briefs become outlines. Transcripts become action-item summaries. In every case, the model is working with material I give it, not generating from training memory. That is the configuration where probabilistic generation performs reliably.
The workflow that has had the most operational impact: we route client call transcripts through Claude for action item extraction before the account manager summary goes out. It runs in seconds and costs a few cents per call. The consistency is the value. Humans doing this manually abbreviate differently, miss items differently, and prioritize differently across calls. The model applies the same logic every time.
Understanding that LLM outputs are probabilistic, not deterministic, changed how I structured our review process. The output needs a human pass not because the model is unreliable in some vague sense, but because it is generating, not verifying. The review step is there to catch the cases where the statistically plausible output happens to be wrong.
The other thing that shifted: I stopped treating each model as interchangeable. Claude, GPT, and Gemini have different personalities, different failure modes, and different strengths because they went through different fine-tuning and RLHF pipelines. OpenAI, Anthropic, Google DeepMind, and xAI have each made distinct choices at those stages. Choosing the right model for a specific workflow is a real decision.
An LLM is the reasoning layer in these systems. When you give it tools (search, a calculator, the ability to call an API), it becomes something closer to what people call an agent. An agent is more than an LLM, and that distinction matters when you are designing workflows that need to act in the world, not just generate text.
Which LLM Should You Actually Use?
If you have read this far and your main question is “okay, but which one should I actually use,” that is a different conversation, and it is the right one to have once this mental model is in place.
The four vendors producing frontier models right now are OpenAI (GPT-5, GPT-5.5, o3, o4), Anthropic (Claude), Google DeepMind (Gemini), and xAI (Grok). Each has real differences in capability, cost, context window, and use-case fit. The full breakdown of ChatGPT vs Claude vs Gemini vs Grok is where that conversation lives.
What I’d Tell a New Client About LLMs in One Paragraph
When a new client asks me what an LLM is, here is roughly what I say.
Imagine an extraordinarily fluent text predictor that has been shaped by reading nearly everything humans have written. That is the practical core of it. Its strengths are language work: rewriting, summarizing, transforming, brainstorming, drafting. Its weakness is anything that requires recalling a specific fact accurately, because there is no “fact” in there to recall, just patterns and probabilities. Treat it as a transformation engine, not a knowledge oracle. Hand it the source material and ask it to do something with that material. Avoid asking it to remember things, especially names, numbers, recent events, or anything where being wrong has a cost. When you stay inside the “work with what I give you” lane, the output quality is high and predictable. The moment you step outside that lane, you have to verify everything before trusting it.
If this was useful, there are more pieces like this in the newsletter. [Subscribe here to get them.]
Frequently Asked Questions
What is LLM in simple words?
Think of an LLM as a sophisticated autocomplete that has absorbed most of the written internet. When you give it a prompt, it produces the most plausible continuation based on patterns it learned during training. It does not look anything up. It is not retrieving an answer that exists in storage. The output is generated on the fly, token by token, and it is shaped entirely by what would statistically fit best after your input.
What is the difference between LLM and AI?
AI is the broad category. A large language model is one specific type of AI. The hierarchy runs: AI, then machine learning, then deep learning, then neural network, then large language model. An LLM is a large, text-trained neural network optimized for next-token prediction. Other forms of AI include image classifiers, reinforcement learning systems, and recommender engines. None of those are LLMs.
What is the difference between GPT and LLM?
LLM is the category. GPT (Generative Pre-trained Transformer) is a specific model family made by OpenAI. GPT is an LLM. Claude is an LLM. Gemini is an LLM. The analogy: LLM is like “car.” GPT is like “Honda.” All GPT models are large language models, but not all large language models are GPT. There are also four broad architectural types: decoder-only models like GPT, Claude, and Grok; encoder-only models like BERT; encoder-decoder models like T5; and multimodal models like GPT-4o and Gemini that process images alongside text.
Is ChatGPT an LLM?
ChatGPT is the interface. The large language model powering it is GPT-4o or GPT-5, depending on which version you are running. ChatGPT is the product; the model underneath is the LLM. So: yes, ChatGPT is built on an LLM, but the two terms refer to different things. ChatGPT is what you interact with. GPT is the underlying text prediction system built on the transformer architecture.
How do LLMs hallucinate?
The model has no internal check that says “I do not know this.” It generates the statistically likely next token regardless of whether the output is factually grounded. There is no database of verified facts it consults. When the training data happens to contain many plausible-sounding but incorrect associations about a topic, the large language model generates those associations confidently. The fix is grounding: give the model a document to reference, and it is completing patterns against real retrieved content rather than training memory. That is why hallucination rates drop dramatically on document summarization tasks compared to open-ended factual recall. Task-dependent rates can range from under 1% (constrained summarization with a reference document) to above 20% (open-ended factual questions without grounding).
What is a context window?
The context window is the maximum amount of text, measured in tokens, that a large language model can process in one interaction. Everything you send in plus everything it generates counts toward this limit. Think of it as the model’s working memory for that conversation. Anything outside the window is invisible to it. Current windows across the four major vendors (Anthropic, OpenAI, Google DeepMind, and xAI) reach 1 to 2 million tokens, which is roughly 750,000 words or about 500 non-fiction books. For most conversations you will never hit the limit. For long-document work, legal review, or codebase analysis, the context window size becomes a real design constraint. One signal that you are approaching the limit: the model starts behaving inconsistently with instructions you gave earlier in the conversation.