RAG Explained -- Making AI Read Only the 3 Relevant Pages, Not the Entire Book

中文

Key takeaway: You have a thousand-page product manual you want AI to read, but AI has limited memory (context window), reads slowly (inference speed), and it's expensive (token costs). RAG's solution is dead simple -- first chop the manual into fragments, then when a user asks a question, pick only the 3 most relevant fragments for AI to read. It's like hiring: you don't interview every applicant. You screen resumes to pick 10, then interview to select 3. That's all RAG is.

1. What Is RAG?

RAG = Retrieval Augmented Generation

Breaking it down: first "Retrieve" relevant content from a database, use that content to "Augment" the large model's knowledge, then let the model "Generate" an answer. The sequence is retrieve first, then generate -- hence "Retrieval Augmented Generation."

Use cases: Enterprise intelligent customer service, internal knowledge base Q&A, product manual queries -- any scenario where AI needs to answer questions based on "your data."

Why Not Just Feed the Entire Document to the Model?

Problem	Cause	Consequence
Won't fit	Every model has a context window size limit	Reads the end, forgets the beginning; accuracy can't be guaranteed
Too expensive	More input tokens = higher costs	Every answer carries a thick manual along; imagine the bill
Too slow	More input = longer processing time	Inference speed severely impacted

RAG's solution: Don't feed the entire document to the model. Instead, pick out only "the fragments relevant to the user's question" and feed those. A thousand pages becomes 3 pages. All three problems solved.

2. The Complete RAG Pipeline: Five Steps

1ChunkingPre-query preparation

Split the entire document into multiple small fragments. You can chunk by word count (e.g., 1,000 words per chunk), paragraphs, chapters, or pages. Regardless of method, the goal is to turn one large document into many small fragments.

2IndexingPre-query preparation

Use an embedding model to convert each fragment's text into a vector (a set of numbers), then store the "original text + corresponding vector" together in a vector database.

Three key concepts:

Vector: A set of numbers representing the meaning of text. Higher dimensions (hundreds to thousands) capture richer semantic information. Texts with similar meanings have similar vectors.
Embedding: The process of converting text into vectors. Done by specialized embedding models (not large models like ChatGPT/DeepSeek, but models specifically designed for vector conversion). See the MTEB leaderboard for model selection.
Vector database: A database optimized specifically for storing and querying vectors, providing vector similarity computation. A table with at minimum two columns: original text + corresponding vector.

3RetrievalPost-query execution

User asks a question -> embedding model converts the question into a vector -> vector database computes similarity between the question vector and all fragment vectors -> returns the 10 most similar fragments.

Similarity computation methods:

Cosine similarity: Measures the angle between two vectors. Smaller angle = higher similarity.
Euclidean distance: Measures the straight-line distance between two vectors. Shorter distance = higher similarity.
Dot product: Considers both direction and magnitude. Larger product = higher similarity.

Retrieval characteristics: low cost, fast speed, but relatively lower accuracy. Ideal for quickly rough-filtering from thousands of fragments.

4RerankingPost-query execution

From the 10 retrieved fragments, use a more precise reranking model to select the 3 most relevant ones.

Reranking characteristics: higher cost, slower speed, but much higher accuracy. Ideal for fine-grained selection.

Why not directly pick 3? It's like hiring -- retrieval = resume screening (quickly picking 10 from thousands), reranking = interviews (carefully picking 3 from 10). Interviewing everyone is impractical; picking 3 from resumes alone isn't accurate enough. The two-stage approach is optimal.

5GenerationPost-query execution

Send the user's question + the 3 most relevant fragments together to the large model (e.g., ChatGPT, DeepSeek). The model generates the final answer based on the fragment content.

3. First Principles Analysis

The Fundamental Contradiction RAG Solves

Large models have a fundamental contradiction: they need knowledge to answer questions, but stuffing all knowledge into them actually makes them dumber, slower, and more expensive. It's like a person -- give them one page and they can answer precisely; give them an entire library and they can't find the answer.

RAG is essentially an information compression funnel: thousands of pages -> 10 fragments (retrieval) -> 3 fragments (reranking) -> 1 answer (generation). Each step compresses the information volume while increasing relevance.

The Core Insight of Embeddings: Capturing Meaning with Numbers

The elegance of embeddings lies in converting human language "meaning" into "position" in mathematical space. Texts with similar meanings are close in mathematical space. This means "Bob likes eating watermelon" and "Bob loves eating watermelon" will have very similar vectors, while "The weather is great" will be far away in mathematical space.

Semantic search replaces keyword search. Traditional search relies on literal matching ("watermelon" matches "watermelon"). Embedding-based search relies on semantic matching ("fruit" can also match "watermelon").

The Design Logic of Retrieval + Reranking

This is a classic funnel design pattern: the first layer uses cheap, fast, but coarse methods for broad filtering; the second layer uses expensive, precise methods for fine selection. This pattern appears everywhere -- Google Search (inverted index for fast retrieval, then ranking algorithms for precision sorting), e-commerce recommendations (collaborative filtering for candidate retrieval, then deep learning models for precision ranking), even hiring (resume screening -> interviews).

The essence is a cost-precision tradeoff: If precision requirements are low, one layer suffices. If cost isn't a concern, use the most precise method on all data directly. But in reality, both are constrained, so the multi-layer funnel is optimal.

4. Historical Parallels

Three Kingdoms Era: Zhuge Liang's "Human RAG" -- The Longzhong Plan

When Zhuge Liang (the legendary strategist of ancient China's Three Kingdoms period, ~200 AD) was farming in Nanyang, he hadn't read every piece of intelligence in the world. His approach was to first "chunk" key information about the broader situation (the forces of Wei, Shu, and Wu; strategic geography; popular sentiment), building his own "knowledge base." When Liu Bei visited three times and asked "What is the state of the realm?", Zhuge Liang didn't dump all information at once. Instead, he "retrieved" the most relevant fragments (Jingzhou can be taken, Yizhou can be held, ally with Wu against Cao), then "reranked" to identify the three most critical strategic steps, and finally "generated" the Longzhong Plan.

This is the human version of RAG: it's not about knowing everything, but about precisely retrieving the most relevant knowledge when asked. Zhuge Liang's value wasn't in how much intelligence he memorized, but in his "retrieval" ability -- rapidly pinpointing the few most critical pieces from a sea of information.

The Evolution of Libraries -- From Alexandria to Google

The Library of Alexandria held 400,000 papyrus scrolls, but finding specific knowledge was extremely difficult -- without an "indexing system," scholars had to search scroll by scroll. Later, Callimachus (the father of library science) created the Pinakes, categorizing documents by author and subject -- humanity's first "index."

Google digitized this concept: "crawl" web pages (chunking), build an "inverted index" (indexing), quickly "retrieve" relevant pages when users search, "rerank" with PageRank, and display the most relevant results.

RAG isn't a new invention -- it's the AI version of library classification. Every era solves "finding relevant knowledge from massive information" differently, but the underlying logic has never changed: classify -> index -> retrieve -> rank.

Three Kingdoms Era: Cao Cao's "Retrieval Failure" -- The Battle of Red Cliffs

Before the Battle of Red Cliffs (~208 AD), Cao Cao (the powerful warlord of northern China) had access to the same information as Zhuge Liang, but his "retrieval" went wrong. He only "retrieved" favorable information (troop superiority, Jingzhou already taken, Liu Cong's surrender) and overlooked critical fragments (northern soldiers can't fight on water, epidemic spreading, Eastern Wu's war faction's determination).

In RAG terms: Cao Cao's embedding model was biased -- it confused "what I want to hear" with "what's relevant to the question." His similarity computation wasn't based on objective facts but on his own arrogance. A RAG system's quality depends on embedding quality. If your "semantic understanding" is biased, the retrieved fragments will be wrong, and the final generated answer will be wrong.

5. Business Insights

Insight 1: Every Company Needs Its Own RAG -- The "Enterprise Knowledge Brain" Is a Must-Have

The video mentions that RAG's most common applications are enterprise intelligent customer service and knowledge bases. Currently, most companies' internal knowledge is scattered across documents, emails, Slack, and wikis. Employees waste an average of 20% of their work time searching for information.

Revenue logic: Build a RAG knowledge base SaaS service for enterprises. Customers upload documents -> automatic chunking + indexing -> employees/customers simply ask questions and get answers. Charge monthly or per query. This is one of the hottest AI deployment directions in 2025-2026.

Insight 2: "Reranking" Is Curation in the Business World -- Curation Skills Are Valuable

The most valuable step in RAG isn't generation (any large model can do that) but reranking -- picking the truly valuable pieces from a pile of seemingly relevant content. This is identical to the value of curators, editors, and consultants: information isn't scarce; what's scarce is "someone to pick out the 3 most important items for you."

Revenue logic: In any information-overloaded domain (investment research, legal cases, medical literature), offer an "AI reranking" service. Don't give clients more information -- help them precisely pick the 3 most relevant items from massive datasets. This is "Judgment-as-a-Service."

Insight 3: Embedding Quality Determines Everything -- "Semantic Understanding" Is the Moat

The quality of a RAG system is 80% determined by embedding quality. General embedding models have limited understanding of industry-specific terminology and context. For example, "Apple" in a tech company's knowledge base should mean the company, while in an agricultural knowledge base it should mean the fruit.

Revenue logic: Fine-tune embedding models for specific industries (Fine-tuned Embedding as a Service). Legal, medical, and financial industries have specialized terminology and semantic relationships that general models can't capture accurately. Whoever's embedding best understands a given industry has the most accurate RAG -- and the stickiest customers.

Insight 4: Chunking Strategy Is Underrated Technology -- "How You Split" Determines "Whether You Can Find"

The "chunking" step, briefly mentioned in the video, is actually the most undervalued component of RAG systems. Chunks too large? Retrieval becomes imprecise. Chunks too small? Context is lost. Different document types (legal contracts, technical documentation, conversation logs) require completely different chunking strategies.

Revenue logic: Develop intelligent chunking tools that automatically select the optimal chunking strategy based on document type. This may seem like a small tool, but it directly impacts the final effectiveness of RAG systems -- a critical piece of AI infrastructure.

6. Core Insight

RAG's true significance isn't in the technology but in the universal principle it reveals: wisdom isn't about memorizing everything, but about quickly finding the most relevant pieces when asked.

Feynman said, "You don't have to know everything. You just have to know where to find it." Zhuge Liang wasn't the most learned person, but he was the best at "retrieval" and "reranking." Google doesn't produce any content, but it's the most powerful "retrieval augmented" system.

In the AI era, this principle is amplified: a large model's "knowledge" (parameters) is finite, but "retrievable knowledge" (external databases) is infinite. RAG transforms the large model from a "knows a little about everything but nothing precisely" generalist into a "precisely retrieves and answers whatever is asked" specialist. This is also how humans should learn -- don't try to memorize everything; instead, build a good "personal knowledge base" and "retrieval system."

Original transcript: MeowKui's Compendium / Source Materials / RAG-Explained-voice-transcript.txt

Video source: https://www.youtube.com/watch?v=JCPLP6BiCrQ