Engineering Core
ISB Vietnam's skilled software engineers deliver high-quality applications, leveraging their extensive experience in developing financial tools, business management systems, medical technology, and mobile/web platforms.

Recently, while working on a project, I had the chance to explore how a Retrieval-Augmented Generation (RAG) system works. Before that, I mostly interacted with Large Language Models through APIs, without thinking too much about how they actually retrieve and use external information.

With the rapid development of LLMs, many people have begun asking AI questions rather than searching on Google. This is very convenient, but LLMs have an important limitation: They can only answer questions based on the data they have been trained on.

For example, if a model was trained in 2025, how will it know what happens in 2026? If we want it to respond with information from documents it has never seen before, how will it know?

That's why RAG systems come in. This article is the first part of a short series documenting what I learned while trying to build a RAG system locally.

The series is divided into three parts:

  • Part 1 - Understanding the RAG Pipeline.
  • Part 2 - Running RAG locally.
  • Part 3 - Challenges and lessons learned.

In this first part, we’ll walk through the basic architecture of a RAG system and understand how its main components work together.

Simplified RAG pipeline

Instead of relying only on what the model already knows, RAG allows the model to retrieve relevant information from external documents.

Although many variations of RAG architectures exist today, most of them revolve around three core components: Document ingestion, retrieval, and generation.

Document Ingestion

The first step in a RAG system is preparing the documents so the system can use them.

Document Parsing

The main job of a parser is to extract text from documents such as PDFs, Word files, or HTML pages. Currently, many tools support this: Docling (Python), Langchain Document Loader (Python/TypeScript), Apache Tika (Java), etc.

Text chunking

At its most basic, we can segment by chunk size. Why do we need to chunk? LLMs have a limited context window and cannot process an entire document at once. Just as we can't remember the entire contents of a file.

For example, a chunk size of 100 means splitting the document into smaller chunks of 100 characters each. More complex methods involve segmenting based on the document's structure and layout.

In practice, chunking strategies may vary depending on the document structure and the context window of the language model.

Document
┌───────────────────────────────┐
Employee Handbook
Employees must reset their passwords every 90 days.
Passwords must contain at least 8 characters.
Two-factor authentication is recommended.
└───────────────────────────────┘

                ↓
            Text Chunking

Chunk 1                  Chunk 2                  Chunk 3
┌────────────────┐   ┌────────────────┐   ┌────────────────┐
Employees must   │   │Passwords must  │   │Two-factor      │
reset passwords  │   │contain at      │   │authentication  │
every 90 days    │   │least 8 chars   │   │is recommended  │
└────────────────┘   └────────────────┘   └────────────────┘

Embedding

Since machines process numbers rather than raw text, we have to convert letters into numbers for them. Embedding is the process of converting text into vectors using an embedding model. These vectors allow the system to measure semantic similarity between the user query and document chunks.

Chunk 1
"Employees must reset their passwords every 90 days."
↓
Embedding
[0.21, -0.33, 0.81, 0.45, -0.12, ...]

Vector Database

After the embeddings are generated, they are stored in a vector database. Unlike traditional databases that store structured data, vector databases are designed to store vector representations and efficiently perform similarity searches. Once all document chunks are stored in the vector database, the system is ready to retrieve relevant information when a user asks a question.

Retrieval

This is the core of RAG technology. Instead of searching text traditionally, the system searches for document vectors that are closest to the query vector of the user query.

Query Embedding

Similar to the previously vectorized document chunks, when a user submits a question, that question will also be converted into a vector.

Always remember this: both query embedding and chunk embedding must use the same embedding model.

User Question:
How often should employees reset their passwords?

↓
Embedding
[0.18, -0.41, 0.72, ...]

Vector Search

To put it simply, imagine that all document embeddings are points in a multi-dimensional space. When a user asks a question, the query is also converted into a vector and placed in the same space. The system then searches for document vectors that are closest to the query vector.

But what do we mean by “closest”?

In practice, similarity between vectors is measured using mathematical metrics such as cosine similarity or dot product. These metrics help the system identify document chunks that are semantically similar to the user's question.

Top-k Relevant Chunks

The top-k retrieved chunks are then combined and sent to the LLM as context. The exact value of k can vary depending on the system and the model’s context window.

In simple terms, the system gives the model relevant pieces of text and asks it to answer the question based on that information.

Query:
"How often should employees reset their passwords?"

↓

Top-k Retrieved Chunks

┌──────────────────────────────┐
Chunk 12
Employees must reset passwords
every 90 days.
└──────────────────────────────┘

┌──────────────────────────────┐
Chunk 27
Passwords must contain at least
8 characters.
└──────────────────────────────┘

┌──────────────────────────────┐
Chunk 35
Two-factor authentication is
recommended for all accounts.
└──────────────────────────────┘

↓

Context sent to LLM

↓

Answer

Generation

Now it's time to ask this AI a question. This step works similarly to copying text from somewhere, giving it to the AI, and asking, "Hey, what's in here?”

Prompt Construction

Besides the retrieved context, we also need to provide the LLM with a clear instruction. A simple prompt structure usually contains the context, the user’s question, and an instruction telling the model to answer based only on the provided information. Something like this:

You are an assistant who answers questions based on the provided context.

Context:
Employees must reset their passwords every 90 days.
Passwords must contain at least 8 characters.
Two-factor authentication is recommended.

Question:
How often should employees reset their passwords?

Answer:

LLM will automatically fill in the answer.

Context - Question - Answer Generation

This is the final step of the RAG process. This step is simple: after the LLM receives the prompt and context, it uses that information to answer the question. This process helps reduce hallucinations by grounding the answer in retrieved documents. commonly seen in LLMs.

However, there is one important thing to note. The accuracy of the answer depends on two factors:

  • Was the previous document search step correct? If you give it the wrong information, of course, it will give the wrong answer.
  • Is LLM strong enough? Even large models have a limited context window, so the system must carefully choose how many chunks to include.

Key Takeaways

  • Retrieval-Augmented Generation (RAG) enables LLMs to answer questions using external documents rather than relying solely on training data.
  • The documents must undergo a process of being received and encoded as vectors before they can be used.
  • A vector database is where vectors are stored and searched.
  • When a user asks a question, the system retrieves the most relevant document segments within the database.
  • The retrieved chunks are then combined into context and sent to the LLM to generate the final answer.

What’s next?

In this article, we walked through the basic pipeline of a RAG system — from document ingestion to answer generation.

In Part 2 - Running a RAG system locally, I’ll share what happened when I tried to run a RAG system locally, including the tools I used and some practical limitations I encountered during development.

This article is part of a technical blog series from ISB Vietnam, where our engineering team shares practical insights and lessons learned from real-world projects.

References

https://www.mhlw.go.jp/toukei/itiran/roudou/monthly/30/3009p/3009p.html

Written by
Author Avatar
Engineering Core
ISB Vietnam's skilled software engineers deliver high-quality applications, leveraging their extensive experience in developing financial tools, business management systems, medical technology, and mobile/web platforms.

COMPANY PROFILE

Please check out our Company Profile.

Download

COMPANY PORTFOLIO

Explore my work!

Download

ASK ISB Vietnam ABOUT DEVELOPMENT

Let's talk about your project!

Contact US