RAG Chatbot Architecture: From Documents to Answers

The RAG pipeline

A RAG (Retrieval-Augmented Generation) chatbot consists of two separate processes: an offline indexing pipeline that processes your content, and an online retrieval-generation pipeline that handles user queries.

Offline: building the knowledge base

Document ingestion

Content is loaded from your sources — PDFs, web pages, Notion, Confluence, Google Docs, databases, or custom APIs. Each source requires a loader that extracts raw text and metadata (title, URL, last updated).

Chunking

Long documents are split into smaller chunks — typically 500–1000 tokens with 100-token overlap between adjacent chunks. Chunking strategy significantly affects retrieval quality. Documents with clear section headers benefit from semantic chunking (splitting at heading boundaries). Dense text benefits from sliding-window chunking.

Embedding

Each chunk is converted into a vector embedding — a list of ~1500 floating point numbers that represents the semantic meaning of the text. We use OpenAI's text-embedding-3-large model or Cohere's embed-v3 depending on language requirements and budget.

Vector storage

Embeddings are stored in a vector database alongside the original text and metadata. We use Pinecone for hosted deployments, pgvector for PostgreSQL-based stacks, and Qdrant for on-premise requirements.

Online: answering user questions

Query embedding

When a user asks a question, the query is embedded using the same model as the documents.

Retrieval

The query embedding is used to search the vector database for the k most semantically similar document chunks. The similarity metric is cosine similarity — how closely the meaning of the query aligns with each chunk.

For better results, we often combine vector search with keyword search (hybrid retrieval) and apply metadata filters (date range, category, author) to pre-filter the search space.

Generation

The retrieved chunks plus the user query are assembled into a prompt and sent to the LLM (GPT-4o, Claude 3.5 Sonnet, or Gemini 1.5 Pro). The model is instructed to answer using only the provided context — not its training data. Source citations are extracted from the retrieved chunks and displayed with the response.

Re-ranking

For higher accuracy, we add a re-ranking step between retrieval and generation. A cross-encoder model scores each retrieved chunk against the query and reorders them by relevance — often significantly improving the quality of the context passed to the LLM.