| 5 minutes read

Local RAG Pipeline

This blog documents the design decisions, experiments, and implementation details of building a fully local RAG pipeline using PDF documents, llama.cpp, FAISS, and ROCm

Retrieval Augmented Generation (RAG) has become the de-facto approach for grounding large language models with external knowledge. Most examples and tutorials, however, rely heavily on cloud services, managed vector databases, and hosted inference APIs. While these approaches are convenient, they introduce constraints around cost, latency, privacy, and long-term control.

This blog documents the process of building a fully local RAG system, capable of ingesting PDF documents, creating embeddings, indexing them for efficient retrieval, and performing inference using locally hosted large language models. Everything runs on local hardware, without external APIs.

The intent here is not to present a polished framework, but to explain the reasoning, trade-offs, and practical issues encountered while building such a system end-to-end.

High-level goals

The goals of this project were intentionally narrow:

  • Use PDF documents as the primary knowledge source
  • Run everything locally, including OCR, embeddings, vector search, and inference
  • Avoid vendor lock-in and opaque tooling
  • Optimise for understandability and debuggability, not just performance

At a high level, the system follows the classic RAG pipeline:

  1. Parse PDFs into structured markdown
  2. Clean and normalise extracted text
  3. Chunk documents into semantically meaningful sections
  4. Generate vector embeddings
  5. Index embeddings for similarity search
  6. Retrieve relevant chunks at query time
  7. Perform inference using retrieved context

Each of these stages turned out to be more nuanced than expected.

Development environment

The development environment was deliberately kept close to bare metal:

  • OS: Ubuntu 24
  • Python: 3.12.3
  • Hardware acceleration: AMD ROCm 7
  • PyTorch: Built from source with ROCm support

Building PyTorch from source was necessary to fully utilise the GPU. A custom wheel was produced and installed into a virtual environment. Some downstream libraries assume official PyTorch builds, which led to dependency friction later on.

Why not Docker or Conda?

While Docker and Conda significantly simplify environment management, they also abstract away many low-level details. For this project, understanding how each component interacts with hardware and system libraries was more important than convenience.

In particular, building PyTorch and llama.cpp from source made GPU compatibility issues visible early, rather than hidden behind container layers.

Stage 1: Parsing PDFs into markdown

PDF parsing is one of the most underestimated problems in RAG pipelines. PDFs are presentation formats, not document formats. Text order, layout, formulas, and tables are often implicit rather than explicit.

Several OCR and document parsing approaches were evaluated.

OCR and parsing experiments

ToolDescriptionOutcome
DolphinVision-based OCRWorks, but too slow
llm-whispererLLM-assisted parsingFails on formulas
DoclingStructured PDF pipelineWorks well after tuning

Docling emerged as the most reliable solution once configured correctly. It supports layout extraction, table understanding, and formula enrichment.

The final configuration enabled:

  • OCR with EasyOCR
  • Table structure extraction
  • Formula enrichment
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options.lang = ["en"]
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options = TableStructureOptions(do_cell_matching=True)
pipeline_options.do_formula_enrichment = True

Performance optimisation notes

Initially, OCR made parsing extremely slow. However, PDFs that already contain embedded text do not benefit from OCR.

Disabling OCR selectively and enabling TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 significantly improved throughput. Formula enrichment, on the other hand, proved valuable and was kept enabled.

The output of this stage is raw markdown, one file per PDF.

Stage 2: Cleaning extracted markdown

The markdown produced by automated tools is rarely production-ready. HTML remnants, layout artifacts, and malformed tables are common.

A lightweight cleanup step was introduced:

  • Strip HTML using BeautifulSoup
  • Remove stray table separators
  • Normalise excessive newlines
cleaned_text = re.sub(r'(?m)^\s*\|\s*$', '', cleaned_text)
cleaned_text = re.sub(r'\n{3,}', '\n\n', cleaned_text)

This step is intentionally conservative. Over-cleaning can remove structure that is useful later during chunking.

Stage 3: Chunking documents

Chunking determines how information is retrieved later. Arbitrary token-based chunking often destroys document structure.

Instead, markdown-aware chunking was used via LangChain’s MarkdownHeaderTextSplitter.

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

This preserves semantic boundaries and attaches header metadata to each chunk.

Why not fixed-size chunks?

Fixed-size chunking ignores document intent. A section describing a single concept may be split across chunks, reducing retrieval quality.

Header-based chunking aligns better with how humans structure technical documents.

Each document produces a JSON file containing content and metadata.

Stage 4: Embedding generation

Embeddings are the backbone of retrieval. The following decisions were made:

  • Use llama.cpp instead of hosted APIs
  • Use GGUF models for efficient inference
  • Choose Qwen3-Embedding-8B, quantised to Q8_0

Embeddings are generated chunk-by-chunk and L2-normalised, which is critical when using cosine similarity via inner product.

norm = np.linalg.norm(embedding_vec)
embedding = embedding_vec / norm if norm > 0 else embedding_vec

Instead of storing embeddings in JSON, they are written as raw binary float arrays with a small header. This reduces I/O overhead and memory footprint.

Stage 5: Vector indexing with FAISS

FAISS was compiled from source with:

  • AVX512 enabled
  • OpenBLAS backend
  • No GPU or MKL dependencies
-DFAISS_OPT_LEVEL=avx512
-DFAISS_USE_LTO=ON
-DFAISS_ENABLE_GPU=OFF

An IndexFlatIP index was used. Since all vectors are normalised, inner product corresponds to cosine similarity.

Why not HNSW or IVF?

Approximate indices are useful at scale, but introduce tuning complexity and non-determinism.

For a local system with thousands to low millions of vectors, exact search is simpler and more predictable.

Stage 6: Retrieval and inference

At query time:

  1. The query is embedded
  2. Top-K chunks are retrieved from FAISS
  3. Retrieved text is concatenated into a context block
  4. The context is passed to the chat model

The chat model used is Qwen3-30B-A3B-Thinking, quantised to Q8_0 and run entirely via llama.cpp.

system_msg = (
    "You are a precise technical assistant. "
    "Answer the user's question using ONLY the context provided below."
)

Streaming responses are enabled to reduce perceived latency.

Observations and lessons learned

  • PDF parsing quality matters more than model size
  • Chunking strategy directly impacts answer quality
  • Quantisation trade-offs are acceptable for RAG workloads
  • llama.cpp provides excellent control and transparency
  • Local systems require more setup, but far less guesswork later