Build With Me

Building a Knowledge Base with SQL + Embeddings

Most organisations already have knowledge — it’s just trapped inside documents, emails, tickets, and people’s heads. This is how you turn that knowledge into a searchable, permission-aware system using SQL and embeddings.

The goal is not “AI answers”. The goal is reliable retrieval — so answers are grounded in real internal sources.

1. What We Are Building

A knowledge base that can answer questions like:

  • “How did we fix that BizTalk GJ load issue last time?”
  • “What is the standard approach for Elite 3E upgrade projects?”
  • “Where is the checklist for release readiness?”

The key difference from a normal search is:

  • Not keyword-based
  • Meaning-based (semantic search)
  • Filtered by permissions and metadata

2. The Minimum Data Model (SQL)

You can do this in many ways, but the minimum structure looks like this:

KnowledgeItem — source document or post

KnowledgeChunk — chunked sections of text

Embedding — the vector stored per chunk

Metadata — tags for filtering: team/client/matter/security

AuditLog — who searched what and when

The most important part is metadata. Without metadata, you can’t control access.

3. Chunking Strategy (This is Where Quality Comes From)

Chunking is how you break content into retrievable pieces. The goal is:

  • Chunks are small enough to be specific
  • Chunks are large enough to be meaningful
  • Chunks preserve context (use overlap)
Good retrieval starts with good chunking. Most RAG failures are chunking failures.

4. Generating Embeddings

For each chunk, generate an embedding and store it alongside:

  • Chunk text
  • Chunk ID
  • Knowledge Item ID
  • Metadata tags
  • Created/updated timestamps

This is what enables semantic retrieval.

5. Similarity Search (The Retrieval Step)

When a user asks a question:

  1. Create an embedding for the question
  2. Run similarity search against stored chunk vectors
  3. Return top-N chunks (with scores)

Then the LLM uses those chunks as context to generate an answer.

Retrieval is a separate stage. Don’t merge retrieval and generation logic into one “magic call”.

6. Retrieval Filters (Permissions + Governance)

In enterprise use, retrieval must be filtered. Examples:

  • Only show content from projects the user worked on
  • Only show a client’s data to that client’s group
  • Exclude “confidential” tags unless user role permits

This is where your metadata becomes powerful.

7. The Prompt Pattern (RAG Prompt)

A good RAG prompt includes:

  • User question
  • Retrieved chunks
  • Instruction: answer only from those chunks
  • Fallback instruction: say “not enough information” if missing
A grounded answer is better than a confident lie.

8. Audit Logs (Non-Negotiable in LegalTech)

You should log:

  • User ID
  • Query text
  • Retrieved chunk IDs
  • Model used
  • Token cost
  • Timestamp

This enables:

  • Compliance audit trails
  • Quality improvements
  • Cost tracking

9. What “Good” Looks Like

  • Search returns relevant chunks even with different phrasing
  • Answers cite sources or show “based on these chunks”
  • Access control is enforced at retrieval
  • System is explainable and auditable
This is how you create enterprise memory — safely.

Continue the Masterclass

Next: English-to-SQL — Turning Questions into Data.

Next Article Back to Writing