Build With Me
Building a Knowledge Base with SQL + Embeddings
Most organisations already have knowledge — it’s just trapped inside documents, emails, tickets, and people’s heads. This is how you turn that knowledge into a searchable, permission-aware system using SQL and embeddings.
The goal is not “AI answers”. The goal is reliable retrieval — so answers are grounded in real internal sources.
1. What We Are Building
A knowledge base that can answer questions like:
- “How did we fix that BizTalk GJ load issue last time?”
- “What is the standard approach for Elite 3E upgrade projects?”
- “Where is the checklist for release readiness?”
The key difference from a normal search is:
- Not keyword-based
- Meaning-based (semantic search)
- Filtered by permissions and metadata
2. The Minimum Data Model (SQL)
You can do this in many ways, but the minimum structure looks like this:
KnowledgeItem — source document or post
KnowledgeChunk — chunked sections of text
Embedding — the vector stored per chunk
Metadata — tags for filtering: team/client/matter/security
AuditLog — who searched what and when
The most important part is metadata. Without metadata, you can’t control access.
3. Chunking Strategy (This is Where Quality Comes From)
Chunking is how you break content into retrievable pieces. The goal is:
- Chunks are small enough to be specific
- Chunks are large enough to be meaningful
- Chunks preserve context (use overlap)
4. Generating Embeddings
For each chunk, generate an embedding and store it alongside:
- Chunk text
- Chunk ID
- Knowledge Item ID
- Metadata tags
- Created/updated timestamps
This is what enables semantic retrieval.
5. Similarity Search (The Retrieval Step)
When a user asks a question:
- Create an embedding for the question
- Run similarity search against stored chunk vectors
- Return top-N chunks (with scores)
Then the LLM uses those chunks as context to generate an answer.
6. Retrieval Filters (Permissions + Governance)
In enterprise use, retrieval must be filtered. Examples:
- Only show content from projects the user worked on
- Only show a client’s data to that client’s group
- Exclude “confidential” tags unless user role permits
This is where your metadata becomes powerful.
7. The Prompt Pattern (RAG Prompt)
A good RAG prompt includes:
- User question
- Retrieved chunks
- Instruction: answer only from those chunks
- Fallback instruction: say “not enough information” if missing
8. Audit Logs (Non-Negotiable in LegalTech)
You should log:
- User ID
- Query text
- Retrieved chunk IDs
- Model used
- Token cost
- Timestamp
This enables:
- Compliance audit trails
- Quality improvements
- Cost tracking
9. What “Good” Looks Like
- Search returns relevant chunks even with different phrasing
- Answers cite sources or show “based on these chunks”
- Access control is enforced at retrieval
- System is explainable and auditable
Continue the Masterclass
Next: English-to-SQL — Turning Questions into Data.