Language Methods & Models Group Project · 2025 · My Role: System Lead

Recipe Assistant —
RAG System for Students

Led full RAG system design on Stack AI — system prompt safety engineering, retrieval config, and 24-case evaluation framework with multi-rater scoring.

24
Test Cases
4.3/5
Factual Score
8
KB Documents
5
Safety Rules

Architecture

RAG Pipeline

✍️

User Query

Natural language student question

🔍

Retrieve

Semantic chunks across 8 docs

🤖

Generate

Claude 4.5 + safety prompt

📌

Cited Answer

Grounded + source citations

Model

Claude 4.5 Sonnet

Strong instruction-following & safe scope

Temperature

0.4

Factual accuracy over creativity

Memory

Window = 10

Short context, no drift

Retrieval

Semantic Chunks

Relevance over keyword match

Citations

Always On

Full source transparency

Safety

5 Prompt Rules

Anti-hallucination + scope control

Prompt Engineering

Safety System Prompt

Safety was engineered at the prompt level — 5 explicit behavioural rules preventing hallucination and enforcing scope. This maps directly to F5's approach of AI defensive measures.

System Prompt Rules (designed by Shraddha Kadam)

01Answer only using the provided recipe documents — no external knowledge
02Clearly state when information is not available in the documents
03Never invent recipes, ingredients or preparation steps
04Avoid medical, nutritional or personalised dietary advice
05Clearly refuse questions outside document scope with explanation

Evaluation Results

24-Test Evaluation Framework

Category 1

4.3

Factual

Direct answers in documents — clear retrieval, accurate grounding

8 questions

Category 2

3.9

Synthesis

Multi-document queries — retrieval occasionally misses a chunk

8 questions

Category 3

3.9

Edge-Case

Out-of-scope queries — high variance, hallucination is main failure

8 questions

Score Distribution — All 24 Test Cases

5 3 1 4.3 Factual 3.9 Synthesis 3.9 Edge-Case

Failure Analysis

Failure Modes & Fixes

Critical — Edge Cases

Confident Hallucination on Missing Info

When documents lacked information, the model generated fluent but unsupported answers rather than refusing.

↳ Fix: Stronger uncertainty prompts + multi-hop retrieval verification

Moderate — Synthesis

Incomplete Multi-Document Retrieval

Synthesis queries sometimes missed a critical document chunk — fluent but incomplete answer presented as complete.

↳ Fix: Improved recall + multi-hop retrieval + better chunking structure

Minor — All Categories

Over-Verbose Factual Answers

Correct information buried in unnecessary context — reduces clarity without improving accuracy.

↳ Fix: Query-type detection + length guidance per category in system prompt