#ai
>[!info]
>This is an excellent YT video to build AI engineering projects. I have used a combination of [youtubetotranscript.com](https://youtubetotranscript.com/) and Claude Code to generate detailed notes of this YT video.
<iframe width="700" height="350" src="https://www.youtube.com/embed/3ZDSdMpczXE" title="How to Build AI Engineering Projects That Get You Interviews" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
## The Core Problem
Beginner AI engineers often build projects from tutorials (chatbots, basic RAG) that don't demonstrate real engineering skills. These projects look identical to everyone else's and don't reflect what production AI systems actually look like.
**What you need instead:** A framework for building end-to-end AI applications that are unique and demonstrate actual engineering skills.
---
## The 8 Core Components
### 1. Problem Framing & Success Metrics
Most people skip this, but it's foundational.
**The right approach:**
- Start with a business problem, then determine if an LLM is the right tool
- Define clearly what "solved" means
**Constraints to consider:**
- Latency and cost requirements
- Quality bar (medical Q&A needs high accuracy + citations; creative writing can tolerate 80% useful outputs)
**Three levels of metrics:**
|Level|Question Answered|Examples|
|---|---|---|
|User-facing|Is this actually useful?|Task completion rate, user satisfaction|
|Technical|Is the AI performing well?|LLM-as-judge scores, human ratings, test case accuracy|
|System|Is this sustainable?|Latency, cost per request, error rate, uptime|
**Key principle:** Metrics must be measurable and tracked over time. Document all scoping decisions in your README.
---
### 2. Prompt Engineering & Systematic Tracking
Prompt engineering is less art, more systematic engineering.
**Best practices:**
1. **Treat prompts as separate components** — Don't hardcode them; use versioned files (e.g., `prompt_v1.txt`, `prompt_v2.txt`)
2. **Build a structured evaluation framework:**
- Create test inputs with expected outputs or quality criteria
- Cover different question types, difficulty levels, edge cases
- Aim for 100–300 test cases (start with what you can get)
3. **Measure performance when prompts change:**
- BLEU/ROUGE scores
- Exact matching for classification
- LLM-as-judge with consistent rubric
**Tools:** PromptLayer, LangFuse, Weights & Biases
![[prompt-structuring.png]]
---
### 3. Model Selection & Evaluation
**Common mistake:** Defaulting to the newest model without testing.
**What to actually do:**
- Run multiple models (different providers, different sizes) on your test set
- Compare both performance and cost
- Document your process
**Advanced technique — Model routing:**
- Use a cheap/fast model to classify query difficulty
- Route simple queries → cheap model
- Route complex queries → expensive model
---
### 4. RAG (Retrieval Augmented Generation)
Connects an LLM to external data sources. The decisions you make here massively impact performance.
#### Chunking Strategies
No universal right answer. Test different approaches:
- Fixed-size chunking (try different sizes)
- Semantic chunking (split by meaning)
**How to evaluate:** Create 20–30 test questions where you know which document sections contain the answer. Measure what percentage of time retrieval returns the correct sections.
#### Embedding Models
- OpenAI's text embedding models (easy to use)
- Sentence Transformers (open source, run locally)
Test different ones on your use case.
#### Vector Storage
Use cloud platforms: AWS OpenSearch, GCP/Azure equivalents
#### Search Strategies
|Strategy|Description|
|---|---|
|Semantic similarity|Convert query to embedding, find similar chunks (baseline)|
|Hybrid search|Combines keyword + semantic search; catches exact technical terms|
|Reranking|Retrieve ~20 chunks fast, then use slower/accurate model to pick best 5|
|Query expansion|Use LLM to rephrase query, generate variations, extract entities|
#### RAG Evaluation Metrics
- **Retrieval accuracy:** Precision@K, Recall@K
- **Answer accuracy given good chunks:** Isolates LLM vs. retrieval problems
- **End-to-end accuracy:** Does the whole system work?
---
### 5. Agent Systems
Agents are LLMs that use tools and take actions autonomously.
**Framework options:** LangGraph, CrewAI, or build your own with OpenAI/Anthropic function calling
**Critical considerations for production-quality agents:**
- **Error handling:** LLMs make mistakes; handle gracefully
- **Security:** Code execution and API calls create security concerns
- **Monitoring:** Log everything — tools called, outputs, decisions, timing, successes/failures
**Testing agents:**
- Unit tests for individual tools
- Integration tests for complete workflows
- Adversarial testing (malicious requests)
- Test set of 10–15 representative tasks (simple → complex)
- Measure task completion rate and average steps to completion
---
### 6. Deployment & User Interface
#### API Development
FastAPI is the standard for Python:
- Fast, automatic documentation, handles async well
**Key things to handle:**
- Streaming responses
- Error handling (LLM APIs fail)
- Rate limiting
- Authentication (at least simple API keys)
#### Hosting
Deploy on AWS, GCP, or similar for a stable public URL
#### UI Options
- Simple: Streamlit, Gradio
- More polished: React, Next.js
---
### 7. System Monitoring & Error Analysis
Separates demos from real systems.
#### What to Monitor by Component
**Prompts:**
- Response quality scores
- Format compliance
- Refusal rates
- Average response length
**RAG:**
- Retrieval confidence scores
- Number of chunks retrieved
- Source diversity
- Retrieval latency
**Agents:**
- Task completion rate
- Average steps to completion
- Tool success rates
- Error types
- Cost per task
**Overall System:**
- End-to-end task success
- User satisfaction
- Latency, cost per request
- Error rate, uptime
#### Logging Requirements
At minimum, log:
- Timestamp
- User query
- Components used (chunks retrieved, model, prompt version)
- Response
- Latency and cost
- Any errors
**Simple approach:** Write to file or SQLite. **Advanced:** Use proper logging services.
---
### 8. Fine-Tuning (Bonus)
Less common in industry than people think. Most of the time, well-engineered prompts get you 90% there.
**When fine-tuning makes sense:**
- Consistent output formatting (specific JSON structure)
- Matching larger model performance with smaller model (e.g., fine-tune Haiku on Sonnet outputs)
- Domain-specific language (medical, legal, technical)
- Improving RAG embedding models for domain-specific retrieval
**How to do it properly:**
1. **Create high-quality training data** — 100–500 real examples from your use case
2. **Establish baseline** — Measure best prompt performance on held-out test set
3. **Train multiple versions** — Vary data amounts, epochs, learning rates; track everything
4. **Compare performance** — Does fine-tuned model beat baseline? Is improvement worth it?
---
## How It All Fits Together
```
User submits query → UI → API (logged)
↓
Agent decides: needs RAG? which tools?
↓
Query → embedding → retrieve chunks → rerank
↓
Prompt template + context + examples
↓
Model selection logic → generate response (streaming)
↓
Validate → post-process → return to user
↓
Log everything → monitoring analyzes → improve system
```
---
## Implementation Advice
- **Start simple:** Basic version with one prompt and basic RAG
- **Add complexity incrementally:** Different chunking, agents, monitoring, model selection
- **Measure impact of each change** — Don't add features just because they sound cool
- **Quality over quantity:** One solid project beats five half-finished ones
- **Timeline:** Expect weeks to months — that's normal
---
## Resources Mentioned
- Book: _AI Engineering_ by Chip Huyen
- Various linked videos on individual components
- AI Agents course