Most 'AI Engineers' Can't Answer These Basic Questions
AI Engineering Fundamentals: A First Principles Q&A
This is a Q&A between Claude and me to gauge my understanding of this book I read titled AI Engineering by Chip Huyen.
Embeddings
What problem do embeddings solve? Why do they exist?
Embeddings allow you to have text or other forms of modality with context and semantic relationship with each other.
That’s what they do, but not the problem they solve. Go deeper. Why can’t you just feed raw text into a neural network?
We are able to convert them into numbers this way.
Getting warmer. Why do we need to convert them to numbers?
Neural networks can’t just accept raw texts.
Right. Neural networks are just matrix math - they only operate on numbers. But why not just assign arbitrary numbers? Like “cat” = 1, “dog” = 2, “elephant” = 3?
You can’t show the relationship that all of these are just animals.
Exactly. Arbitrary numbers have no semantic structure. Embeddings give you a space where distance = meaning. Cat and dog end up near each other because they share context (pets, animals, four legs). Elephant is nearby but further. “Democracy” is way off in another region.
RAG (Retrieval-Augmented Generation)
What’s the core problem RAG solves that you can’t fix with just a bigger context window?
Models come with training data which has a cutoff date. RAG provides more access to newer data or even data that has more context to the problem they are being asked to solve.
Why isn’t a bigger context window enough? If context windows keep growing, why not just shove all your documents in there?
Because the information available/needed will always be bigger than the context window.
That’s one reason. What’s the other practical issue even if your documents did fit?
More data also means more data to process which means more money on compute.
Bingo. RAG lets you pay embedding cost once (at index time), then retrieve only the relevant chunks per query. Way cheaper than brute-forcing the whole corpus every time.
Evaluation
Why use an LLM to evaluate another LLM’s output instead of just using traditional metrics like BLEU or exact match?
LLMs do make mistakes so it’s important to have a mechanism that checks what the LLM is generating. This said mechanism also has to fit into your application context, it has to fit into your budget for compute and the model you have access to.
You’re describing why evaluation matters, but not why LLM-as-judge specifically. Why can’t you just compare the LLM’s output to a correct answer using string matching?
LLMs are open ended.
Exactly. There’s no single “correct” answer to compare against. Ask “summarize this document” - there are hundreds of valid summaries. BLEU score would penalize a perfectly good summary just because it used different words than your reference.
What’s the trade-off between weak model → strong judge and strong model → weak judge?
To be honest it all depends on testing and what works, no general rule of thumb. Decision is guided based on what works, budget amongst other things.
That’s the pragmatic answer, but there is a principle underneath. If you use a weak model to judge a strong model’s output, what’s the risk?
The risk is that it won’t be able to accurately judge its results but to be honest that is not consistent with my personal testing. Sometimes it works, sometimes it doesn’t.
That’s exactly the right observation. The theoretical risk: weak judge might not recognize quality it can’t produce itself. Like a student grading a professor’s essay. But in practice, judging is often easier than generating. A 7B model can often tell if a response is coherent, relevant, and complete - even if it couldn’t produce that response itself.
The heuristic:
Simple criteria (is it valid JSON? does it answer the question?) → weak judge often fine
Nuanced criteria (is this legal advice accurate? is this code optimal?) → probably need stronger judge
Prompt Engineering
What’s the difference between few-shot prompting and fine-tuning? When would you pick one over the other?
Few-shot is providing examples for the model at the prompting level while fine-tuning is messing with the weights of the model.
When few-shotting (and other prompting techniques) fails and you have the compute to fine-tune, then you can fine-tune.
That’s circular though - “fine-tune when prompting doesn’t work” just restates the question. What are the actual signs that few-shot has failed?
When you can’t get more accurate results from the model.
Still circular. Let me give you concrete signals:
Consistent format failures - model keeps breaking your output structure despite examples
Domain knowledge gap - model doesn’t know terminology/patterns specific to your domain
Latency/cost - your few-shot prompt is so long that it’s slow and expensive per request
Style consistency - model can’t maintain a specific voice/tone across outputs
Fine-tuning bakes the pattern into the weights, so you don’t need to spend tokens teaching it every request. But fine-tuning is expensive, needs data, and the model becomes specialized. Few-shot keeps the model general.
Agents
What’s the difference between a single-step agent and a multi-step agent?
Difference is in the name. Single-step agents do everything in a single step, while multi-step can do multiple steps before getting to the result.
Right, but what forces you into multi-step? Your calculator handles 5 + 3. Why can’t it handle (5 + 3) * 2 - 7 in a single step?
You have to provide the function call for it to do that.
Not quite. You already have add, subtract, multiply functions defined. The issue is: to compute (5 + 3) * 2 - 7, the model needs to:
Call add(5, 3) → get 8
Call multiply(8, 2) → get 16
Call subtract(16, 7) → get 9
Each step depends on the result of the previous step. The model can’t call multiply(8, 2) until it knows the 8 came from step 1.
Single-step: one LLM call, one function execution, done. Multi-step: LLM call → execute → feed result back to LLM → LLM decides next call → execute → repeat until done.
What’s the risk with agent loops? What can go wrong?
The agent can plan rubbish from the start which sends the loop to failure.
Yep. Garbage plan at step 1 cascades through everything. What else can go wrong? Think about the loop itself.
The loop not having a clear exit.
Exactly. Infinite loops. Model keeps calling tools, never decides it’s done. Or it gets stuck oscillating - calls tool A, then B, then A again, forever.
Multi-step agents need:
Max iterations - hard cap, bail out after N steps
Exit condition - model explicitly signals “I have the answer”
Error recovery - what happens when a tool call fails mid-chain
Dataset Engineering
What’s the point of augmenting your dataset?
When you don’t have enough data, or you have a lot of data but certain aspects are not represented enough.
Exactly. What are ways to augment data for text?
You can use AI to generate data. You can scrape non-copyrighted data from the internet. You can rewrite some of your existing data to ask the same thing but in different ways.
Good. Quick distinction though: scraping new data isn’t really augmentation - that’s just collecting more real data. Augmentation is creating synthetic variations from what you already have.
The AI generation one is powerful but has a trap. What’s the risk of using LLM-generated data to train another LLM?
It can sometimes lead to model degradation.
Right. It’s called model collapse. If you train on AI-generated data, the model learns the patterns and quirks of the generator - including its mistakes and biases. Each generation gets slightly more distorted. Over iterations, diversity shrinks and errors compound.
It’s like making a photocopy of a photocopy of a photocopy. Each generation loses fidelity.
The rule: synthetic data is useful for augmentation, but you need to keep real human data in the mix to anchor quality.
Inference Optimization
What’s your current understanding of why batching matters for inference?
Because it’s cheaper to batch and you can save more money. Useful when you don’t need instant results like chatbots.
That’s the business case. But why is it cheaper? What’s happening at the hardware level?
I don’t think I have an answer for that.
Fair. GPUs are like massive parallel calculators - thousands of cores designed to do the same operation on lots of data simultaneously. When you send one request, most of those cores sit idle. You’re paying for the whole GPU but using a fraction of it.
Batching fills those idle cores. Ten requests batched together use roughly the same GPU time as one request alone, because the cores were waiting anyway.
It’s like booking a private jet for one person vs filling all the seats. The fuel cost (GPU compute) is nearly identical - you’re just splitting it across more passengers (requests).
Batching is: higher throughput, higher latency, lower cost per request.
Why do we measure time to first token (TTFT) specifically, rather than just total response time?
Because we mostly stream responses in latency-required apps.
Exactly. In a chatbot, the user sees tokens appearing as they’re generated. TTFT is the “perceived wait” - how long they stare at a blank screen before something starts appearing.
For chatbots, which matters more - optimizing TTFT or TPOT (time per output token)?
Actually both.
Why both?
You don’t want the user staring at an empty screen for long and also wait forever for the next token.
Yep. But if you had to pick one to optimize first, TTFT usually wins. Users are more tolerant of a steady stream (even if slow) than staring at nothing. The blank screen triggers “is it broken?” anxiety.
Why is average latency misleading?
Because latency is a distribution. With 10 requests whose TTFT values are 100ms, 102ms, 100ms, 100ms, 99ms, 104ms, 110ms, 90ms, 3000ms, 95ms - the average is 390ms, which makes your service seem slower than it is.
Use percentiles instead:
p50 (median): Your “typical” user experience
p95: Most users’ worst case
p99: Your unlucky 1-in-100 user
What’s the difference between throughput and goodput?
I don’t have an idea.
Fair enough.
Throughput: Total tokens your system spits out per second. Raw output volume. Goodput: Tokens that actually get used per second.
The gap is wasted work - user cancels mid-stream, request times out, speculative decoding guesses wrong.
Throughput matters for billing. Goodput matters for actual user value delivered.
You’re designing inference for two use cases: a chatbot, and a nightly job that summarizes 50,000 tickets. How would you configure batching differently?
Chatbot: no batching at all. Nightly job needs batching.
Right direction. Why no batching for chatbot specifically? What metric are you protecting?
Latency.
Which latency metric - TTFT or TPOT?
Both but more of TTFT.
Exactly. User staring at blank screen is worse than slower streaming. Batching kills TTFT because requests queue up waiting for the batch to fill.
For the nightly job - nobody’s waiting, so you batch aggressively. Fill those GPU cores, maximize throughput, minimize cost per ticket.
PageIndex vs Embedding RAG
What’s your mental model of when you’d pick one over the other?
Personal blog: embedding RAG because it’s smaller context and I don’t have defined sections.
Legal contract: PageIndex because larger context, sectioned, and I can query via sections.
Close, but you’re anchoring on the wrong attributes. Size and “having sections” isn’t the deciding factor. A 200-page novel has chapters but embedding RAG works fine.
The real distinction is: does answering the question require following references?
Embedding RAG asks: “what chunks of text are semantically similar to this query?” PageIndex asks: “what’s the relevant information, even if I need to hop through cross-references to find it?”
Example: Legal contract says “The penalty shall be calculated per Appendix B, adjusted by the rates in Section 12.3”
Embedding RAG finds this sentence because it matches “penalty calculation”
But the answer isn’t here - it’s in Appendix B and Section 12.3
Embedding search won’t necessarily retrieve those
PageIndex uses LLM reasoning to navigate the document structure.
So the question is: does your document have cross-references that matter for answering questions?
Makes a lot of sense. The anchor is references.
Exactly.
Litmus test:
Embedding RAG wins when: the answer lives in the similar text
PageIndex wins when: the similar text points to the answer


