Augmenting Large Language Models
Large Language Models are remarkably capable out of the box, but they have well-known limitations — stale training data, hallucinations, no access to private knowledge, inability to take actions in the real world, and lack of domain depth. Augmentation is the practice of extending an LLM’s capabilities beyond what it learned during pre-training, without retraining the model from scratch.
This post surveys every major augmentation strategy available today, with honest pros and cons, real-world use cases, and guidance on when to reach for each one.
1. Prompt Engineering
The simplest form of augmentation: crafting the input to steer the model’s behaviour. Techniques include zero-shot, few-shot, chain-of-thought (CoT), self-consistency, tree-of-thought, and system/role prompts.
How It Works
You provide instructions, examples, or reasoning scaffolds directly in the prompt. The model’s weights remain unchanged — all the “augmentation” lives in the input.
# Few-shot prompt example
You are a sentiment classifier. Respond with POSITIVE, NEGATIVE, or NEUTRAL.
Review: "The battery life is incredible." → POSITIVE
Review: "Shipping took 3 weeks." → NEGATIVE
Review: "The packaging was standard." → NEUTRAL
Review: "I love how lightweight it is." →Pros
- Zero infrastructure — No databases, pipelines, or training runs. Works with any model via API.
- Fast iteration — Change the prompt and test in seconds.
- Low cost — No compute beyond inference; no labelled data required.
- Composable — Techniques stack: you can combine few-shot with chain-of-thought and role prompts.
Cons
- Context window limits — You can only fit so many examples and instructions before hitting the token ceiling.
- Fragile — Small wording changes can dramatically affect output quality; prompts can “drift” across model versions.
- No new knowledge — The model can only use information present in its training data or the prompt itself.
- Hard to scale — Maintaining dozens of finely tuned prompts across a production system becomes unwieldy.
Real-World Use Cases
- Customer support triage — Classify incoming tickets by urgency and department using a few-shot prompt (Intercom, Zendesk integrations).
- Code review assistants — System prompts instruct the LLM to act as a senior reviewer focusing on security and performance (GitHub Copilot code review).
- Data extraction — Chain-of-thought prompts extract structured fields from unstructured legal or medical documents.
When to Use
Start here. Prompt engineering is the first thing to try for any LLM task. Move to other approaches only when you hit limits on accuracy, freshness, or task complexity.
2. Retrieval-Augmented Generation (RAG)
RAG grounds the model in external, up-to-date, or private knowledge by retrieving relevant documents at inference time and injecting them into the prompt.
How It Works
- Index — Chunk your documents and generate vector embeddings (e.g., with OpenAI
text-embedding-3-large, Cohere Embed, or open-source models likeBGE). - Retrieve — At query time, embed the user’s question and find the top-k most similar chunks via a vector database (Pinecone, Weaviate, pgvector, Qdrant, Milvus).
- Augment — Inject the retrieved chunks into the prompt as context.
- Generate — The LLM answers using the retrieved context.
System: You are a helpful assistant. Answer ONLY based on the provided context.
If the context doesn't contain the answer, say "I don't know."
Context:
[Retrieved chunk 1: "Our return policy allows returns within 30 days..."]
[Retrieved chunk 2: "Refunds are processed within 5-7 business days..."]
User: What is your return policy?Pros
- Always fresh — Update the index and the model instantly “knows” new information, without retraining.
- Auditable — You can show users the source documents that informed the answer (citations).
- Works with private data — Internal wikis, proprietary databases, customer records.
- Reduces hallucinations — Grounding the model in retrieved facts significantly lowers fabrication rates.
Cons
- Retrieval quality is a bottleneck — If the retriever misses relevant chunks or returns irrelevant ones, the answer suffers (“garbage in, garbage out”).
- Chunking is an art — Poor chunk boundaries (splitting mid-sentence, losing table structure) degrade quality.
- Latency overhead — The embedding → search → rerank → generate pipeline adds 200–500ms.
- Infrastructure cost — Requires a vector database, embedding pipeline, and reranking model in production.
- Context window pressure — Injecting many chunks consumes tokens, leaving less room for the conversation.
Real-World Use Cases
- Enterprise search & Q&A — Glean, Notion AI, and Confluence AI use RAG to answer questions over internal company knowledge bases.
- Customer support bots — Klarna’s AI assistant answers billing questions grounded in account-specific data.
- Legal research — Harvey AI retrieves relevant case law and statutes to assist lawyers in drafting briefs.
- Healthcare — Hippocratic AI retrieves clinical guidelines to provide evidence-based responses.
When to Use
Choose RAG when the model needs access to knowledge that isn’t in its training data — proprietary documents, frequently updated content, or data too large to fit in a prompt. It’s the go-to augmentation for knowledge-intensive tasks.
3. Fine-Tuning
Fine-tuning continues training a pre-trained model on a domain-specific dataset, adjusting the model’s weights to specialize its behaviour.
Variants
| Technique | What Changes | Data Needed | Compute |
|---|---|---|---|
| Full fine-tuning | All weights | 10k–100k+ examples | Very high (multi-GPU) |
| LoRA / QLoRA | Low-rank adapter layers | 1k–10k examples | Moderate (single GPU) |
| Prefix tuning | Learned prompt embeddings | 500–5k examples | Low |
| Instruction tuning | Weights, optimized for instruction-following | 1k–50k examples | High |
How It Works (LoRA Example)
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B")
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
# Trainable params: ~0.1% of total — fits on a single A100Pros
- Deep domain specialization — The model internalizes terminology, style, and reasoning patterns of your domain.
- Consistent tone and format — Fine-tuned models reliably produce outputs in the exact structure you need (JSON schemas, medical reports, legal clauses).
- Reduced prompt size — Behaviour is “baked in,” so you need fewer examples in the prompt at inference time.
- Works offline — A fine-tuned open-source model runs on your own hardware with no API dependency.
Cons
- Data requirements — You need high-quality, labelled training data. Garbage data produces a garbage model.
- Compute cost — Even LoRA needs a GPU for hours; full fine-tuning can run into thousands of dollars.
- Catastrophic forgetting — The model can lose general capabilities when over-fitted to a narrow domain.
- Maintenance burden — You must retrain when the base model updates or when your domain data changes.
- Evaluation is hard — Measuring whether fine-tuning actually improved things requires robust benchmarks.
Real-World Use Cases
- Bloomberg GPT — Bloomberg fine-tuned a 50B-parameter model on financial data for sentiment analysis, NER, and financial Q&A.
- Med-PaLM 2 — Google fine-tuned PaLM 2 on medical datasets, achieving expert-level performance on USMLE-style questions.
- Code generation — Codex, StarCoder, and DeepSeek-Coder are fine-tuned on code corpora for programming assistance.
- Brand voice — Companies fine-tune models to match their specific tone, vocabulary, and style guide.
When to Use
Choose fine-tuning when you need the model to deeply internalize domain knowledge or a specific output style, and you have high-quality training data. It’s ideal for production systems where consistency, format adherence, and domain accuracy are paramount — and where prompt engineering and RAG alone fall short.
4. Tool Use / Function Calling
Give the LLM the ability to call external tools — APIs, databases, calculators, code interpreters — to perform actions or retrieve live data it cannot generate on its own.
How It Works
The LLM receives a list of available tool definitions (name, description, parameters). When it determines a tool is needed, it outputs a structured function call instead of plain text. Your application executes the call and feeds the result back.
{
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": { "type": "string" },
"units": { "type": "string", "enum": ["celsius", "fahrenheit"] }
},
"required": ["city"]
}
}
}
]
}Pros
- Live data access — The model can query real-time APIs (weather, stock prices, databases) instead of relying on stale training data.
- Takes real actions — Send emails, create tickets, execute trades, update records.
- Accurate computation — Offload math, date calculations, and data transforms to deterministic tools rather than relying on the LLM.
- Composable — Combine multiple tools to build complex workflows.
Cons
- Security risk — A model that can call APIs can also call them incorrectly or maliciously (prompt injection → unintended actions).
- Latency — Each tool call is a round-trip; multi-step tool chains compound latency.
- Error handling complexity — The model must gracefully handle API errors, rate limits, and malformed responses.
- Model dependency — Not all models support function calling equally well; smaller models often struggle with tool selection.
Real-World Use Cases
- ChatGPT Plugins / GPT Actions — OpenAI’s plugin system lets ChatGPT query Expedia, Wolfram Alpha, Zapier, and hundreds of third-party APIs.
- Coding assistants — GitHub Copilot uses tool calls to read files, run terminal commands, and search codebases.
- Personal assistants — Google Gemini can call Google Maps, Flights, Hotels, and Calendar to complete real tasks.
- Data analysis — Code Interpreter (Advanced Data Analysis) executes Python in a sandbox to analyze uploaded CSVs and generate charts.
When to Use
Choose tool use when the LLM needs to interact with the external world — query live data, perform calculations, or take actions. Essential for any assistant that goes beyond text generation.
5. Agentic Workflows
Agents take tool use to the next level: the LLM autonomously plans multi-step tasks, deciding which tools to call, in what order, and how to handle intermediate results — looping until the goal is achieved.
How It Works
An agent framework (LangChain, LangGraph, CrewAI, AutoGen, OpenAI Assistants API) provides the LLM with:
- A goal or task description.
- A set of tools it can use.
- A reasoning loop (typically ReAct: Reason → Act → Observe → repeat).
Goal: "Find the top 3 trending AI papers this week and summarize them."
Thought: I need to search for recent AI papers. I'll use the arxiv_search tool.
Action: arxiv_search(query="AI", sort_by="submittedDate", max_results=10)
Observation: [list of 10 papers with titles and abstracts]
Thought: I have the papers. Let me identify the most cited/discussed ones and summarize the top 3.
Action: summarize(papers=[paper1, paper2, paper3])
Observation: [3 summaries]
Final Answer: Here are the top 3 trending AI papers this week...Pros
- Handles complex, multi-step tasks — Research, data pipeline construction, debugging workflows, multi-document analysis.
- Adaptive — The agent adjusts its plan based on intermediate results and errors.
- Scalable — Multi-agent architectures divide work across specialized agents (researcher, coder, reviewer).
- Autonomous — Reduced need for human intervention in well-defined workflows.
Cons
- Unpredictable — Agents can go off-track, loop infinitely, or take unintended actions. Debugging is difficult.
- Expensive — Multi-step reasoning means many LLM calls; a single task can consume thousands of tokens.
- Slow — Sequential tool calls and reasoning loops accumulate latency.
- Trust boundary — Autonomous actions require robust guardrails; a rogue agent with write access to production systems is dangerous.
- Hard to evaluate — Success depends on the full trajectory, not just the final answer.
Real-World Use Cases
- Software engineering agents — Devin, SWE-Agent, and GitHub Copilot Agent mode autonomously write, test, and debug code across multi-file repositories.
- Research assistants — GPT Researcher autonomously searches the web, gathers sources, and produces research reports.
- Customer service escalation — Multi-agent systems where a triage agent routes to specialist agents (billing, technical support, returns).
- Data engineering — Agents that ingest data, write SQL queries, build dashboards, and iterate based on user feedback.
When to Use
Choose agentic workflows for complex, multi-step tasks where the path to completion isn’t fully known in advance. Best for internal tools, developer workflows, and supervised environments where a human can review critical actions.
6. Retrieval-Augmented Fine-Tuning (RAFT)
RAFT combines RAG and fine-tuning: you fine-tune the model specifically on how to use retrieved documents to answer questions, teaching it to distinguish relevant from irrelevant context.
How It Works
- Generate a training set of (question, retrieved documents, answer) triples.
- Include both relevant (“oracle”) documents and distractor documents in the context.
- Fine-tune the model to produce chain-of-thought answers that cite the relevant documents while ignoring distractors.
Pros
- Best of both worlds — The model learns domain-specific reasoning AND how to leverage retrieved context.
- Robust to noisy retrieval — The model is trained to ignore irrelevant documents, making it more resilient than vanilla RAG.
- Higher accuracy — Studies show RAFT outperforms both standalone RAG and standalone fine-tuning on domain Q&A benchmarks.
Cons
- Complexity — Requires both a retrieval pipeline AND a fine-tuning pipeline.
- Data engineering effort — Generating realistic (question, context, answer) triples at scale is labour-intensive.
- Double maintenance — You must maintain both the retrieval index and the fine-tuned model.
Real-World Use Cases
- Enterprise document Q&A — Companies with large internal knowledge bases where retrieval alone produces too many errors.
- Compliance and regulatory — Financial institutions fine-tune models to accurately answer questions over regulatory documents while citing specific sections.
When to Use
Choose RAFT when you’ve already tried RAG but the model struggles with noisy retrieval results or doesn’t reason well over retrieved documents. It’s the “advanced RAG” strategy for teams with the resources to fine-tune.
7. Knowledge Graphs + LLMs
Augment the LLM with a structured knowledge graph to provide factual, relational knowledge that vector search alone may miss.
How It Works
- GraphRAG — Use the LLM to extract entities and relationships from documents, build a knowledge graph, then query the graph at inference time for structured context.
- KG-enhanced prompts — Query a pre-existing knowledge graph (Neo4j, Amazon Neptune) and inject the subgraph into the prompt.
- Hybrid retrieval — Combine vector similarity search with graph traversal for richer context.
Pros
- Relational reasoning — Knowledge graphs excel at multi-hop queries: “Who are the competitors of the companies that supply our top 3 products?”
- Structured and precise — Entities, relationships, and properties are explicit — no ambiguity.
- Explainable — The reasoning path through the graph is transparent and auditable.
- Complements RAG — Catches relationships that vector similarity search misses.
Cons
- Graph construction is expensive — Building and maintaining a knowledge graph requires significant effort (NER, relation extraction, entity resolution).
- Scalability — Large graphs can become slow to query and expensive to maintain.
- Brittleness — The graph is only as complete as the extraction process; missing entities or relations degrade quality.
- Integration complexity — Requires combining graph databases, vector stores, and LLM orchestration.
Real-World Use Cases
- Microsoft GraphRAG — Microsoft Research’s approach uses LLMs to build community-level summaries of a knowledge graph for global Q&A over large document sets.
- Drug discovery — Pharmaceutical companies use knowledge graphs of drugs, genes, and diseases, augmented with LLMs, for hypothesis generation.
- Fraud detection — Financial institutions traverse transaction graphs with LLM-powered reasoning to explain suspicious patterns.
When to Use
Choose knowledge graphs when your domain is rich in relationships (supply chains, org charts, biomedical data, social networks) and simple vector search can’t capture the multi-hop reasoning you need.
8. Long-Context and Memory Augmentation
Extend the model’s effective memory beyond its context window using external memory systems, summarization chains, or retrieval-backed conversation history.
Approaches
| Approach | How | Best For |
|---|---|---|
| Sliding window + summarization | Summarize older messages, keep recent ones | Chatbots, long conversations |
| Memory databases | Store key facts in a persistent DB, retrieve on demand | Personal assistants |
| Recursive summarization | Hierarchically summarize long documents | Book/report analysis |
| Extended context models | Use models with 128k–1M+ token windows | Document-level tasks |
Pros
- Persistent conversations — The model “remembers” user preferences, past interactions, and key facts across sessions.
- Handles long documents — Process entire codebases, books, or legal contracts in a single pass.
- Personalization — Build user profiles over time for tailored responses.
Cons
- Long-context models are expensive — Costs scale with token count; a 200k-token prompt is 50x the cost of a 4k-token one.
- Attention degradation — Models tend to lose accuracy in the middle of very long contexts (“lost in the middle” problem).
- Memory management complexity — Deciding what to remember, what to summarize, and what to forget requires careful engineering.
- Privacy concerns — Persistent memory stores potentially sensitive user data.
Real-World Use Cases
- ChatGPT Memory — OpenAI’s memory feature persists user preferences and facts across conversations.
- Personal AI assistants — Mem.ai and Rewind.ai build persistent memory layers for AI companions.
- Codebase analysis — Tools like Cursor and Cody use long-context models to ingest entire repositories for code understanding.
When to Use
Choose memory augmentation when the application requires multi-turn persistence, user personalization, or processing very long documents. Use extended context models for document-level tasks; use external memory systems for cross-session persistence.
9. Guardrails and Output Structuring
Constrain the LLM’s output to ensure safety, format compliance, and factual accuracy through validation layers, structured output schemas, and content filters.
Approaches
- Structured outputs — Force JSON, XML, or schema-compliant responses (OpenAI Structured Outputs, Instructor library, Outlines, LMQL).
- Content filters — Block harmful, biased, or off-topic outputs (Guardrails AI, NVIDIA NeMo Guardrails, Llama Guard).
- Fact-checking chains — Use a second LLM call to verify claims against trusted sources.
- Constitutional AI — Train the model with principles that self-correct harmful outputs.
Pros
- Production-safe — Prevents harmful, off-topic, or malformed outputs from reaching users.
- Deterministic structure — Guarantees the output matches your API schema, database format, or UI contract.
- Composable — Layer multiple guardrails (safety + format + fact-check) in a pipeline.
Cons
- Added latency — Every validation layer adds processing time.
- Over-filtering — Aggressive guardrails can block legitimate responses, frustrating users.
- Maintenance — Safety taxonomies and schemas evolve; guardrails need ongoing updates.
- False sense of security — No guardrail system is 100% effective against adversarial inputs.
Real-World Use Cases
- Banking chatbots — Financial institutions use guardrails to ensure the model never provides investment advice or leaks PII.
- Healthcare — NVIDIA NeMo Guardrails ensures medical AI assistants stay within approved clinical guidelines.
- API backends — Structured outputs guarantee the LLM returns valid JSON for downstream services.
When to Use
Use guardrails in any production system. They’re not an alternative to other augmentation strategies — they’re a complement that should be layered on top of every approach.
10. Multi-Modal Augmentation
Extend the LLM beyond text by incorporating vision, audio, video, or other modalities — enabling it to reason over images, transcribe speech, or analyze visual data.
Approaches
- Native multi-modal models — GPT-4o, Gemini 1.5, Claude 3.5 Sonnet natively process text + images (+ audio/video for some).
- Pipeline augmentation — Use a separate vision model (e.g., OCR, YOLO, Whisper) to convert non-text inputs to text, then feed to the LLM.
- Vision-language adapters — LLaVA, BLIP-2, CogVLM align visual encoders with language models.
Pros
- Richer understanding — The model can analyze charts, screenshots, handwritten notes, medical images, satellite imagery.
- Natural interaction — Users can share photos, voice messages, or screen recordings instead of typing.
- New use cases — Opens domains that are impossible with text-only models (radiology, manufacturing QC, autonomous driving).
Cons
- Higher compute cost — Image and video tokens are expensive; a single image can consume 1000+ tokens.
- Hallucinations — Vision models can misread fine print, confuse similar objects, or fabricate details about images.
- Limited availability — Not all models support all modalities; open-source multi-modal models lag behind proprietary ones.
- Data privacy — Sending images/audio to cloud APIs raises additional privacy concerns.
Real-World Use Cases
- Document processing — Extracting data from invoices, receipts, and forms using vision + LLM (Google Document AI, Azure AI Document Intelligence).
- Accessibility — Be My Eyes uses GPT-4o to describe the visual world to visually impaired users.
- Retail — Visual product search and try-on using image understanding.
- Manufacturing — Quality control inspection by analyzing product images for defects.
When to Use
Choose multi-modal augmentation when your input or problem domain is inherently visual, auditory, or mixed-media. If users need to interact with non-text content, multi-modal is not optional — it’s required.
11. RLHF / RLAIF (Reinforcement Learning from Human or AI Feedback)
Align the model’s outputs with human preferences using reinforcement learning, making it more helpful, harmless, and honest.
How It Works
- Collect comparisons — Present human raters (RLHF) or a judge LLM (RLAIF) with multiple model outputs for the same prompt, ranked by quality.
- Train a reward model — Learn a scoring function that predicts human preference.
- Optimize with RL — Use PPO or DPO to fine-tune the LLM to maximize the reward model’s score.
Pros
- Alignment — The model becomes more helpful, truthful, and safe — not just more capable.
- Captures nuance — Human preferences encode subtleties (tone, verbosity, empathy) that are hard to specify in a loss function.
- DPO simplification — Direct Preference Optimization removes the need for a separate reward model, reducing complexity.
Cons
- Expensive — Human annotation is slow and costly; recruiting domain experts even more so.
- Reward hacking — The model can learn to game the reward model rather than genuinely improve.
- Subjectivity — “Good” is contextual; different annotators may disagree, injecting noise.
- Not accessible to most teams — RLHF requires significant ML expertise and infrastructure.
Real-World Use Cases
- ChatGPT — OpenAI used RLHF extensively to make GPT-4 conversational and aligned.
- Claude — Anthropic uses Constitutional AI (a variant of RLAIF) to align Claude with safety principles.
- Llama 3 — Meta used RLHF and DPO to align the open-source Llama models.
When to Use
RLHF is primarily for model builders training foundation or fine-tuned models. Application developers rarely do RLHF themselves — instead, they benefit from it through already-aligned models. Consider it if you’re training a custom model and need to optimize for subjective quality metrics.
Comparison Matrix
| Approach | Cost | Complexity | Latency Impact | Best For |
|---|---|---|---|---|
| Prompt Engineering | Very Low | Low | None | First pass on any task |
| RAG | Medium | Medium | +200–500ms | Private/fresh knowledge |
| Fine-Tuning | High | High | None (at inference) | Domain specialization |
| Tool Use | Low–Medium | Medium | +per tool call | Live data & actions |
| Agentic Workflows | High | Very High | +seconds to minutes | Complex multi-step tasks |
| RAFT | Very High | Very High | +200–500ms | High-accuracy domain Q&A |
| Knowledge Graphs | High | High | +100–300ms | Relational reasoning |
| Memory Augmentation | Medium | Medium | Varies | Personalization & long docs |
| Guardrails | Low | Low–Medium | +50–200ms | Every production system |
| Multi-Modal | Medium–High | Medium | +per modality | Non-text inputs |
| RLHF / RLAIF | Very High | Very High | None (at inference) | Model alignment |
Decision Framework
Use this flowchart to choose the right augmentation strategy:
- Does the model need knowledge it doesn’t have?
- If the data changes frequently → RAG
- If the data is static and you have labelled examples → Fine-Tuning
- If both → RAFT
- Does the model need to take actions or access live data?
- Single action → Tool Use
- Multi-step, dynamic workflow → Agentic Workflows
- Does the model need to reason over relationships?
- Yes → Knowledge Graphs (potentially combined with RAG)
- Does the model need to handle images, audio, or video?
- Yes → Multi-Modal Augmentation
- Does the model need to remember past interactions?
- Yes → Memory Augmentation
- Is the output format or safety critical?
- Yes → Guardrails (always layer these on top)
- Are you building a custom model and need preference alignment?
- Yes → RLHF / RLAIF
- None of the above?
- Start with Prompt Engineering and measure.
In practice, production systems combine multiple strategies. A typical enterprise AI assistant might use RAG for knowledge, tool calling for actions, guardrails for safety, and memory for personalization — all orchestrated through an agentic framework. The key is to start simple, measure what’s lacking, and layer on augmentations incrementally.
References & Further Reading
Foundational Papers
- Retrieval-Augmented Generation (RAG) — Lewis, P. et al. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”, NeurIPS 2020.
- Chain-of-Thought Prompting — Wei, J. et al. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”, NeurIPS 2022.
- Self-Consistency — Wang, X. et al. “Self-Consistency Improves Chain of Thought Reasoning in Language Models”, ICLR 2023.
- Tree of Thoughts — Yao, S. et al. “Tree of Thoughts: Deliberate Problem Solving with Large Language Models”, NeurIPS 2023.
- ReAct — Yao, S. et al. “ReAct: Synergizing Reasoning and Acting in Language Models”, ICLR 2023.
- LoRA — Hu, E.J. et al. “LoRA: Low-Rank Adaptation of Large Language Models”, ICLR 2022.
- QLoRA — Dettmers, T. et al. “QLoRA: Efficient Finetuning of Quantized Language Models”, NeurIPS 2023.
- RAFT — Zhang, T. et al. “RAFT: Adapting Language Model to Domain Specific RAG”, 2024.
- GraphRAG — Edge, D. et al. “From Local to Global: A Graph RAG Approach to Query-Focused Summarization”, 2024.
- Lost in the Middle — Liu, N.F. et al. “Lost in the Middle: How Language Models Use Long Contexts”, TACL 2024.
Alignment & Safety
- InstructGPT / RLHF — Ouyang, L. et al. “Training language models to follow instructions with human feedback”, NeurIPS 2022.
- DPO — Rafailov, R. et al. “Direct Preference Optimization: Your Language Model is Secretly a Reward Model”, NeurIPS 2023.
- Constitutional AI — Bai, Y. et al. “Constitutional AI: Harmlessness from AI Feedback”, 2022.
- PPO — Schulman, J. et al. “Proximal Policy Optimization Algorithms”, 2017.
- Llama Guard — Inan, H. et al. “Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations”, 2023.
- Llama 3 — Dubey, A. et al. “The Llama 3 Herd of Models”, 2024.
Multi-Modal & Vision-Language
- LLaVA — Liu, H. et al. “Visual Instruction Tuning”, NeurIPS 2023.
- BLIP-2 — Li, J. et al. “BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models”, ICML 2023.
- CogVLM — Wang, W. et al. “CogVLM: Visual Expert for Pretrained Language Models”, 2023.
Domain-Specific Models
- BloombergGPT — Wu, S. et al. “BloombergGPT: A Large Language Model for Finance”, 2023.
- Med-PaLM 2 — Singhal, K. et al. “Towards Expert-Level Medical Question Answering with Large Language Models”, 2023.
- Codex — Chen, M. et al. “Evaluating Large Language Models Trained on Code”, 2021.
- StarCoder — Li, R. et al. “StarCoder: May the Source Be with You!”, 2023.
- DeepSeek-Coder — Guo, D. et al. “DeepSeek-Coder: When the Large Language Model Meets Programming”, 2024.
Books
- Chip Huyen — Designing Machine Learning Systems, O’Reilly, 2022. Covers production ML systems including retrieval, evaluation, and deployment — directly relevant to RAG and fine-tuning pipelines.
- Sebastian Raschka — Build a Large Language Model (From Scratch), Manning, 2024. Walks through pre-training, fine-tuning, and RLHF from first principles.
- Jay Alammar & Maarten Grootendorst — Hands-On Large Language Models, O’Reilly, 2024. Practical guide covering prompt engineering, RAG, fine-tuning, and multi-modal models with code examples.
- Cameron R. Wolfe — A Complete Guide to Fine-Tuning LLMs, Substack deep-dive series. Accessible introduction to LoRA, PEFT, and practical fine-tuning.
Tools & Platforms
- LangChain / LangGraph — Framework for building LLM applications with chains, agents, and tool use.
- Hugging Face PEFT — Library for parameter-efficient fine-tuning (LoRA, QLoRA, prefix tuning).
- Guardrails AI — Open-source framework for adding input/output guardrails to LLM applications.
- NVIDIA NeMo Guardrails — Toolkit for adding programmable guardrails to LLM conversational systems.
- Instructor — Library for structured LLM outputs using Pydantic models.
- Outlines — Library for constrained text generation from LLMs.