Online course from Berkeley here
Syllabus
Lecture | Supplemental Reading |
---|---|
LLM Reasoning | |
LLM Agents: Brief History and Overview | |
Agentic AI Frameworks & AutoGen | |
Building a Multimodal Knowledge Assistant |
Lecture 1: LLM Reasoning
Last Letter Concatenation Problem.
Input | Output | |
---|---|---|
Elon Musk | nk | |
Bill Gates | ls | |
Barack Obama | ? | |
Adding ‘reasoning process’ before ‘answer’ ? |
Key Idea: Derive the Final Answer through Intermediate steps.
-Ling et al (Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems)
-Cobbe et al. Training Verifiers to Solve Math Word Problems
-Nyt et al. Show Your Work: Scratchpads for Intermediate Computation with Language Models
Training, Fine tuning and Prompting with intermediate steps. Examples with intermediate steps.
Reasoning strategies
- Least-to-Most Prompting
- “Let’s think step by step” prompt
- Recall related problem - adaptively generate relevant examples and knowledge, rather than using a fix set of examples
- Chain-of-Thought Reasoning without Prompting (Greedy Decoding) ??
- Self-Consistency
- Universal Self-Consistency
- Consensus under debate of multiple LLM
- Oracle feedback for self-debug (unit tests)
Test
- Compositional Generalization
Limitation
- Distracted by irrelevant context
- Cannot self-correct reasoning yet
- Premise Order Matters in LLM Reasoning
Concern on generating intermediate steps instead of direct answers ?
- Probabilistic nature of LLM for token generation
What LLM does in decoding:
What we want:
One-step Further
Summary
- Generating intermediate steps improves LLM performance
- Training / finetuning / prompting with intermediate steps
- Zero-shot, analogical reasoning, special decoding
- Self-consistency greatly improves step-by-step reasoning
- Limitation: irrelevant context, self-correction, premise order
What are next?
- Define a right problem to work on
- Solve it from the first principles
Lecture 2: LLM Agents: Brief History and Overview
Agent: an intelligent system that interacts with some enviroment
Types of Task
- Reasoning
- Knowledge - can be augmented with RAG
- Computation
Tools
- Search Engine, Calculator
- Task-specific models
- APIs
Question and Answer Breakdown
- Symbolic Reasoning
- Mathematical Reasoning
- Commonsense QA
- Knowledge-intensive QA
- Multi-hop knowledge-intensive QA
Potential Tools
- Chain of Thought
- Tool use
- RAG
- Program of Thought
- WebGPT
- Self-ask
- IRCoT
ReAct = Reason and Act
- cannot explore systematically or incorporate feedback
- own context space is infinite size, changed when doing the thinking
- reasoning is an internal action for agents
Memory
Short-Term Memory
- append-only
- limited context
- limited attention
- do not persist over new tasks
Long-Term Memory
- read and write
- stores experience, knowledge, skills
- persist over new experience
Reflexion
- Task
- Trajectory
- Evaluation
- Reflection (find which unit test fail from a coding example) How does it update the memory ?
Cognitive Architectures for Language Agents (CoALA)
Any agent can be described with
- Memory
- Action Space
- Decision Making
Q
- What distinguishes external environment vs internal memory ?
- What distinguished long vs short term memory ?
Symbolic AI Agent → Deep RL Agent → LLM Agent Language is the latent space for LLM Agents.
Challenge
- Reasoning over real-world language
- decision making over open-ended actions and long horizon
What’s Next ?
- Training - FireAct: Toward Language Agent Fine-tuning
- Interface - SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
- Robustness - how many pass out of k?
- Human - -bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
- Benchmark
https://princeton-nlp.github.io/language-agent-impact/ EMNLP tutorial on language agents (Nov12-16), sorry bois, the visa prob won’t work out
Lecture 3a: Agentic AI Frameworks & AutoGen
Generative → Agentic Self-Healing code
Commander → Writer → Safeguard
Multi-Agent Orchestration
- static/ dynamic
- NL/ PL
- context sharing/ isolation
- cooperation/ competition
- centralize/ decentralized
- intervention/ automation
Agentic Design Patterns
- conversation
- prompting & reasoning
- tool use
- planning
- integrating multiple models, modalities and memories
AutoGen Framework
- define agents
- get them to talk
Conversation Type
- Blogpost writing with reflection
- Nested chat → multiple experts to do reflection
- Conversational with tools
- Group chat
Lecture 3b: Building a Multimodal Knowledge Assistant
Knowledge Base → Vector Database via chunking
Knowledge Assistant with Basic RAG
Data Processing and Indexing
Data → Basic Text Splitting → Index
Basic Retrieval and Prompting
Top-5 = 5 Simple QA Prompt → Response
Limitations
- Naive data processing, primitive retrieval interface
- Poor query understanding/ planning
- No function calling or tool use
- Stateless, no memory
A Better Knowledge Assistant
- High-quality Multimodal RAG
- Complex output generation
- Agentic reasoning over complex inputs
- Towards a scalable, full-stack application
Data → Data Processing → Index → Agent → Response
Setting up Multimodal RAG
- ETL for LLMS
- parsing
- chunking
- indexing
- Complex Documents → embedded tables, charts, images, irregular layouts, headers/footers
- LLM-Native Document Parser
- Agentic RAG
- Unconstrained vs Constrained Flows
Agentic Orchestration Foundations
- event-driven
- composable
- flexible
- code-fist
- debuggable and observable
- easily deployable to production
Multimodal Report Generation → Final System ?
Lecture 4: Enterprise trends for generative AI, and key components of building successful agents/ applications
- needle in a haystack test
Trends
- AI moving faster
- amount of data needed has come down
- anyone can develop AI with new tools
- Technical
- multimodal
- efficient sparse models
- Platform
- broad set of models
- customization
- Cost of API Call → $0
- Search → LLM and Search
- Enterprise Search/ Assistant
Customization
- Fine Tuning
- Distillation
- Grounding
- Function Calling
- Prompt Design vs Prompt Tuning
Fine Tuning
- Parameter-efficient Fine Tuning (PEFT)
- Low Rank Adaptation (LoRA)
Distillation
- teacher-student model
- softmax function with temperature
Grounding
Minimize Hallucination
- right context
- Retrieval Service
- dynamic retrieval
- better models
- user experience
Function Calling
Lecture 5: Compound AI Systems & the DSPy Framework
modular programs that use LMs as specialized componenets
Monolithic nature of LMs makes them hard to control, debug and improve.
Retrieval-Augmented Generation
- Transparency: can debug traces & offer user-facing attribution
- Efficiency: can use smaller LMs, offloading knowledge & control flow
Multi-Hop Retrieval-Augmented Generation
- control: can iteratively improve the system & ground it via tools
Compositional Report Generation
- quality: more reliable composition of better-scoped LM capabilities
- inference-time scaling: systematically searching for better outputs
- DIN-SQL
Role of Prompt
- The core input→ output behavior, a signature
- The computation specializing an inference-time strategy to the signature, a predictor
- The computation formatting the signature’s inputs and parses it typed outputs, an adapter
- The computations defining objectives and constraints on behavior, metrics and assertions
- The strings that instruct (or weights that adapt) the LM for desired behavior, an optimizer
What if we could abstract Compound AI Systems as programs with fuzzy natural-language-typed modules that learn their behavior ? (DSPy → Declarative Self-Improving Python)
LM Program with X and Y in natural language In the course of its execution, makes calls to modules Each module is a declarative LM invocation, defined via inherently fuzzy natural-language descriptions of:
- a sub-task (optional)
- input domain type(s)
- output co-domain types(s)
For each module , determine the:
- String prompt in which inputs are plugged in.
- Weights assigned to the LM in the optimization problem defined by: given a small training set and a metric for labels or hints
Optimizers
- Construct an initial prompt from each module via an Adapter
- Generate examples of every module via rejection sampling
- Use the examples to update the program’s modules
- automatic few-show prompting: dsp.BootstrapFewShotWithRandomSearch
- induction of instructions: dspy.MIPROv2
- multi-stage fine-tuning: dspy.BootstrapFineTune
Problem Setting
Inputs: Training/Validation Input + Optimized LM Program + Metric Outputs:
Optimize → Instructions + Few-Shot Examples
Metrics
- labels - can use the historical tables and scheme change tables as the ‘correct’ response
- grounded in context that was retrieve ? what counts as correct context ?
Contraints / Assumption
- No access to log-probs or model weights: developers may want to optimize LM prorams for use on API only models
- No intermediate metrics/ labels: we assume no access to manual ground-truth labels for intermediate stages
- Budget-Conscious: we want to limit the number of input examples we require and the number of program calls we make
- the input can be scaled based on historical changes to the table pairs with chat data
Key Challenges
Prompt Proposal
Searching over space for prompts
Credit Assignment
how each prompt variable contributes to performance
Methods
- Bootstrap Few-Shot (with Random Search)
- iterate on examples
- Extending OPRO (Optimization through Prompting)
- “think step by step”
- “take a deep breath and think step by step”
- “I believe in you”
- evaluate and request more proposals
- Coordinate-Ascent OPRO
- Module-Level OPRO
- Grounding
- dataset summary
- history of instructions
- bootstrapped demos
- LM Program Code itself
- MIPRO (Multi-prompt Instruction PRoposal Optimizer)
- Prompt Proposal (1 & 2), Credit Assignment (3)
- Bootstap Task Demostations
- Propose Instruction Candidates using an LM Program
- Jointly tune with a Bayesian hyperparameter optimizer (Bayesian Surrogate Model)
Optimizing Instructions can deliver gains over baseline signature Optimizing bootstrap
Key Lessons 1: Natural Language Programming
- Programs can often be more accurate, controllable, transparent, and even efficient than models
- You just need declarative programs, not implementation details. High-level optimizers can bootstrap prompts - or weights, or whatever the next paradigm deals with.
Hand-written prompts ⇒ Signature Prompting techniques and prompt chains ⇒ Modules Manual prompt engineering ⇒ Optimized programs
Key Lessons 2: Natural Language Optimization
- In isolation, on many tasks nothing beats bootstrapping good demonstrations. Show don't tell!
- Generating good instructions on top of these is possible, and is especially important for tasks with conditional rules!
- But you will need effective grounding, and explicit forms of credit assignment.
Lecture 6: Agents for Software Development
Levels of Support
- Manual Coding
- Copilot/Cursor code completion
- Copilot chat refactoring
- DiffBlue test generation, Transcoder code porting
- Devin/ OpenDevin end-to-end development
Challenges in Coding Agents
- Defining the environment
- Designing an Observations/ Actions
- Code Generation(atomic actions)
- File Localization (exploration)
- Planning and Error Recovery
- Safety
Software Development Enviroment
- Source Repositories: Github Gitlab
- Task Management Software: Jira, Linear, Github Issues
- Office Software: Google Docs, Microsoft Office
- Communication Tools: Gmail Slack
Simple Coding → Specification → Code
Metrics
- Pass@K ← unit tests
- Lexical/ Semantic Overlap ← dataset leakage of the test
- visual similarity of website ← for front end related
multimodal coding ? → design2code ← can use to recode from front end
Objective of Coding Agents
- Understanding repo structure
- Read in existing code
- Modify or produce code
- Run code and debug
LM friendly tools
Method: Code Generating LM
adding code improves the reasoning of model ???? find paper → the stack 2
research idea 1 ?
- Use LLM to tell if a repo is bad or not ? → are LLMs good at giving numerical something as a judgement ? eg. this sentence is a good sentence ?
- microsoft GEMBA ?
- categorical evaluations yield better results ? what about asking the LM to describe it in a sentence and using a simple NLP sentiment analysis
- chain of thought
- directional feedback ?
- NumeroLogic → multiple operations ?
- microsoft GEMBA ?
- GM cut bottom 10%, train new model
- does new model do better with less bad data ?
- repeat step 1 and 2 I feel like this has been done before ? in vision models ?
Method: Code Infilling
masking and using missing info as target
Method: Long-Context Extension
RoPE method → why does this work ?
research idea 2
File Localization
- finding the correct files give the user intent
- OpenHands have such issues ?
- Solution
- offload to the user
- prompt the agent w/ Search Tools
- A-priori Map the Repo
- makes the most sense with a sql database that does not change much
- Aider repomap
- Agentless (Xia et al 2024)
- Retrieval-augmented Code Generation
- please we need documentation (LLM assisted generated)
Planning and Error Recovery
Hard-coded Task Completion Process
Agentless
- File Localization
- Function Localization
- Patch Generation
- Patch Application
LLM- Generated Plans
CodeR (Chen et al. 2024)
- breaks into subagent
Planning and Revisiting
CoAct goes back and fixes (Hou et a. 2024)
Fixing Based on Error Messages → how to understand error message
InterCode (Yang et al. 2023)
Safety
- Docker - limit the execution environment
- Credentialing
- Post-hoc Auditing
Directions
- Agentic training methods
- Human-in-the-loop
- Broader software tasks than coding
Lecture 7: AI Agents for Enterprise Workflows
- API Agents
- Web Agents
LLM-Based Single Agents: Typical Architecture
graph LR LLM(LLM Agent) Tools(Tools) Mem(Memory) Plan(Planning) Act(Action) Env(Environment) STM(Short-Term Memory) LTM(Long-Term Memory) Ref(Reflection) Self(Self-critique) Chain(Chain of thoughts) Sub(Subgoal Decomposition) Calc(Calculator) Code(CodeInterpreter) Web(WebSearch) Trig(TriggerWorkflow) More(... more ...) LLM --> Mem LLM --> Plan LLM --> Act Tools --> LLM Mem --> STM Mem --> LTM Tools --> Calc Tools --> Code Tools --> Web Tools --> Trig Tools --> More Plan --> Ref Plan --> Self Plan --> Chain Plan --> Sub Act --> Env Env --> LLM Plan -.-> Mem
TapeAgents
LangGraph, AutoGen, Crew
- resumable modular state machine DSPy, TextGrad, Trace
- code that uses structured modules and generates structured logs TapAgents
- combine the top t