Online course from Berkeley here

Syllabus

Lecture	Supplemental Reading
LLM Reasoning
LLM Agents: Brief History and Overview
Agentic AI Frameworks & AutoGen
Building a Multimodal Knowledge Assistant

Lecture 1: LLM Reasoning

Last Letter Concatenation Problem.

Input	Output
Elon Musk	nk
Bill Gates	ls
Barack Obama	?
Adding ‘reasoning process’ before ‘answer’ ?

Key Idea: Derive the Final Answer through Intermediate steps.

-Ling et al (Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems)
-Cobbe et al. Training Verifiers to Solve Math Word Problems
-Nyt et al. Show Your Work: Scratchpads for Intermediate Computation with Language Models

Training, Fine tuning and Prompting with intermediate steps. Examples with intermediate steps.

Reasoning strategies

Least-to-Most Prompting
“Let’s think step by step” prompt
Recall related problem - adaptively generate relevant examples and knowledge, rather than using a fix set of examples
Chain-of-Thought Reasoning without Prompting (Greedy Decoding) ??
Self-Consistency
Universal Self-Consistency
Consensus under debate of multiple LLM
Oracle feedback for self-debug (unit tests)

Test

Compositional Generalization

Limitation

Distracted by irrelevant context
Cannot self-correct reasoning yet
Premise Order Matters in LLM Reasoning

Concern on generating intermediate steps instead of direct answers ?

Probabilistic nature of LLM for token generation

What LLM does in decoding:

$arg max P (reasoning path, final answer| problem)$

What we want:

$arg max P (final answer| problem)$

One-step Further

$arg max P (final answer| problem) = \sum_{reasoning path} P (final answer| problem)$

Summary

Generating intermediate steps improves LLM performance
- Training / finetuning / prompting with intermediate steps
- Zero-shot, analogical reasoning, special decoding
Self-consistency greatly improves step-by-step reasoning
Limitation: irrelevant context, self-correction, premise order

What are next?

Define a right problem to work on
Solve it from the first principles

Lecture 2: LLM Agents: Brief History and Overview

Agent: an intelligent system that interacts with some enviroment

Types of Task

Reasoning
Knowledge - can be augmented with RAG
Computation

Tools

Search Engine, Calculator
Task-specific models
APIs

Question and Answer Breakdown

Symbolic Reasoning
Mathematical Reasoning
Commonsense QA
Knowledge-intensive QA
Multi-hop knowledge-intensive QA

Potential Tools

Chain of Thought
Tool use
RAG
Program of Thought
WebGPT
Self-ask
IRCoT

ReAct = Reason and Act

cannot explore systematically or incorporate feedback
own context space is infinite size, changed when doing the thinking
reasoning is an internal action for agents

Memory

Short-Term Memory

append-only
limited context
limited attention
do not persist over new tasks

Long-Term Memory

read and write
stores experience, knowledge, skills
persist over new experience

Reflexion

Task
Trajectory
Evaluation
Reflection (find which unit test fail from a coding example) How does it update the memory ?

Cognitive Architectures for Language Agents (CoALA)

Any agent can be described with

Memory
Action Space
Decision Making

Q

What distinguishes external environment vs internal memory ?
What distinguished long vs short term memory ?

Symbolic AI Agent → Deep RL Agent → LLM Agent Language is the latent space for LLM Agents.

Challenge

Reasoning over real-world language
decision making over open-ended actions and long horizon

What’s Next ?

Training - FireAct: Toward Language Agent Fine-tuning
Interface - SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
Robustness - how many pass out of k?
Human - $τ$ -bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
Benchmark

https://princeton-nlp.github.io/language-agent-impact/ EMNLP tutorial on language agents (Nov12-16), sorry bois, the visa prob won’t work out

Lecture 3a: Agentic AI Frameworks & AutoGen

Generative → Agentic Self-Healing code

Commander → Writer → Safeguard

Multi-Agent Orchestration

static/ dynamic
NL/ PL
context sharing/ isolation
cooperation/ competition
centralize/ decentralized
intervention/ automation

Agentic Design Patterns

conversation
prompting & reasoning
tool use
planning
integrating multiple models, modalities and memories

AutoGen Framework

define agents
get them to talk

Conversation Type

Blogpost writing with reflection
Nested chat → multiple experts to do reflection
Conversational with tools
Group chat

Lecture 3b: Building a Multimodal Knowledge Assistant

Knowledge Base → Vector Database via chunking

Knowledge Assistant with Basic RAG

Data Processing and Indexing

Data → Basic Text Splitting → Index

Basic Retrieval and Prompting

Top-5 = 5 Simple QA Prompt → Response

Limitations

Naive data processing, primitive retrieval interface
Poor query understanding/ planning
No function calling or tool use
Stateless, no memory

A Better Knowledge Assistant

High-quality Multimodal RAG
Complex output generation
Agentic reasoning over complex inputs
Towards a scalable, full-stack application

Data → Data Processing → Index → Agent → Response

Setting up Multimodal RAG

ETL for LLMS
- parsing
- chunking
- indexing
Complex Documents → embedded tables, charts, images, irregular layouts, headers/footers
LLM-Native Document Parser
Agentic RAG
Unconstrained vs Constrained Flows

Agentic Orchestration Foundations

event-driven
composable
flexible
code-fist
debuggable and observable
easily deployable to production

Multimodal Report Generation → Final System ?

Lecture 4: Enterprise trends for generative AI, and key components of building successful agents/ applications

needle in a haystack test

Trends

AI moving faster
- amount of data needed has come down
- anyone can develop AI with new tools
Technical
- multimodal
- efficient sparse models
Platform
- broad set of models
- customization
Cost of API Call → $0
Search → LLM and Search
Enterprise Search/ Assistant

Customization

Fine Tuning
Distillation
Grounding
Function Calling
Prompt Design vs Prompt Tuning

Fine Tuning

Parameter-efficient Fine Tuning (PEFT)
Low Rank Adaptation (LoRA)

Distillation

teacher-student model
softmax function with temperature

Grounding

Minimize Hallucination

right context
- Retrieval Service
- dynamic retrieval
better models
user experience

Function Calling

Lecture 5: Compound AI Systems & the DSPy Framework

modular programs that use LMs as specialized componenets

Monolithic nature of LMs makes them hard to control, debug and improve.

Retrieval-Augmented Generation

Transparency: can debug traces & offer user-facing attribution
Efficiency: can use smaller LMs, offloading knowledge & control flow

Multi-Hop Retrieval-Augmented Generation

control: can iteratively improve the system & ground it via tools

Compositional Report Generation

quality: more reliable composition of better-scoped LM capabilities
inference-time scaling: systematically searching for better outputs
- DIN-SQL

Role of Prompt

The core input→ output behavior, a signature
The computation specializing an inference-time strategy to the signature, a predictor
The computation formatting the signature’s inputs and parses it typed outputs, an adapter
The computations defining objectives and constraints on behavior, metrics and assertions
The strings that instruct (or weights that adapt) the LM for desired behavior, an optimizer

What if we could abstract Compound AI Systems as programs with fuzzy natural-language-typed modules that learn their behavior ? (DSPy → Declarative Self-Improving Python)

LM Program $ϕ : X - > Y$ with X and Y in natural language In the course of its execution, $ϕ$ makes calls to modules $< M_{1}, \dots, M_{∣ M ∣} >$ Each module $M_{i} : X_{i} - > Y_{i}$ is a declarative LM invocation, defined via inherently fuzzy natural-language descriptions of:

a sub-task $D_{i}^{T}$ (optional)
input domain type(s) $D_{i}^{X}$
output co-domain types(s) $D_{i}^{Y}$

For each module $M_{i}$ , determine the:

String prompt $Π_{i}$ in which inputs $X_{i}$ are plugged in.
Weights $Θ$ assigned to the LM in the optimization problem defined by: $Θ, Π arg max \frac{1}{∣ X ∣} \sum_{(x, m) \in X} μ (Φ_{Θ, Π} (x), m)$ given a small training set $X = (x_{1}, m_{1}), \dots, (x_{∣ x ∣}, m_{∣ x ∣})$ and a metric $μ : Y \times M \to R$ for labels or hints $M$

Optimizers

Construct an initial prompt from each module via an Adapter
Generate examples of every module via rejection sampling
Use the examples to update the program’s modules
- automatic few-show prompting: dsp.BootstrapFewShotWithRandomSearch
- induction of instructions: dspy.MIPROv2
- multi-stage fine-tuning: dspy.BootstrapFineTune

Problem Setting

Inputs: Training/Validation Input + Optimized LM Program + Metric Outputs:

Optimize → Instructions + Few-Shot Examples

Metrics

labels - can use the historical tables and scheme change tables as the ‘correct’ response
grounded in context that was retrieve ? what counts as correct context ?

Contraints / Assumption

No access to log-probs or model weights: developers may want to optimize LM prorams for use on API only models
No intermediate metrics/ labels: we assume no access to manual ground-truth labels for intermediate stages
Budget-Conscious: we want to limit the number of input examples we require and the number of program calls we make
- the input can be scaled based on historical changes to the table pairs with chat data

Key Challenges

Prompt Proposal

Searching over space for prompts

Credit Assignment

how each prompt variable contributes to performance

Methods

Bootstrap Few-Shot (with Random Search)
- iterate on examples
Extending OPRO (Optimization through Prompting)
- “think step by step”
- “take a deep breath and think step by step”
- “I believe in you”
- evaluate and request more proposals
- Coordinate-Ascent OPRO
- Module-Level OPRO
- Grounding
  1. dataset summary
  2. history of instructions
  3. bootstrapped demos
  4. LM Program Code itself
MIPRO (Multi-prompt Instruction PRoposal Optimizer)
- Prompt Proposal (1 & 2), Credit Assignment (3)
1. Bootstap Task Demostations
2. Propose Instruction Candidates using an LM Program
3. Jointly tune with a Bayesian hyperparameter optimizer (Bayesian Surrogate Model)

Optimizing Instructions can deliver gains over baseline signature Optimizing bootstrap

Key Lessons 1: Natural Language Programming

Programs can often be more accurate, controllable, transparent, and even efficient than models
You just need declarative programs, not implementation details. High-level optimizers can bootstrap prompts - or weights, or whatever the next paradigm deals with.

Hand-written prompts ⇒ Signature Prompting techniques and prompt chains ⇒ Modules Manual prompt engineering ⇒ Optimized programs

Key Lessons 2: Natural Language Optimization

In isolation, on many tasks nothing beats bootstrapping good demonstrations. Show don't tell!
Generating good instructions on top of these is possible, and is especially important for tasks with conditional rules!
But you will need effective grounding, and explicit forms of credit assignment.

Lecture 6: Agents for Software Development

Levels of Support

Manual Coding
Copilot/Cursor code completion
Copilot chat refactoring
DiffBlue test generation, Transcoder code porting
Devin/ OpenDevin end-to-end development

Challenges in Coding Agents

Defining the environment
Designing an Observations/ Actions
Code Generation(atomic actions)
File Localization (exploration)
Planning and Error Recovery
Safety

Software Development Enviroment

Source Repositories: Github Gitlab
Task Management Software: Jira, Linear, Github Issues
Office Software: Google Docs, Microsoft Office
Communication Tools: Gmail Slack

Simple Coding → Specification → Code

Metrics

Pass@K ← unit tests
Lexical/ Semantic Overlap ← dataset leakage of the test
visual similarity of website ← for front end related

multimodal coding ? → design2code ← can use to recode from front end

Objective of Coding Agents

Understanding repo structure
Read in existing code
Modify or produce code
Run code and debug

LM friendly tools

Method: Code Generating LM

adding code improves the reasoning of model ???? find paper → the stack 2

research idea 1 ?

Use LLM to tell if a repo is bad or not ? → are LLMs good at giving numerical something as a judgement ? eg. this sentence is a good sentence ?
- microsoft GEMBA ?
  - categorical evaluations yield better results ? what about asking the LM to describe it in a sentence and using a simple NLP sentiment analysis
  - chain of thought
  - directional feedback ?
- NumeroLogic → multiple operations ?
GM cut bottom 10%, train new model
does new model do better with less bad data ?
repeat step 1 and 2 I feel like this has been done before ? in vision models ?

Method: Code Infilling

masking and using missing info as target

Method: Long-Context Extension

RoPE method → why does this work ?

research idea 2

File Localization

finding the correct files give the user intent
OpenHands have such issues ?
Solution
1. offload to the user
2. prompt the agent w/ Search Tools
3. A-priori Map the Repo
  - makes the most sense with a sql database that does not change much
  - Aider repomap
  - Agentless (Xia et al 2024)
4. Retrieval-augmented Code Generation
  - please we need documentation (LLM assisted generated)

Planning and Error Recovery

Hard-coded Task Completion Process

Agentless

File Localization
Function Localization
Patch Generation
Patch Application

LLM- Generated Plans

CodeR (Chen et al. 2024)

breaks into subagent

Planning and Revisiting

CoAct goes back and fixes (Hou et a. 2024)

Fixing Based on Error Messages → how to understand error message

InterCode (Yang et al. 2023)

Safety

Docker - limit the execution environment
Credentialing
Post-hoc Auditing

Directions

Agentic training methods
Human-in-the-loop
Broader software tasks than coding

Lecture 7: AI Agents for Enterprise Workflows

API Agents
Web Agents

LLM-Based Single Agents: Typical Architecture

graph LR
    LLM(LLM Agent)
    Tools(Tools)
    Mem(Memory)
    Plan(Planning)
    Act(Action)
    Env(Environment)
    STM(Short-Term Memory)
    LTM(Long-Term Memory)
    Ref(Reflection)
    Self(Self-critique)
    Chain(Chain of thoughts)
    Sub(Subgoal Decomposition)
    
    Calc(Calculator)
    Code(CodeInterpreter)
    Web(WebSearch)
    Trig(TriggerWorkflow)
    More(... more ...)
    
    LLM --> Mem
    LLM --> Plan
    LLM --> Act
    Tools --> LLM
    
    Mem --> STM
    Mem --> LTM
    
    Tools --> Calc
    Tools --> Code
    Tools --> Web
    Tools --> Trig
    Tools --> More
    
    Plan --> Ref
    Plan --> Self
    Plan --> Chain
    Plan --> Sub
    Act --> Env
    Env --> LLM
    
    Plan -.-> Mem

TapeAgents

LangGraph, AutoGen, Crew

resumable modular state machine DSPy, TextGrad, Trace
code that uses structured modules and generates structured logs TapAgents
combine the top t

🪴 Berwin Gan

Explorer

Large Language Model Agents 🧠 (CS 294/197-196)

Syllabus

Lecture 1: LLM Reasoning

Key Idea: Derive the Final Answer through Intermediate steps.

Reasoning strategies

Test

Limitation

Concern on generating intermediate steps instead of direct answers ?

What LLM does in decoding:

What we want:

One-step Further

Summary

What are next?

Lecture 2: LLM Agents: Brief History and Overview

Question and Answer Breakdown

Potential Tools

ReAct = Reason and Act

Memory

Short-Term Memory

Long-Term Memory

Reflexion

Cognitive Architectures for Language Agents (CoALA)

Q

Challenge

What’s Next ?

Lecture 3a: Agentic AI Frameworks & AutoGen

Multi-Agent Orchestration

Agentic Design Patterns

AutoGen Framework

Conversation Type

Lecture 3b: Building a Multimodal Knowledge Assistant

Knowledge Assistant with Basic RAG

Data Processing and Indexing

Basic Retrieval and Prompting

Limitations

A Better Knowledge Assistant

Setting up Multimodal RAG

Agentic Orchestration Foundations

Lecture 4: Enterprise trends for generative AI, and key components of building successful agents/ applications

Trends

Customization

Fine Tuning

Distillation

Grounding

Minimize Hallucination

Function Calling

Lecture 5: Compound AI Systems & the DSPy Framework

Retrieval-Augmented Generation

Multi-Hop Retrieval-Augmented Generation

Compositional Report Generation

Role of Prompt

What if we could abstract Compound AI Systems as programs with fuzzy natural-language-typed modules that learn their behavior ? (DSPy → Declarative Self-Improving Python)

Optimizers

Problem Setting

Metrics

Contraints / Assumption

Key Challenges

Prompt Proposal

Credit Assignment

Methods

Key Lessons 1: Natural Language Programming

Key Lessons 2: Natural Language Optimization

Lecture 6: Agents for Software Development

Challenges in Coding Agents

Software Development Enviroment

Metrics

Objective of Coding Agents

Method: Code Generating LM

research idea 1 ?

Method: Code Infilling

Method: Long-Context Extension

research idea 2

File Localization

Planning and Error Recovery

Hard-coded Task Completion Process

LLM- Generated Plans

Planning and Revisiting

Fixing Based on Error Messages → how to understand error message