Show me what you do, and I will tell you what you want. That is the bet of inverse reinforcement learning. It is also the bet behind every modern LLM alignment pipeline, every robot learning from human demonstration, and every recommendation system trying to figure out what users actually care about. This post is about how the field first learned to take that bet seriously, in 2004 and 2006, on examples small enough to fit in a 12×12 gridworld. We trace Abbeel-Ng (2004) and MMP (2006) with interactive widgets at every key step. The fact that took me longest to internalise: the cost numbers these algorithms recover are wrong, and that turns out to be exactly the right answer.

Notation cheat sheet (open this before reading the math sections)

Every symbol the post uses, what it stands for, and how it's pronounced. The post is consistent about these, but the first appearance of each can come without warning.

symbolpronouncedmeaning
$s$"s"a state, i.e. one cell of the gridworld
$a$"a"an action (N, E, S, W, or L, R depending on the example)
$\xi$"xi" (Greek)a trajectory, a sequence of states the agent visits
$\xi_E$"xi sub E"the expert's trajectory (the one we observe)
$\xi^*$"xi star"the worst-violator trajectory in MMP (output of loss-augmented planner)
$\pi$"pi"a policy (a mapping from states to actions)
$\phi(s)$"phi of s"per-state feature vector. In the gridworld this is a one-hot indicator over terrain types: grass / road / sand / water / mountain
$f(\xi)$"f of xi"feature counts of trajectory $\xi$, equal to $\sum_{s \in \xi} \phi(s)$. The histogram of terrains the path visits. Also called the feature expectation
$f_E$"f sub E"shorthand for $f(\xi_E)$, the expert's feature counts
$f^*$"f star"shorthand for $f(\xi^*)$, the worst violator's feature counts
$w$"w"weight vector. Same length as $\phi(s)$. Interpreted as cost weights in MMP (higher = more expensive) and as reward weights in Abbeel-Ng (higher = preferred)
$c(s)$"c of s"cost of being in state $s$ under current weights, computed as $c(s) = w \cdot \phi(s)$
$L(s)$"L of s"per-state Hamming loss. Equals 1 if $s$ is NOT on the expert path, 0 if it IS
$\ell(\xi)$"script L of xi"total Hamming loss along path $\xi$, equal to $\sum_{s \in \xi} L(s)$. The number of non-expert cells the path visits
$c'(s)$"c prime of s"loss-augmented cost, $c'(s) = c(s) - L(s)$. Non-expert cells get a 1-unit discount
$\eta$"eta" (Greek)step size / learning rate
$\zeta$"zeta" (Greek)slack variable in the soft-margin MMP optimisation
$\lambda$"lambda" (Greek)used twice (sorry): mixing weights in Abbeel-Ng's final policy mixture, and Lagrange multipliers in the proofs
$t$"t"timestep within an episode
$T$"T"horizon length of the episode
$V_t(s)$"V sub t of s"value function at time $t$ in state $s$, the expected return going forward
$Q_t(s, a)$"Q sub t of s, a"action-value at time $t$, expected return if you take action $a$ in state $s$ and then act optimally
$\|w\|$"norm of w"Euclidean length: $\|w\| = \sqrt{w_1^2 + w_2^2 + \cdots}$
$w \cdot v$"w dot v"dot product of two vectors: $w_1 v_1 + w_2 v_2 + \cdots$

Cost vs reward sign convention. Abbeel-Ng treats $w$ as reward weights (the expert maximises $w \cdot \phi$). MMP treats $w$ as cost weights with $w \geq 0$ (the expert minimises $w \cdot \phi$). The math is the same up to a sign flip, but the literature uses both, and the post follows whichever paper a given section is explaining. Within a section the convention is consistent.

The question we're asking

You're watching an expert do a task. You can see what they do but not why. You'd like to behave like them.

The straightforward approach, behavioural cloning, just memorises "when in state $s$, take action $a$" and replays it. The problem with that: the moment you find yourself in a state the expert never showed you, you have no idea what to do. Errors compound. You drive off the road.

Inverse RL bets that there's a more compact answer. The expert has some preferences, they like roads, they avoid puddles, they care about reaching the destination. If we can recover those preferences, then in any new situation we can re-derive the right action by planning. That's the dream.

The two algorithms in this post:

  • Abbeel-Ng (2004), "I'll find a policy that spends the same amount of time on each type of terrain as the expert. I won't bother recovering the preferences themselves."
  • MMP (2006), "I'll directly tune a cost vector so that the expert's exact route is the cheapest one. Then I have a cost function I can use in new environments."

Both work. They're useful for different reasons.

A concrete world

Here's the 12×12 gridworld we'll use as a running example. Each cell is a terrain type. The expert starts at the black dot, wants to reach the gold star. Walking on each terrain has a cost:

The expert's path · animated
Interactive
step 0 / 22terrain grass
grass road sand water mountain
The red line is what the expert actually does. It hugs the grey road all the way around. Why? Because road is cheap. But you don't know that , you just see the line. Click Animate to watch the expert walk step-by-step.
TerrainColourTrue cost (hidden from you)
grass  1.0
road  0.2
sand  2.0
water  5.0
mountain  8.0

The expert takes a 23-step path, mostly on the road. We, the IRL researchers, get to see the path. We don't see the cost table. Our job is to recover something useful from that path alone.

What information do we actually have?

This is the part beginners stumble on, so let me be explicit. We have:

  • The map. Which cell is which terrain. We can see the grid.
  • The rules. Move N/E/S/W; bumping a wall keeps you in place.
  • The expert's path. A sequence of 23 cells from start to goal.

We do not have:

  • The cost table.
  • Any signal about whether the expert was happy or sad at each step.
  • Any "preferences" the expert might confess if asked.

That's it. Now we need a strategy.

The big idea: count what the expert visits

Here's the move that makes IRL tractable. Instead of asking "what's the expert's reward function?", which has infinitely many answers, we ask "what kinds of cells does the expert visit?"

For our gridworld:

TerrainCells the expert visits
grass2
road20
sand0
water0
mountain0

Call this vector $f_E = (2, 20, 0, 0, 0)$. It's called the feature expectation, basically a histogram of terrain types along the expert's path.

Why does this help? Because any policy that produces the same histogram will look just as good as the expert under any cost function that depends only on terrain. Think about it: if the cost of a path is $\text{cost}(\text{path}) = w_{\text{grass}} \cdot \#\text{grass} + w_{\text{road}} \cdot \#\text{road} + \ldots$ , a sum over terrain visits, then matching the histogram matches the cost no matter what the weights are.

The trick in one sentence: if the cost is a linear function of terrain-visit counts, then matching the expert's visit counts gives expert-equivalent cost regardless of what the weights are. We don't need to know the weights to imitate the expert.

This is the foundational observation behind both algorithms. They differ in how they find a policy whose visit counts match $f_E$.

Apprenticeship Learning (Abbeel & Ng, 2004)

The first classical IRL algorithm is Apprenticeship Learning via Inverse Reinforcement Learning, from Abbeel and Ng's 2004 ICML paper. The community usually calls it just "Abbeel-Ng." Its slogan: match the expert's visit counts. The full procedure is a guessing game with four steps.

  1. Start with any policy (a random one is fine) and compute its terrain-visit histogram.
  2. Compare the histogram to the expert's. Find the direction in feature space along which the expert beats this candidate the most.
  3. Plan a new policy that performs well in that direction. Add the new policy to your collection.
  4. Repeat. Stop when some mixture of the policies you have collected reproduces the expert's histogram exactly.

The "mixture" in step 4 is a probability distribution over your collected policies. If you follow policy $\pi_1$ with probability $\lambda_1$ and $\pi_2$ with probability $\lambda_2$, your average visit counts come out to $\lambda_1 f_1 + \lambda_2 f_2$. The algorithm's job, at the end, is to find the $\lambda$ values that make this average equal the expert's $f_E$.

Step 2 is the clever bit. We solve a small optimisation problem: find a unit vector $w$ such that the inner product $w \cdot (f_E - f_j)$ is positive and as large as possible for every candidate $f_j$ collected so far. This $w$ is the "opinion" that best distinguishes the expert from our candidates. We then plan a new policy that is optimal under $w$ treated as a reward function. The new policy's visit counts will be large in the direction $w$, which closes the gap to the expert.

Step 2 is sometimes called the "SVM step." The reason: it is exactly the support-vector-machine problem with the expert as one class and the candidates as the other. The name does not matter for the intuition. All we need is a direction that distinguishes "where the expert spends time" from "where our candidates do."

Walk-through: 5 iterations of Abbeel-Ng

Here's what actually happens when you run the algorithm on our gridworld. The expert's histogram is $f_E = (2, 20, 0, 0, 0)$, heavy on road, light on grass, nothing on the rest.

Iteration 0, a random policy

We start with a totally random policy. It wanders, bumping into walls and crossing whatever's in front of it. Over a 60-step horizon, its visit counts come out around $f_0 = (5, 0, 0, 0, 55)$. The 5 grass steps happen near the start; then the agent drifts into the mountain band and gets stuck for the remaining 55 steps (every action from a mountain cell leads back to a mountain neighbour in this setup). The counts sum to $5 + 0 + 0 + 0 + 55 = 60$, the full horizon.

The histogram is completely wrong. The algorithm now asks: "in what direction does the expert outscore this random policy most?"

Answer: $w \approx (-0.05, +0.34, 0, 0, -0.94)$. Reading this: "the expert visits grass slightly less than the random policy, road much more, and mountain much less." The load-bearing signal is the mountain component: the random policy got stuck there, the expert never went, so $w$ tells the next planner "whatever you do, do not go to the mountain."

Where do those weight numbers actually come from?

The "find a direction" step is a small optimization problem. Stated cleanly, with the candidates we have so far being $\{f_0\}$:

$\max_{w,\, t} \ t \quad \text{subject to} \quad w \cdot (f_E - f_0) \geq t, \quad \|w\| \leq 1.$

Read this in plain English: find a unit-norm direction $w$ such that the expert beats the candidate (under $w$ as cost weights) by margin at least $t$, and make $t$ as large as possible.

When there is only one candidate, the closed-form solution is easy. By Cauchy-Schwarz, $w \cdot v \leq \|w\| \cdot \|v\|$ with equality when $w$ points in the direction of $v$. So the optimal $w$ is just the unit vector in the $f_E - f_0$ direction (see Appendix A for three proofs of this):

$w^* = \dfrac{f_E - f_0}{\|f_E - f_0\|}, \quad t^* = \|f_E - f_0\|.$

Plugging in the numbers. First the difference vector:

$f_E - f_0 = (2,\ 20,\ 0,\ 0,\ 0) - (5,\ 0,\ 0,\ 0,\ 55)$

$= (-3,\ 20,\ 0,\ 0,\ -55)$

Then the norm:

$\|f_E - f_0\| = \sqrt{9 + 400 + 0 + 0 + 3025}$

$= \sqrt{3434} \approx 58.60$

Then the normalised direction:

$w^* = \dfrac{(-3,\ 20,\ 0,\ 0,\ -55)}{58.60}$

$\approx (-0.051,\ 0.341,\ 0,\ 0,\ -0.939)$

Each component of $w$ literally tells you "how much, per unit, does the expert exceed the candidate on this terrain, normalised so that the whole vector has length 1." The largest-magnitude component is mountain at $-0.94$, because the random policy spent 55 steps on mountain and the expert spent 0. The grass component is a small negative number ($-0.05$) because the random policy spent 5 steps on grass and the expert only spent 2. Road has a moderate positive component ($+0.34$) because the expert spent 20 steps there and the random policy spent none. Sand and water are exactly 0 because both the expert and the random policy visited them zero times, so there is no signal in those dimensions.

When there are multiple candidates, the closed form goes away. Suppose by iteration $k$ we have candidates $\{f_0, f_1, \ldots, f_{k-1}\}$. The optimization becomes:

$\max_{w,\, t} \ t \quad \text{subject to} \quad w \cdot (f_E - f_j) \geq t \ \text{ for } j = 0, \ldots, k-1, \quad \|w\| \leq 1.$

This is a quadratic program with $k$ linear constraints and one quadratic constraint (the unit-norm). You hand it to your favourite QP solver (CVXPY, scipy.optimize.minimize with method='SLSQP', or in production a specialised SVM solver). The geometric solution is elegant: $w$ ends up pointing from the closest point on the convex hull of $\{f_0, \ldots, f_{k-1}\}$ toward $f_E$. We prove this in the appendix at the end of the post.

What is the dot product actually computing?

This is the question worth pausing on. Why is $w \cdot (f_E - f_j)$ the right quantity to maximize? Three views, all saying the same thing.

View 1, as a reward gap. Recall that with linear features the reward of a trajectory is $\text{reward}(\xi) = w \cdot f(\xi)$, and the expected reward of a policy is $J(\pi, w) = w \cdot f(\pi)$. So:

$w \cdot (f_E - f_j) \ = \ w \cdot f_E - w \cdot f_j \ = \ J(\pi_E, w) - J(\pi_j, w).$

That is the gap in expected reward between the expert and candidate $j$ when the cost is $w$. Maximizing this gap over $w$ means: "find weights under which the expert clearly outperforms the candidate."

View 2, as a projection. The dot product $w \cdot v$ is the length of $v$ projected onto the unit direction $w$. So $w \cdot (f_E - f_j)$ is the length of the "expert minus candidate" vector projected onto $w$. We want the direction $w$ that makes this projection as long as possible for every candidate simultaneously. The solution is the direction that aligns best with all the difference vectors at once, which geometrically is the direction pointing from the convex hull of candidates toward the expert.

View 3, as a classifier. The optimization is exactly the support-vector-machine problem with the expert as the single positive example and the candidates as negative examples. The hyperplane with normal vector $w$ separates "expert-like feature vectors" from "candidate-like feature vectors" with maximum margin. That is why the algorithm is sometimes called the SVM step.

All three views agree on what $w$ is and how to compute it. They disagree only on which kind of mathematician finds them most natural.

Iteration 1, first informed policy

We plan a policy optimal under that $w$ as reward. The result wanders toward grass and road (because $w$ says grass has negative reward, so the planner avoids it, but mountain has even more negative reward, so it really avoids mountain). The new histogram comes out $f_1 = (38, 18, 0, 0, 0)$, heavy on both grass and road, no mountain. Closer to the expert, but the grass count is way too high.

The algorithm finds the new direction that distinguishes $f_E = (2, 20, 0, 0, 0)$ from this $f_1$ and from the random $f_0$. Now both candidates have $f_3 = 0$ (no water) but $f_E$ also has zero water, so water doesn't matter yet. The direction it finds penalises grass even more heavily.

Iterations 2, 3, exploring the polytope

The algorithm continues. Each iteration adds a new vertex to its collection. The visit counts at iterations 2 and 3 are:

itergrassroadsandwatermountain
expert220000
0 (random)500055
13818000
230131300
32215890
4220000

Iteration 4, the expert's policy emerges

At iteration 4 something nice happens: the policy the planner produces has histogram $(2, 20, 0, 0, 0)$, identical to the expert's. The algorithm checks: "is the expert still beating my candidates by a margin in some direction?" The answer is no. Mix of weights = 100% on iteration 4's policy. Done.

The whole thing took 5 iterations. The output is a policy whose visit counts are identical to the expert's, which means it has the same cost-on-average no matter what cost function you write down. Mission accomplished.

Fully worked numerical example: Abbeel-Ng on a 5-state corridor (click to expand)

The 12×12 trace earlier in the post showed the punchline numbers but skipped how they were computed. Let me work through every arithmetic step on a tiny example so you can see exactly what the algorithm is doing under the hood.

The MDP

Five states in a line, numbered 0 through 4:

0 (S) grass r=−1 1 road r=−0.2 2 road r=−0.2 3 road r=−0.2 4 (G) goal r=0
  • Two actions: L (left), R (right). Bumping a wall keeps you in place.
  • Feature map $\phi$: 2-dim. $\phi(\text{grass}) = (1, 0)$, $\phi(\text{road}) = (0, 1)$, $\phi(\text{goal}) = (0, 0)$.
  • True reward weights: $w_{\text{true}} = -(1, 0.2)$ giving the per-state rewards above. We don't know these.
  • Horizon: $T = 5$ steps.

Step 1 · Compute the expert's path via finite-horizon value iteration

The expert knows $w_{\text{true}}$ and plans optimally. We do backward induction with $V_5(s) = 0$ for all $s$ (game over at time $T$). At each prior timestep, the Bellman backup is

$Q_t(s, a) = r(s) + V_{t+1}(\text{next}(s, a))$,    $V_t(s) = \max_a Q_t(s, a)$.

Step backward: $t = 4$ (one action remaining).

With $V_5 = 0$ everywhere, both actions from any state give $Q_4(s, a) = r(s)$. So:

state $s$01234 (goal)
$V_4(s)$−1−0.2−0.2−0.20

$t = 3$ (two actions remaining).

From state 0, L bumps the wall → stays at 0 (so $Q_3(0, L) = r(0) + V_4(0) = -1 + (-1) = -2$). R moves to state 1 ($Q_3(0, R) = -1 + (-0.2) = -1.2$). Pick R.

From state 1, L → state 0 ($-0.2 - 1 = -1.2$). R → state 2 ($-0.2 - 0.2 = -0.4$). Pick R.

Continuing for every state:

state$Q_3(s, L)$$Q_3(s, R)$$V_3(s)$$\pi_3(s)$
0−1 + V₄(0) = −2−1 + V₄(1) = −1.2−1.2R
1−0.2 + V₄(0) = −1.2−0.2 + V₄(2) = −0.4−0.4R
2−0.2 + V₄(1) = −0.4−0.2 + V₄(3) = −0.4−0.4tied
3−0.2 + V₄(2) = −0.4−0.2 + V₄(4) = −0.2−0.2R
4absorbing0,

$t = 2, 1, 0$. Keep applying the same Bellman backup, each time using the previous row's $V$. The final result:

$V_4$$V_3$$V_2$$V_1$$V_0$
state 0−1−1.2−1.4−1.6−1.6
state 1−0.2−0.4−0.6−0.6−0.6
state 2−0.2−0.4−0.4−0.4−0.4
state 3−0.2−0.2−0.2−0.2−0.2
state 400000

The optimal policy at every timestep (from each state, take R) gives the expert's path:

$t=0$: state 0 → $t=1$: state 1 → $t=2$: state 2 → $t=3$: state 3 → $t=4$: state 4 (goal)

Total expert reward: $r(0) + r(1) + r(2) + r(3) + 0 = -1 - 0.2 - 0.2 - 0.2 + 0 = -1.6$ ✓ (matches $V_0(0)$).

Step 2 · Compute the expert's feature expectation $f_E$

Sum $\phi(s_t)$ along the path:

$f_E = \phi(0) + \phi(1) + \phi(2) + \phi(3) + \phi(4) = (1,0) + (0,1) + (0,1) + (0,1) + (0,0) = \mathbf{(1, 3)}$

The expert spends 1 step on grass and 3 steps on road. That's all the IRL algorithm gets to see, this 2-dim vector, the only "summary" of the demonstrations.

Step 3 · Abbeel-Ng iteration 0, a random policy

Start with any policy. Pick the worst one possible: "always L." From state 0, this bumps the wall every step, the agent never leaves state 0.

Trajectory: states 0, 0, 0, 0, 0 (over $t=0..4$).
Feature expectation: $f_0 = 5 \phi(0) = \mathbf{(5, 0)}$. All 5 steps on grass, none on road.

Step 4 · Iteration 1, find a separating direction

We want a unit vector $w$ that distinguishes the expert from our candidate $f_0 = (5, 0)$. Solve:

$\max\;t\quad\text{s.t.}\quad w \cdot (f_E - f_0) \geq t, \quad \|w\| \leq 1.$

Compute $f_E - f_0 = (1, 3) - (5, 0) = (-4, 3)$.

The optimal $w$ for "maximise $w \cdot v$ with $\|w\| \leq 1$" is just $v / \|v\|$:

$w_1 = \frac{(-4, 3)}{\|(-4, 3)\|} = \frac{(-4, 3)}{\sqrt{16+9}} = \frac{(-4, 3)}{5} = \mathbf{(-0.8, +0.6)}$

Margin: $t_1 = \|(-4, 3)\| = \mathbf{5.0}$.

Reading $w_1$: the expert weighs grass negatively (−0.8) and road positively (+0.6). Roughly correct: the expert prefers road and avoids grass.

Step 5 · Plan an optimal policy under $w_1$

Treat $w_1$ as the new reward weights: $r(s) = w_1 \cdot \phi(s)$, giving

state01234
$r$ under $w_1$−0.8+0.6+0.6+0.60

Notice: road now gives positive reward, so the agent will want to stay on the road as long as possible. Goal still gives 0, actually less attractive than road!

Running VI again with these new rewards, the optimal policy turns out to be:

at state →0123
$\pi_1$ action (most timesteps)RRR (tied)L

The action at state 3 flips from "R (reach goal)" to "L (back to road)." Goal-reaching costs +0 but road gives +0.6, the agent prefers the road. Trajectory:

state 0 → 1 → 2 → 3 → (L) 2 → end of horizon

Feature counts of this candidate policy: states visited are 0, 1, 2, 3, 2 (one of state 2 twice because we bounced back).

$f_1 = \phi(0) + \phi(1) + \phi(2) + \phi(3) + \phi(2) = (1, 0) + 4 \cdot (0, 1) = \mathbf{(1, 4)}$

Compare with the expert $f_E = (1, 3)$. Grass count matches; road count is 1 too high.

Step 6 · Iteration 2, find a new separating direction

Now we have two candidates: $f_0 = (5, 0)$ and $f_1 = (1, 4)$. The optimization is:

$\max\;t\quad\text{s.t.}\quad w \cdot (f_E - f_0) \geq t,\;\; w \cdot (f_E - f_1) \geq t,\;\; \|w\| \leq 1.$

The two difference vectors:

  • $v_0 = f_E - f_0 = (1, 3) - (5, 0) = \mathbf{(-4, 3)}$. "From all-grass toward the expert, go down on grass, up on road."
  • $v_1 = f_E - f_1 = (1, 3) - (1, 4) = \mathbf{(0, -1)}$. "From too-much-road toward the expert, cut the road count."

The single-candidate trick (just normalise $f_E - f_0$) no longer works, because the resulting $w$ would have positive margin on $v_0$ but negative margin on $v_1$. We need a $w$ that beats both at once. Solve the system directly.

Setup. At the optimum, both constraints are tight, otherwise we could rotate $w$ toward the slack constraint and gain margin. So:

$w \cdot v_0 = t \quad\text{and}\quad w \cdot v_1 = t \quad\text{and}\quad \|w\|^2 = 1.$

Let $w = (w_1, w_2)$. Plug the two vectors in:

$-4 w_1 + 3 w_2 = t \qquad (1)$

$-w_2 = t \qquad (2)$

Substituting (2) into (1):

$-4 w_1 + 3 w_2 = -w_2 \;\Longrightarrow\; -4 w_1 + 4 w_2 = 0 \;\Longrightarrow\; w_1 = w_2.$

Use the norm constraint:

$w_1^2 + w_2^2 = 1 \;\Longrightarrow\; 2 w_1^2 = 1 \;\Longrightarrow\; w_1 = \pm \tfrac{1}{\sqrt 2}.$

Pick the sign that gives positive margin. From (2), $t = -w_2$, so for $t > 0$ we need $w_2 < 0$. That forces both components negative:

$w_2 = \left(-\tfrac{1}{\sqrt 2},\ -\tfrac{1}{\sqrt 2}\right) \approx \mathbf{(-0.707,\ -0.707)}, \quad t_2 = \tfrac{1}{\sqrt 2} \approx \mathbf{0.707}.$

Verify both constraints:

  • $w_2 \cdot v_0 = (-0.707)(-4) + (-0.707)(3) = 2.828 - 2.121 = 0.707$ ✓
  • $w_2 \cdot v_1 = (-0.707)(0) + (-0.707)(-1) = 0 + 0.707 = 0.707$ ✓
  • $\|w_2\| = \sqrt{0.707^2 + 0.707^2} = \sqrt{0.5 + 0.5} = 1$ ✓

Both constraints are tight at margin $0.707$, and $w_2$ is unit norm. This is the solution.

Reading $w_2$: grass weight is $-0.707$, road weight is also $-0.707$. The algorithm has decided that both grass and road should be penalised equally. With every non-goal step costing $-0.707$ no matter what terrain, the only way to maximise return is to reach the goal as fast as possible. So the next planner will care only about path length, not about which terrain it walks on.

Step 7 · Plan an optimal policy under $w_2$

The updated per-state rewards are $r(s) = w_2 \cdot \phi(s)$:

stateterrain$\phi(s)$$r(s) = w_2 \cdot \phi(s)$
0 (start)grass$(1, 0)$$-0.707$
1road$(0, 1)$$-0.707$
2road$(0, 1)$$-0.707$
3road$(0, 1)$$-0.707$
4 (goal)goal$(0, 0)$$0$

Every non-goal step pays $-0.707$. The goal pays $0$. So the agent wants to reach the goal in as few steps as possible. Now run finite-horizon VI with these rewards. As before, $V_5(s) = 0$ for all $s$, and we work backward.

$t = 4$: $V_4(s) = r(s) + V_5(\text{next}(s, a))$ for any action; since $V_5 = 0$, $V_4(s) = r(s)$.

state01234
$V_4(s)$$-0.707$$-0.707$$-0.707$$-0.707$$0$

$t = 3$: $Q_3(s, a) = r(s) + V_4(\text{next}(s, a))$.

state$Q_3(s, L)$$Q_3(s, R)$$V_3(s)$best action
0$-0.707 + V_4(0) = -1.414$$-0.707 + V_4(1) = -1.414$$-1.414$tied
1$-0.707 + V_4(0) = -1.414$$-0.707 + V_4(2) = -1.414$$-1.414$tied
2$-0.707 + V_4(1) = -1.414$$-0.707 + V_4(3) = -1.414$$-1.414$tied
3$-0.707 + V_4(2) = -1.414$$-0.707 + V_4(4) = -0.707$$\mathbf{-0.707}$R
4absorbing$0$,

At state 3 with one step left to go, the agent strongly prefers R (which reaches the goal) over L (which keeps it on road). Everywhere else, both actions stay in non-goal cells, so the Q-values tie.

$t = 2$: use $V_3$ from above.

state$Q_2(s, L)$$Q_2(s, R)$$V_2(s)$best action
0$-0.707 + V_3(0) = -2.121$$-0.707 + V_3(1) = -2.121$$-2.121$tied
1$-0.707 + V_3(0) = -2.121$$-0.707 + V_3(2) = -2.121$$-2.121$tied
2$-0.707 + V_3(1) = -2.121$$-0.707 + V_3(3) = -1.414$$\mathbf{-1.414}$R
3$-0.707 + V_3(2) = -2.121$$-0.707 + V_3(4) = -0.707$$\mathbf{-0.707}$R

The advantage of R propagates back. Now at state 2 with two steps left, going R lets the agent reach the goal one step later, saving one $-0.707$ cost.

$t = 1, 0$: the same backup. By the time we reach $t = 0$, every state strictly prefers R (the path to the goal is shorter the further right you start). The final value table:

$V_4$$V_3$$V_2$$V_1$$V_0$
state 0$-0.707$$-1.414$$-2.121$$-2.828$$\mathbf{-2.828}$
state 1$-0.707$$-1.414$$-2.121$$-2.121$$-2.121$
state 2$-0.707$$-1.414$$-1.414$$-1.414$$-1.414$
state 3$-0.707$$-0.707$$-0.707$$-0.707$$-0.707$
state 4$0$$0$$0$$0$$0$

The optimal policy is "always go R" from any non-goal state (ties broken toward R since both ties lead to identical states and R is the path to goal). Starting from state 0 at $t = 0$:

$t=0$: state 0 $\to$ $t=1$: state 1 $\to$ $t=2$: state 2 $\to$ $t=3$: state 3 $\to$ $t=4$: state 4 (goal)

Total return: $-0.707 \times 4 + 0 = -2.828$ ✓ (matches $V_0(0)$).

Feature counts:

$f_2 = \phi(0) + \phi(1) + \phi(2) + \phi(3) + \phi(4)$

$= (1, 0) + (0, 1) + (0, 1) + (0, 1) + (0, 0) = \mathbf{(1, 3)} = f_E.$

The algorithm has matched the expert exactly. 🎉 The 5-state corridor is solved in 2 iterations.

Step 8 · Convergence check

At iteration 3, the SVM would try to find a direction in which $f_E$ outscores all candidates $\{f_0, f_1, f_2\}$. But $f_E = f_2$, so $f_E - f_2 = (0, 0)$. The SVM's margin is 0. We stop.

The output is a mixture over $\{\pi_0, \pi_1, \pi_2\}$ with weights $\lambda$ chosen to match $f_E$. Since $f_2 = f_E$ already, $\lambda = (0, 0, 1)$, just use $\pi_2$.

The story in 7 numbers: $f_0 = (5, 0)$ → $w_1 = (-0.8, +0.6)$, margin 5 → $f_1 = (1, 4)$ → $w_2 = (-0.707, -0.707)$, margin 0.707 → $f_2 = (1, 3) = f_E$ → margin 0 → done.

What the final policy mixture looks like

Abbeel-Ng's output isn't a single policy, it's a mixture over all the candidate policies the algorithm planned along the way. After convergence the algorithm picks mixing weights $\lambda = (\lambda_0, \lambda_1, \lambda_2)$ such that $\lambda_0 \, f_0 + \lambda_1 \, f_1 + \lambda_2 \, f_2 = f_E$. Here's the table of candidates and their mixture weights:

policyoptimal undertrajectory$f = (\text{grass}, \text{road})$mixing weight $\lambda$
$\pi_0$"always L" (a deliberately bad init)$0 \to 0 \to 0 \to 0 \to 0$ (bumps wall)$(5, 0)$$0$
$\pi_1$$w_1 = (-0.8, +0.6)$$0 \to 1 \to 2 \to 3 \to 2$ (loops back, road is too rewarding)$(1, 4)$$0$
$\pi_2$$w_2 = (-0.707, -0.707)$$0 \to 1 \to 2 \to 3 \to 4$ (matches expert)$(1, 3) = f_E$$1$

The mixture is "put all the weight on $\pi_2$." So the final returned policy is just $\pi_2$ itself: at every state, take $R$. This happened because $f_2 = f_E$ exactly, no convex combination of the earlier candidates was needed.

In a less convenient example, $f_E$ might not be a vertex of the polytope; it might sit in the interior. Then the algorithm would return something like $\lambda = (0, 0.3, 0.7)$, meaning "with probability 0.3 follow $\pi_1$, with probability 0.7 follow $\pi_2$." Each episode you flip the coin once at the start and commit to one policy for the whole trajectory. The resulting feature counts, averaged over episodes, equal $f_E$.

Side-by-side comparison against the expert

quantitytrue (hidden)recovered by Abbeel-Ngmatch?
reward weights $w$$(-1, -0.2)$$(-0.707, -0.707)$No, different magnitudes
policy at $t=0$, state 0$R$$R$Yes
policy at $t=0$, state 1$R$$R$Yes
policy at $t=0$, state 2$R$$R$Yes
policy at $t=0$, state 3$R$$R$Yes
trajectory from state 0$0 \to 1 \to 2 \to 3 \to 4$$0 \to 1 \to 2 \to 3 \to 4$Yes
feature counts $(f_{\text{grass}}, f_{\text{road}})$$(1, 3)$$(1, 3)$Yes

Read the table top to bottom. The recovered weights $w_2 = (-0.707, -0.707)$ do not match the true weights $w_{\text{true}} = (-1, -0.2)$. They don't even have the same magnitude pattern (the true weights say "road is much cheaper than grass," the recovered weights say "grass and road cost the same"). But the recovered policy matches the expert's exactly: at every state, take $R$. The trajectory matches. The feature counts match.

This is the philosophical point of Abbeel-Ng. The algorithm doesn't promise to recover the true cost. It promises that the policy you get out behaves like the expert. On the 5-state corridor, that's enough. (On a new map where the cost weights would need to transfer, this would not be enough; that's where MMP shines.)

Abbeel-Ng on the 5-state corridor · step through it
Interactive
iter 0 of 2 candidate visits (5, 0) expert visits (1, 3) found direction w , margin ,

The picture that makes Abbeel-Ng click

The mental model that makes everything snap together: think of each candidate's visit-count vector as a point in 5-dimensional space. The set of all achievable visit counts (over all possible policies) is a convex shape, a "polytope." The expert's $f_E$ is somewhere inside this polytope.

Each iteration of Abbeel-Ng adds a new vertex of this polytope to our collection. After enough vertices, the expert is inside the convex hull of what we've collected, and so a mix of those policies hits the expert's histogram exactly.

The picture (in 2D, for illustration):

visits of feature 1 feature 2 $f_0$ random $f_1$ $f_2$ $f_3$ $f_4$ $f_E$
Each blue dot is a policy. The shaded blob is the polytope of achievable visit counts. The gold star is the expert. Abbeel-Ng works by discovering vertices of the blue blob one at a time, until the expert's star is inside.

The interactive widget below shows exactly this picture, with you driving the iterations forward. Highly recommended, it builds a feel for what the algorithm is doing that no amount of math will.

Stretch the hull · interactive Abbeel-Ng in 2D
Interactive
vertices 0 / 5 distance to expert ,

Each click runs one Abbeel-Ng iteration: find the direction in which the expert (gold star) outscores the current hull most, then plan a new policy along that direction. Its visit-count vector becomes a new vertex (blue dot). The dashed red line is the current distance from the expert to the closest point on the hull; the algorithm stops when that distance hits zero.

"Why does the direction-finding step look like an SVM?"

When you ask "what direction does the expert outscore the candidates most in?", you're literally solving the support-vector-machine problem, find the hyperplane that best separates the expert from the candidates. It turns out (and we prove this in the appendix if you want the math) that this is the same problem as projecting the expert onto the convex hull of candidates. They're two views of the same calculation. The "projection view" is more intuitive once you have the polytope picture: the algorithm is just stretching the hull toward the expert, one vertex at a time.

Maximum Margin Planning (Ratliff, Bagnell & Zinkevich, 2006)

MMP in one sentence: learn a cost function such that, when you run a planner on it, the planner's optimal path matches the expert's demonstrated path, and do this so the recovered cost works across many different maps.

Three load-bearing parts of that sentence:

  1. We learn a cost function, not a policy. Specifically, weights $w$ on terrain features.
  2. We match the planner's output path, not its internal workings. Plug $w$ into Dijkstra, value iteration, A*, anything, and the path that comes out must equal $\xi_E$.
  3. We want the recovered cost to be portable across maps. The expert may demonstrate on a forest map; the same $w$ should produce a sensible path on a city map, as long as both maps use the same terrain types.

Point 3 is the philosophical break from Abbeel-Ng. Abbeel-Ng recovers a policy that matches the expert on this map. MMP recovers a cost that works on any map. Everything that follows is mechanics for making point 2 hold.

The five symbols this section uses. Everything below is built from these, defined once here so nothing appears cold:
  • $\xi$ (Greek "xi"), a path through the grid. $\xi_E$ is the expert's demonstrated path, the one we observe.
  • $w$, the cost weights we are solving for. One weight per terrain type.
  • $\phi(s)$, the feature vector of a single cell (a one-hot terrain indicator). $w \cdot \phi(s)$ is the cost of standing on cell $s$.
  • $f(\xi) = \sum_{s \in \xi} \phi(s)$, the feature counts of path $\xi$ (how many grass cells, how many road cells, and so on).
  • $c(\xi) = w \cdot f(\xi)$, the total cost of path $\xi$ under the current weights. So $c(\xi_E)$ is the expert path's cost, and the planner's job is to find the path with the smallest $c$.

Build the intuition on a tiny example

Forget the 12x12 grid for a moment. Imagine just two feasible paths from start to goal:

  • Expert path: start → road → goal. One road cell.
  • Alternative path: start → grass → grass → goal. Two grass cells.

We want weights $w = (w_{\text{grass}}, w_{\text{road}})$ such that the planner, given $w$, picks the expert path. Translation: we want $\;w_{\text{road}} \,<\, 2 w_{\text{grass}}$.

That's the constraint. Easy. Many $w$ satisfy it.

Naive update: raise what the planner overused

Try a guess for $w$. Run the planner. If it returns the expert path, we're done. If it returns the alternative, we made grass too cheap. Raise the cost of grass. Lower the cost of road if the expert used it and the alternative didn't. Repeat.

"Raise what the planner overused, lower what the expert favoured." That is the update rule, stated informally. If we stopped here we would have a perfectly workable algorithm. But it has one problem.

Why we need a margin, not just any cost ordering

The constraint $w_{\text{road}} < 2 w_{\text{grass}}$ is satisfied by both $w = (1, 1.99)$ and $w = (1, 0.2)$. The first sits on a knife's edge: a tiny perturbation flips the planner's choice. The second has room to breathe. We want the second.

MMP makes the breathing room explicit by demanding a margin:

$c(\xi) \;\geq\; c(\xi_E) \,+\, (\text{how different is }\xi\text{ from }\xi_E)$

"How different" counts the cells in $\xi$ that are not on the expert path. This is the Hamming distance. Formalised: define $L(s) = 1$ if state $s$ is not on the expert path, $L(s) = 0$ if it is, and define $\ell(\xi) = \sum_{s \in \xi} L(s)$. The constraint becomes:

$c(\xi) \;\geq\; c(\xi_E) \,+\, \ell(\xi) \quad\forall\,\xi.$

Read it: the more an alternative path differs from the expert, the more we should outscore it. Tiny variations cost slightly more. Wild variations cost dramatically more. That scaling makes the recovered cost robust.

The trick: one planner call finds the worst violator

The constraint must hold for every feasible path. On the 12x12 gridworld there are millions. On a city road network, more than the number of atoms in the universe. We cannot enumerate. Instead, find the single worst violator, fix that one, repeat.

The worst violator is the path that breaks the constraint by the biggest margin. The amount by which $\xi$ violates the constraint is $c(\xi_E) + \ell(\xi) - c(\xi)$. Maximising that over $\xi$, and dropping the constant $c(\xi_E)$, gives:

$\xi^* \;=\; \arg\max_\xi \big[\ell(\xi) - c(\xi)\big] \;=\; \arg\min_\xi \big[c(\xi) - \ell(\xi)\big].$

The worst violator is the path that minimises $c(\xi) - \ell(\xi)$: the cheap-and-very-different-from-expert path. Now the key observation. $c(\xi)$ and $\ell(\xi)$ both decompose additively over cells, so their difference does too:

$c(\xi) - \ell(\xi) \;=\; \sum_{s \in \xi} \big(c(s) - L(s)\big) \;=\; \sum_{s \in \xi} c'(s)$

where $c'(s) = c(s) - L(s)$ is the loss-augmented cost. Minimising a sum over cells is exactly what shortest-path solvers do. So one planner call under modified per-cell costs returns the worst violator.

What "loss-augmented" means in plain English. The modification $c'(s) = c(s) - L(s)$ does nothing on cells the expert visited ($L(s) = 0$) and gives a 1-unit discount on cells the expert did not ($L(s) = 1$). So we run shortest-path on a map where non-expert cells look 1 unit cheaper than they really are. The discount tempts the planner toward non-expert cells. Whatever path the planner returns is the alternative that gets closest to beating the expert. That is the path we need to fix.

A concrete example: which paths actually violate the constraint?

The definition above is precise but slippery on first read. Let's see what kinds of paths actually count as violators with a small worked example. Suppose the expert path costs $c(\xi_E) = 1.2$ under current $w$. Consider three candidate paths:

candidatecost $c(\xi)$Hamming $\ell(\xi)$required cost: $c(\xi_E) + \ell$violates?violation amount
A$0.8$$3$$1.2 + 3 = 4.2$yes$+3.4$
B$4.0$$3$$1.2 + 3 = 4.2$yes$+0.2$
C$5.0$$2$$1.2 + 2 = 3.2$no$-1.8$

Three observations:

  • A is cheap and different. Cost $0.8 < $ expert cost $1.2$, AND high Hamming. Big violation. The clearest kind of violator: the planner found a path that is both genuinely cheaper than the expert and goes through lots of non-expert cells.
  • B is more expensive than the expert but still violates. Cost $4.0 > 1.2$, but the Hamming margin demands at least $4.2$, and B falls short by $0.2$. A path can violate the constraint without being cheaper than the expert in absolute terms; it just has to fail to be expensive enough given how different it is.
  • C does not violate. Even with a moderate Hamming distance of 2, C is expensive enough ($5.0 > 3.2$) to clear the margin. The constraint is satisfied here.

The worst violator at this iteration is A (violation amount $+3.4$). That is the path the loss-augmented planner would return, and that is the path the update rule will "fix" by raising the cost of A's non-expert cells.

The refined intuition. The worst violator is the path that (a) visits lots of non-expert cells (high Hamming), and (b) costs less than $c(\xi_E) + \ell(\xi)$ under current $w$. In practice the worst violator usually combines "cheaper than expert" and "different from expert," because those are the two ways to maximise the violation. But the constraint MMP enforces is the looser one: it demands the cost gap exceed the Hamming margin, not just that the expert be the cheapest path.

Why this looser constraint matters. If we only required "expert cheaper than alternatives by any amount," tiny noise in the costs would flip the planner's choice. By demanding the gap scale with Hamming distance, we get a wide buffer: paths far from the expert have to be much more expensive, paths close to the expert can be only slightly more expensive. The recovered $w$ has room to breathe and remains robust to small perturbations.

The update rule, derived

Once we have the violator $\xi^*$, we want to raise $w$ on terrains $\xi^*$ overused and lower $w$ on terrains the expert favoured. The right direction is the difference in feature counts:

$w \;\leftarrow\; w \,+\, \eta \cdot \big(f^* \,-\, f_E\big)$

Read $f^* - f_E$ component by component: positive entries mark terrains where the violator overshoots the expert, negative entries mark terrains where the expert overshoots the violator. Adding $\eta \cdot (f^* - f_E)$ to $w$ raises the cost of the first kind and lowers the cost of the second kind, exactly what we want.

After each update we project $w$ to be non-negative (any component that went below zero is clipped to zero). This keeps $w$ a valid cost vector and prevents the algorithm from making cells arbitrarily attractive.

The update is the subgradient of the structured hinge loss; the full derivation lives in the drop-down at the end of this section.

The algorithm, as one block of pseudocode

Initialise w ≥ 0 (random non-negative weights).
Repeat until convergence:
  1. c(s)  = w · phi(s)                  # per-state cost under current weights
  2. c'(s) = c(s) - L(s)                  # 1-unit discount on non-expert cells
  3. xi*   = shortest_path(c')            # the worst violator, one planner call
  4. if xi* matches the expert:           # no violator remains
       stop
  5. w ← (1 - eta*lambda)*w + eta*(f(xi*) - f_E)   # L2 shrinkage + subgradient
     w ← max(0, w)                       # project to non-negative

Here $\eta$ is the step size and $\lambda$ is the L2 regularisation strength (a hyperparameter, often written $1/C$). Two things happen in line 5. The $(1 - \eta\lambda)\,w$ part shrinks the current weights toward zero, which is the effect of the $\tfrac{1}{2}\|w\|^2$ regulariser. The $\eta\,(f(\xi^*) - f_E)$ part raises the cost of terrains the violator overused and lowers the cost of terrains the expert favoured. This is exactly the update in Ratliff, Bagnell & Zinkevich (2006). If you set $\lambda = 0$ (no regularisation), the shrinkage disappears and you get the bare subgradient step $w \leftarrow w + \eta\,(f(\xi^*) - f_E)$.

Five lines on top of any shortest-path solver. Every iteration: one planner call, one feature-difference, one gradient step. The exponentially-many-constraints structured SVM collapses into something an undergrad can implement in 50 lines of Python. The drop-down below proves why line 5 has exactly this form.

Show me the full structured-SVM derivation

The intuitive walk-through above hand-waved through two claims. This appendix proves both of them rigorously, following Ratliff, Bagnell & Zinkevich (2006), which itself builds on the general structured-prediction framework of Tsochantaridis, Joachims, Hofmann & Altun (2005).

What this appendix proves. Two results:

  1. The path $\xi^*$ returned by shortest-path under modified costs $c'(s) = c(s) - L(s)$ is the worst violator of the MMP margin constraint.
  2. The vector $f^* - f_E$ is the correct gradient-descent direction for the structured hinge loss, so the update rule $w \leftarrow w + \eta(f^* - f_E)$ is a valid optimisation step.

Both claims are non-trivial. Both follow from the same setup.

Notation, all in one place

Symbols used in this appendix:

  • $\xi$, a feasible path through the gridworld.
  • $\xi_E$, the expert's demonstrated path (observed).
  • $\xi^*$, the worst-violator path (to be defined).
  • $\phi(s)$, the feature vector of a single cell. $\phi(s) = (1, 0, 0, 0, 0)$ if $s$ is grass, and so on.
  • $f(\xi) = \sum_{s \in \xi} \phi(s)$, the feature counts of path $\xi$. $f_E = f(\xi_E)$ and $f^* = f(\xi^*)$ are shorthands.
  • $w$, the cost weights we are solving for. $w \cdot \phi(s) = c(s)$ is the cost of cell $s$. $w \cdot f(\xi) = c(\xi)$ is the total cost of path $\xi$.
  • $L(s) \in \{0, 1\}$, the per-cell Hamming loss. $L(s) = 1$ if $s$ is not on the expert path.
  • $\ell(\xi) = \sum_{s \in \xi} L(s)$, the total Hamming loss of path $\xi$. Counts how many cells $\xi$ visits that the expert did not.
  • $\zeta \geq 0$, a slack variable allowing the margin constraint to be partially violated.
  • $C > 0$, a regularisation tradeoff constant.

The MMP optimisation, formalised

The goal is to find weights $w$ such that the expert path is cheaper than every alternative by at least a Hamming-distance margin. With L2 regularisation and a slack variable for the soft-margin version:

$\displaystyle \min_{w,\, \zeta} \;\; \tfrac{1}{2}\|w\|^2 + C \cdot \zeta$

$\text{s.t.}\quad w \cdot f_E + \ell(\xi) \leq w \cdot f(\xi) + \zeta \;\;\text{for every feasible}\;\xi,\quad w \geq 0,\; \zeta \geq 0.$

Read the constraint left to right: expert cost plus margin (the Hamming distance) is at most alternative cost plus slack. The $\tfrac{1}{2}\|w\|^2$ term picks the smallest-magnitude $w$ among the many that satisfy the constraints (which would otherwise be under-determined). The constraint must hold for every feasible path $\xi$, of which there are exponentially many.

Claim 1: shortest-path under $c'(s) = c(s) - L(s)$ returns the worst violator

We want to show that the single most-violated constraint can be found with one shortest-path call, instead of enumerating exponentially many paths.

Step 1. Rearrange the soft-margin constraint to isolate the slack:

$\ell(\xi) - w \cdot \big(f(\xi) - f_E\big) \leq \zeta \quad\text{for every}\;\xi.$

The left-hand side measures how much path $\xi$ violates the constraint. A path with $\ell(\xi)$ much larger than $w \cdot (f(\xi) - f_E)$ is a serious violator: it is very different from the expert (high $\ell(\xi)$) and is not much more expensive (low $w \cdot (f(\xi) - f_E)$).

Step 2. The smallest $\zeta$ that satisfies every constraint is the maximum of the left-hand side over all $\xi$:

$\zeta^* \;=\; \max_\xi \big[\ell(\xi) - w \cdot f(\xi)\big] \,+\, w \cdot f_E.$

(The $w \cdot f_E$ term came out of the max because it does not depend on $\xi$.)

Step 3. Convert the $\max$ to a $\min$ by flipping the sign:

$\max_\xi \big[\ell(\xi) - w \cdot f(\xi)\big] \;=\; -\min_\xi \big[w \cdot f(\xi) - \ell(\xi)\big].$

The worst violator is the $\xi$ that minimises $w \cdot f(\xi) - \ell(\xi)$.

Step 4. Now the key observation: both $w \cdot f(\xi)$ and $\ell(\xi)$ decompose additively over the cells of the path:

$w \cdot f(\xi) \,=\, \sum_{s \in \xi} c(s), \qquad \ell(\xi) \,=\, \sum_{s \in \xi} L(s).$

So their difference is also a sum over cells:

$w \cdot f(\xi) - \ell(\xi) \;=\; \sum_{s \in \xi} \big(c(s) - L(s)\big) \;=\; \sum_{s \in \xi} c'(s).$

where $c'(s) = c(s) - L(s)$ is the loss-augmented per-cell cost.

Conclusion. The worst violator $\xi^* = \arg\min_\xi \sum_{s \in \xi} c'(s)$ is exactly the shortest path under modified per-cell costs $c'$. A standard shortest-path algorithm (Dijkstra, value iteration, A*) returns it in one call.

Claim 2: the update rule is gradient descent on the full objective

We want to show that line 5 of the algorithm, $w \leftarrow (1 - \eta\lambda)\,w + \eta\,(f^* - f_E)$, is exactly one gradient-descent step on the regularised MMP objective. This is the part that justifies where the $(1 - \eta\lambda)$ shrinkage factor comes from.

Step 1, write down the objective for one demonstration. Combine the L2 regulariser and the structured hinge loss:

$J(w) \;=\; \underbrace{\tfrac{\lambda}{2}\|w\|^2}_{\text{L2 regulariser}} \;+\; \underbrace{\max\big(0,\; \ell(\xi^*) - w \cdot (f^* - f_E)\big)}_{\text{structured hinge } \mathcal H(w)}.$

The hinge term $\mathcal H(w)$ is the constraint-violation amount for the worst violator $\xi^*$. It is zero when the expert beats $\xi^*$ by the required margin, positive when the constraint is violated. The $\tfrac{\lambda}{2}\|w\|^2$ term is the L2 regulariser ($\lambda$ is its strength). We want to minimise $J(w)$.

Step 2, differentiate the L2 term. The gradient of $\tfrac{\lambda}{2}\|w\|^2$ with respect to $w$ is:

$\dfrac{\partial}{\partial w} \left[\tfrac{\lambda}{2}\|w\|^2\right] \;=\; \dfrac{\partial}{\partial w} \left[\tfrac{\lambda}{2}\, (w_1^2 + w_2^2 + \cdots)\right] \;=\; \lambda\, w.$

(Each $w_i^2$ differentiates to $2 w_i$, the $\tfrac{1}{2}$ cancels the 2, leaving $\lambda w_i$ in each coordinate, i.e. $\lambda w$.)

Step 3, differentiate the hinge term. In the regime where the hinge is active (constraint violated, $\mathcal H > 0$), the hinge equals $\ell(\xi^*) - w \cdot (f^* - f_E)$. Its gradient:

$\dfrac{\partial \mathcal H}{\partial w} \;=\; \dfrac{\partial}{\partial w}\big[\ell(\xi^*) - w \cdot (f^* - f_E)\big] \;=\; -\,(f^* - f_E) \;=\; f_E - f^*.$

($\ell(\xi^*)$ does not depend on $w$, so it drops. The term $-w \cdot (f^* - f_E)$ is linear in $w$ with coefficient $-(f^* - f_E)$, so its derivative is $-(f^* - f_E)$.)

Step 4, combine. The gradient of the full objective is the sum of the two pieces:

$\dfrac{\partial J}{\partial w} \;=\; \lambda\, w \;+\; (f_E - f^*).$

Step 5, take a gradient-descent step. Gradient descent moves $w$ in the direction opposite the gradient, scaled by step size $\eta$:

$w \;\leftarrow\; w \,-\, \eta \cdot \dfrac{\partial J}{\partial w} \;=\; w \,-\, \eta\,\big(\lambda w + (f_E - f^*)\big).$

Distribute the $\eta$ and group the $w$ terms:

$w \;\leftarrow\; w \,-\, \eta\lambda\, w \,-\, \eta\,(f_E - f^*) \;=\; (1 - \eta\lambda)\, w \,+\, \eta\,(f^* - f_E).$

Conclusion. This is exactly line 5 of the algorithm. The $(1 - \eta\lambda)\,w$ part is the L2 shrinkage (it pulls $w$ toward zero each step, in proportion to the regularisation strength $\lambda$). The $\eta\,(f^* - f_E)$ part is the structured-hinge subgradient (it raises the cost of terrains the violator overused). Setting $\lambda = 0$ removes the shrinkage and recovers the bare subgradient step.

Why the shrinkage matters. Without the $(1 - \eta\lambda)$ factor, nothing stops $w$ from growing without bound across iterations: each subgradient step can keep pushing the weights larger. The shrinkage caps this. At a fixed point, the shrinkage pull $\eta\lambda w$ exactly balances the subgradient push $\eta(f^* - f_E)$, which gives the unique regularised solution. This is why Ratliff, Bagnell & Zinkevich (2006) include the L2 term: not for its own sake, but to make the recovered $w$ well-defined and bounded.

Why this is called "structured"

Standard classification predicts a label from a small finite set (cat, dog, fish). The output space is small enough to enumerate, so the loss is just per-example accuracy.

Structured classification predicts a combinatorial object: a parse tree, a translation, a path through a graph. The output space is exponentially large, so enumeration is impossible. The structured-SVM framework (Tsochantaridis et al. 2005) handles this with the loss-augmented inference trick we just used: find the worst-violating output via a problem-specific solver (here, shortest path), then take a subgradient step.

MMP is the path-prediction specialisation. The "structure" is that the output (a trajectory) is a sequence of states with feature counts that decompose additively over cells. That decomposability is what lets the worst-violator search collapse into a shortest-path call. Other structured outputs (trees, alignments) use different solvers but the same overall framework.

Walk-through: how MMP rewrites the cost map

Starting with random non-negative weights and running on our gridworld:

iter$w$ (grass, road, sand, water, mountain)What's happening
0[3.6, 0.0, 26.0, 0.0, 0.8]random init; sand is randomly very high
1[6.6, 2.7, 26.0, 8.6, 9.5]loss-aug planner explored everywhere, MMP raises everything
3[8.0, 4.8, 26.0, 16.8, 15.9]water and mountain finally getting expensive
99[8.04, 4.99, 25.81, 17.90, 16.84]converged

The recovered cost ordering is:

  • road = 4.99 (cheap)
  • grass = 8.04 (more expensive than road, which is right!)
  • water = 17.90, mountain = 16.84 (both expensive, which is right!)
  • sand = 25.81 (most expensive, but actually true cost says it should be mid-range, this is a quirk of random init, the algorithm never had reason to lower it)

Compare to the true costs $(1, 0.2, 2, 5, 8)$. The recovered numbers are wrong in magnitude but mostly right in ordering (road cheapest, mountain expensive). When we plan a path under the recovered cost function, we get the same 23-step road snake as the expert, even though the numbers don't match.

This is the deep point. The recovered cost function is not unique, there are many cost functions that would produce the same expert behaviour. MMP picks one of them that satisfies its margin constraint. The recovered policy is right; the recovered weights are off by a multiplicative factor.

Why does the slack plateau at 1?

When you run MMP on this gridworld, the slack (the constraint violation amount) stops going down at exactly 1.0 and stays there. This is not a bug. The Hamming penalty is integer-valued: every non-expert cell counts as 1. Even at the best $w$, the loss-augmented planner can usually find one non-expert cell to step on for a free +1 bonus. So slack = 1 is the discrete floor. The algorithm has converged, the policy is correct, the slack just cannot go below the granularity of the penalty.

Fully worked numerical example: MMP on a 2x3 grid (click to expand)

For MMP we need an MDP where the expert doesn't visit every cell, otherwise the Hamming-loss bonus has nothing to grab onto. Here's a 2-row, 3-column grid:

0 (S) grass 1 road 2 (G) goal 3 grass 4 grass 5 grass
  • Six states, indexed 0–5. State 0 is start, state 2 is goal.
  • Actions: N, E, S, W. Walls reflect.
  • True cost: grass = 1, road = 0.2, goal = 0.
  • Feature map: $\phi$ is 2-dim grass/road one-hot.

Step 1 · The expert's path and visited states

The expert plans optimally under the true costs. The cheapest path from state 0 to state 2:

state 0 (grass) → E → state 1 (road) → E → state 2 (goal)

Total cost = 1 + 0.2 + 0 = 1.2. (Going via the bottom row would cost 1+1+1+1+0 = 4, much more.)

Feature expectation:

$f_E = \phi(0) + \phi(1) + \phi(2) = (1, 0) + (0, 1) + (0, 0) = \mathbf{(1, 1)}$

States the expert visited: $V_E = \{0, 1, 2\}$. Non-expert states: $\{3, 4, 5\}$.

Step 2 · Build the per-state Hamming loss

$L(s) = 1$ if $s$ is not on the expert's path, else 0:

state012345
$L(s)$000111

The bottom row (which the expert avoided) carries a 1-unit "bonus", the loss-augmented planner will see those cells as cheaper than they really are.

Step 3 · MMP iteration 0, initial $w$

Start with a random non-negative weight vector. Say $w = (0.3, 0.7)$, grass costs 0.3, road costs 0.7.

Current cost map $c(s) = w \cdot \phi(s)$:

state012345
$c(s)$0.30.700.30.30.3

Note that under this random init, grass is cheaper than road, the opposite of the truth. The algorithm has to fix this.

Step 4 · Compute the loss-augmented cost $c'(s) = c(s) - L(s)$

state012345
$c(s)$0.30.700.30.30.3
$L(s)$000111
$c'(s)$0.30.70−0.7−0.7−0.7

The bottom-row cells now have negative cost, visiting them saves 0.7 each. The augmented planner will be tempted to detour.

Step 5 · Find the worst-violator path $\xi^*$

Plan the minimum-cost path from state 0 to state 2 under $c'$. Some candidate paths and their augmented cost:

pathsum of $c'$
0 → 1 → 2 (the expert path)0.3 + 0.7 + 0 = 1.0
0 → 3 → 4 → 5 → 2 (the long detour)0.3 + (−0.7) + (−0.7) + (−0.7) + 0 = −1.8
0 → 3 → 4 → 1 → 20.3 + (−0.7) + (−0.7) + 0.7 + 0 = −0.4
0 → 1 → 4 → 5 → 20.3 + 0.7 + (−0.7) + (−0.7) + 0 = −0.4

The cheapest under the loss-augmented cost is the long detour through the bottom row: $\xi^* = (0, 3, 4, 5, 2)$, cost $-1.8$. This is the "worst violator", the path the algorithm wants to make more expensive.

Feature counts of $\xi^*$:

$f^* = \phi(0) + \phi(3) + \phi(4) + \phi(5) + \phi(2) = (1, 0) + (1, 0) + (1, 0) + (1, 0) + (0, 0) = \mathbf{(4, 0)}$

Step 6 · Compute the subgradient and apply the update

We use the update from Section 8 Step 4 with $\lambda = 0$ (no L2 regularisation), to keep the arithmetic clean. With $\lambda = 0$ the shrinkage factor $(1 - \eta\lambda)$ becomes 1, so the update reduces to the bare subgradient step $w \leftarrow w + \eta \cdot \big(f^* - f_E\big)$. (A real run would keep a small positive $\lambda$ to bound the weights; here it would only add a tiny shrinkage and clutter the numbers.) The update direction is violator features minus expert features:

$f^* - f_E = (4, 0) - (1, 1) = \mathbf{(+3, -1)}$

Reading this:

  • $+3$ on grass: the violator visited grass 3 more times than the expert. Raise $w_{\text{grass}}$ to discourage grass.
  • $-1$ on road: the expert visited road 1 more time than the violator. Lower $w_{\text{road}}$ to encourage road.

With $\eta = 0.5$:

$w_{\text{new}} = w + \eta \cdot \big(f^* - f_E\big) = (0.3, 0.7) + 0.5 \cdot (+3, -1) = (0.3 + 1.5, \; 0.7 - 0.5) = \mathbf{(1.8, \; 0.2)}$

After projecting to non-negative orthant (both entries already $\geq 0$): $w = (1.8, 0.2)$.

Step 7 · Compute the slack at this iteration

The "violation" the algorithm just measured:

slack$_E = \ell(\xi^*) - w \cdot (f^* - f_E)$

$\ell(\xi^*) = L(0) + L(3) + L(4) + L(5) + L(2) = 0 + 1 + 1 + 1 + 0 = 3$ (three non-expert cells visited).

$f^* - f_E = (4, 0) - (1, 1) = (3, -1)$.

$w \cdot (3, -1) = 0.3 \cdot 3 + 0.7 \cdot (-1) = 0.9 - 0.7 = 0.2$.

slack $= 3 - 0.2 = \mathbf{2.8}$.

2.8 is large, the algorithm has work to do.

Step 8 · Iteration 1, recompute everything with the new $w$

$w = (1.8, 0.2)$. Cost map:

state012345
$c(s)$1.80.201.81.81.8

Now grass is much more expensive than road. ✓

Loss-augmented cost:

state012345
$c'(s)$1.80.200.80.80.8

Even with the 1-unit bonus on the bottom row, those cells still cost 0.8 each, not enough to compete with road at 0.2.

Find $\xi^*$ now:

pathsum of $c'$
0 → 1 → 2 (expert path)1.8 + 0.2 + 0 = 2.0
0 → 3 → 4 → 5 → 21.8 + 0.8 + 0.8 + 0.8 + 0 = 4.2
0 → 3 → 4 → 1 → 21.8 + 0.8 + 0.8 + 0.2 + 0 = 3.6

The cheapest is now the expert path itself. $\xi^* = (0, 1, 2)$, same as $\xi_E$.

Subgradient: $g = f_E - f^* = (1, 1) - (1, 1) = (0, 0)$. No update.

Slack: $\ell(\xi^*) = 0$ (no non-expert cells), $w \cdot (f^* - f_E) = 0$. slack $= 0$.

Converged. 🎉

Step 9 · What we recovered (the final answer)

After one iteration, the loss-augmented planner under $w = (1.8, 0.2)$ returns the expert's path itself. The algorithm has converged.

The final recovered policy, state by state

Unlike Abbeel-Ng, MMP doesn't output a mixture of candidate policies. It outputs a single cost vector $w$, and the "policy" is implicit: it's whatever a planner produces from $w$. Here's what the planner produces on this 2×3 grid, for every starting state:

start stateterrainbest actionresulting trajectorytotal cost under $w$
0grass (S)E (go to 1)$0 \to 1 \to 2$$1.8 + 0.2 + 0 = 2.0$
1roadE (go to 2)$1 \to 2$$0.2 + 0 = 0.2$
2goalabsorbing$2$$0$
3grassN (go to 0) then E, E$3 \to 0 \to 1 \to 2$$1.8 + 1.8 + 0.2 + 0 = 3.8$
4grassN (go to 1) then E$4 \to 1 \to 2$$1.8 + 0.2 + 0 = 2.0$
5grassN (go to 2)$5 \to 2$$1.8 + 0 = 1.8$

The recovered policy at every non-goal state: get to the road or goal as fast as possible, then ride the road home. Starting at the official start state (0), this produces the expert's trajectory exactly. Starting elsewhere, the policy is what you'd want a sensible agent to do, but those starts weren't demonstrated, so the recovered policy there is just whatever falls out of running a planner on the recovered $w$.

Side-by-side comparison against the expert

quantitytrue (hidden)recovered by MMPmatch?
cost weights $w$$(1.0, \; 0.2)$$(1.8, \; 0.2)$road exact, grass 1.8× too high
cost of state 0 (grass)$1.0$$1.8$off
cost of state 1 (road)$0.2$$0.2$Yes
cost ordering (grass vs road)grass > roadgrass > roadYes
planner's optimal path from state 0$0 \to 1 \to 2$$0 \to 1 \to 2$Yes
total path cost under recovered $w$$1.0 + 0.2 + 0 = 1.2$$1.8 + 0.2 + 0 = 2.0$different scale, same minimiser
matches expert demonstration?by constructionyesYes

The recovered policy (the "go through the road" path) is exactly the expert's. The recovered cost vector has the right ordering and gets road exactly right, but grass came out 1.8× too high. This is the typical MMP outcome on under-constrained problems: ordering correct, magnitude off, policy correct.

What does the recovered cost give you on a NEW map?

This is the answer to "why bother recovering a cost at all instead of just using Abbeel-Ng?" Suppose tomorrow you get a new map: an L-shaped corridor where the same kinds of terrain appear in different places. Hand the recovered $w = (1.8, 0.2)$ to a planner on the new map and it returns the road-hugging path automatically. The cost transfers because it depends only on terrain type, not on the specific map layout. Abbeel-Ng's policy-mixture output, by contrast, is tied to the original map and gives nonsense on the new one.

Why the recovered $w$ differs from the true $w$: because both vectors produce the same expert path under loss- augmented planning. From the demonstration alone, the algorithm cannot distinguish $(1, 0.2)$ from $(1.8, 0.2)$, both make the expert's path cheapest. The Hamming-loss term picks "one" of the many consistent $w$'s, biased by initialisation and step size. Adding more demonstrations (especially on cells the expert in this example avoided, like water and mountain) would tighten the recovery.
Cost recovery, by hand · drag the sliders
Interactive

A 4x6 gridworld with mixed terrain. The expert (the gold path) takes the cheapest route under the true weights, hugging the road. You control the cost weights below. The blue path is what a planner currently produces. Adjust the sliders until your planner matches the expert.

grass road sand water mountain expert path your path
3.0
3.0
3.0
3.0
3.0
your path cost , expert path cost (under your $w$) , violation = expert cost - your cost , matches expert path? ,

Both algorithms compared

Abbeel-Ng (2004)MMP (2006)
Goalmatch expert's visit countsmake expert's path cheapest under recovered cost
Outputa policy mixturea cost vector $w$
Useful whenyou just want to imitateyou want a cost to plan with in new environments
Recovers the true cost?no, doesn't tryno, but recovered cost is structurally usable
Inner loopdirection-finding (SVM) + planningloss-augmented planning + cost adjustment
Number of iterations~5 on our gridworld~10-100 on our gridworld

The big practical difference is transfer. If you move to a new map with the same terrain types, MMP's cost vector still applies: plug it into a planner on the new map, get a sensible policy. Abbeel-Ng's policy mixture is tied to the original map; on a new map it's useless. In practice this is the most important reason to prefer MMP when you have any chance of needing the recovered cost function on a different environment.

Conclusion: two landmark papers and what they leave open

We walked through two landmark papers from the early 2000s. Apprenticeship Learning (Abbeel and Ng, 2004) recovered a policy that matches the expert's terrain-visit histogram by iteratively collecting candidate policies and mixing them. Maximum Margin Planning (Ratliff, Bagnell and Zinkevich, 2006) recovered a portable cost function by demanding that the expert's path be cheaper than every alternative by at least the Hamming distance, and then finding the worst violator with a single loss-augmented planner call.

Both algorithms work. Both rest on the same structural skeleton, match expert features through some intermediate representation, which is the pattern that carries forward to every modern IRL algorithm. They are the right place for a reader new to the field to start.

Appendix A. Why w* = v / ‖v‖ maximises the dot product

This appendix proves the closed-form result used in section 6: given a fixed vector $v \in \mathbb R^n$, the maximum of $w \cdot v$ over all $w$ with $\|w\| \leq 1$ is achieved at $w^* = v / \|v\|$, and the maximum value equals $\|v\|$. Three proofs follow, each making the same statement from a different angle. They are not redundant: each one gives a different intuition that's worth carrying forward.

Proof A.1 (Cauchy-Schwarz)

The Cauchy-Schwarz inequality says that for any two vectors $a, b \in \mathbb R^n$:

$|a \cdot b| \leq \|a\| \cdot \|b\|$

with equality if and only if one is a non-negative scalar multiple of the other (they point in the same direction).

Apply this to $w$ and $v$, using the constraint $\|w\| \leq 1$:

$w \cdot v \;\leq\; |w \cdot v| \;\leq\; \|w\| \cdot \|v\| \;\leq\; 1 \cdot \|v\| \;=\; \|v\|.$

So $\|v\|$ is an upper bound on $w \cdot v$. Now we need to show this bound is achieved. Pick $w = v / \|v\|$. Compute:

$w \cdot v \;=\; \dfrac{v}{\|v\|} \cdot v \;=\; \dfrac{v \cdot v}{\|v\|} \;=\; \dfrac{\|v\|^2}{\|v\|} \;=\; \|v\|.$

And the norm constraint is satisfied:

$\|w\| \;=\; \left\| \dfrac{v}{\|v\|} \right\| \;=\; \dfrac{\|v\|}{\|v\|} \;=\; 1 \leq 1. \;\;\checkmark$

So $w^* = v / \|v\|$ achieves the upper bound, and is therefore the maximiser.

Proof A.2 (geometric, the picture)

The dot product has a geometric definition:

$w \cdot v \;=\; \|w\| \cdot \|v\| \cdot \cos\theta$

where $\theta$ is the angle between $w$ and $v$. So:

$w \cdot v \;\leq\; \|w\| \cdot \|v\| \cdot 1 \;=\; \|w\| \cdot \|v\|$

with equality when $\cos\theta = 1$, that is, $\theta = 0$, that is, $w$ and $v$ point in the same direction. Under the constraint $\|w\| \leq 1$, the right-hand side is at most $\|v\|$, with equality when $\|w\| = 1$.

So to maximise the dot product we need two things at once:

  • Make $\|w\|$ as large as the constraint allows. Set $\|w\| = 1$, saturate the constraint.
  • Make $\cos\theta$ as large as possible. Set $w$ parallel to $v$.

The unique vector satisfying both is $w^* = v / \|v\|$. The picture: $w^*$ is the unit arrow pointing exactly the way $v$ points. Any other direction loses you $\cos\theta$; any shorter vector loses you $\|w\|$.

Proof A.3 (Lagrangian, the rigorous version)

Set up the constrained optimisation. Use $\|w\|^2 \leq 1$ (the squared norm) so the constraint is smooth:

$\max_w \; w \cdot v \quad\text{subject to}\quad w \cdot w \leq 1.$

The Lagrangian is:

$\mathcal L(w, \lambda) \;=\; w \cdot v \;-\; \lambda \, (w \cdot w - 1)$

with $\lambda \geq 0$. Stationarity, set $\nabla_w \mathcal L = 0$:

$v - 2\lambda w \;=\; 0 \quad\Longrightarrow\quad w \;=\; \dfrac{v}{2\lambda}.$

So $w$ must be parallel to $v$. For the unconstrained maximum to be bounded, the constraint must be active (otherwise $w$ could grow without bound and so could $w \cdot v$), so $w \cdot w = 1$, giving:

$\left\| \dfrac{v}{2\lambda} \right\|^2 \;=\; 1 \quad\Longrightarrow\quad \dfrac{\|v\|^2}{4\lambda^2} \;=\; 1 \quad\Longrightarrow\quad 2\lambda \;=\; \|v\|.$

Plugging back:

$w^* \;=\; \dfrac{v}{2\lambda} \;=\; \dfrac{v}{\|v\|}.$

The optimal Lagrange multiplier $\lambda^* = \|v\| / 2$ measures how much the optimum would increase per unit relaxation of the norm constraint. Bigger $\|v\|$ means more room to grow $w \cdot v$ if we were allowed to let $\|w\|$ exceed 1.

What this means in the IRL context. In Abbeel-Ng's first iteration, $v = f_E - f_0$ is the difference in feature space from the candidate to the expert. We want a unit-norm reward weight vector that scores the expert as high as possible above the candidate. All three proofs say the same thing: the reward weights should point in the direction of expert minus candidate. Anything else, you're either pointing partly sideways (losing $\cos\theta$) or using a sub-unit-length vector (losing magnitude).

The intuition to carry forward: the gradient of $w \cdot v$ with respect to $w$ is just $v$. So if you're at $w = 0$ and you want to grow $w \cdot v$ as fast as possible, you walk in the direction $v$. You stop when you hit the boundary $\|w\| = 1$. That walk ends at $w^* = v / \|v\|$. This same gradient-flow intuition is exactly what ends up powering policy gradient methods later in the modern RL literature.

Appendix B. The SVM-step / convex-hull duality (Abbeel-Ng)

The walk-through claimed that Abbeel-Ng's "find a separating direction" step is equivalent to "project the expert onto the convex hull of candidates." Both formulations return the same direction $w$. This appendix proves it.

The direction-finding problem

$\displaystyle \max_{w,\,t}\;t \quad\text{s.t.}\quad w \cdot (f_E - f_j) \geq t \;\;\forall j,\quad \|w\| \leq 1.$

The claim

The maximum margin $t^*$ equals the Euclidean distance from $f_E$ to the convex hull of $\{f_0, \ldots, f_{k-1}\}$, and the optimal direction $w^*$ is the unit vector from the closest hull point toward $f_E$.

Proof, direction 1: $t \leq$ distance to hull

Take any feasible $(w, t)$ and any convex combination $\bar f = \sum_j \lambda_j f_j$ in the hull. The convex combination of the constraints gives $w \cdot (f_E - \bar f) \geq t$. Cauchy-Schwarz with $\|w\| \leq 1$ gives $w \cdot (f_E - \bar f) \leq \|f_E - \bar f\|$. So $t \leq \|f_E - \bar f\|$ for every $\bar f$ in the hull. Hence $t$ is at most the distance from $f_E$ to the hull.

Proof, direction 2: this bound is achievable

Let $\bar f^*$ be the closest hull point to $f_E$. The projection theorem of convex analysis says $(f_E - \bar f^*) \cdot (y - \bar f^*) \leq 0$ for any $y$ in the hull. Apply this with $y = f_j$ and rearrange: $(f_E - \bar f^*) \cdot (f_E - f_j) \geq \|f_E - \bar f^*\|^2$. Set $w^* = (f_E - \bar f^*) / \|f_E - \bar f^*\|$. This $w^*$ achieves margin $\|f_E - \bar f^*\|$ exactly. So the maximum margin equals the distance from $f_E$ to the hull.

The takeaway

The SVM step is geometrically just "point the cost direction from the closest hull point toward the expert." That is why the projection method (which directly tracks the closest hull point) gives the same answer as the SVM. The polytope widget earlier in the post is literally drawing this picture: each iteration adds a vertex to the convex hull, and the dashed red line from $f_E$ to the closest hull point is the direction $w^*$ that Abbeel-Ng's SVM step computes.

References & further reading

  • Abbeel, P. & Ng, A. Y. (2004). Apprenticeship Learning via Inverse Reinforcement Learning. ICML.
  • Ratliff, N., Bagnell, J. A., & Zinkevich, M. (2006). Maximum Margin Planning. ICML. (The MMP paper. The theory section here is a distillation of sections 2 and 3.)
  • Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large Margin Methods for Structured and Interdependent Output Variables. JMLR. (The structured-SVM foundation MMP builds on. Read for the loss-augmented inference framework in its general form.)
  • Ng, A. Y. & Russell, S. (2000). Algorithms for Inverse Reinforcement Learning. ICML. (The original "many costs explain the same expert" paper.)
  • Every number in this post comes from an actual run of the algorithms; the interactive widgets embedded throughout the post let you reproduce each iteration step by step.