The paper Regret Minimization in Games with Incomplete Information introduces counterfactual regret and how minimizing counterfactual regret through self-play can be used to reach Nash equilibrium. The algorithm is called Counterfactual Regret Minimization (**CFR**).

The paper Monte Carlo Sampling for Regret Minimization in Extensive Games introduces Monte Carlo Counterfactual Regret Minimization (**MCCFR**), where we sample from the game tree and estimate the regrets.

We tried to keep our Python implementation easy-to-understand like a tutorial. We run it on a very simple imperfect information game called Kuhn poker.

We implement Monte Carlo Counterfactual Regret Minimization (MCCFR) with chance sampling (CS). It iteratively, explores part of the game tree by trying all player actions, but sampling chance events. Chance events are things like dealing cards; they are kept sampled once per iteration. Then it calculates, for each action, the *regret* of following the current strategy instead of taking that action. Then it updates the strategy based on these regrets for the next iteration, using regret matching. Finally, it computes the average of the strategies throughout the iterations, which is very close to the Nash equilibrium if we ran enough iterations.

We will first introduce the mathematical notation and theory.

A player is denoted by $i∈N$, where $N$ is the set of players.

History $h∈H$ is a sequence of actions including chance events, and $H$ is the set of all histories.

$Z⊆H$ is the set of terminal histories (game over).

Action $a$, $A(h)=a:(h,a)∈H$ where $h∈H$ is a non-terminal history.

**Information set** $I_{i}∈I_{i}$ for player $i$ is similar to a history $h∈H$ but only contains the actions visible to player $i$. That is, the history $h$ will contain actions/events such as cards dealt to the opposing player while $I_{i}$ will not have them.

$I_{i}$ is known as the **information partition** of player $i$.

$h∈I$ is the set of all histories that belong to a given information set; i.e. all those histories look the same in the eye of the player.

**Strategy of player** $i$, $σ_{i}∈Σ_{i}$ is a distribution over actions $A(I_{i})$, where $Σ_{i}$ is the set of all strategies for player $i$. Strategy on $t$-th iteration is denoted by $σ_{i}$.

Strategy is defined as a probability for taking an action $a$ in for a given information set $I$,

$σ_{i}(I)(a)$

$σ$ is the **strategy profile** which consists of strategies of all players $σ_{1},σ_{2},…$

$σ_{−i}$ is strategies of all players except $σ_{i}$

$π_{σ}(h)$ is the probability of reaching the history $h$ with strategy profile $σ$. $π_{σ}(h)_{−i}$ is the probability of reaching $h$ without player $i$'s contribution; i.e. player $i$ took the actions to follow $h$ with a probability of $1$.

$π_{σ}(h)_{i}$ is the probability of reaching $h$ with only player $i$'s contribution. That is, $π_{σ}(h)=π_{σ}(h)_{i}π_{σ}(h)_{−i}$

Probability of reaching a information set $I$ is, $π_{σ}(I)=h∈I∑ π_{σ}(h)$

The terminal utility is the utility (or pay off) of a player $i$ for a terminal history $h$.

$u_{i}(h)$ where $h∈Z$

$u_{i}(σ)$ is the expected utility (payoff) for player $i$ with strategy profile $σ$.

$u_{i}(σ)=h∈Z∑ u_{i}(h)π_{σ}(h)$

Nash equilibrium is a state where none of the players can increase their expected utility (or payoff) by changing their strategy alone.

For two players, Nash equilibrium is a strategy profile where

$u_{1}(σ)u_{2}(σ) ≥σ_{1}∈Σ_{1}max u_{1}(σ_{1},σ_{2})≥σ_{2}∈Σ_{2}max u_{1}(σ_{1},σ_{2}) $$ϵ$-Nash equilibrium is,

$u_{1}(σ)+ϵu_{2}(σ)+ϵ ≥σ_{1}∈Σ_{1}max u_{1}(σ_{1},σ_{2})≥σ_{2}∈Σ_{2}max u_{1}(σ_{1},σ_{2}) $Regret is the utility (or pay off) that the player didn't get because she didn't follow the optimal strategy or took the best action.

Average overall regret for Player $i$ is the average regret of not following the optimal strategy in all $T$ rounds of iterations.

$R_{i}=T1 σ_{i}∈Σ_{i}max t=1∑T (u_{i}(σ_{i},σ_{−i})−u_{i}(σ_{t}))$

where $σ_{t}$ is the strategy profile of all players in iteration $t$, and

$(σ_{i},σ_{−i})$

is the strategy profile $σ_{t}$ with player $i$'s strategy replaced with $σ_{i}$.

The average strategy is the average of strategies followed in each round, for all $I∈I,a∈A(I)$

$σˉ_{i}(I)(a)=∑_{t=1}π_{i}(I)∑_{t=1}π_{i}(I)σ_{t}(I)(a) $

That is the mean regret of not playing with the optimal strategy.

If $R_{i}<ϵ$ for all players then $σˉ_{i}(I)(a)$ is a $2ϵ$-Nash equilibrium.

$R_{i}R_{i} <ϵ=T1 σ_{i}∈Σ_{i}max t=1∑T (u_{i}(σ_{i},σ_{−i})−u_{i}(σ_{t}))=T1 σ_{i}∈Σ_{i}max t=1∑T u_{i}(σ_{i},σ_{−i})−T1 t=1∑T u_{i}(σ_{t})<ϵ $Since $u_{1}=−u_{2}$ because it's a zero-sum game, we can add $R_{1}$ and $R_{i}$ and the second term will cancel out.

$2ϵ >T1 σ_{1}∈Σ_{1}max t=1∑T u_{1}(σ_{1},σ_{−1})+T1 σ_{2}∈Σ_{2}max t=1∑T u_{2}(σ_{2},σ_{−2}) $The average of utilities over a set of strategies is equal to the utility of the average strategy.

$T1 t=1∑T u_{i}(σ_{t})=u_{i}(σˉ_{T})$

Therefore,

$2ϵ >σ_{1}∈Σ_{1}max u_{1}(σ_{1},σˉ_{−1})+σ_{2}∈Σ_{2}max u_{2}(σ_{2},σˉ_{−2}) $From the definition of $max$, $σ_{2}∈Σ_{2}max u_{2}(σ_{2},σˉ_{−2})≥u_{2}(σˉ_{T})=−u_{1}(σˉ_{T})$

Then,

$2ϵu_{1}(σˉ_{T})+2ϵ >σ_{1}∈Σ_{1}max u_{1}(σ_{1},σˉ_{−1})+−u_{1}(σˉ_{T})>σ_{1}∈Σ_{1}max u_{1}(σ_{1},σˉ_{−1}) $This is $2ϵ$-Nash equilibrium. You can similarly prove for games with more than 2 players.

So we need to minimize $R_{i}$ to get close to a Nash equilibrium.

**Counterfactual value** $v_{i}(σ,I)$ is the expected utility for player $i$ if if player $i$ tried to reach $I$ (took the actions leading to $I$ with a probability of $1$).

$v_{i}(σ,I)=z∈Z_{I}∑ π_{−i}(z[I])π_{σ}(z[I],z)u_{i}(z)$

where $Z_{I}$ is the set of terminal histories reachable from $I$, and $z[I]$ is the prefix of $z$ up to $I$. $π_{σ}(z[I],z)$ is the probability of reaching z from $z[I]$.

**Immediate counterfactual regret** is,

$R_{i,imm}(I)=a∈AImax R_{i,imm}(I,a)$

where

$R_{i,imm}(I)=T1 t=1∑T (v_{i}(σ_{t}∣_{I→a},I)−v_{i}(σ_{t},I))$

where $σ∣_{I→a}$ is the strategy profile $σ$ with the modification of always taking action $a$ at information set $I$.

The paper proves that (Theorem 3),

$R_{i}≤I∈I∑ R_{i,imm}(I)$ where $R_{i,imm}(I)=max(R_{i,imm}(I),0)$

The strategy is calculated using regret matching.

The regret for each information set and action pair $R_{i}(I,a)$ is maintained,

$r_{i}(I,a)R_{i}(I,a) =v_{i}(σ_{t}∣_{I→a},I)−v_{i}(σ_{t},I)=T1 t=1∑T r_{i}(I,a) $and the strategy is calculated with regret matching,

$σ_{i}(I)(a)=⎩⎨⎧ ∑_{a∈A(I)}R_{i}(I,a_{′})R_{i}(I,a) ,∣A(I)∣1 , if∑_{a_{′}∈A(I)}R_{i}(I,a_{′})>0otherwise $where $R_{i}(I,a)=max(R_{i}(I,a),0)$

The paper The paper Regret Minimization in Games with Incomplete Information proves that if the strategy is selected according to above equation $R_{i}$ gets smaller proportionate to $T 1 $, and therefore reaches $ϵ$-Nash equilibrium.

Computing $r_{i}(I,a)$ requires expanding the full game tree on each iteration.

The paper Monte Carlo Sampling for Regret Minimization in Extensive Games shows we can sample from the game tree and estimate the regrets.

$Q=Q_{1},…,Q_{r}$ is a set of subsets of $Z$ ($Q_{j}⊆Z$) where we look at only a single block $Q_{j}$ in an iteration. Union of all subsets spans $Z$ ($Q_{1}∩…∩Q_{r}=Z$). $q_{j}$ is the probability of picking block $Q_{j}$.

$q(z)$ is the probability of picking $z$ in current iteration; i.e. $q(z)=∑_{j:z∈Q_{j}}q_{j}$ - the sum of $q_{j}$ where $z∈Q_{j}$.

Then we get **sampled counterfactual value** fro block $j$,

$v~(σ,I∣j)=z∈Q_{j}∑ q(z)1 π_{−i}(z[I])π_{σ}(z[I],z)u_{i}(z)$

The paper shows that

$E_{j∼q_{j}}[v~(σ,I∣j)]=v_{i}(σ,I)$

with a simple proof.

Therefore we can sample a part of the game tree and calculate the regrets. We calculate an estimate of regrets

$r~_{i}(I,a)=v~_{i}(σ_{t}∣_{I→a},I)−v~_{i}(σ_{t},I)$

And use that to update $R_{i}(I,a)$ and calculate the strategy $σ_{i}(I)(a)$ on each iteration. Finally, we calculate the overall average strategy $σˉ_{i}(I)(a)$.

Here is a Kuhn Poker implementation to try CFR on Kuhn Poker.

*Let's dive into the code!*

```
328from typing import NewType, Dict, List, Callable, cast
329
330from labml import monit, tracker, logger, experiment
331from labml.configs import BaseConfigs, option
```

A player $i∈N$ where $N$ is the set of players

`334Player = NewType('Player', int)`

`336Action = NewType('Action', str)`

History $h∈H$ is a sequence of actions including chance events, and $H$ is the set of all histories.

This class should be extended with game specific logic.

`339class History:`

Whether it's a terminal history; i.e. game over. $h∈Z$

`351 def is_terminal(self):`

`356 raise NotImplementedError()`

`358 def terminal_utility(self, i: Player) -> float:`

`364 raise NotImplementedError()`

Get current player, denoted by $P(h)$, where $P$ is known as **Player function**.

If $P(h)=c$ it means that current event is a chance $c$ event. Something like dealing cards, or opening common cards in poker.

`366 def player(self) -> Player:`

`373 raise NotImplementedError()`

Whether the next step is a chance step; something like dealing a new card. $P(h)=c$

`375 def is_chance(self) -> bool:`

`380 raise NotImplementedError()`

Sample a chance when $P(h)=c$.

`382 def sample_chance(self) -> Action:`

`386 raise NotImplementedError()`

Add an action to the history.

`388 def __add__(self, action: Action):`

`392 raise NotImplementedError()`

Get information set for the current player

`394 def info_set_key(self) -> str:`

`398 raise NotImplementedError`

Create a new information set for the current player

`400 def new_info_set(self) -> 'InfoSet':`

`404 raise NotImplementedError()`

Human readable representation

`406 def __repr__(self):`

`410 raise NotImplementedError()`

`413class InfoSet:`

Unique key identifying the information set

`421 key: str`

Total regret of not taking each action $A(I_{i})$,

$r~_{i}(I,a)R_{i}(I,a) =v~_{i}(σ_{t}∣_{I→a},I)−v~_{i}(σ_{t},I)=T1 t=1∑T r~_{i}(I,a) $We maintain $TR_{i}(I,a)$ instead of $R_{i}(I,a)$ since $T1 $ term cancels out anyway when computing strategy $σ_{i}(I)(a)$

`438 regret: Dict[Action, float]`

We maintain the cumulative strategy $t=1∑T π_{i}(I)σ_{t}(I)(a)$ to compute overall average strategy

$σˉ_{i}(I)(a)=∑_{t=1}π_{i}(I)∑_{t=1}π_{i}(I)σ_{t}(I)(a) $

`445 cumulative_strategy: Dict[Action, float]`

Initialize

`447 def __init__(self, key: str):`

```
451 self.key = key
452 self.regret = {a: 0 for a in self.actions()}
453 self.cumulative_strategy = {a: 0 for a in self.actions()}
454 self.calculate_strategy()
```

Actions $A(I_{i})$

`456 def actions(self) -> List[Action]:`

`460 raise NotImplementedError()`

Load information set from a saved dictionary

```
462 @staticmethod
463 def from_dict(data: Dict[str, any]) -> 'InfoSet':
```

`467 raise NotImplementedError()`

Save the information set to a dictionary

`469 def to_dict(self):`

```
473 return {
474 'key': self.key,
475 'regret': self.regret,
476 'average_strategy': self.cumulative_strategy,
477 }
```

Load data from a saved dictionary

`479 def load_dict(self, data: Dict[str, any]):`

```
483 self.regret = data['regret']
484 self.cumulative_strategy = data['average_strategy']
485 self.calculate_strategy()
```

Calculate current strategy using regret matching.

$σ_{i}(I)(a)=⎩⎨⎧ ∑_{a∈A(I)}R_{i}(I,a_{′})R_{i}(I,a) ,∣A(I)∣1 , if∑_{a_{′}∈A(I)}R_{i}(I,a_{′})>0otherwise $where $R_{i}(I,a)=max(R_{i}(I,a),0)$

`487 def calculate_strategy(self):`

$R_{i}(I,a)=max(R_{i}(I,a),0)$

`506 regret = {a: max(r, 0) for a, r in self.regret.items()}`

$a_{′}∈A(I)∑ R_{i}(I,a_{′})$

`508 regret_sum = sum(regret.values())`

if $∑_{a_{′}∈A(I)}R_{i}(I,a_{′})>0$,

`510 if regret_sum > 0:`

$σ_{i}(I)(a)=∑_{a_{′}∈A(I)}R_{i}(I,a_{′})R_{i}(I,a) $

`513 self.strategy = {a: r / regret_sum for a, r in regret.items()}`

Otherwise,

`515 else:`

$∣A(I)∣$

`517 count = len(list(a for a in self.regret))`

$σ_{i}(I)(a)=∣A(I)∣1 $

`520 self.strategy = {a: 1 / count for a, r in regret.items()}`

`522 def get_average_strategy(self):`

$t=1∑T π_{i}(I)σ_{t}(I)(a)$

`531 cum_strategy = {a: self.cumulative_strategy.get(a, 0.) for a in self.actions()}`

$t=1∑T π_{i}(I)=a∈A(I)∑ t=1∑T π_{i}(I)σ_{t}(I)(a)$

`535 strategy_sum = sum(cum_strategy.values())`

If $∑_{t=1}π_{i}(I)>0$,

`537 if strategy_sum > 0:`

$σˉ_{i}(I)(a)=∑_{t=1}π_{i}(I)∑_{t=1}π_{i}(I)σ_{t}(I)(a) $

`541 return {a: s / strategy_sum for a, s in cum_strategy.items()}`

Otherwise,

`543 else:`

$∣A(I)∣$

`545 count = len(list(a for a in cum_strategy))`

$σˉ_{i}(I)(a)=∣A(I)∣1 $

`548 return {a: 1 / count for a, r in cum_strategy.items()}`

Human readable representation

`550 def __repr__(self):`

`554 raise NotImplementedError()`

We do chance sampling (**CS**) where all the chance events (nodes) are sampled and all other events (nodes) are explored.

We can ignore the term $q(z)$ since it's the same for all terminal histories since we are doing chance sampling and it cancels out when calculating strategy (common in numerator and denominator).

`557class CFR:`

$I$ set of all information sets.

`570 info_sets: Dict[str, InfoSet]`

`create_new_history`

creates a new empty history`epochs`

is the number of iterations to train on $T$`n_players`

is the number of players

```
572 def __init__(self, *,
573 create_new_history: Callable[[], History],
574 epochs: int,
575 n_players: int = 2):
```

```
581 self.n_players = n_players
582 self.epochs = epochs
583 self.create_new_history = create_new_history
```

A dictionary for $I$ set of all information sets

`585 self.info_sets = {}`

Tracker for analytics

`587 self.tracker = InfoSetTracker()`

Returns the information set $I$ of the current player for a given history $h$

`589 def _get_info_set(self, h: History):`

```
593 info_set_key = h.info_set_key()
594 if info_set_key not in self.info_sets:
595 self.info_sets[info_set_key] = h.new_info_set()
596 return self.info_sets[info_set_key]
```

This function walks the game tree.

`h`

is the current history $h$`i`

is the player $i$ that we are computing regrets of`pi_i`

is $π_{i}(h)$`pi_neg_i`

is $π_{−i}(h)$

It returns the expected utility, for the history $h$ $z∈Z_{h}∑ π_{σ}(h,z)u_{i}(z)$ where $Z_{h}$ is the set of terminal histories with prefix $h$

While walking the tee it updates the total regrets $R_{i}(I,a)$.

`598 def walk_tree(self, h: History, i: Player, pi_i: float, pi_neg_i: float) -> float:`

If it's a terminal history $h∈Z$ return the terminal utility $u_{i}(h)$.

```
619 if h.is_terminal():
620 return h.terminal_utility(i)
```

If it's a chance event $P(h)=c$ sample a and go to next step.

```
622 elif h.is_chance():
623 a = h.sample_chance()
624 return self.walk_tree(h + a, i, pi_i, pi_neg_i)
```

Get current player's information set for $h$

`627 I = self._get_info_set(h)`

To store $∑_{z∈Z_{h}}π_{σ}(h,z)u_{i}(z)$

`629 v = 0`

To store $z∈Z_{h}∑ π_{σ_{t}∣_{I→a}}(h,z)u_{i}(z)$ for each action $a∈A(h)$

`633 va = {}`

Iterate through all actions

`636 for a in I.actions():`

If the current player is $i$,

`638 if i == h.player():`

`643 va[a] = self.walk_tree(h + a, i, pi_i * I.strategy[a], pi_neg_i)`

Otherwise,

`645 else:`

`650 va[a] = self.walk_tree(h + a, i, pi_i, pi_neg_i * I.strategy[a])`

$z∈Z_{h}∑ π_{σ}(h,z)u_{i}(z)=a∈A(I)∑ [σ_{i}(I)(a)z∈Z_{h}∑ π_{σ_{t}∣_{I→a}}(h,z)u_{i}(z)]$

`655 v = v + I.strategy[a] * va[a]`

If the current player is $i$, update the cumulative strategies and total regrets

`659 if h.player() == i:`

Update cumulative strategies $t=1∑T π_{i}(I)σ_{t}(I)(a)=t=1∑T [h∈I∑ π_{i}(h)σ_{t}(I)(a)]$

```
664 for a in I.actions():
665 I.cumulative_strategy[a] = I.cumulative_strategy[a] + pi_i * I.strategy[a]
```

$r~_{i}(I,a)TR_{i}(I,a) =v~_{i}(σ_{t}∣_{I→a},I)−v~_{i}(σ_{t},I)=π_{−i}(h)(z∈Z_{h}∑ π_{σ_{t}∣_{I→a}}(h,z)u_{i}(z)−z∈Z_{h}∑ π_{σ}(h,z)u_{i}(z))=t=1∑T r~_{i}(I,a) $

```
678 for a in I.actions():
679 I.regret[a] += pi_neg_i * (va[a] - v)
```

Update the strategy $σ_{t}(I)(a)$

`682 I.calculate_strategy()`

Return the expected utility for player $i$, $z∈Z_{h}∑ π_{σ}(h,z)u_{i}(z)$

`686 return v`

`688 def iterate(self):`

Loop for `epochs`

times

`696 for t in monit.iterate('Train', self.epochs):`

Walk tree and update regrets for each player

```
698 for i in range(self.n_players):
699 self.walk_tree(self.create_new_history(), cast(Player, i), 1, 1)
```

Track data for analytics

```
702 tracker.add_global_step()
703 self.tracker(self.info_sets)
704 tracker.save()
```

Save checkpoints every $1,000$ iterations

```
707 if (t + 1) % 1_000 == 0:
708 experiment.save_checkpoint()
```

Print the information sets

`711 logger.inspect(self.info_sets)`

`714class InfoSetTracker:`

Set tracking indicators

`720 def __init__(self):`

```
724 tracker.set_histogram(f'strategy.*')
725 tracker.set_histogram(f'average_strategy.*')
726 tracker.set_histogram(f'regret.*')
```

Track the data from all information sets

`728 def __call__(self, info_sets: Dict[str, InfoSet]):`

```
732 for I in info_sets.values():
733 avg_strategy = I.get_average_strategy()
734 for a in I.actions():
735 tracker.add({
736 f'strategy.{I.key}.{a}': I.strategy[a],
737 f'average_strategy.{I.key}.{a}': avg_strategy[a],
738 f'regret.{I.key}.{a}': I.regret[a],
739 })
```

`742class CFRConfigs(BaseConfigs):`

```
746 create_new_history: Callable[[], History]
747 epochs: int = 1_00_000
748 cfr: CFR = 'simple_cfr'
```

Initialize **CFR** algorithm

```
751@option(CFRConfigs.cfr)
752def simple_cfr(c: CFRConfigs):
```

```
756 return CFR(create_new_history=c.create_new_history,
757 epochs=c.epochs)
```