Reinforcement Learning with DQNs in No-Limit Hold’em Poker

Projects | 2025 | Links: GitHub Repository | Contact me to learn more

A Deep Q-Network (DQN) was trained to play simplified No-Limit Texas Hold’em Poker using the RLCard environment. The project explored how basic reinforcement learning techniques adapt to strategic uncertainty and opponent behavior, without relying on heavy computational resources or advanced game-theoretic tools.

This project explored whether a Deep Q-Network (DQN) could learn meaningful strategies in a simplified heads-up No-Limit Texas Hold’em Poker setting. Using the RLCard Python framework, two training regimes were tested:

Self-play, where both players learned simultaneously, and
Fixed-opponent training, where the agent played against a uniform-random baseline.

The state space was represented as a 54-dimensional binary vector, encoding cards and chip counts, while the action space was discretized into six betting options ranging from fold to all-in. Agents were trained for 100,000 episodes using an ε-greedy exploration policy and a two-layer neural network with 512 hidden units.

Learning was governed by two standard components of reinforcement learning:

1. Q-learning update rule

Q(s,a) \leftarrow Q(s,a) + \alpha \Big[ r + \gamma \max_{a'} Q(s',a') - Q(s,a) \Big]

2. ε-greedy action selection

\pi(s) = \begin{cases} \text{random action}, & \text{with probability } \epsilon \\ \arg\max_{a} Q(s,a), & \text{with probability } 1-\epsilon \end{cases}

where:

$s$ and $s'$ are the current and next states, representing the game configuration (cards, chip stacks, betting history);
$a$ and $a'$ are actions chosen from the discrete action space (fold, call, raise, all-in);
$r$ is the immediate reward from taking action $a$ in state $s$ ;
$Q(s,a)$ is the action-value function, estimating the expected return of taking action $a$ in state $s$ and following the current policy thereafter;
$\alpha$ is the learning rate, controlling how much new information overrides old estimates;
$\gamma$ is the discount factor, weighting the importance of future rewards;
$\epsilon$ is the exploration rate in the $\epsilon$ -greedy policy, giving the probability of choosing a random action instead of the greedy one.

Main results from the study:

Both training setups produced agents that achieved positive returns and consistently outperformed a random baseline;
Training against a fixed random opponent led to stable and reliable gains (≈ +5,000 to +6,000 big blinds over 1000 evaluation hands);
Self-play generated higher peak performances over the same 1000 hands (sometimes >10,000 big blinds), but with greater variability across seeds;
Sensitivity analysis showed that:
• Lower learning rates ( $\alpha$ = 1×10⁻⁵) produced the most stable learning;
• A slightly shorter planning horizon ( $\gamma$ = 0.95) worked best in self-play;
• Higher initial exploration ( $\epsilon$ = 0.9–1.0) improved adaptation in dynamic self-play settings.
While the model is far from state-of-the-art, it demonstrated how basic RL methods can adapt to imperfect information, uncertainty, and competition.

Overall, this work highlights how reinforcement learning agents can learn non-trivial strategies in complex, adversarial environments even with limited resources, making it a valuable teaching and experimentation tool.

Note: Due to university policies, I cannot share the full report or detailed data. If you are interested in discussing the methodology or results further, please get in touch :)

Reinforcement Learning with DQNs in No-Limit Hold’em Poker

Ricardo Rocha

Error

Templates (for web app):

Error