Reinforcement Learning with DQNs in No-Limit Hold’em Poker

Reinforcement Learning with DQNs in No-Limit Hold’em Poker

A Deep Q-Network (DQN) was trained to play simplified No-Limit Texas Hold’em Poker using the RLCard environment. The project explored how basic reinforcement learning techniques adapt to strategic uncertainty and opponent behavior, without relying on heavy computational resources or advanced game-theoretic tools.

This project explored whether a Deep Q-Network (DQN) could learn meaningful strategies in a simplified heads-up No-Limit Texas Hold’em Poker setting. Using the RLCard Python framework, two training regimes were tested:

  1. Self-play, where both players learned simultaneously, and
  2. Fixed-opponent training, where the agent played against a uniform-random baseline.

The state space was represented as a 54-dimensional binary vector, encoding cards and chip counts, while the action space was discretized into six betting options ranging from fold to all-in. Agents were trained for 100,000 episodes using an ε-greedy exploration policy and a two-layer neural network with 512 hidden units.

Learning was governed by two standard components of reinforcement learning:

1. Q-learning update rule

Q(s,a)Q(s,a)+α[r+γmaxaQ(s,a)Q(s,a)]Q(s,a) \leftarrow Q(s,a) + \alpha \Big[ r + \gamma \max_{a'} Q(s',a') - Q(s,a) \Big]

2. ε-greedy action selection

π(s)={random action,with probability ϵargmaxaQ(s,a),with probability 1ϵ\pi(s) = \begin{cases} \text{random action}, & \text{with probability } \epsilon \\ \arg\max_{a} Q(s,a), & \text{with probability } 1-\epsilon \end{cases}

where:

  • ss and ss' are the current and next states, representing the game configuration (cards, chip stacks, betting history).
  • aa and aa' are actions chosen from the discrete action space (fold, call, raise, all-in).
  • rr is the immediate reward from taking action aa in state ss.
  • Q(s,a)Q(s,a) is the action-value function, estimating the expected return of taking action aa in state ss and following the current policy thereafter.
  • α\alpha is the learning rate, controlling how much new information overrides old estimates.
  • γ\gamma is the discount factor, weighting the importance of future rewards.
  • ϵ\epsilon is the exploration rate in the ϵ\epsilon-greedy policy, giving the probability of choosing a random action instead of the greedy one.

Main insights from the study:

  • Both training setups produced agents that achieved positive returns and consistently outperformed a random baseline.
  • Training against a fixed random opponent led to stable and reliable gains (≈ +5,000 to +6,000 big blinds over evaluation runs).
  • Self-play generated higher peak performances (sometimes >10,000 big blinds), but with greater variability across seeds.
  • Sensitivity analysis showed that:
    Lower learning rates (α\alpha = 1×10⁻⁵) produced the most stable learning.
    • A slightly shorter planning horizon (γ\gamma = 0.95) worked best in self-play.
    Higher initial exploration (ϵ\epsilon = 0.9–1.0) improved adaptation in dynamic self-play settings.
  • While the model is far from state-of-the-art, it demonstrated how basic RL methods can adapt to imperfect information, uncertainty, and competition.

Overall, this work highlights how reinforcement learning agents can learn non-trivial strategies in complex, adversarial environments even with limited resources, making it a valuable teaching and experimentation tool.

Note: Due to university policies, I cannot share the full report or detailed data. If you are interested in discussing the methodology or results further, please get in touch :)