The problem

(This problem is Ex. 4.8 in the book Reinforcement Learning (by Richard S. Sutton and Andrew G. Barton))

You have a certain amount < 100 dollars with you. On multiple coin tosses, for a single coin toss you can stake an amount to reach the goal of 100 dollars. Heads you win the amount you stake and tails you loose the amount you stake. Game ends when you hit 0 dollars. The probability of a heads is 0.4 .

You are given an optimal plan of action (graph below) for any start value to reach a 100 and are asked

Why does the optimal policy have a curious form?


Notes:

  1. In the literature preceding the actual question, the actions are limited to staking amounts in the range of 0 to min(s, 100-s) where s is your current amount. ( THIS IS IMPORTANT )

    For example, if you have 70 dollars on hand, theoretically you can stake any amount from 0 dollars to 30 dollars (as per the above rule)

  2. A graph depicting the odds of getting to the goal from your current state
  3. Definition of a Policy:

    A function that given your current state gives out an action to take you to the next state (this can be deterministic or stochastic. For this problem, this is deterministic in nature)


A look at how bad 0.4 chance of heads is in terms of winning

To get a sense of how bad things are in terms of winning, we plot the number of wins per thousand games for various starting amounts for three different types of coin toss distributions

The strategy used here is one where we go all in, all the time. I choose to take this as my baseline in the constrast against the optimal policy (shown later on)


Some thoughts on a possible optimal solution

How the above points translate to the optimal policy in question

Refer to [1] for actual points

you'd want to concentrate your sweet spots to a few places where you drive in a winning shot in one go OR if not that be a jump spot to a winning spot
You'd want to make sure that every state that cannot possibly reach the goal in one sweep somehow gets closer to the sweet/jump spots on win or lose

The rest of the states i.e ≥ 1 and ≤ 12 go all in and get closer to 25 (can't do significantly much here in the scheme of this policy); > 88 and ≤ 99 are already within the range of the goal in one shot

You want to make sure there is enough distance between these sweet/jump spots

The choice of 25, 50 and 75 with the spacing that comes along with those, along with the goal of reaching the exact amount of 100 and the funneling of the other states on wins/loses is responsible for the curious form

If the actions were permissive of over-reaching the goal i.e getting an amount of more than 100, the policy would look different


Closing thoughts

Here's a graph that shows us how the optimal policyfares against the baseline which serves as evidence that it is a good solution.

We've established why the optimal policy has a curious form. One so symmetric and beautiful. The thing of beauty is how the policy manages loss to provide wins.


Footnotes

[1] Optimum policy state-action (x,y) pairs

              [(1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (8, 8), (9, 9), (10, 10), (11, 11), (12, 12),
              (13, 12), (14, 11), (15, 10), (16, 9), (17, 8), (18, 7), (19, 6), (20, 5), (21, 4), (22, 3), (23, 2), (24, 1),
              (25, 25),
              (26, 1), (27, 2), (28, 3), (29, 4), (30, 5), (31, 6), (32, 7), (33, 8), (34, 9), (35, 10), (36, 11), (37, 12),
              (38, 12), (39, 11), (40, 10), (41, 9), (42, 8), (43, 7), (44, 6), (45, 5), (46, 4), (47, 3), (48, 2), (49, 1),
              (50, 50),
              (51, 1), (52, 2), (53, 3), (54, 4), (55, 5), (56, 6), (57, 7), (58, 8), (59, 9), (60, 10), (61, 11), (62, 12),
              (63, 12), (64, 11), (65, 10), (66, 9), (67, 8), (68, 7), (69, 6), (70, 5), (71, 4), (72, 3), (73, 2), (74, 1),
              (75, 25),
              (76, 1), (77, 2), (78, 3), (79, 4), (80, 5), (81, 6), (82, 7), (83, 8), (84, 9), (85, 10), (86, 11), (87, 12),
              (88, 12), (89, 11), (90, 10), (91, 9), (92, 8), (93, 7), (94, 6), (95, 5), (96, 4), (97, 3), (98, 2), (99, 1)]
          

Acknowledgements

Special thanks to Spencer for intensive feedback on structure and content of this blog post