---
# <div align="center"><font color='green'>  </font></div>
# <div align="center"><font color='green'> COSC 2673/2793 | Machine Learning  </font></div>
## <div align="center"> <font color='green'> Week 8 Lab Exercises: **Reinforcement learning**</font></div>
---

## Introduction

In this lab you will be:

1. Learning how to use OpenAI gym.  
2. Implement Q-learning to solve a well-known toy reinforcement learning problem called [Cartpole problem](https://gym.openai.com/envs/CartPole-v1/).  


**It is tricky to get OpenAI gym to work on the AWS (without using the AWS specific functionality). Therefore please run this lab on the anaconda environment on your PC.**


## Cartpole with RL
`Cartpole` is a classic control Reinforcement Learning problem that was first introduced by by Barto, Sutton, and Anderson [Barto83]. 
A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.
    

Cartpole Problem definition:
>Objective: Prevent the pole (pendulum) from falling over.

>State: {Cart Position, Cart Velocity, Pole Angle, Pole Angular Velocity}

>Action: {Push cart to the left, Push cart to the right}.

>Reward: +1 for every timestep that the pole remains upright (we will change this slightly in our implementation)

## OpenAI Gym 
OpenAI Gym is a Python package comprising a selection of RL environments, ranging from simple “toy” environments to more challenging environments, including simulated robotics environments and Atari video game environments.
It was developed with the aim of becoming a standardized environment and benchmark for RL research.
In this Lab, we will use the OpenAI Gym Cartpole environment to demonstrate how to get started in using this exciting tool and show how Q-learning can be used to solve this problem.

## Setting up the environment
Lets first import the libraries required for the implementation.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
from IPython import display
!pip install gym
import numpy as np
import gym

**Only uncomment the following block if the visualization of the environment gives an error**. On mac you need to install pyglet version 1.5.11 to get the gym environment to render. The installation will give an error, but it will work.

In [None]:
!pip install pyglet==1.5.11

To begin with this environment, import and initialize it as follows:

In [None]:
env = gym.make('CartPole-v0')
state = env.reset()
print(state)

The `env.reset()` command resets the environemnt and return the initial state

Lets explore the state space and the action space og the Cartpole environment

In [None]:
print('State space: ', env.observation_space)
print('Action space: ', env.action_space)

This tells us that the state space is a 4-dimensional space, so each state observation is a vector of 4 (float) values, and that the action space comprises two discrete actions (Push cart to the left, Push cart to the right). By default, the two actions are represented by the integers 0 and 1. How about the state space? What are the limis of the state space?

In [None]:
print('State space Low: ', env.observation_space.low)
print('State space High: ', env.observation_space.high)

This shows that the first state variable (Cart Position) has a range [-4.8, 4.8] and the second state variable (Cart Velocity) has a range [-$\infty$, $\infty$].... The state space of the environment is a continuous state space, which means that there are infinitely many state-action pairs, making it impossible to build a Q table. As a solution to this problem we can descritize the state space. One simple discritization is to conver the stat espace to a grid where there are 20 grid positions in along each dimention. Note that we have truncated the two dimetions with infinite limits. 

In [None]:
numBins = 20
bins = [np.linspace(-4.8, 4.8, numBins),
        np.linspace(-4, 4, numBins),
        np.linspace(-.418, .418, numBins),
        np.linspace(-4, 4, numBins)]
obsSpaceSize = len(env.observation_space.high)

We can also write a function that will convert a continuous state vector to a descrete one. 

In [None]:
def discretize_state(state, bins, obsSpaceSize):
    stateIndex = []
    for i in range(obsSpaceSize):
        stateIndex.append(np.digitize(state[i], bins[i]) - 1) # -1 will turn bin into index
    return tuple(stateIndex)

Lets now make some random actions in the environment and see what the output will be. For this we need a function to plot the output of the environment. 

In [None]:
def show_state(env, step=0, info=""):
    plt.figure(1)
    plt.clf()
    plt.imshow(env.render(mode='rgb_array'))
    plt.title("Step: %d %s" % (step, info))
    plt.axis('off')

    display.clear_output(wait=True)
    display.display(plt.gcf())

In [None]:
env.reset()
done = False
step_index = 0
while done != True:
    action = env.action_space.sample()    # get a random action from the set of actions
    state, reward, done, info = env.step(action) # perform the action and receive new state and reward
    d_state = discretize_state(state, bins, obsSpaceSize)
    show_state(env, step=step_index, info='State ({},{},{},{}) Reward: {}'.format(d_state[0], d_state[1],d_state[2], d_state[3], reward))
    step_index = step_index + 1

Did the pole remain upright for a long time?

## Learning 

We are going to use Q-learning for this task. Lets first define some hyper parameters. You may change them to get better performance later.

In [None]:
LEARNING_RATE = 0.1
DISCOUNT = 0.95
EPISODES = 50000

# parameters for epsilon decay policy
EPSILON = 1 # not a constant, going to be decayed
START_EPSILON_DECAYING = 1
END_EPSILON_DECAYING = EPISODES // 2
epsilon_decay_value = EPSILON / (END_EPSILON_DECAYING - START_EPSILON_DECAYING)

#for testing
N_TEST_RUNS = 100
TEST_INTERVAL = 5000

Write a function to test a given model. The function outputs two performace values. Was the run successful (pole upright for 200 steps), and the number of steps it ran. 

> **<font color='red'><span style="font-size:1.5em;">☞</span> Task: Identify if there are better performace measures to be used for this task and discuss with tutor. </font>**  

In [None]:
def test_model(Qtable):
    state = env.reset()
    dstate = discretize_state(state, bins, obsSpaceSize)
    done = False
    steps = 0
    while not done:
        action = np.argmax(Qtable[dstate]) 
        state, reward, done, _ = env.step(action)
        dstate = discretize_state(state, bins, obsSpaceSize)
        steps = steps + 1
        
    success_run = 0
    if steps > 199:
        success_run = 1
        
    return success_run, steps

Now lets develop a function for Q learning. The function prototype is given below. Assume that Q is a numpy matrix with dimentions (number of elemets for Cart Position, number of elements for Cart Velocity, number of elements for Pole Angle, number of elements for Pole Angular Velocity, number of actions).

<span style="font-size:1.5em;">�</span> Complete the following function using the knowladge gained in the lecture. 

In [None]:
def QLearning(env, QTable):
    # Env: The OpenAI gym environment
    # QTable: Initial Q table
    
    for episode in range(EPISODES):
        done = False
        
        # get the initial state
        state = env.reset()
        discretState = discretize_state(state, bins, obsSpaceSize)
        
        epsilon = EPSILON
    
        steps = 0;
        while done != True:   
                
            # Determine next action - epsilon greedy strategy for explore vs exploitation
            if np.random.random() < 1 - epsilon:
                # select the best action according to Qtable (exploitation)
                # TODO
            else:
                # select a random action (exploration)
                # TODO
                
            # Step and Get the next state and reward
            # TODO
            
            
            #Allow for terminal states
            if done and steps < 200:
                reward = -375    # what is happending here?
                
            # Update the Q table
            # TODO
                                     
            # Update variables
            discretState = discretStateNew
            steps = steps + 1
            
            
        # Update epsilon
        if END_EPSILON_DECAYING >= episode and episode >= START_EPSILON_DECAYING:
            epsilon -= epsilon_decay_value
        
        # test the model and print test results
        if episode % TEST_INTERVAL == 0:
            success_run_ = list()
            steps_ = list()
            for i in range(N_TEST_RUNS):
                success_run, steps = test_model(QTable)
                success_run_.append(success_run)
                steps_.append(steps)
                
            print('Testing at Episode {}:'.format(episode))
            print('\t Successful Runs: {}/{}'.format(np.sum(success_run_), N_TEST_RUNS) )
            print('\t Average Steps: {}'.format(np.mean(steps_)))

    env.close()
    
    return QTable

### Sample Solutions

If you are struggling with the above function, a sample solution has been provided.
Only use this if you have **made your absolute best attempts** at implementing the function yourself.
The purpose of this lab is to understand common aspects of RL algorithm, though the Q-learning algorithm.
You will gain significantly less out of this lab if you don't try to solve the problems yourself.

> **<font color='red'><span style="font-size:1.5em;">☞</span> Task: Identify what is happenning in the epsilon decay policy. Discuss with tutor. </font>**  

In [None]:
# Initialize Q table randomly
Initial_QTable = np.random.uniform(low=-2, high=0, size=([numBins] * obsSpaceSize + [env.action_space.n]))

# Run Q-learning algorithm
QTable = QLearning(env, Initial_QTable)

Now lets see how the we can perform the task with the learned model

In [None]:
state = env.reset()
dState = discretize_state(state, bins, obsSpaceSize)
done = False
step_index = 0
while done != True:
    action = np.argmax(QTable[dState]) 
    state, reward, done, info = env.step(action)
    dState = discretize_state(state, bins, obsSpaceSize)
    show_state(env, step=step_index, info='State ({},{},{},{}) Reward: {}'.format(dState[0], dState[1],dState[2], dState[3], reward))
    step_index = step_index + 1
