API - Reinforcement Learning¶
Reinforcement Learning.
discount_episode_rewards ([rewards, gamma]) |
Take 1D float array of rewards and compute discounted rewards for an episode. |
cross_entropy_reward_loss (logits, actions, …) |
Calculate the loss for Policy Gradient Network. |
Reward functions¶
-
tensorlayer.rein.
discount_episode_rewards
(rewards=[], gamma=0.99)[source]¶ Take 1D float array of rewards and compute discounted rewards for an episode. When encount a non-zero value, consider as the end a of an episode.
Parameters: - rewards : numpy list
a list of rewards
- gamma : float
discounted factor
Examples
>>> rewards = np.asarray([0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1]) >>> gamma = 0.9 >>> discount_rewards = tl.rein.discount_episode_rewards(rewards, gamma) >>> print(discount_rewards) ... [ 0.72899997 0.81 0.89999998 1. 0.72899997 0.81 ... 0.89999998 1. 0.72899997 0.81 0.89999998 1. ]
Cost functions¶
-
tensorlayer.rein.
cross_entropy_reward_loss
(logits, actions, rewards, name=None)[source]¶ Calculate the loss for Policy Gradient Network.
Parameters: - logits : tensor
The network outputs without softmax. This function implements softmax inside.
- actions : tensor/ placeholder
The agent actions.
- rewards : tensor/ placeholder
The rewards.
Examples
>>> states_batch_pl = tf.placeholder(tf.float32, shape=[None, D]) # observation for training >>> network = tl.layers.InputLayer(states_batch_pl, name='input_layer') >>> network = tl.layers.DenseLayer(network, n_units=H, act = tf.nn.relu, name='relu1') >>> network = tl.layers.DenseLayer(network, n_units=3, act = tl.activation.identity, name='output_layer') >>> probs = network.outputs >>> sampling_prob = tf.nn.softmax(probs) >>> actions_batch_pl = tf.placeholder(tf.int32, shape=[None]) >>> discount_rewards_batch_pl = tf.placeholder(tf.float32, shape=[None]) >>> loss = cross_entropy_reward_loss(probs, actions_batch_pl, discount_rewards_batch_pl) >>> train_op = tf.train.RMSPropOptimizer(learning_rate, decay_rate).minimize(loss)