harl.common.buffers package¶
Submodules¶
harl.common.buffers.off_policy_buffer_base module¶
Off-policy buffer.
- class harl.common.buffers.off_policy_buffer_base.OffPolicyBufferBase(args, share_obs_space, num_agents, obs_spaces, act_spaces)[source]¶
Bases:
object
- insert(data)[source]¶
Insert data into buffer.
- Parameters:
data – a tuple of (share_obs, obs, actions, available_actions, reward, done, valid_transitions, term, next_share_obs, next_obs, next_available_actions)
share_obs – EP: (n_rollout_threads, *share_obs_shape), FP: (n_rollout_threads, num_agents, *share_obs_shape)
obs – [(n_rollout_threads, *obs_shapes[agent_id]) for agent_id in range(num_agents)]
actions – [(n_rollout_threads, *act_shapes[agent_id]) for agent_id in range(num_agents)]
available_actions – [(n_rollout_threads, *act_shapes[agent_id]) for agent_id in range(num_agents)]
reward – EP: (n_rollout_threads, 1), FP: (n_rollout_threads, num_agents, 1)
done – EP: (n_rollout_threads, 1), FP: (n_rollout_threads, num_agents, 1)
valid_transitions – [(n_rollout_threads, 1) for agent_id in range(num_agents)]
term – EP: (n_rollout_threads, 1), FP: (n_rollout_threads, num_agents, 1)
next_share_obs – EP: (n_rollout_threads, *share_obs_shape), FP: (n_rollout_threads, num_agents, *share_obs_shape)
next_obs – [(n_rollout_threads, *obs_shapes[agent_id]) for agent_id in range(num_agents)]
next_available_actions – [(n_rollout_threads, *act_shapes[agent_id]) for agent_id in range(num_agents)]
harl.common.buffers.off_policy_buffer_ep module¶
Off-policy buffer.
- class harl.common.buffers.off_policy_buffer_ep.OffPolicyBufferEP(args, share_obs_space, num_agents, obs_spaces, act_spaces)[source]¶
Bases:
OffPolicyBufferBase
Off-policy buffer that uses Environment-Provided (EP) state.
- sample()[source]¶
Sample data for training.
- Returns:
(batch_size, *dim) sp_obs: (n_agents, batch_size, *dim) sp_actions: (n_agents, batch_size, *dim) sp_available_actions: (n_agents, batch_size, *dim) sp_reward: (batch_size, 1) sp_done: (batch_size, 1) sp_valid_transitions: (n_agents, batch_size, 1) sp_term: (batch_size, 1) sp_next_share_obs: (batch_size, *dim) sp_next_obs: (n_agents, batch_size, *dim) sp_next_available_actions: (n_agents, batch_size, *dim) sp_gamma: (batch_size, 1)
- Return type:
sp_share_obs
harl.common.buffers.off_policy_buffer_fp module¶
Off-policy buffer.
- class harl.common.buffers.off_policy_buffer_fp.OffPolicyBufferFP(args, share_obs_space, num_agents, obs_spaces, act_spaces)[source]¶
Bases:
OffPolicyBufferBase
Off-policy buffer that uses Feature-Pruned (FP) state.
When FP state is used, the critic takes different global state as input for different actors. Thus, OffPolicyBufferFP has an extra dimension for number of agents.
- sample()[source]¶
Sample data for training.
- Returns:
(n_agents * batch_size, *dim) sp_obs: (n_agents, batch_size, *dim) sp_actions: (n_agents, batch_size, *dim) sp_available_actions: (n_agents, batch_size, *dim) sp_reward: (n_agents * batch_size, 1) sp_done: (n_agents * batch_size, 1) sp_valid_transitions: (n_agents, batch_size, 1) sp_term: (n_agents * batch_size, 1) sp_next_share_obs: (n_agents * batch_size, *dim) sp_next_obs: (n_agents, batch_size, *dim) sp_next_available_actions: (n_agents, batch_size, *dim) sp_gamma: (n_agents * batch_size, 1)
- Return type:
sp_share_obs
harl.common.buffers.on_policy_actor_buffer module¶
On-policy buffer for actor.
- class harl.common.buffers.on_policy_actor_buffer.OnPolicyActorBuffer(args, obs_space, act_space)[source]¶
Bases:
object
On-policy buffer for actor data storage.
- after_update()[source]¶
After an update, copy the data at the last step to the first position of the buffer.
- feed_forward_generator_actor(advantages, actor_num_mini_batch=None, mini_batch_size=None)[source]¶
Training data generator for actor that uses MLP network.
- insert(obs, rnn_states, actions, action_log_probs, masks, active_masks=None, available_actions=None)[source]¶
Insert data into actor buffer.
- naive_recurrent_generator_actor(advantages, actor_num_mini_batch)[source]¶
Training data generator for actor that uses RNN network. This generator does not split the trajectories into chunks, and therefore maybe less efficient than the recurrent_generator_actor in training.
- recurrent_generator_actor(advantages, actor_num_mini_batch, data_chunk_length)[source]¶
Training data generator for actor that uses RNN network. This generator splits the trajectories into chunks of length data_chunk_length, and therefore maybe more efficient than the naive_recurrent_generator_actor in training.
harl.common.buffers.on_policy_critic_buffer_ep module¶
On-policy buffer for critic that uses Environment-Provided (EP) state.
- class harl.common.buffers.on_policy_critic_buffer_ep.OnPolicyCriticBufferEP(args, share_obs_space)[source]¶
Bases:
object
On-policy buffer for critic that uses Environment-Provided (EP) state.
- after_update()[source]¶
After an update, copy the data at the last step to the first position of the buffer.
- compute_returns(next_value, value_normalizer=None)[source]¶
Compute returns either as discounted sum of rewards, or using GAE. :param next_value: (np.ndarray) value predictions for the step after the last episode step. :param value_normalizer: (ValueNorm) If not None, ValueNorm value normalizer instance.
- feed_forward_generator_critic(critic_num_mini_batch=None, mini_batch_size=None)[source]¶
Training data generator for critic that uses MLP network. :param critic_num_mini_batch: (int) Number of mini batches for critic. :param mini_batch_size: (int) Size of mini batch for critic.
- insert(share_obs, rnn_states_critic, value_preds, rewards, masks, bad_masks)[source]¶
Insert data into buffer.
- naive_recurrent_generator_critic(critic_num_mini_batch)[source]¶
Training data generator for critic that uses RNN network. This generator does not split the trajectories into chunks, and therefore maybe less efficient than the recurrent_generator_critic in training. :param critic_num_mini_batch: (int) Number of mini batches for critic.
- recurrent_generator_critic(critic_num_mini_batch, data_chunk_length)[source]¶
Training data generator for critic that uses RNN network. This generator splits the trajectories into chunks of length data_chunk_length, and therefore maybe more efficient than the naive_recurrent_generator_actor in training. :param critic_num_mini_batch: (int) Number of mini batches for critic. :param data_chunk_length: (int) Length of data chunks.
harl.common.buffers.on_policy_critic_buffer_fp module¶
On-policy buffer for critic that uses Feature-Pruned (FP) state.
- class harl.common.buffers.on_policy_critic_buffer_fp.OnPolicyCriticBufferFP(args, share_obs_space, num_agents)[source]¶
Bases:
object
On-policy buffer for critic that uses Feature-Pruned (FP) state. When FP state is used, the critic takes different global state as input for different actors. Thus, OnPolicyCriticBufferFP has an extra dimension for number of agents compared to OnPolicyCriticBufferEP.
- after_update()[source]¶
After an update, copy the data at the last step to the first position of the buffer.
- compute_returns(next_value, value_normalizer=None)[source]¶
Compute returns either as discounted sum of rewards, or using GAE. :param next_value: (np.ndarray) value predictions for the step after the last episode step. :param value_normalizer: (ValueNorm) If not None, ValueNorm value normalizer instance.
- feed_forward_generator_critic(critic_num_mini_batch=None, mini_batch_size=None)[source]¶
Training data generator for critic that uses MLP network. :param critic_num_mini_batch: (int) Number of mini batches for critic. :param mini_batch_size: (int) Size of mini batch for critic.
- insert(share_obs, rnn_states_critic, value_preds, rewards, masks, bad_masks)[source]¶
Insert data into buffer.
- naive_recurrent_generator_critic(critic_num_mini_batch)[source]¶
Training data generator for critic that uses RNN network. This generator does not split the trajectories into chunks, and therefore maybe less efficient than the recurrent_generator_critic in training. :param critic_num_mini_batch: (int) Number of mini batches for critic.
- recurrent_generator_critic(critic_num_mini_batch, data_chunk_length)[source]¶
Training data generator for critic that uses RNN network. This generator splits the trajectories into chunks of length data_chunk_length, and therefore maybe more efficient than the naive_recurrent_generator_actor in training. :param critic_num_mini_batch: (int) Number of mini batches for critic. :param data_chunk_length: (int) Length of data chunks.