harl.common.buffers package¶

Submodules¶

harl.common.buffers.off_policy_buffer_base module¶

Off-policy buffer.

class harl.common.buffers.off_policy_buffer_base.OffPolicyBufferBase(args, share_obs_space, num_agents, obs_spaces, act_spaces)[source]¶

Bases: object

get_mean_rewards()[source]¶: Get mean rewards of the buffer

insert(data)[source]¶

Insert data into buffer.

Parameters:

data – a tuple of (share_obs, obs, actions, available_actions, reward, done, valid_transitions, term, next_share_obs, next_obs, next_available_actions)
share_obs – EP: (n_rollout_threads, *share_obs_shape), FP: (n_rollout_threads, num_agents, *share_obs_shape)
obs – [(n_rollout_threads, *obs_shapes[agent_id]) for agent_id in range(num_agents)]
actions – [(n_rollout_threads, *act_shapes[agent_id]) for agent_id in range(num_agents)]
available_actions – [(n_rollout_threads, *act_shapes[agent_id]) for agent_id in range(num_agents)]
reward – EP: (n_rollout_threads, 1), FP: (n_rollout_threads, num_agents, 1)
done – EP: (n_rollout_threads, 1), FP: (n_rollout_threads, num_agents, 1)
valid_transitions – [(n_rollout_threads, 1) for agent_id in range(num_agents)]
term – EP: (n_rollout_threads, 1), FP: (n_rollout_threads, num_agents, 1)
next_share_obs – EP: (n_rollout_threads, *share_obs_shape), FP: (n_rollout_threads, num_agents, *share_obs_shape)
next_obs – [(n_rollout_threads, *obs_shapes[agent_id]) for agent_id in range(num_agents)]
next_available_actions – [(n_rollout_threads, *act_shapes[agent_id]) for agent_id in range(num_agents)]

next(indices)[source]¶

sample()[source]¶

update_end_flag()[source]¶

harl.common.buffers.off_policy_buffer_ep module¶

Off-policy buffer.

class harl.common.buffers.off_policy_buffer_ep.OffPolicyBufferEP(args, share_obs_space, num_agents, obs_spaces, act_spaces)[source]¶

Bases: OffPolicyBufferBase

Off-policy buffer that uses Environment-Provided (EP) state.

next(indices)[source]¶: Get next indices

sample()[source]¶

Sample data for training.

Returns:: (batch_size, *dim) sp_obs: (n_agents, batch_size, *dim) sp_actions: (n_agents, batch_size, *dim) sp_available_actions: (n_agents, batch_size, *dim) sp_reward: (batch_size, 1) sp_done: (batch_size, 1) sp_valid_transitions: (n_agents, batch_size, 1) sp_term: (batch_size, 1) sp_next_share_obs: (batch_size, *dim) sp_next_obs: (n_agents, batch_size, *dim) sp_next_available_actions: (n_agents, batch_size, *dim) sp_gamma: (batch_size, 1)
Return type:: sp_share_obs

update_end_flag()[source]¶: Update current end flag for computing n-step return. End flag is True at the steps which are the end of an episode or the latest but unfinished steps.

harl.common.buffers.off_policy_buffer_fp module¶

Off-policy buffer.

class harl.common.buffers.off_policy_buffer_fp.OffPolicyBufferFP(args, share_obs_space, num_agents, obs_spaces, act_spaces)[source]¶

Bases: OffPolicyBufferBase

Off-policy buffer that uses Feature-Pruned (FP) state.

When FP state is used, the critic takes different global state as input for different actors. Thus, OffPolicyBufferFP has an extra dimension for number of agents.

next(indices)[source]¶: Get next indices

sample()[source]¶

Sample data for training.

Returns:: (n_agents * batch_size, *dim) sp_obs: (n_agents, batch_size, *dim) sp_actions: (n_agents, batch_size, *dim) sp_available_actions: (n_agents, batch_size, *dim) sp_reward: (n_agents * batch_size, 1) sp_done: (n_agents * batch_size, 1) sp_valid_transitions: (n_agents, batch_size, 1) sp_term: (n_agents * batch_size, 1) sp_next_share_obs: (n_agents * batch_size, *dim) sp_next_obs: (n_agents, batch_size, *dim) sp_next_available_actions: (n_agents, batch_size, *dim) sp_gamma: (n_agents * batch_size, 1)
Return type:: sp_share_obs

update_end_flag()[source]¶: Update current end flag for computing n-step return. End flag is True at the steps which are the end of an episode or the latest but unfinished steps.

harl.common.buffers.on_policy_actor_buffer module¶

On-policy buffer for actor.

class harl.common.buffers.on_policy_actor_buffer.OnPolicyActorBuffer(args, obs_space, act_space)[source]¶

Bases: object

On-policy buffer for actor data storage.

after_update()[source]¶: After an update, copy the data at the last step to the first position of the buffer.

feed_forward_generator_actor(advantages, actor_num_mini_batch=None, mini_batch_size=None)[source]¶: Training data generator for actor that uses MLP network.

insert(obs, rnn_states, actions, action_log_probs, masks, active_masks=None, available_actions=None)[source]¶: Insert data into actor buffer.

naive_recurrent_generator_actor(advantages, actor_num_mini_batch)[source]¶: Training data generator for actor that uses RNN network. This generator does not split the trajectories into chunks, and therefore maybe less efficient than the recurrent_generator_actor in training.

recurrent_generator_actor(advantages, actor_num_mini_batch, data_chunk_length)[source]¶: Training data generator for actor that uses RNN network. This generator splits the trajectories into chunks of length data_chunk_length, and therefore maybe more efficient than the naive_recurrent_generator_actor in training.

update_factor(factor)[source]¶: Save factor for this actor.

harl.common.buffers.on_policy_critic_buffer_ep module¶

On-policy buffer for critic that uses Environment-Provided (EP) state.

class harl.common.buffers.on_policy_critic_buffer_ep.OnPolicyCriticBufferEP(args, share_obs_space)[source]¶

Bases: object

On-policy buffer for critic that uses Environment-Provided (EP) state.

after_update()[source]¶: After an update, copy the data at the last step to the first position of the buffer.

compute_returns(next_value, value_normalizer=None)[source]¶: Compute returns either as discounted sum of rewards, or using GAE. :param next_value: (np.ndarray) value predictions for the step after the last episode step. :param value_normalizer: (ValueNorm) If not None, ValueNorm value normalizer instance.

feed_forward_generator_critic(critic_num_mini_batch=None, mini_batch_size=None)[source]¶: Training data generator for critic that uses MLP network. :param critic_num_mini_batch: (int) Number of mini batches for critic. :param mini_batch_size: (int) Size of mini batch for critic.

get_mean_rewards()[source]¶: Get mean rewards for logging.

insert(share_obs, rnn_states_critic, value_preds, rewards, masks, bad_masks)[source]¶: Insert data into buffer.

naive_recurrent_generator_critic(critic_num_mini_batch)[source]¶: Training data generator for critic that uses RNN network. This generator does not split the trajectories into chunks, and therefore maybe less efficient than the recurrent_generator_critic in training. :param critic_num_mini_batch: (int) Number of mini batches for critic.

recurrent_generator_critic(critic_num_mini_batch, data_chunk_length)[source]¶: Training data generator for critic that uses RNN network. This generator splits the trajectories into chunks of length data_chunk_length, and therefore maybe more efficient than the naive_recurrent_generator_actor in training. :param critic_num_mini_batch: (int) Number of mini batches for critic. :param data_chunk_length: (int) Length of data chunks.

harl.common.buffers.on_policy_critic_buffer_fp module¶

On-policy buffer for critic that uses Feature-Pruned (FP) state.

class harl.common.buffers.on_policy_critic_buffer_fp.OnPolicyCriticBufferFP(args, share_obs_space, num_agents)[source]¶

Bases: object

On-policy buffer for critic that uses Feature-Pruned (FP) state. When FP state is used, the critic takes different global state as input for different actors. Thus, OnPolicyCriticBufferFP has an extra dimension for number of agents compared to OnPolicyCriticBufferEP.

after_update()[source]¶: After an update, copy the data at the last step to the first position of the buffer.

compute_returns(next_value, value_normalizer=None)[source]¶: Compute returns either as discounted sum of rewards, or using GAE. :param next_value: (np.ndarray) value predictions for the step after the last episode step. :param value_normalizer: (ValueNorm) If not None, ValueNorm value normalizer instance.

feed_forward_generator_critic(critic_num_mini_batch=None, mini_batch_size=None)[source]¶: Training data generator for critic that uses MLP network. :param critic_num_mini_batch: (int) Number of mini batches for critic. :param mini_batch_size: (int) Size of mini batch for critic.

get_mean_rewards()[source]¶

insert(share_obs, rnn_states_critic, value_preds, rewards, masks, bad_masks)[source]¶: Insert data into buffer.

naive_recurrent_generator_critic(critic_num_mini_batch)[source]¶: Training data generator for critic that uses RNN network. This generator does not split the trajectories into chunks, and therefore maybe less efficient than the recurrent_generator_critic in training. :param critic_num_mini_batch: (int) Number of mini batches for critic.

recurrent_generator_critic(critic_num_mini_batch, data_chunk_length)[source]¶: Training data generator for critic that uses RNN network. This generator splits the trajectories into chunks of length data_chunk_length, and therefore maybe more efficient than the naive_recurrent_generator_actor in training. :param critic_num_mini_batch: (int) Number of mini batches for critic. :param data_chunk_length: (int) Length of data chunks.

harl.common.buffers package¶

Submodules¶

harl.common.buffers.off_policy_buffer_base module¶

harl.common.buffers.off_policy_buffer_ep module¶

harl.common.buffers.off_policy_buffer_fp module¶

harl.common.buffers.on_policy_actor_buffer module¶

harl.common.buffers.on_policy_critic_buffer_ep module¶

harl.common.buffers.on_policy_critic_buffer_fp module¶

Module contents¶