harl.algorithms.actors package¶

Submodules¶

harl.algorithms.actors.haa2c module¶

HAA2C algorithm.

class harl.algorithms.actors.haa2c.HAA2C(args, obs_space, act_space, device=device(type='cpu'))[source]¶

Bases: OnPolicyBase

train(actor_buffer, advantages, state_type)[source]¶

Perform a training update using minibatch GD. :param actor_buffer: (OnPolicyActorBuffer) buffer containing training data related to actor. :param advantages: (np.ndarray) advantages. :param state_type: (str) type of state.

Returns:: (dict) contains information regarding training update (e.g. loss, grad norms, etc).
Return type:: train_info

update(sample)[source]¶

Update actor network. :param sample: (Tuple) contains data batch with which to update networks.

Returns:: (torch.Tensor) actor(policy) loss value. dist_entropy: (torch.Tensor) action entropies. actor_grad_norm: (torch.Tensor) gradient norm from actor update. imp_weights: (torch.Tensor) importance sampling weights.
Return type:: policy_loss

harl.algorithms.actors.had3qn module¶

HAD3QN algorithm.

class harl.algorithms.actors.had3qn.HAD3QN(args, obs_space, act_space, device=device(type='cpu'))[source]¶

Bases: OffPolicyBase

get_actions(obs, epsilon_greedy)[source]¶

Get actions for observations. :param obs: (np.ndarray) observations of actor, shape is (n_threads, dim) or (batch_size, dim) :param epsilon_greedy: (bool) whether choose action epsilon-greedily

Returns:: (torch.Tensor) actions taken by this actor, shape is (n_threads, 1) or (batch_size, 1)
Return type:: actions

get_target_actions(obs)[source]¶

Get target actor actions for observations. :param obs: (np.ndarray) observations of target actor, shape is (batch_size, dim)

Returns:: (torch.Tensor) actions taken by target actor, shape is (batch_size, 1)
Return type:: actions

train_values(obs, actions)[source]¶

Get values with grad for obs and actions :param obs: (np.ndarray) observations batch, shape is (batch_size, dim) :param actions: (torch.Tensor) actions batch, shape is (batch_size, 1)

Returns:: (torch.Tensor) values predicted by Q network, shape is (batch_size, 1)
Return type:: values

harl.algorithms.actors.haddpg module¶

HADDPG algorithm.

class harl.algorithms.actors.haddpg.HADDPG(args, obs_space, act_space, device=device(type='cpu'))[source]¶

Bases: OffPolicyBase

get_actions(obs, add_noise)[source]¶

Get actions for observations. :param obs: (np.ndarray) observations of actor, shape is (n_threads, dim) or (batch_size, dim) :param add_noise: (bool) whether to add noise

Returns:: (torch.Tensor) actions taken by this actor, shape is (n_threads, dim) or (batch_size, dim)
Return type:: actions

get_target_actions(obs)[source]¶

Get target actor actions for observations. :param obs: (np.ndarray) observations of target actor, shape is (batch_size, dim)

Returns:: (torch.Tensor) actions taken by target actor, shape is (batch_size, dim)
Return type:: actions

harl.algorithms.actors.happo module¶

HAPPO algorithm.

class harl.algorithms.actors.happo.HAPPO(args, obs_space, act_space, device=device(type='cpu'))[source]¶

Bases: OnPolicyBase

train(actor_buffer, advantages, state_type)[source]¶

Returns:: (dict) contains information regarding training update (e.g. loss, grad norms, etc).
Return type:: train_info

update(sample)[source]¶

Update actor network. :param sample: (Tuple) contains data batch with which to update networks.

Returns:: (torch.Tensor) actor(policy) loss value. dist_entropy: (torch.Tensor) action entropies. actor_grad_norm: (torch.Tensor) gradient norm from actor update. imp_weights: (torch.Tensor) importance sampling weights.
Return type:: policy_loss

harl.algorithms.actors.hasac module¶

HASAC algorithm.

class harl.algorithms.actors.hasac.HASAC(args, obs_space, act_space, device=device(type='cpu'))[source]¶

Bases: OffPolicyBase

get_actions(obs, available_actions=None, stochastic=True)[source]¶

Get actions for observations.

Parameters:

obs – (np.ndarray) observations of actor, shape is (n_threads, dim) or (batch_size, dim)
available_actions – (np.ndarray) denotes which actions are available to agent (if None, all actions available)
stochastic – (bool) stochastic actions or deterministic actions

Returns:

(torch.Tensor) actions taken by this actor, shape is (n_threads, dim) or (batch_size, dim)

Return type:

actions

get_actions_with_logprobs(obs, available_actions=None, stochastic=True)[source]¶

Get actions and logprobs of actions for observations.

Parameters:

obs – (np.ndarray) observations of actor, shape is (batch_size, dim)
available_actions – (np.ndarray) denotes which actions are available to agent (if None, all actions available)
stochastic – (bool) stochastic actions or deterministic actions

Returns:

(torch.Tensor) actions taken by this actor, shape is (batch_size, dim) logp_actions: (torch.Tensor) log probabilities of actions taken by this actor, shape is (batch_size, 1)

Return type:

actions

restore(model_dir, id)[source]¶: Restore the actor.

save(save_dir, id)[source]¶: Save the actor.

harl.algorithms.actors.hatd3 module¶

HATD3 algorithm.

class harl.algorithms.actors.hatd3.HATD3(args, obs_space, act_space, device=device(type='cpu'))[source]¶

Bases: HADDPG

get_target_actions(obs)[source]¶

Get target actor actions for observations. :param obs: (np.ndarray) observations of target actor, shape is (batch_size, dim)

Returns:: (torch.Tensor) actions taken by target actor, shape is (batch_size, dim)
Return type:: actions

harl.algorithms.actors.hatrpo module¶

HATRPO algorithm.

class harl.algorithms.actors.hatrpo.HATRPO(args, obs_space, act_space, device=device(type='cpu'))[source]¶

Bases: OnPolicyBase

train(actor_buffer, advantages, state_type)[source]¶

Returns:: (dict) contains information regarding training update (e.g. loss, grad norms, etc).
Return type:: train_info

update(sample)[source]¶

Update actor networks. :param sample: (Tuple) contains data batch with which to update networks.

Returns:: (torch.Tensor) KL divergence between old and new policy. loss_improve: (np.float32) loss improvement. expected_improve: (np.ndarray) expected loss improvement. dist_entropy: (torch.Tensor) action entropies. ratio: (torch.Tensor) ratio between new and old policy.
Return type:: kl

harl.algorithms.actors.maddpg module¶

MADDPG algorithm.

class harl.algorithms.actors.maddpg.MADDPG(args, obs_space, act_space, device=device(type='cpu'))[source]¶: Bases: HADDPG

harl.algorithms.actors.mappo module¶

MAPPO algorithm.

class harl.algorithms.actors.mappo.MAPPO(args, obs_space, act_space, device=device(type='cpu'))[source]¶

Bases: OnPolicyBase

share_param_train(actor_buffer, advantages, num_agents, state_type)[source]¶

Perform a training update for parameter-sharing MAPPO using minibatch GD. :param actor_buffer: (list[OnPolicyActorBuffer]) buffer containing training data related to actor. :param advantages: (np.ndarray) advantages. :param num_agents: (int) number of agents. :param state_type: (str) type of state.

Returns:: (dict) contains information regarding training update (e.g. loss, grad norms, etc).
Return type:: train_info

train(actor_buffer, advantages, state_type)[source]¶

Perform a training update for non-parameter-sharing MAPPO using minibatch GD. :param actor_buffer: (OnPolicyActorBuffer) buffer containing training data related to actor. :param advantages: (np.ndarray) advantages. :param state_type: (str) type of state.

Returns:: (dict) contains information regarding training update (e.g. loss, grad norms, etc).
Return type:: train_info

update(sample)[source]¶

Update actor network. :param sample: (Tuple) contains data batch with which to update networks.

Returns:: (torch.Tensor) actor(policy) loss value. dist_entropy: (torch.Tensor) action entropies. actor_grad_norm: (torch.Tensor) gradient norm from actor update. imp_weights: (torch.Tensor) importance sampling weights.
Return type:: policy_loss

harl.algorithms.actors.matd3 module¶

MATD3 algorithm.

class harl.algorithms.actors.matd3.MATD3(args, obs_space, act_space, device=device(type='cpu'))[source]¶: Bases: HATD3

harl.algorithms.actors.off_policy_base module¶

Base class for off-policy algorithms.

class harl.algorithms.actors.off_policy_base.OffPolicyBase(args, obs_space, act_space, device=device(type='cpu'))[source]¶

Bases: object

get_actions(obs, randomness)[source]¶

get_target_actions(obs)[source]¶

lr_decay(step, steps)[source]¶: Decay the actor and critic learning rates. :param step: (int) current training step. :param steps: (int) total number of training steps.

restore(model_dir, id)[source]¶: Restore the actor and target actor.

save(save_dir, id)[source]¶: Save the actor and target actor.

soft_update()[source]¶: Soft update target actor.

turn_off_grad()[source]¶: Turn off grad for actor parameters.

turn_on_grad()[source]¶: Turn on grad for actor parameters.

harl.algorithms.actors.on_policy_base module¶

Base class for on-policy algorithms.

class harl.algorithms.actors.on_policy_base.OnPolicyBase(args, obs_space, act_space, device=device(type='cpu'))[source]¶

Bases: object

act(obs, rnn_states_actor, masks, available_actions=None, deterministic=False)[source]¶

Compute actions using the given inputs.

Parameters:

obs – (np.ndarray) local agent inputs to the actor.
rnn_states_actor – (np.ndarray) if actor is RNN, RNN states for actor.
masks – (np.ndarray) denotes points at which RNN states should be reset.
available_actions – (np.ndarray) denotes which actions are available to agent (if None, all actions available)
deterministic – (bool) whether the action should be mode of distribution or should be sampled.

evaluate_actions(obs, rnn_states_actor, action, masks, available_actions=None, active_masks=None)[source]¶

Get action logprobs, entropy, and distributions for actor update.

Parameters:

obs – (np.ndarray / torch.Tensor) local agent inputs to the actor.
rnn_states_actor – (np.ndarray / torch.Tensor) if actor has RNN layer, RNN states for actor.
action – (np.ndarray / torch.Tensor) actions whose log probabilities and entropy to compute.
masks – (np.ndarray / torch.Tensor) denotes points at which RNN states should be reset.
available_actions – (np.ndarray / torch.Tensor) denotes which actions are available to agent (if None, all actions available)
active_masks – (np.ndarray / torch.Tensor) denotes whether an agent is active or dead.

get_actions(obs, rnn_states_actor, masks, available_actions=None, deterministic=False)[source]¶

Compute actions for the given inputs.

Parameters:

obs – (np.ndarray) local agent inputs to the actor.
rnn_states_actor – (np.ndarray) if actor has RNN layer, RNN states for actor.
masks – (np.ndarray) denotes points at which RNN states should be reset.
available_actions – (np.ndarray) denotes which actions are available to agent (if None, all actions available)
deterministic – (bool) whether the action should be mode of distribution or should be sampled.

lr_decay(episode, episodes)[source]¶

Decay the learning rates.

Parameters:

episode – (int) current training episode.
episodes – (int) total number of training episodes.

prep_rollout()[source]¶: Prepare for rollout.

prep_training()[source]¶: Prepare for training.

train(actor_buffer, advantages, state_type)[source]¶

Perform a training update using minibatch GD.

Parameters:

actor_buffer – (OnPolicyActorBuffer) buffer containing training data related to actor.
advantages – (np.ndarray) advantages.
state_type – (str) type of state.

update(sample)[source]¶

Update actor network.

Parameters:: sample – (Tuple) contains data batch with which to update networks.

Module contents¶

Algorithm registry.