harl.algorithms.actors package

Submodules

harl.algorithms.actors.haa2c module

HAA2C algorithm.

class harl.algorithms.actors.haa2c.HAA2C(args, obs_space, act_space, device=device(type='cpu'))[source]

Bases: OnPolicyBase

train(actor_buffer, advantages, state_type)[source]

Perform a training update using minibatch GD. :param actor_buffer: (OnPolicyActorBuffer) buffer containing training data related to actor. :param advantages: (np.ndarray) advantages. :param state_type: (str) type of state.

Returns:

(dict) contains information regarding training update (e.g. loss, grad norms, etc).

Return type:

train_info

update(sample)[source]

Update actor network. :param sample: (Tuple) contains data batch with which to update networks.

Returns:

(torch.Tensor) actor(policy) loss value. dist_entropy: (torch.Tensor) action entropies. actor_grad_norm: (torch.Tensor) gradient norm from actor update. imp_weights: (torch.Tensor) importance sampling weights.

Return type:

policy_loss

harl.algorithms.actors.had3qn module

HAD3QN algorithm.

class harl.algorithms.actors.had3qn.HAD3QN(args, obs_space, act_space, device=device(type='cpu'))[source]

Bases: OffPolicyBase

get_actions(obs, epsilon_greedy)[source]

Get actions for observations. :param obs: (np.ndarray) observations of actor, shape is (n_threads, dim) or (batch_size, dim) :param epsilon_greedy: (bool) whether choose action epsilon-greedily

Returns:

(torch.Tensor) actions taken by this actor, shape is (n_threads, 1) or (batch_size, 1)

Return type:

actions

get_target_actions(obs)[source]

Get target actor actions for observations. :param obs: (np.ndarray) observations of target actor, shape is (batch_size, dim)

Returns:

(torch.Tensor) actions taken by target actor, shape is (batch_size, 1)

Return type:

actions

train_values(obs, actions)[source]

Get values with grad for obs and actions :param obs: (np.ndarray) observations batch, shape is (batch_size, dim) :param actions: (torch.Tensor) actions batch, shape is (batch_size, 1)

Returns:

(torch.Tensor) values predicted by Q network, shape is (batch_size, 1)

Return type:

values

harl.algorithms.actors.haddpg module

HADDPG algorithm.

class harl.algorithms.actors.haddpg.HADDPG(args, obs_space, act_space, device=device(type='cpu'))[source]

Bases: OffPolicyBase

get_actions(obs, add_noise)[source]

Get actions for observations. :param obs: (np.ndarray) observations of actor, shape is (n_threads, dim) or (batch_size, dim) :param add_noise: (bool) whether to add noise

Returns:

(torch.Tensor) actions taken by this actor, shape is (n_threads, dim) or (batch_size, dim)

Return type:

actions

get_target_actions(obs)[source]

Get target actor actions for observations. :param obs: (np.ndarray) observations of target actor, shape is (batch_size, dim)

Returns:

(torch.Tensor) actions taken by target actor, shape is (batch_size, dim)

Return type:

actions

harl.algorithms.actors.happo module

HAPPO algorithm.

class harl.algorithms.actors.happo.HAPPO(args, obs_space, act_space, device=device(type='cpu'))[source]

Bases: OnPolicyBase

train(actor_buffer, advantages, state_type)[source]

Perform a training update using minibatch GD. :param actor_buffer: (OnPolicyActorBuffer) buffer containing training data related to actor. :param advantages: (np.ndarray) advantages. :param state_type: (str) type of state.

Returns:

(dict) contains information regarding training update (e.g. loss, grad norms, etc).

Return type:

train_info

update(sample)[source]

Update actor network. :param sample: (Tuple) contains data batch with which to update networks.

Returns:

(torch.Tensor) actor(policy) loss value. dist_entropy: (torch.Tensor) action entropies. actor_grad_norm: (torch.Tensor) gradient norm from actor update. imp_weights: (torch.Tensor) importance sampling weights.

Return type:

policy_loss

harl.algorithms.actors.hasac module

HASAC algorithm.

class harl.algorithms.actors.hasac.HASAC(args, obs_space, act_space, device=device(type='cpu'))[source]

Bases: OffPolicyBase

get_actions(obs, available_actions=None, stochastic=True)[source]

Get actions for observations.

Parameters:
  • obs – (np.ndarray) observations of actor, shape is (n_threads, dim) or (batch_size, dim)

  • available_actions – (np.ndarray) denotes which actions are available to agent (if None, all actions available)

  • stochastic – (bool) stochastic actions or deterministic actions

Returns:

(torch.Tensor) actions taken by this actor, shape is (n_threads, dim) or (batch_size, dim)

Return type:

actions

get_actions_with_logprobs(obs, available_actions=None, stochastic=True)[source]

Get actions and logprobs of actions for observations.

Parameters:
  • obs – (np.ndarray) observations of actor, shape is (batch_size, dim)

  • available_actions – (np.ndarray) denotes which actions are available to agent (if None, all actions available)

  • stochastic – (bool) stochastic actions or deterministic actions

Returns:

(torch.Tensor) actions taken by this actor, shape is (batch_size, dim) logp_actions: (torch.Tensor) log probabilities of actions taken by this actor, shape is (batch_size, 1)

Return type:

actions

restore(model_dir, id)[source]

Restore the actor.

save(save_dir, id)[source]

Save the actor.

harl.algorithms.actors.hatd3 module

HATD3 algorithm.

class harl.algorithms.actors.hatd3.HATD3(args, obs_space, act_space, device=device(type='cpu'))[source]

Bases: HADDPG

get_target_actions(obs)[source]

Get target actor actions for observations. :param obs: (np.ndarray) observations of target actor, shape is (batch_size, dim)

Returns:

(torch.Tensor) actions taken by target actor, shape is (batch_size, dim)

Return type:

actions

harl.algorithms.actors.hatrpo module

HATRPO algorithm.

class harl.algorithms.actors.hatrpo.HATRPO(args, obs_space, act_space, device=device(type='cpu'))[source]

Bases: OnPolicyBase

train(actor_buffer, advantages, state_type)[source]

Perform a training update using minibatch GD. :param actor_buffer: (OnPolicyActorBuffer) buffer containing training data related to actor. :param advantages: (np.ndarray) advantages. :param state_type: (str) type of state.

Returns:

(dict) contains information regarding training update (e.g. loss, grad norms, etc).

Return type:

train_info

update(sample)[source]

Update actor networks. :param sample: (Tuple) contains data batch with which to update networks.

Returns:

(torch.Tensor) KL divergence between old and new policy. loss_improve: (np.float32) loss improvement. expected_improve: (np.ndarray) expected loss improvement. dist_entropy: (torch.Tensor) action entropies. ratio: (torch.Tensor) ratio between new and old policy.

Return type:

kl

harl.algorithms.actors.maddpg module

MADDPG algorithm.

class harl.algorithms.actors.maddpg.MADDPG(args, obs_space, act_space, device=device(type='cpu'))[source]

Bases: HADDPG

harl.algorithms.actors.mappo module

MAPPO algorithm.

class harl.algorithms.actors.mappo.MAPPO(args, obs_space, act_space, device=device(type='cpu'))[source]

Bases: OnPolicyBase

share_param_train(actor_buffer, advantages, num_agents, state_type)[source]

Perform a training update for parameter-sharing MAPPO using minibatch GD. :param actor_buffer: (list[OnPolicyActorBuffer]) buffer containing training data related to actor. :param advantages: (np.ndarray) advantages. :param num_agents: (int) number of agents. :param state_type: (str) type of state.

Returns:

(dict) contains information regarding training update (e.g. loss, grad norms, etc).

Return type:

train_info

train(actor_buffer, advantages, state_type)[source]

Perform a training update for non-parameter-sharing MAPPO using minibatch GD. :param actor_buffer: (OnPolicyActorBuffer) buffer containing training data related to actor. :param advantages: (np.ndarray) advantages. :param state_type: (str) type of state.

Returns:

(dict) contains information regarding training update (e.g. loss, grad norms, etc).

Return type:

train_info

update(sample)[source]

Update actor network. :param sample: (Tuple) contains data batch with which to update networks.

Returns:

(torch.Tensor) actor(policy) loss value. dist_entropy: (torch.Tensor) action entropies. actor_grad_norm: (torch.Tensor) gradient norm from actor update. imp_weights: (torch.Tensor) importance sampling weights.

Return type:

policy_loss

harl.algorithms.actors.matd3 module

MATD3 algorithm.

class harl.algorithms.actors.matd3.MATD3(args, obs_space, act_space, device=device(type='cpu'))[source]

Bases: HATD3

harl.algorithms.actors.off_policy_base module

Base class for off-policy algorithms.

class harl.algorithms.actors.off_policy_base.OffPolicyBase(args, obs_space, act_space, device=device(type='cpu'))[source]

Bases: object

get_actions(obs, randomness)[source]
get_target_actions(obs)[source]
lr_decay(step, steps)[source]

Decay the actor and critic learning rates. :param step: (int) current training step. :param steps: (int) total number of training steps.

restore(model_dir, id)[source]

Restore the actor and target actor.

save(save_dir, id)[source]

Save the actor and target actor.

soft_update()[source]

Soft update target actor.

turn_off_grad()[source]

Turn off grad for actor parameters.

turn_on_grad()[source]

Turn on grad for actor parameters.

harl.algorithms.actors.on_policy_base module

Base class for on-policy algorithms.

class harl.algorithms.actors.on_policy_base.OnPolicyBase(args, obs_space, act_space, device=device(type='cpu'))[source]

Bases: object

act(obs, rnn_states_actor, masks, available_actions=None, deterministic=False)[source]

Compute actions using the given inputs.

Parameters:
  • obs – (np.ndarray) local agent inputs to the actor.

  • rnn_states_actor – (np.ndarray) if actor is RNN, RNN states for actor.

  • masks – (np.ndarray) denotes points at which RNN states should be reset.

  • available_actions – (np.ndarray) denotes which actions are available to agent (if None, all actions available)

  • deterministic – (bool) whether the action should be mode of distribution or should be sampled.

evaluate_actions(obs, rnn_states_actor, action, masks, available_actions=None, active_masks=None)[source]

Get action logprobs, entropy, and distributions for actor update.

Parameters:
  • obs – (np.ndarray / torch.Tensor) local agent inputs to the actor.

  • rnn_states_actor – (np.ndarray / torch.Tensor) if actor has RNN layer, RNN states for actor.

  • action – (np.ndarray / torch.Tensor) actions whose log probabilities and entropy to compute.

  • masks – (np.ndarray / torch.Tensor) denotes points at which RNN states should be reset.

  • available_actions – (np.ndarray / torch.Tensor) denotes which actions are available to agent (if None, all actions available)

  • active_masks – (np.ndarray / torch.Tensor) denotes whether an agent is active or dead.

get_actions(obs, rnn_states_actor, masks, available_actions=None, deterministic=False)[source]

Compute actions for the given inputs.

Parameters:
  • obs – (np.ndarray) local agent inputs to the actor.

  • rnn_states_actor – (np.ndarray) if actor has RNN layer, RNN states for actor.

  • masks – (np.ndarray) denotes points at which RNN states should be reset.

  • available_actions – (np.ndarray) denotes which actions are available to agent (if None, all actions available)

  • deterministic – (bool) whether the action should be mode of distribution or should be sampled.

lr_decay(episode, episodes)[source]

Decay the learning rates.

Parameters:
  • episode – (int) current training episode.

  • episodes – (int) total number of training episodes.

prep_rollout()[source]

Prepare for rollout.

prep_training()[source]

Prepare for training.

train(actor_buffer, advantages, state_type)[source]

Perform a training update using minibatch GD.

Parameters:
  • actor_buffer – (OnPolicyActorBuffer) buffer containing training data related to actor.

  • advantages – (np.ndarray) advantages.

  • state_type – (str) type of state.

update(sample)[source]

Update actor network.

Parameters:

sample – (Tuple) contains data batch with which to update networks.

Module contents

Algorithm registry.