harl.algorithms.actors package¶
Submodules¶
harl.algorithms.actors.haa2c module¶
HAA2C algorithm.
- class harl.algorithms.actors.haa2c.HAA2C(args, obs_space, act_space, device=device(type='cpu'))[source]¶
Bases:
OnPolicyBase
- train(actor_buffer, advantages, state_type)[source]¶
Perform a training update using minibatch GD. :param actor_buffer: (OnPolicyActorBuffer) buffer containing training data related to actor. :param advantages: (np.ndarray) advantages. :param state_type: (str) type of state.
- Returns:
(dict) contains information regarding training update (e.g. loss, grad norms, etc).
- Return type:
train_info
- update(sample)[source]¶
Update actor network. :param sample: (Tuple) contains data batch with which to update networks.
- Returns:
(torch.Tensor) actor(policy) loss value. dist_entropy: (torch.Tensor) action entropies. actor_grad_norm: (torch.Tensor) gradient norm from actor update. imp_weights: (torch.Tensor) importance sampling weights.
- Return type:
policy_loss
harl.algorithms.actors.had3qn module¶
HAD3QN algorithm.
- class harl.algorithms.actors.had3qn.HAD3QN(args, obs_space, act_space, device=device(type='cpu'))[source]¶
Bases:
OffPolicyBase
- get_actions(obs, epsilon_greedy)[source]¶
Get actions for observations. :param obs: (np.ndarray) observations of actor, shape is (n_threads, dim) or (batch_size, dim) :param epsilon_greedy: (bool) whether choose action epsilon-greedily
- Returns:
(torch.Tensor) actions taken by this actor, shape is (n_threads, 1) or (batch_size, 1)
- Return type:
actions
- get_target_actions(obs)[source]¶
Get target actor actions for observations. :param obs: (np.ndarray) observations of target actor, shape is (batch_size, dim)
- Returns:
(torch.Tensor) actions taken by target actor, shape is (batch_size, 1)
- Return type:
actions
- train_values(obs, actions)[source]¶
Get values with grad for obs and actions :param obs: (np.ndarray) observations batch, shape is (batch_size, dim) :param actions: (torch.Tensor) actions batch, shape is (batch_size, 1)
- Returns:
(torch.Tensor) values predicted by Q network, shape is (batch_size, 1)
- Return type:
values
harl.algorithms.actors.haddpg module¶
HADDPG algorithm.
- class harl.algorithms.actors.haddpg.HADDPG(args, obs_space, act_space, device=device(type='cpu'))[source]¶
Bases:
OffPolicyBase
- get_actions(obs, add_noise)[source]¶
Get actions for observations. :param obs: (np.ndarray) observations of actor, shape is (n_threads, dim) or (batch_size, dim) :param add_noise: (bool) whether to add noise
- Returns:
(torch.Tensor) actions taken by this actor, shape is (n_threads, dim) or (batch_size, dim)
- Return type:
actions
harl.algorithms.actors.happo module¶
HAPPO algorithm.
- class harl.algorithms.actors.happo.HAPPO(args, obs_space, act_space, device=device(type='cpu'))[source]¶
Bases:
OnPolicyBase
- train(actor_buffer, advantages, state_type)[source]¶
Perform a training update using minibatch GD. :param actor_buffer: (OnPolicyActorBuffer) buffer containing training data related to actor. :param advantages: (np.ndarray) advantages. :param state_type: (str) type of state.
- Returns:
(dict) contains information regarding training update (e.g. loss, grad norms, etc).
- Return type:
train_info
- update(sample)[source]¶
Update actor network. :param sample: (Tuple) contains data batch with which to update networks.
- Returns:
(torch.Tensor) actor(policy) loss value. dist_entropy: (torch.Tensor) action entropies. actor_grad_norm: (torch.Tensor) gradient norm from actor update. imp_weights: (torch.Tensor) importance sampling weights.
- Return type:
policy_loss
harl.algorithms.actors.hasac module¶
HASAC algorithm.
- class harl.algorithms.actors.hasac.HASAC(args, obs_space, act_space, device=device(type='cpu'))[source]¶
Bases:
OffPolicyBase
- get_actions(obs, available_actions=None, stochastic=True)[source]¶
Get actions for observations.
- Parameters:
obs – (np.ndarray) observations of actor, shape is (n_threads, dim) or (batch_size, dim)
available_actions – (np.ndarray) denotes which actions are available to agent (if None, all actions available)
stochastic – (bool) stochastic actions or deterministic actions
- Returns:
(torch.Tensor) actions taken by this actor, shape is (n_threads, dim) or (batch_size, dim)
- Return type:
actions
- get_actions_with_logprobs(obs, available_actions=None, stochastic=True)[source]¶
Get actions and logprobs of actions for observations.
- Parameters:
obs – (np.ndarray) observations of actor, shape is (batch_size, dim)
available_actions – (np.ndarray) denotes which actions are available to agent (if None, all actions available)
stochastic – (bool) stochastic actions or deterministic actions
- Returns:
(torch.Tensor) actions taken by this actor, shape is (batch_size, dim) logp_actions: (torch.Tensor) log probabilities of actions taken by this actor, shape is (batch_size, 1)
- Return type:
actions
harl.algorithms.actors.hatd3 module¶
HATD3 algorithm.
harl.algorithms.actors.hatrpo module¶
HATRPO algorithm.
- class harl.algorithms.actors.hatrpo.HATRPO(args, obs_space, act_space, device=device(type='cpu'))[source]¶
Bases:
OnPolicyBase
- train(actor_buffer, advantages, state_type)[source]¶
Perform a training update using minibatch GD. :param actor_buffer: (OnPolicyActorBuffer) buffer containing training data related to actor. :param advantages: (np.ndarray) advantages. :param state_type: (str) type of state.
- Returns:
(dict) contains information regarding training update (e.g. loss, grad norms, etc).
- Return type:
train_info
- update(sample)[source]¶
Update actor networks. :param sample: (Tuple) contains data batch with which to update networks.
- Returns:
(torch.Tensor) KL divergence between old and new policy. loss_improve: (np.float32) loss improvement. expected_improve: (np.ndarray) expected loss improvement. dist_entropy: (torch.Tensor) action entropies. ratio: (torch.Tensor) ratio between new and old policy.
- Return type:
kl
harl.algorithms.actors.maddpg module¶
MADDPG algorithm.
harl.algorithms.actors.mappo module¶
MAPPO algorithm.
- class harl.algorithms.actors.mappo.MAPPO(args, obs_space, act_space, device=device(type='cpu'))[source]¶
Bases:
OnPolicyBase
Perform a training update for parameter-sharing MAPPO using minibatch GD. :param actor_buffer: (list[OnPolicyActorBuffer]) buffer containing training data related to actor. :param advantages: (np.ndarray) advantages. :param num_agents: (int) number of agents. :param state_type: (str) type of state.
- Returns:
(dict) contains information regarding training update (e.g. loss, grad norms, etc).
- Return type:
train_info
- train(actor_buffer, advantages, state_type)[source]¶
Perform a training update for non-parameter-sharing MAPPO using minibatch GD. :param actor_buffer: (OnPolicyActorBuffer) buffer containing training data related to actor. :param advantages: (np.ndarray) advantages. :param state_type: (str) type of state.
- Returns:
(dict) contains information regarding training update (e.g. loss, grad norms, etc).
- Return type:
train_info
- update(sample)[source]¶
Update actor network. :param sample: (Tuple) contains data batch with which to update networks.
- Returns:
(torch.Tensor) actor(policy) loss value. dist_entropy: (torch.Tensor) action entropies. actor_grad_norm: (torch.Tensor) gradient norm from actor update. imp_weights: (torch.Tensor) importance sampling weights.
- Return type:
policy_loss
harl.algorithms.actors.matd3 module¶
MATD3 algorithm.
harl.algorithms.actors.off_policy_base module¶
Base class for off-policy algorithms.
- class harl.algorithms.actors.off_policy_base.OffPolicyBase(args, obs_space, act_space, device=device(type='cpu'))[source]¶
Bases:
object
harl.algorithms.actors.on_policy_base module¶
Base class for on-policy algorithms.
- class harl.algorithms.actors.on_policy_base.OnPolicyBase(args, obs_space, act_space, device=device(type='cpu'))[source]¶
Bases:
object
- act(obs, rnn_states_actor, masks, available_actions=None, deterministic=False)[source]¶
Compute actions using the given inputs.
- Parameters:
obs – (np.ndarray) local agent inputs to the actor.
rnn_states_actor – (np.ndarray) if actor is RNN, RNN states for actor.
masks – (np.ndarray) denotes points at which RNN states should be reset.
available_actions – (np.ndarray) denotes which actions are available to agent (if None, all actions available)
deterministic – (bool) whether the action should be mode of distribution or should be sampled.
- evaluate_actions(obs, rnn_states_actor, action, masks, available_actions=None, active_masks=None)[source]¶
Get action logprobs, entropy, and distributions for actor update.
- Parameters:
obs – (np.ndarray / torch.Tensor) local agent inputs to the actor.
rnn_states_actor – (np.ndarray / torch.Tensor) if actor has RNN layer, RNN states for actor.
action – (np.ndarray / torch.Tensor) actions whose log probabilities and entropy to compute.
masks – (np.ndarray / torch.Tensor) denotes points at which RNN states should be reset.
available_actions – (np.ndarray / torch.Tensor) denotes which actions are available to agent (if None, all actions available)
active_masks – (np.ndarray / torch.Tensor) denotes whether an agent is active or dead.
- get_actions(obs, rnn_states_actor, masks, available_actions=None, deterministic=False)[source]¶
Compute actions for the given inputs.
- Parameters:
obs – (np.ndarray) local agent inputs to the actor.
rnn_states_actor – (np.ndarray) if actor has RNN layer, RNN states for actor.
masks – (np.ndarray) denotes points at which RNN states should be reset.
available_actions – (np.ndarray) denotes which actions are available to agent (if None, all actions available)
deterministic – (bool) whether the action should be mode of distribution or should be sampled.
- lr_decay(episode, episodes)[source]¶
Decay the learning rates.
- Parameters:
episode – (int) current training episode.
episodes – (int) total number of training episodes.
Module contents¶
Algorithm registry.