Hindsight Experience Replay (HER)[56]: A data training algorithm that enables model-free RL algorithms to mimic the ability of humans to learn almost as much from achieving an undesired outcome as from achieving the desired one, i.e., enables the agent to learn even when the reward signal is sparse or binary. In this way the MARL objective is not singular, and training is decoupled to enable insights from unintended outcomes to be used to direct wanted outcomes. | ● Rewards are granted based on the outcome achieved and not the initial goal set.● Experiences are replayed with different goals to expand the learning gained by the agent and recall rare but probable occurrences.● Algorithm can be applied over a DQN and in any model-free MARL algorithm. | ● Defining goals and setting the optimisation approach are a difficult process.● Require a lot of memory and computational processing power.● The run time can be long depending on the variation of the HER algorithm applied. |
Imagination-Augmented Agents (I2A)[63]: A data training algorithm that aims to form imagination augmented RL approaches for approximate environment models. It models the environment by initialising an imagined trajectory using the present time real observation and subsequently feeding the simulated observations into the RL model. It uses predictive analysis and imagination to enable the learning agent to create implicit plans and take advantage of model-based and model-free RL jointly, without the pitfalls of regular model-based approaches. The predictive analysis is conducted over predictions obtained from the environment. | ● Augment model-free RL agents with imagination to enable them to construct implicit plans or policies. ● Improve the learning algorithm’s performance.● Demonstrate a superior ability to interpret imperfect predictions even in unknown environments. ● Use predictions as an additional context in developing the DQN. | ● Need a moderate number of iterations to become efficient, however, too many iterations result in diminishing returns.● Moderately complex to compute due to its minimal reliance on the model of the environment and increased focus on its predictive ability.● Perform slower than model-free RL approaches. |
Proximal policy optimisation algorithms[62]: A data training algorithm that alternates between sampling data through interaction with the environment and optimising a “proxy” objective function that enables multiple epochs of minibatch updates using stochastic gradient ascent. The method is applied over RL approaches to improve their efficiency. | ● Leverage a dynamic learning rate so the algorithm can self-correct through the learning rate when the common pitfalls of policy gradient methods resurface (e.g., inconsistent policy update and high reward variances).● Reuse samples more than once to mitigate against the sample inefficiencies challenge prevalent in traditional policy gradient methods. | When compared to the traditional policy gradient method, this method has a simplified implementation process, decreased sample complexity, improved learning performance, and improved algorithm convergence-time. |
Model-Based RL with Model-Free Fine-Tuning (MB-MF)[61]: Use a moderate number of samples and medium-sized neural networks to leverage the benefits of model-free DRL algorithms and model-based DRL in a joint approach. The two approaches are combined to produce stable and conceivable steps for an agent to conduct complex tasks well. | ● The algorithm initialises a learning agent using model-free RL combined with DNN features while also applying a medium-sized neural network model combined with Model Predictive Control (MPC) over a model-based RL algorithm. | ● When compared with model-based or model-free fine-tuning, MB-MF has increased performance, reduced complexity, and improved sample efficiency over a wide range of complex tasks. |
Twin Delayed Deep Deterministic (TD3) policy gradient algorithm[57]: Designed as a model-free, online, and off-policy reinforcement learning method that extends the DDPG by relating the target network bias to the over-estimation bias, using the minimum value between a pair of actor-critics, and delaying policy updates. | ● Improve the learning speed by applying two Q-value functions.● Minimise the effects of function approximation errors on both the actor and the critic by using the minimum value function estimate during policy updates.● Prevent overestimated value estimates and sub-optimal policies by adding noise to the target action, which makes the policy less likely to exploit actions that have high Q-value estimates. | ● When compared to the DDPG, the approach has improved learning speed and minimal estimation biases and approximation errors.● Suffer from slow convergence.● Long training duration.● Prone to converging to a local optimum.● PPO often outperforms TD3. |
Self-Play Actor-Critic (SPAC)[60]: Combine a wide-ranging critic into the policy gradient method to form a self-play actor-critic with imperfect information. | ● Improve stability and sample efficiency of the self-play reinforcement learning training procedure.● Usable in environments with limited information.● Speed up the training process. | ● Increased algorithm performance (outperform DDPG and PPO).● Reduced sample complexity.● Improved sample efficiency over a wide range of complex high-dimensional tasks. |