Towards a multi-agent reinforcement learning approach for joint sensing and sharing in cognitive radio networks

Kagiso Rapetswa^¹, Ling Cheng^¹()

1School of Electrical and Information Engineering, University of the Witwatersrand, Johannesburg 0001, South Africa

Show Author Information

Abstract

The adoption of the Fifth Generation (5G) and beyond 5G networks is driving the demand for learning approaches that enable users to co-exist harmoniously in a multi-user distributed environment. Although resource-constrained, the Cognitive Radio (CR) has been identified as a key enabler of distributed 5G and beyond networks due to its cognitive abilities and ability to access idle spectrum opportunistically. Reinforcement learning is well suited to meet the demand for learning in 5G and beyond 5G networks because it does not require the learning agent to have prior information about the environment in which it operates. Intuitively, CRs should be enabled to implement reinforcement learning to efficiently gain opportunistic access to spectrum and co-exist with each other. However, the application of reinforcement learning is straightforward in a single-agent environment and complex and resource intensive in a multi-agent and multi-objective learning environment. In this paper, (1) we present a brief history and overview of reinforcement learning and its limitations; (2) we provide a review of recent multi-agent learning methods proposed and multi-agent learning algorithms applied in Cognitive Radio (CR) networks; and (3) we further present a novel framework for multi-CR reinforcement learning and conclude with a synopsis of future research directions and recommendations.

Keywords

cognitive radio multi-agent reinforcement learning deep reinforcement learning mean field reinforcement learning organic computing

References

[1]

E. Biglieri, A. J. Goldsmith, L. J. Greenstein, N. B. Mandayam, and H. V. Poor, Principles of Cognitive Radio. New York, NY, USA: Cambridge University Press, 2013.

Crossref

[2]

S. Haykin, Cognitive radio: Brain-empowered wireless communications, IEEE Journal on Selected Areas in Communications, vol. 23, no. 2, pp. 201–220, 2005.

No.	SARL characteristic	MARL characteristic
1	Represent the environment as a Markov Decision Process (MDP) with a specified space and reward function.	Represent the environment as a stochastic game or an extensive form game. The states represent the joint states of all the CRs, and rewards are associated with each joint state.
2	State-action pairs selected by individual CR always yield the same reward whenever selected.	The outcome of the state-action pair selected by a single CR is dependent on the state-action pairs of other CRs operating in the same environment as they collectively make up the joint state.
3	Computational complexity increases as the number of state-action pairs available to the CR increases.	Computational complexity increases as the number of CRs, state-action pairs, and computing episodes/iterations increases.
4	Learning approaches: Direct and indirect.	Learning approaches: Centralised or distributed (co-operative, competitive, and mixed/hybrid)
5	Methods: Model-free, i.e., value-based [e.g., Monte-Carlo, temporal difference (e.g., State-Action-Reward-State-Action (SARSA) and Q-learning)] or policy-based (e.g., gradient/gradient free). Model-based (e.g., dynamic programming and certainty equivalence).	Methods: Game theory (stateless/static games, Markov games, and mean field games), couples with model-free and model-based methods for RL. Heuristics and direct policy search techniques, in conjunction with deep neural networks, have also been applied.

Method summary	Key feature	Computational complexity
Hindsight Experience Replay (HER)^[56]: A data training algorithm that enables model-free RL algorithms to mimic the ability of humans to learn almost as much from achieving an undesired outcome as from achieving the desired one, i.e., enables the agent to learn even when the reward signal is sparse or binary. In this way the MARL objective is not singular, and training is decoupled to enable insights from unintended outcomes to be used to direct wanted outcomes.	● Rewards are granted based on the outcome achieved and not the initial goal set.● Experiences are replayed with different goals to expand the learning gained by the agent and recall rare but probable occurrences.● Algorithm can be applied over a DQN and in any model-free MARL algorithm.	● Defining goals and setting the optimisation approach are a difficult process.● Require a lot of memory and computational processing power.● The run time can be long depending on the variation of the HER algorithm applied.
Imagination-Augmented Agents (I2A)^[63]: A data training algorithm that aims to form imagination augmented RL approaches for approximate environment models. It models the environment by initialising an imagined trajectory using the present time real observation and subsequently feeding the simulated observations into the RL model. It uses predictive analysis and imagination to enable the learning agent to create implicit plans and take advantage of model-based and model-free RL jointly, without the pitfalls of regular model-based approaches. The predictive analysis is conducted over predictions obtained from the environment.	● Augment model-free RL agents with imagination to enable them to construct implicit plans or policies. ● Improve the learning algorithm’s performance.● Demonstrate a superior ability to interpret imperfect predictions even in unknown environments. ● Use predictions as an additional context in developing the DQN.	● Need a moderate number of iterations to become efficient, however, too many iterations result in diminishing returns.● Moderately complex to compute due to its minimal reliance on the model of the environment and increased focus on its predictive ability.● Perform slower than model-free RL approaches.
Proximal policy optimisation algorithms^[62]: A data training algorithm that alternates between sampling data through interaction with the environment and optimising a “proxy” objective function that enables multiple epochs of minibatch updates using stochastic gradient ascent. The method is applied over RL approaches to improve their efficiency.	● Leverage a dynamic learning rate so the algorithm can self-correct through the learning rate when the common pitfalls of policy gradient methods resurface (e.g., inconsistent policy update and high reward variances).● Reuse samples more than once to mitigate against the sample inefficiencies challenge prevalent in traditional policy gradient methods.	When compared to the traditional policy gradient method, this method has a simplified implementation process, decreased sample complexity, improved learning performance, and improved algorithm convergence-time.
Model-Based RL with Model-Free Fine-Tuning (MB-MF)^[61]: Use a moderate number of samples and medium-sized neural networks to leverage the benefits of model-free DRL algorithms and model-based DRL in a joint approach. The two approaches are combined to produce stable and conceivable steps for an agent to conduct complex tasks well.	● The algorithm initialises a learning agent using model-free RL combined with DNN features while also applying a medium-sized neural network model combined with Model Predictive Control (MPC) over a model-based RL algorithm.	● When compared with model-based or model-free fine-tuning, MB-MF has increased performance, reduced complexity, and improved sample efficiency over a wide range of complex tasks.
Twin Delayed Deep Deterministic (TD3) policy gradient algorithm^[57]: Designed as a model-free, online, and off-policy reinforcement learning method that extends the DDPG by relating the target network bias to the over-estimation bias, using the minimum value between a pair of actor-critics, and delaying policy updates.	● Improve the learning speed by applying two Q-value functions.● Minimise the effects of function approximation errors on both the actor and the critic by using the minimum value function estimate during policy updates.● Prevent overestimated value estimates and sub-optimal policies by adding noise to the target action, which makes the policy less likely to exploit actions that have high Q-value estimates.	● When compared to the DDPG, the approach has improved learning speed and minimal estimation biases and approximation errors.● Suffer from slow convergence.● Long training duration.● Prone to converging to a local optimum.● PPO often outperforms TD3.
Self-Play Actor-Critic (SPAC)^[60]: Combine a wide-ranging critic into the policy gradient method to form a self-play actor-critic with imperfect information.	● Improve stability and sample efficiency of the self-play reinforcement learning training procedure.● Usable in environments with limited information.● Speed up the training process.	● Increased algorithm performance (outperform DDPG and PPO).● Reduced sample complexity.● Improved sample efficiency over a wide range of complex high-dimensional tasks.

Symbol	Meaning
$N$	Number of CR transceiver-receiver pairs.
$K, β$	Number of Additive White Gaussian Noise (AWGN) subchannels and subchannel bandwidth.
${T, T}_{s}, B$	Frame duration, sensing duration, and memory B.
$τ, λ$	Sensing decision threshold and test statistic.
$H_{0}, H_{1}$	Hypothesis that the subchannel is vacant and hypothesis that the subchannel is occupied.
$P (H_{0})$	The probability that hypothesis $H_{0}$ will prevail.
$G, G_{i j k}$	Channel gain matrix and channel again between the i-th transmitter and the j-th receiver over channel k.
$ς_{i k}, ς_{m i n}$	Actual SINR of the $i$ -th CR on channel $k$ and minimum SINR required for successful transmission.
$n_{0}, σ_{s}^{2}$	AWGN with mean zero, variance $σ_{v}^{2}$ , noise uncertainty coefficient $α$ , and signal variance $σ_{s}^{2}$ .
$A, A_{i k}$	Channel selection matrix. It depicts if channel $k$ is selected by the $i$ -th CR. $A_{i k}$ $\in [0, 1]$ .
$P, P_{i k}$	Transmission (Tx) power matrix takes the value of the i-th CR’s Tx power estimation for subchannel k, if the estimation is positive, otherwise it takes value 0.
$U, U_{i k}$	Utility matrix. Entry $u (i, k)$ takes value of the i-th CR’s utility estimation over channel k.
$P_{m i n}$	Threshold that the interference generated should not exceed to preserve the CR receiver’s decoding capabilities.
$r_{0}, ρ$	Scaling factor corresponding to the Fraunhofer distance. Spatial density of the nodes; A positive constant dependent on the number of resolvable paths and their variances.