Artificial Social Intelligence: A Comparative and Holistic View

In addition to a physical comprehension of the world, humans possess a high social intelligence—the intelligence that senses social events, infers the goals and intents of others, and facilitates social interaction. Notably, humans are distinguished from their closest primate cousins by their social cognitive skills as opposed to their physical counterparts. We believe that artificial social intelligence (ASI) will play a crucial role in shaping the future of artificial intelligence (AI). This article begins with a review of ASI from a cognitive science standpoint, including social perception, theory of mind (ToM), and social interaction. Next, we examine the recently-emerged computational counterpart in the AI community. Finally, we provide an in-depth discussion on topics related to ASI.


Dawn of Artificial Social Intelligence
Despite controversies [2] , the measuring of AI has a long history of employing human-like behavior tests, originating from the Turing test (originally called the imitation game) [3] : a test is administered to determine whether a person is conversing with a real person or a computer program simulating a human.Herbert Simon defined AI similarly with a focus on human-like behaviors: "We call programs intelligent if they exhibit behaviors that would be regarded as intelligent if they were exhibited by human beings." [4]lthough modern AI has achieved human-level intelligence in some tasks using data-driven methods [5] , it continues to advocate human-like tasks and evaluations [6−9] .
Despite its rapid growth in psychology [12, 57−60] , artificial social intelligence (ASI) has been mostly disregarded in the AI community, with only scattered applications.Notably, cognitive skills for interacting with the social world rather than the physical world distinguish 2.5-year-old human children (prior to reading and schooling) from chimpanzees [61] ; humans exhibit significantly more advanced social-cognitive skills than their closest animal cousins.Thus, the research of ASI is essential for the future generation of AI.
To address the aforementioned deficiency, this article highlights a promising AI direction, the ASI, from a computational perspective.In contrast to the mechanical and abstract nature of physical intelligence, ASI involves many subfields that are currently studied separately, such as social perception, theory of mind (ToM), and social interaction [62,63] , with varying emphasis on perception, cognitive components, behavior, and even psychometric methods to measure social skills [64] .We intend to provide a comparative and holistic perspective on (1) the gap between existing AI systems and human intelligence, (2) current issues, and (3) future directions by examining human social intelligence and recent efforts on building computational models.

Unique challenges of context
ASI is distinct and challenging compared to our physical understanding of the world; it is highly context-dependent [63] .This view is shared by Defense Advanced Research Prejects Agency (DARPA), which believes that the future generation of AI should include the human-like skill of contextual adaptation [65] -the capacity to reason about and adapt to various contextual inputs.
Here, context could be as large as culture and common sense or as little as two friends' shared experiences [58] .This unique challenge prohibits standard algorithms from tackling ASI problems in realworld environments, which are frequently complex, ambiguous, dynamic, stochastic, partially observable, and multi-agent.
ASI is comprised of numerous social signals that are frequently overloaded and ambiguous [66,67] .This difficulty does not even begin at the level of verbal or so-called natural language; rather, it begins with nonverbal communication.Given different contexts, the same gesture (e.g., pointing to a cup) might convey different meanings [68] .Pointing to a cup may indicate its shape, color, capacity to hold water, or a request for assistance in retrieving some water.Consequently, addressing ASI requires a comprehensive approach; improving specific components of an ASI system would not always result in improved performance [9] .

Overview of the article
Multidisciplinary research, including philosophy, cognitive science, neurology, computer science, applied mathematics, statistics, system engineering, and robotics, informs and inspires ASI.In Section 2, which covers social perception, theory of mind, and social interaction, we begin with experimental evidence and theoretical hypotheses of human social intelligence from the standpoint of cognitive science.In Section 3, we present the AI community's computational counterpart, focused on social perception, theory of mind, and social interaction, with an added topic on social robot and cognitive architectures.In Section 4, we explore significant challenges that impede the development of the ASI and recommend potential future trends.Section 5 gives the conclusion.

Human Social Intelligence
Evolutionarily, social intelligence development is advantageous for human adaptation to more complex social situations.As a result, studying human social intelligence provides insight into the foundation, curriculum, points of comparison, and benchmarks required to develop ASI with human-like characteristics [63,69] .
We concentrate on the three most important aspects of social intelligence: social perception, ToM, and social interaction.We select these themes not just because they are grounded in wellestablished cognitive science theories but also because there are readily available tools for developing computational models in these areas (to be discussed in Section 3).
Social perception is the basis for ToM and social interaction.It consists primarily of the perception of social features, such as animacy and agency, and provides low-level, automatic, instantaneous, and non-conscious visual perception [70] .ToM, in contrast, is concerned with sophisticated, analytic, and logical cognitive reasoning, involving a general cognitive system with several essential components, including belief, intent, and desire.Social interaction emphasizes more multi-agent interactive activities, such as communication and cooperation, than social perception and ToM.

Social perception
What factor is the most fundamental and influential in determining social perception?Contrary to intuition, motion cues composed of simple geometry may suffice [71] .According to Michotte [72] , "... the specifying factors-gestures, facial expressions, speech-are innumerable and can be differentiated by an infinity of nuances.However, they are all additional refinements compared with the key factors, which are the simple kinetic structures."Heider-Simmel stimuli [73] is perhaps the most seminal work (see the redrawing in Fig. 1).Participants were instructed to watch a film depicting three simple 2D geometric shapes (a large triangle, a small triangle, and a small circle) roaming in the vicinity of a rectangle.Even when told explicitly that these are merely simple shapes, participants still make a rapid, spontaneous, and consistent perception of animate social agents with various complex mental states, including desires, goals, emotions, personalities, and coalitions.These mental states combine to form a narrative-like description of the display, such as a hero rescuing a victim from a bully.This interpretation of simple moving shapes as animated agents is a remarkable demonstration of how the human visual system can infer complex social relationships and mental states from simple motion cues with minimal visual characteristics.Even though they involve impressions typically associated with higher-level cognitive processing, such interpretations appear to be predominately perceptual in nature, i.e., relatively rapid, automatic, irresistible, and highly stimulusdriven.
The Heider-Simmel experiment demonstrates two essential aspects of human social perception: the perception of animacy and agency.Animacy denotes that the perceived entities are animate as opposed to inanimate (e.g., physical objects), whereas agency refers to animate beings who are goal-oriented and capable of planning to achieve their goals rationally and efficiently.Below, we concentrate primarily on these two properties (i.e., animacy and agency).
Animacy.Experiments have demonstrated that infants can distinguish between animate and inanimate motion characteristics as early as six months of age [74] .Children ages 3 to 4 can accurately distinguish between mental and physical actions [75] .How can such complex social phenomena be perceived so early?
Michotte [76] describes a seminal experiment that yielded the initial evidence.In this experiment, participants were shown two small squares separated by several inches and arranged in a line.In the first scenario, the first square (A) moves in a straight line until it reaches the second square (B), at which point A stops moving and B begins moving in the same direction (also called the launching effect).In case two, the first square (A) approaches the second square (B).While A approaches, B moves away from A quickly and stops when it is several inches away again.In the first instance, observers observe A physically causing B's motion (also termed as phenomenal causality or the illusion of causality).In contrast, in the second case, A and B are perceived as alive with their own intentions, i.e., A attempting to capture B and B attempting to escape, even though all that is occurring in such films is simple kinematics.
Scholl and Tremoulet [71] provide a comprehensive review of a series of causal perception and animacy experiments conducted by Michotte [76] and Heider and Simmel [73] .Michotte's experiments and subsequent variations reveal that the spatiotemporal parameters mediate causal perceptions, such as relative velocity, speed-mass interaction, path length, and spatial and temporal gap.Minor manipulations, such as a brief spatial or temporal gap, could quickly transform the perceptions from physical causality to Fig. 1 Heider-Simmel stimuli [73] .Humans can perceive complex mental states and social interactions based solely on the motion of simple geometric shapes.
animated interaction [77] .Overwhelming evidence indicates that human perception of animacy appears hardwired into the visual system and is therefore implicit, automatic, and distinct from higher-level cognitive interpretations.
Agency.We now know that humans can automatically perceive complex social phenomena as early as six months of age.A natural follow-up question would be: How can we distinguish between social events and physical phenomena?The solution lies in the notion of agency [78] .An agent is rationally controlled because it has an internal energy source, whereas an object is not.
Similar to animacy, the social perception of agency is primarily associated with motion kinematics as opposed to featural properties.Gergely et al. [79] and Csibra et al. [80] find that relatively simple motion sequences, without self-initiated movement to cue animacy, can elicit an impression of goal-directed behavior in infants aged nine months.
The perception of agency is frequently studied in tandem with animacy for more complex social phenomena.Gao et al. [81] study a particularly salient form of perceived animacy and agency via tasks based on dynamic visual search (the Find-the-Chase task) and a new type of interactive display (the Don't-Get-Caught! task).They used two cues to evaluate the objective accuracy of such perceptions: (1) chasing subtlety-the degree to which the wolf deviates from a perfectly heat-seeking pursuit, and (2) directionality-whether and how the shapes face each other.Gao et al. [82] present the wolfpack effect, a novel social cue to perceived animacy that could effectively, irresistibly, and subtly influence human visual performance and interactive behavior.The study of chasing investigates how the visual system maintains and updates the dynamic social perception of animacy and agency over time and motion [83] ; the researchers discovered that temporal dynamics could lead the visual system to either construct or actively reject interpretations of chasing.
What are these perceptions' underlying units?In other words, are these social perceptions identifiable as discrete objects without the necessary movement properties?van Buren et al. [84] depict one disc (the "wolf") pursuing another disc (the "sheep") amidst several distractor discs that are moving.Lines were visible between each pair of discs.In the Unconnected condition, both lines connected distractors in pairs.In the Connected condition, however, one line connected the wolf to a distractor, and the other line connected the sheep to a different distractor.Observers in the Connected condition were markedly less likely to describe these behaviors in terms of mental state.According to the outcomes of their experiments, discrete visual objects are the fundamental units of social perception.
Summary.Does the human visual system have a natural tendency to recognize animacy and agency?The aforementioned experimental findings support the hypothesis that specific bottomup perceptual processing is specialized and difficult to be "penetrated" by higher-level cognition [ 71] .This type of social perception may be at the intersection of perceptual and cognitive processing, where basic stimuli are transformed into causal, animate, or even intentional qualities, which are strongly linked to higher-level cognitive processing.

ToM
ToM is an additional crucial aspect of social intelligence.In their study examining ToM abilities in chimpanzees, Premack and woodruff [85] first establish the term and idea of ToM.The chimpanzee Sarah was shown a brief clip of an experimenter attempting to perform simple tasks.Subsequently, Sarah observed images of several objects, one of which solved the experimenter's dilemma.Sarah could select the correct photograph, demonstrating that she comprehended the task and the problem at hand, i.e., to depict the current scenario and the experimenter's intentions.Their findings highlight two essential components of ToM: a representation of the affair state and a representation of an individual's motivational link to the state, i.e., belief and intention [86] .
Formally, ToM entails attributing mental states (such as beliefs, intents, or desires) to oneself and others, as well as acknowledging that people's perspectives and mental constructs may differ from those of the natural world and from one another [87] .Perspective taking in an internal simulation process is one of the defining characteristics of ToM [88] , as understanding another agent requires not peering into the agent's brain chemistry or soul, but rather putting onself in the agent's shoes in order to comprehend the agent's copy of world [89] beyond one's own egocentric perspective.The infamous Sally-Anne test [90,91] , a classic first-order false belief task, is a well-known experiment on perspective taking.
ToM is replete with noteworthy experimental findings from cognitive development research.ToM formation around age 4 is one of the most important developmental milestones of early childhood [87] .Infants begin to exhibit gaze-following behaviors, identify themselves and others as agents who perform deliberate actions, and are capable of subjectively experiencing the environment by the end of their first year [63,92] .These behaviors are indicative of early development in ToM.Children follow another agent's gaze at approximately 14 months of age, move to acquire visual information, and visually confirm (check back and forth) that the other agent is experiencing the same reality as themselves [92] .By 14-18 months, the infant begins comprehending the mental states of desire, intention, and the causal relationship between emotions and goals through gaze direction [93] .Around the ages of 3-4, children begin to comprehend the differences between their own beliefs and knowledge and those of others, and thus begin to comprehend false beliefs; however, this ability does not become fully stable until the ages of 5-6 [94] .Later in the development trajectory [95] is the establishment of second-order ToM, which entails predicting what one person thinks or feels about what another person thinks or feels [94,96] .
Intent.Among all the cognitive components of ToM, we concentrate on the intent component and examine the evidence of the development of human intent in greater depth.Since humans can inversely infer the underlying intents of others through social contact and act to fulfill those intents based on their beliefs and desires, intent may be the most crucial component of ToM [9] .In fact, research has shown that humans do not encode the entirety of action details but rather observe and interpret actions in terms of their intentions and store these interpretations for later retrieval [97] .As a fundamental organizing principle that regulates how we comprehend one another and act in the environment, the concept of intent has been awarded a central position within social intelligence and should thus be an essential component of future AI.
The developmental psychology literature indicates that sixmonth-old infants view human actions as goal-directed behavior [98] .By the age of 10 months, infants segment continuous behavior streams into discrete units that correspond to what adults would perceive as distinct goal-directed acts [9,99,100] .After their first birthday, infants begin to comprehend that an agent may explore multiple plans to achieve a goal and choose one based on environmental conditions [101] .18-month-old children can deduce and reproduce an action's intended purpose, even if the activity frequently fails to achieve the aim [102] .In addition, infants can replicate behaviors rationally and effectively based on an evaluation of the environmental restrictions, as opposed to just duplicating movements, indicating that they understand the relationships between the environment, action, and underlying intent [103] .
Typically, intentions are hierarchically arranged across extensive spatiotemporal ranges as a sequence of goals [9] .Infants are already capable of perceiving intentions on multiple levels, including concrete action goals, higher order plans, and collaborative goals [104] .Young children can offer assistance based on the inferred intentions of others derived from observing their behaviors (including failed efforts) [105] .Figure 2 depicts a toddler as young as 18 months old who, upon watching an adult with both arms full of books repeatedly knocking into a cabinet with closed doors, infers that the adult intends to store books inside the cabinet and then walks over to open the cabinet for the adult [106] .
Categorization.Understanding ToM's categorization may also assist our understanding, given that ToM is a vast topic of a general system.Cognitive ToM emphasizes explicit perspectivetaking, representing, and strategic reasoning regarding another person's beliefs, intentions, and generating causal inferences and predictions of the other's behavior.In contrast, affective ToM is more associated with the representation of emotional states and feelings and typically does not emphasize goal states or valuations of possible actions [94] ; Roiser and Sahakian [107] employ the words cold cognition (unemotional) and hot cognition (emotion-laden).Cognitive ToM can be further divided into ToM for motivation (i.e., another organism's valuation, intention, purpose, and goal) and ToM for knowledge (i.e., another organism's belief states or taught schemas/scripts) [86] .
Individual differences in cognitive strategies are also present [94] .The theory-theory method [108] and simulation-theory approach [109] are examples of these diverse ToM strategies.The theory-theory approach may be based on a set of intrinsic rules or on causal and probabilistic reasoning models, which may be analogous to cold cognition [94] in which mental states are inferred through intellectual processes.The simulation-theory approach relies on the individual's own motivations and deductive reasoning [110] .
Challenges.Despite the many approaches used to investigate ToM (such as behavioral analysis, neuroimaging, and neural signal analysis), a coherent picture of what ToM is, how humans and other species engage in it, and what neurological systems contribute to its functioning is still largely unknown [111,112] .

Social interaction
We continue by introducing several concepts and significant studies of social interaction in human social intelligence.Studying social cues, phenomena, rules, and mechanisms in human social interaction could equip ASI with more sophisticated human-like communication and collaboration capabilities.
Social cues.Whiltshire et al. [113] defined a taxonomy of social cues and signals, which includes the following five categories of social cues: paralinguistic (voice prosody and non-language sounds), facial expression (motion and position of facial muscles), gaze (motion and position of the eyes and predicted sight-line), kinematics (motion, position, and posture of the body), and proxemics (use of interpersonal space) [63] .
Gaze communication.Psychological evidence [114] suggests that eyes are stimuli with distinct "hardwired" neural pathways in the brain for their interpretation.Humans have the unique capacity to infer the intentions of another based on gazes.Gaze communication is a primitive form of human communication whose underlying social-cognitive and social-motivational infrastructure serve as a psychological platform upon which diverse linguistic systems might be constructed [58,115] .Thus, gaze communication plays a crucial role in expressing concealed mental states and enhancing verbal communication in social interactions [116] .
Joint attention.Fan et al. [115] thoroughly delineated two hierarchical layers of human gaze communication dynamics: atomic-level and event-level.Event-level gaze communication refers to high-level, complex social communication events, such as non-communicative, mutual gaze, gaze aversion, gaze following, and joint attention.Each gaze communication event is a temporal composite of a few gaze communications at the atomic level.Atomic-level gaze communication describes the granular structures of human gaze interactions, including single, mutual, avoid, refer, follow, and share.
Joint attention is the most advanced sort of gaze communication, as it requires two agents (1) to have the same intention to share attention on common stimuli and (2) to be aware that they are sharing a common ground [115] .Typically, joint attention requires a mutual gaze to establish a communication channel, a refer gaze to direct attention to the target, a follow gaze to examine the referred stimuli, and a final mutual gaze to guarantee that the experience is shared [115,117] .In addition to this topdown approach that forms joint attention, there is also a bottomup approach whereby two agents are drawn to the same stimuli and are familiar with one another.At 48 months of age, infants develop joint attention with mental attribution to represent their own perception, that of an agent, and the object [114] .The formation of shared attention is a vital initial step toward social interaction and imitation, a predecessor to ToM, and the basic foundation of social intelligence [118][119][120] .
Pointing.In social communication, pointing is another essential social cue.According to Tomasello [58] , pointing is one of the earliest forms of communication exclusive to the human species (the other is pantomiming).Pointing is also an indicator of particular cognitive abilities, such as being an intentional actor and having ToM [121] .Bates et al. [122] and Brinck [121] are credited with introducing the distinction between imperative pointing and declarative pointing.Declarative pointing is primarily intersubjective with a signaling function, whereas imperative pointing is based on behaviorally motivated regularities and is used to request the addressee to do something for the subject.
Because the recipient must use context to imagine, discern, and reason about the communicator's communication intentions, the interpretation of pointing is highly context-dependent.Tomasello [58] presented an intriguing example (see Fig. 3): one agent points to a bicycle outside the library to her companion, and depending on the environment, this pointing gesture could have entirely different communication intentions.The common ground between agents is an essential element of social communication and collaboration.All human communication, including linguistic communication, is only possible when the Fig. 2 Altruistic helping in human infants.Human infants as young as 18 months readily help others achieve their goals in a range of contexts, requiring both an understanding of others' goals and an altruistic desire to assist [106] .
agents involved have established a common ground composed of shared attention, shared experience, and common cultural knowledge.
Levinson [123] developed the concept of interaction engine, which allows communication intentions to be conveyed and recognized in both linguistic and nonlinguistic encounters.This interactive nature substantially impacts how young children coordinate social interactions with peers [124] .This article does not cover verbal communication studies.Nonetheless, it is essential to note that the basic skills required for effective language communication could be derived from the more rudimentary structures provided here for action control, nonlinguistic communications, and joint actions [125] .
Cooperation.Cooperation is a type of social interaction that is more complex than simple communication, as it requires a psychological infrastructure of shared intentionality.This infrastructure is comprised of two crucial factors: (1) socialcognitive skills for creating common conceptual ground with others, such as joint attention and joint intention, and (2) prosocial motivations and norms to help and share with others [58] .
Cichocki and Kuleshov [126] examined the precise distinctions between the four notions of communication, coordination, cooperation, and collaboration.By this rigorous definition, com refers to the exchange of information between agents, coordination refers to the alignment of multiple agents towards the achievement of specific common goals through the efforts of individual agents, cooperation means that each individual agent/robot exchanges relevant information and resources in support of each other's goals, rather than a shared common goal, and collaboration requires agents to exchange information and knowledge in support of a shared task.
Tomasello [127] presents a comprehensive analysis and discussion of cooperation.According to his idea of collaboration, "shared cooperative actions" have two essential characteristics: (1) the participants have a joint goal in the sense that we (in mutual knowledge) do X together; and (2) the participants coordinate their interdependent roles-their plans and sub-plans of action, including helping one another as needed in their respective roles.−130] Tomasello [127] also proposed a dual-level attentional structure (the shared focus of attention at a higher level, differentiated into perspectives at a lower level) and a dual-level intentional structure (shared goal with individual roles), arguing that the former is directly parallel to the latter and may ultimately derive from it.Fig. 4 illustrates the core idea.

Summary
This section provides a glimpse into the realm of human social intelligence from the perspective of cognitive science, covering three essential topics: social perception, theory of mind, and social interaction, with growing social interactivity and cognitive complexity.For social perception (Section 2.1), we have explored (1) two most significant concepts (i.e., animacy and agency), (2)  what may be the most fundamental, distinguishing, and determining aspect of social perception, and (3) where social perception fits within the human cognitive mechanism.Regarding ToM (Section 2.2), we have discussed its evolution and defining traits.Specifically, we have investigated (1) the findings of one of ToM's most essential components, and (2) the classification of ToM, and (3) its applied cognitive strategies.As for social interaction (Section 2.3), we (1) provided a detailed analysis spanning several most important aspects of social interaction (i.e., gaze communication, joint attention, pointing, cooperation), (2)  discussed why these problems are significant, and (3) the theory underlying the social interaction.
It is essential to highlight that these three fundamental aspects of human social intelligence are not isolated but are inextricably linked.Social perception is the foundation for the formation of ToM; they both play crucial roles in human social interaction.Only with well-functioning abilities of social perception and ToM can humans interpret the latent meaning of social cues, understand other agents' mental states (e.g., belief and intent), and cooperate tacitly in a shared task, which are the requirements

Context Intent
A and B mutually know that the bicycle belongs to B's boyfriend C.
"C is in the library.Let's go to find him!" A and B mutually know that B broke up with C yester-day.
"C is already in the library, so perhaps we should skip it." Fig. 3 Importance of context in social signal interpretation [58] .Fig. 4 A theory of cooperation by Tomasello [127] .Agents engaged in cooperation think and act in We-mode rather than I-mode.They have a joint goal and coordinate their roles.For shared cooperative tasks, socialcognitive skills and prosocial incentives and norms are two crucial components [58] .
of ASI.In the following section, we describe the computational efforts devoted to these three aspects with a fourth aspect, the social robot and cognitive architectures.

Artificial Social Intelligence
In this section, we introduce social intelligence from a computational perspective and highlight some computational works on social perception (in simulated and real-world scenarios), ToM, social interaction (i.e., social communication and cooperation), and social robot.The first three parts are in the same order as in the last section; we add a subsection on social robot and cognitive architectures because this field encompasses the other three aspects of social intelligence and leads to the development of future applications.

Social perception in simulated scenarios
Since humans possess an innate ability to perceive social cues from extremely simple stimuli, we investigate ways to computationally model social perception in simulated scenarios, akin to the Heider-Simmel stimuli introduced in Section 2.1.Shu et al. [10] present a unified theory that describes the interrelationships between the perception of physical and social events (see Fig. 5).They employed a simulation-based approach to generate various animations depicting rich behavioral patterns.Through human studies, these animations reveal that the perception of dynamic stimuli transitions gradually from physical to social events and vice versa.In addition, they devise a learningbased computational framework to account for human judgments.Specifically, the model learns to identify latent forces by inferring a family of potential functions capturing physical laws and value functions of agent goals, thereby projecting the animations into a sociophysical space with two psychological dimensions: an intuitive sense of whether physical laws are violated and an impression of whether an agent possesses intentions to perform goal-directed actions.
Tang et al. [131] investigate the problem of simultaneously perceiving physics and mind using a leash-chasing display, in which a disc ("sheep") is being chased by another disc ("wolf") that is physically constrained by a leash tied to a third disc ("master").They discover that (1) an intuitive physical system, such as a leash, can significantly mitigate the detrimental effects of spatial deviation and the diminishing objecthood on perceived chasing, thereby enhancing its robustness, and (2) a mutual dependency exists between physics and mind, where disrupting one will inevitably impair the perception on the other, supporting a joint perception of physics and mind.
Flatland is a new experimental paradigm introduced by Shu et al. [132] for exploring social inference in physical situations.Results demonstrate that human interpretations of interactive events in Flatland can be accounted for by a computational model that combines inverse hierarchical planning with a physical simulation engine to reason about objects and agents.
Shu et al. [133] examine the perception of social interaction using decontextualized motion trajectories, in which stimuli are extracted from drone-recorded aerial films of a real-world setting.To account for human judgments of interactiveness between two moving dots and the dynamic change of such judgments over time, they construct a hierarchical model that represents interactivity using latent variables and learns the distribution of critical movement features that signal potential interactivity.Intriguingly, the model can generalize to handle the original Heider-Simmel animations [73] .In addition, the generative model can also synthesize decontextualized animations with a controlled degree of interactiveness.The temporal parsing of trajectories and the conditional interactive fields for each sub-interaction are depicted in Fig. 6.
To investigate the cognitive architecture of perceived animacy, Gao et al. [134] devise Bayesian models that integrate domain-specific hypotheses of social agency with domain-general cognitive constraints on sensory, memory, and attentional processing.The proposed model posits that perceived animacy combines a bottomup, feature-based, parallel search for goal-directed movements with architecturally distinct processes that make perceived animacy fast, flexible, and cognitively efficient.By distinguishing target agents from distractor objects in the "wolf-chasing-sheep" setting, they demonstrate that a Bayesian ideal observer model may explain the efficacy of human perceived animacy with realistic cognitive constraints.

Social perception in real-world scenarios
In addition to simulations, we further demonstrate computational modeling of social perception in more challenging real-world situations.
Fan et al. [120] investigate the topic of inferring shared attention in their collected third-person social scene video dataset VideoCoAtt by employing a spatiotemporal neural network utilizing human gaze directions and potential target boxes extracted from the context.In their subsequent study [115] (see Fig. 7), the authors systematically investigate the subject of human gaze communication by constructing spatiotemporal graphs for realworld social scenarios in the collected VACATION video dataset.They devise a graph neural network and an event network for the prediction of gaze communication at the atomic and event levels, respectively.
To jointly infer human attention, intention, and task from videos, Wei et al. [135] introduce a hierarchical model of humanattention-object (HAO) and a beam search algorithm.According to their definition, the intention consists of the human pose, attention, and objects, whereas the task is represented as a series of intentions.Xie et al. [136] offer an unsupervised method for localizing functional objects and predicting human intents and trajectories from surveillance footage of public places.Agents are influenced by the attractive or repulsive "fields" of functioning objects, referred to as "dark matter" (see Fig. 8).In addition to estimating the agent's intent, the model can also derive the agent's trajectory via agent-based Lagrangian mechanics.
Holtzen et al. [137] present a method that enables robots to infer a person's hierarchical intent from partially observed RGB-D videos.
They represent intent as a novel hierarchical, compositional, and probabilistic And-Or-Graph structure that describes a relationship between actions and plans.Human intent is inferred by reverseengineering a person's decision-making and action-planning processes under a Bayesian probabilistic programming framework.Experiments conducted in a 3D environment reveal that the inferred human intent (1) corresponds well with human judgment, and (2) provides useful contextual cues for object tracking and action recognition.

ToM
The computational modeling of ToM may concentrate on different components, such as belief, intent, and desire.Gonzalez and Chang [138] divide computational models of ToM into several broad categories, including Game ToM [139] , Observational (RL) [140] , Inverse RL [141] , and Bayesian ToM [142] .These models contain modules for representing the goals and desires of an agent, inferring the mental states of other agents (e.g., beliefs, goals, desires, intentions, and feelings), and integrating these goals and mentalizing computations to generate optimal policies.
We start this section with some of the most representative studies on different ToM components and modeling methods.Yuan et al. [143] jointly infer object states, robot knowledge, and human beliefs using parse graphs, which accurately identify human (false-)beliefs.Fan et al. [144] (see Fig. 9) incorporate different nonverbal communication cues (e.g., gaze, human poses, and gestures) to infer agents' mental states based solely on visual inputs.By aggregating beliefs and physical-world states, their approach effectively forms five minds during the interactions between two agents.In particular, they construct a common mind to avoid the infinite recursion commonly used in prior works.In addition, they devise a hierarchical energy-based model that simultaneously tracks and predicts social cues, social communication events, and belief dynamics in five minds.Arslan [145] investigate how 5-year-olds choose and revise reasoning strategies in second-order false belief tasks by constructing two computational cognitive models of this process: an instance-based learning model and a RL model.Oguntola [146] develop an interpretable modular neural framework for modeling the intentions of other observed entities, demonstrating the model's efficacy in a Minecraft search and rescue task.They also demonstrate that, under the right conditions, integrating interpretability can dramatically improve prediction performance.Zeng et al. [147] suggest a brain-inspired model of belief ToM, leveraging high-level knowledge of brain regions' functions relevant to ToM.Although tested on false belief tasks, such cognitive architecture may be difficult to motivate at the computational level [94] .
One stream in ToM is based on Bayesian methods.Baker et al. [148] investigate the rational quantitative attribution of beliefs, desires, and percepts in human mentalizing from agents' movement in a local spatial environment (see Fig. 10).They devise a Bayesian theory of mind (BToM) model in a partially observable Markov decision process (POMDP) setting for rational planning and state estimation, which extends classical expected-utility agent models to sequential actions in complex, partially observable domains.In two experiments, their model accurately captures the quantitative mental-state judgments of human participants by alternating numerous stimulus parameters over a large number of stimuli.A family of simpler non-mentalistic motion features reveals the value contributed by the model's component.BToM appears particularly well-suited to model the inherent uncertainty required to infer unobservable mental states and to capture the judgments of human participants [142] .However, the scalability of BToM is often problematic, only tested in scenarios that are typically simple [94] .
RL represents another stream in ToM computational modeling; Wen et al. [149] and Moreno et al. [150] are examples of recursive reasoning models for higher-order ToM in a RL framework.According to Skinner's theory, Hakimzadeh [151] contend that RL plays a crucial role in human intuition and cognition, and theories such as the language of thought hypothesis, script theory, and Piaget's theory of cognitive development offer complementary approaches.They present a computational building block that supports the principles of productivity, systematicity, and inferential coherence for Piaget's schema theory.Reference [152]  point out that ToM can indeed be formulated as an inverse reinforcement learning (IRL) problem, where expectations for how mental states produce behavior are represented by a RL model.By simulating the hypothesized beliefs and desires, an RL model predicts the actions of other individuals, and the mentalstate inference is accomplished by inverting this model.Overall, RL models, such as IRL and multi-agent reinforcement learning (MARL), are highly scalable but computationally intensive and less interpretable.
Under a POMDP setting, Yuan et al. [153] argue that misalignment of values could impede group performance in cooperation; hence, communication plays a vital role during which a robot needs to serve as an effective listener and an expressive speaker.In the context of value alignment, they investigate how to foster effective bidirectional human-robot communications and propose an explainable artificial intelligence (XAI) system in which a collection of robots anticipates human values by using in-situ feedback while explaining their decisionmaking processes to users (see Fig. 11).Their XAI system integrates a cooperative communication model to infer human values associated with multiple desirable goals, mimic human mental dynamics, and predict optimal explanations using graphical models.
A related direction is game ToM [139] , which leverages concepts like Nash equilibria [138] .de Weerd et al. [154,155] employ a combination of computational agents and Bayesian model selection to determine the extent to which individuals use higher-order ToM reasoning in a particularly competitive game known as matching pennies.Their findings suggest that humans do not primarily employ their high-order ToM abilities.In a case study of the paperscissors-rock game, Kanwal et al. [156] develop a ToM-based agent, capable of using gestures for non-verbal communication.Tejwani [157] formalize a theory of social interactions, encompassing cooperation, conflict, coercion, competition, and trade, by extending a nested Markov decision process (MDP) where agents m 21 m 2 m 1 Fig. 9 Triadic belief dynamics in nonverbal communication [144] .In five minds, three sorts of communication events emerge from social interactions (bottom) and causally construct agents' belief dynamics (top).Reproduced from Ref. [144] with permission.reason about arbitrary functions of each other's hidden rewards.
In a follow-up study, Tejwani [158] expand the reward function to incorporate both physical and social goals.Their method permits more complex behaviors, such as politely hindering or aggressively assisting another agent.Panella and Gmytrasiewicz [159] devise a new computational framework, interactive partially observable Markov decision process (I-POMDP), wherein the agent does not explicitly model the beliefs and preferences of other agents but rather represents them as stochastic processes implemented by probabilistic deterministic finite-state controllers (PDFCs).Using Bayesian inference, the agent updates its belief over the PDFCs models of other agents.Deep learning (DL) is an effective means to approximate complex ToM computations.Aru et al. [69] examine the difficulties associated with applying DL to ToM problems.Although the architectures and learning algorithms are not the ultimate brainlike learning system, they argue that DL remains a solid solution in large-scale tasks and could provide scientific models to aid our comprehension of higher mental functions.They also point out that the problems of existing DL methods are taking shortcuts rather than learning ToM; the system may learn a much simpler decision rule (see Fig. 12).DL for ToM is explored predominantly with deep reinforcement learning (DRL), wherein the agent's experiences and objectives are intertwined.Usually, the task's reward structure determines what the agent accomplishes and learns.However, in the case of ToM, there may not exist a straightforward cost function or reward structure that would necessitate the emergence of ToM.Crucially, Zhao et al. [160] demonstrate in a multi-agent setting that rewards may simply be a byproduct of ToM, not playing a causal role in establishing effective coordination.

Social communication and cooperation
Computational endeavors in modeling social interaction primarily focus on social communication (both nonverbal and verbal) and cooperation.
Nonverbal communication.Jiang et al. [66] model pointing as a  communicative act between agents who have a mutual understanding that the pointed observation must be relevant and interpretable; the act of pointing is an invitation to jointly attend to an object, which elicits mutual inference between agents of each other's minds [67].The proposed model measures relevance by defining a Smithian value of information (SVI) as the utility gain of a pointing signal.By integrating SVI into rational speech act (RSA), their pragmatic model of pointing permits contextually flexible interpretations.Tang et al. [128] demonstrate that agents can successfully and robustly employ bootstrapping to converge to a joint intention from randomness under an Imagined We framework, leveraging a real-time cooperative hunting task subject to various setting manipulations.Stacy et al. [161] propose a computational account of overloaded signaling from a shared agency perspective, which we refer to as the Imagined We for communication.Within this framework, communication is a means for cooperators to coordinate their perspectives, allowing them to act in concert to achieve shared objectives (see Fig. 13).In a series of simulations, the model performs effectively under growing ambiguity and increasing levels of reasoning, highlighting how shared knowledge and cooperative logic may perform the majority of the heavy lifting in language.Verbal communication.Studying social communication using natural language in the wild is still challenging.Hence, researchers tend to study verbal communication in a confined domain.Gao et al. [162] devise a novel XAI framework for attaining human-like communication in human-robot collaborations, in which the robot builds a hierarchical mind model of the human user and generates explanations of its own mind as a form of communication based on its online Bayesian inference of the user's mental states.A user study using a real-time human-robot cooking task demonstrates that the generated explanations considerably enhance the collaboration performance and user perception of the robot.
Cooperation.Cooperative tasks demand stronger ToM reasoning in social communication.The notion of ToM-based communication, which chooses information-sharing actions based on relevance and estimation of human beliefs [163] , tackles the question of when and what type of information humans require.Wang et al. [164] introduce ToM to build socially intelligent agents, who can communicate and cooperate effectively to accomplish challenging tasks.These agents determine when and with whom to reveal their intentions and sub-goals based on the inferred mental states of others.Pöppel et al. [165] study how efficient, automatic coordination mechanisms at the level of mental states (intentions, objectives), also known as belief resonance, may lead to collaborative situated problem-solving.They describe a model of hierarchical active inference for collaborative agent (HAICA) that blends Bayesian ToM with a perception-action system based on predictive processing and active inference.Belief resonance is realized by allowing the inferred mental states of one agent influence another agent's prediction beliefs regarding its own goals and intentions, hence influencing the agent's task behavior without explicit collaborative reasoning.

Social robot and cognitive architectures
The social robot is an interdisciplinary research field that requires comprehensive studies of social perception, ToM, and social interaction.We expect a social robot to be endowed with cognitive and affective capabilities, in order to comprehend the feelings, intentions, and beliefs of human agents, which are not only directly expressed by the user but also shaped by bodily cues (e.g., gaze, posture, facial expressions) and vocal cues (e.g., vocal tones and expressions) [166] .A social robot is expected to (1) develop adaptive behavioral models [167] , (2) be socially adept, (3) establish a natural, fluent, and effective human-like communication and interaction with humans [168] , (4) establish empathetic relationships with humans and be perceived as a teammate or a colleague rather than a tool, (5) offer proactive and parental help based on the observations and understanding of the human situation, and (6) build trust with humans [45] .Understanding robots' decisions promotes the growth of trust and is crucial for facilitating contact between humans and social robots [63] .
However, there are still many obstacles to overcome before constructing an ideal social robot [167] .It is difficult to incorporate behavioral adaption techniques, cognitive architectures, persuasive communication strategies, and empathy into a single solution for understanding nonverbal phenomena in social interactions, as contexts are constantly changing.A common limitation of current research is that researchers have focused on a particular aspect of a social robot, such as (1) emphasizing a communication strategy, (2) studying a particular behavior as a response to human action, or (3) conducting experimental studies that include only partial factors.Cognitive architecture.A cognitive architecture, as a software implementation of a general theory of intelligence, is not a single algorithm or method tackling a particular problem; rather, it is the task-independent infrastructure that learns, encodes, and applies knowledge to produce behavior [169] .One of the challenges in cognitive architecture design is to create a sufficient structure to support coherent and purposeful behavior, while at the same time providing sufficient flexibility to adapt to the specifics of its tasks and environment.ASI in robotic agents relies heavily on the construction of cognitive architecture, which involves both abstract models of cognition and software instantiations of such models [170] .Researchers are working on developing cognitive architectures that approach a fully cognitive state, embedding mechanisms of perception, adaptation, and motivation [171] .Next, we briefly introduce three most common cognitive architectures.

Signal
Learning intelligent distribution agent (LIDA) cognitive architecture [172] is an integrated artificial cognitive system that models a broad spectrum of biological cognition, from low-level perception and action to high-level reasoning.Two hypotheses underlie the LIDA architecture and its corresponding conceptual model: (1) Much of human cognition functions through cognitive cycles, which are interactions between conscious contents, memory systems, and action selection, occur frequently (10 Hz).
(2) Cognitive cycles serve as the cognitive atoms of which higherlevel cognitive processes are composed.
Soar.The Soar cognitive architecture [173] is composed of interacting task-independent modules, including short-term and long-term memories, processing modules, learning mechanisms, and interfaces between them.Since Soar hypothesizes that sufficient regularities exist above the neural level to capture the functionality of the human mind, the majority of knowledge representations in Soar are symbol structures, with architecturally maintained numeric metadata biasing the retrieval and learning of those structures [169] .Soar also facilitates non-symbolic reasoning via the spatial visual system, an interface between perception and working memory.
Adaptive control of thought-rationale architecture (ACT-R) [174,175] includes modules such as (1) a visual module for identifying objects in the visual field, (2) a manual module for controlling the hands, (3) a declarative module for retrieving information from memory, (4) a goal module for tracking current goals and intentions, and (5) a central production system to coordinate these modules.There are buffers within each module that transmit information back and forth to the central production system.The architecture assumes a mixture of serial and parallel processing.
Cognitive architectures in social robots.We now discuss some notable works that implement various cognitive architectures in social robots.Wiltshire et al. [168] discuss the problem of engineering human social-cognitive mechanisms to enable robot social intelligence and provide an integrative perspective of social cognition as a systematic theoretical underpinning for computational instantiations of these mechanisms.They also provide a series of recommendations to facilitate the development of the perceptual, motor, and cognitive architecture.Breazeal et al. [176] provide an integrated sociocognitive architecture (see Fig. 14) to endow an anthropomorphic robot with the ability to infer mental states such as beliefs, intents, and desires from the observable behavior of its human partner via simulation-theoretic techniques.Kennedy et al. [177] describe an approach known as a like-me simulation, in which the agent uses its own knowledge and capabilities as a model of another agent to predict that agent's actions.They present three examples of a likeme mental simulation in a social context implemented in the embodied version of the adaptive control of thought-rationale architecture (ACT-R) cognitive architecture, ACT-R Embodied (ACT-R/E), including perspective taking, teamwork, and dominant-submissive social behavior.Moulin-Frier et al. [178] suggest the DAC-h3 architecture, which incorporates a reactive interaction engine, a number of state-of-the-art perceptual and motor learning algorithms, planning capabilities, and an autobiographical memory.The architecture as a whole drives the robot's behavior to solve the symbol grounding problem, acquire language capabilities, perform goal-oriented behavior, and articulate a verbal narrative of its own experience in the world.Franchi et al. [179] present a brain-inspired architecture, the intentional distributed robotic architecture (IDRA), which aims to permit the autonomous development of new goals in situated agents beginning with simple hard-coded instincts.

Recent advances in datasets and environments
Datasets and environments are quintessential for developing modern AI.The past few years have witnessed a significant boom of modern treatment.In this section, we provide a brief review of recent notable works.
No previous dataset or benchmark has systematically analyzed physically grounded perception of complex social interactions that extend beyond short actions (e.g., high-five) or simple group tasks (i.e., gathering), until Netanyahu et al. [180] .They resemble a collection of physically-grounded abstract social events (PHASE) that simulates a wide variety of real-world social interactions by incorporating social concepts, such as helping another agent.PHASE is comprised of 2D animations of agent pairs, moving in continuous space with multiple objects and landmarks, generated procedurally by a physics engine and a hierarchical planner.
Inspired by intuitive psychology, Shu et al. [181] present a benchmark consisting of a large dataset of procedurally generated 3D animations, Action, Goal, Efficiency, coNstraint, uTility (AGENT), structured around four scenarios (goal preferences, action efficiency, unobserved constraints, and cost-reward tradeoffs) that probe key concepts of core intuitive psychology.
Puig et al. [182] introduced watch-and-help (WAH), a challenge for testing social intelligence in agents, wherein an AI agent is tasked to help a human-like agent perform a complex household task efficiently.They build VirtualHomeSocial, a multi-agent household environment, and provide a benchmark including both planning and learning-based baselines.
Sap et al. [183] proposed a dataset to evaluate language-based commonsense reasoning about social interactions, including reasoning about motivation and about emotional reactions [94] .
Bard et al. [184] propose the cooperative and imperfect information card game, Hanabi, as a challenging benchmark.It requires reasoning about the beliefs and the intentions of other players, focusing on the ad-hoc setting where an agent has to coordinate with a team they encounter for the first time.

Evaluation protocols
The evaluation of social intelligence is arguably the most challenging problem in developing ASI.To answer the question "are we at least making progress towards ASI?", we need an account of how the social intelligence of machines should be measured [185] .The evaluation can aid in testing and training computational models [94] .
However, the formation of universally accepted criteria for the design and implementation of ASI benchmarks and the accompanying evaluation protocols is still in its infancy and represents a significant barrier to the field's continued progress.Because human judgments can be ambiguous and difficult to express, many social intelligence tasks do not include requirements that can be easily captured using hand-crafted rules.
Hence, a balanced benchmark should likely involve humans evaluating the performance of algorithms.Existing approaches for assessing social intelligence in humans continue to have shortcomings [64] .
The Turing test is a test of a machine's ability to exhibit intelligent behavior equivalent to or indistinguishable from that of a human.However, current systems that perform well on these tests typically do so by employing techniques that are not generalizable to other problems.Other approaches for assessing social intelligence competency are often derived from various sources, such as peer-/superior-/self-ratings and observers' behavioral assessments [63,186] .Notably, the Animal-AI Olympics [187] is initiated by testing artificial agents on tasks derived directly from animal cognition research in an effort to establish common ground.
Typically, the evaluation of ToM in DL is based on the performance of a task; however, this approach is problematic since DL systems may exploit shortcuts-they learn to employ simpler decision rules than recovering the underlying ToM [69] .An important aspect of the ASI is to measure cognitive skills, adaptability, and meta-level learning and reasoning ability rather than specific problem-solving ability [126] .Using more abstract cognitive processes, such as the ability to (1) transfer information from one domain to another, (2) retain information for extended periods, and (3) correct errors in performance, may be future effective strategies for assessing ASI [185] .

Future trends
In this section, we discuss future trends in ASI.We hope these four directions inspire future works in ASI.
A holistic approach.Cognitive and neuroscience research [188] shows that while distinct brain regions are involved in specific tasks, a core network is involved in all ToM tasks, suggesting that humans take a more holistic approach to social intelligence than existing computational models, which often focus on a single aspect of the problem.Through multidisciplinary study spanning psychology, neurology, cognitive science, computer science, statistics, and mathematics, future progress could be accelerated.
Learning methods.Infants develop intelligence gradually [189] .This suggests that learning, and in particular lifelong/continuous learning [190] , is a crucial path for developing ASI.The objective of lifelong/continuous learning is to successively learn a model for a large number of activities without forgetting the knowledge acquired from the previous tasks.Other potentially effective learning strategies include multi-task learning [191,192] , one-/few-shot learning, and meta-learning [193] .
Open-ended and interactive environment.Infants live in a physical world, full of rich regularities that organize perception, action, and ultimately thought [189] .Infants' intelligence is dispersed across their interactions and experiences with the physical world, which serves to stimulate the development of higher mental functions.In addition, infants behave and learn in a social environment where more experienced partners facilitate learning and provide support.An important aspect of human infants' learning is that they explore; they move and act in extremely unpredictable, random, and non-goal-directed ways.During exploration, they uncover new issues and solutions, and exploration makes intellect open-ended and inventive.Openendedness departs from the single-task paradigm to an unbounded number of tasks, or even no task at all, simply a world with different possibilities.Open-ended environments could provide a fruitful playground where agents coordinate, cooperate, and compete to solve tasks, and learn similar strategies to social intelligence in humans, and even more complex behavior [69] .
Human biases.The development of social intelligence demands an open-ended setting, yet ToM-like skills would not spontaneously "pop out" from AI agents playing in such contexts [69] .We must also introduce better biases, even structural biases, as a form of built-in common sense, as there may be multiple biases and limits in the human brain that facilitate the acquisition of social intelligence.For instance, there may be innate biases of attention to the human face, speech, hands, eyes, gazedirection, and biological motion, and these early biases ensure that the infant learns about the components of the world that provide information about the minds of other people.These biases could be hard-coded, evolve from interactions with other agents, or be taught by humans.

Conclusion
Although there have been significant advances in AI research, we are still a long way from obtaining human-level intelligence.ASI is a crucial missing component for artificial general intelligence (AGI) on par with humans and symbolizes the future path of AI.Acknowledging ASI as a distinct research area will enhance the field's awareness and encourage academics to discuss and investigate the topic's challenging problems.As one of the most significant promising subfields in AI, ASI requires more theoretical and computational work from the AI community.

Fig. 5 A
Fig. 5 A unified theory that captures the interconnections between the perception of physical and social events.Reproduced from Ref. [10] with permission of Elsevier Inc., © 2021.

3 TrajectoriesFig. 6 Fig. 7 Fig. 8
Fig. 6 Perception of human interaction from motion trajectories.The bottom depicts the motion trajectories.The colored bars in the middle represent the temporal parsing of the trajectories in terms of the subinteraction types (S).The top row depicts the change within a conditional interactive field (CIF) in sub-interactions as the interaction progresses, where the CIF represents the expected relative motion pattern conditioned on the motion of the reference agent.Reproduced from Ref. [133] with permission.

Fig. 10
Fig. 10 Experimental scenario and model schema for rational quantitative attribution of beliefs, desires, and percepts in human mentalizing [148] .(a) In the experimental scenario, the agent leaves their office where they can see the K truck (Frame 1).Next, the agent walks past it to the opposite side of the building, where the L truck is parked (Frame 2).Finally, the agent returns to the K truck (Frame 3).The bar charts illustrate the model and human prediction of the agent's utility (i.e., which truck is the agent's preference) and belief (i.e., which truck the agent initially believed to be parked on the other side of the building).(b) The folk-psychological schema for ToM, formulated as a generative action model based on the solution of a POMDP.In this generative model, mentalizing is formulated as Bayesian inference about unseen variables (beliefs, desires, perceptions) conditioned on observed actions.Reproduced from Ref. [148] with permission of Nature Publishing Group, © 2017.

Fig. 11 Fig. 12
Fig. 11 Bidirectional human-robot value alignment[153] .In a collaboration task, the values-the significance of various goals-are represented by pie charts.In each interaction round, the machine receives signals from the physical environment and processes observations to generate an abstract environment state.Next, the machine offers the processed map together with movement proposals and explanations to human users, who accept or reject the proposals according to the given human values and the current state of the map.Finally, the machine updates its estimation of human values based on the user's feedback and takes action based on the new values.Reproduced from Ref.[153] with permission.

Fig. 13 Imagined
Fig. 13 Imagined We for Communication.Stacy et al. [161] extend Imagined We for communication by developing a novel utility calculus of a signal based on shared agency ToM and interactions in the physical world.By integrating with RSA, their model effectively recognizes (1) signals change each Imagined We, (2) minds produce predictable and rational joint actions under ToM reasoning, and (3) actions have well-defined expected utilities, derived through joint planning.

Fig. 14
Fig.14System architecture incorporating simulation-theoretic mechanisms as a foundational and organizational principle[176] .The two concentric bands represent two distinct operational modes.In generation mode (the light band), the robot builds its own mental states in order to behave intelligently in the environment.In simulation mode (the dark band), the robot constructs and represents its human collaborator's mental states by monitoring their behavior and adopting their mental perspective.Reproduced from Ref.[176] with permission of SAGE Publications, © 2009.