Reinforcement learning (RL)
The reinforcement learning attempts to solve the problem of learning from interaction to achieve an object. An agent starts by observing the currents state
s of the environment, then performs an action on the environment, and later on receives a feedback
r from the environment. The received feedback is also called a reinforcement or reward. Agents aim to maximize their cumulative reward they receive in the end [
7].
There are three well-known, fundamental classes of algorithms for solving the reinforcement learning problem, namely dynamic programming, Monte Carlo, and temporal-difference (TD) learning methods [
7]. Unlike other approaches, TD learning algorithms can learn directly from experience without a model of the environment. TD algorithms do not require an accurate model of the environment (contrary to Dynamic Programming) and are incremental in a systematic sense (contrary to Monte Carlo methods). However, unlike Monte Carlo algorithms, which must wait until the end of an episode to update the value function (only then is the return
r known), TD algorithms only need to wait until the next time step. TD algorithms are thus incremental in a systematic sense [
7].
One of the most widely used TD algorithms is known as the Q-learning algorithm. Q-learning works by learning an action-value function based on the interactions of an agent with the environment and the instantaneous reward it receives. For a state
s, the Q-learning algorithm chooses an action a to perform such that the state-action value
Q(
s,
a) is maximized. If performing action
a in state
s produces a reward
r and a transition to state
s
′, then the corresponding state-action value Q (s, a) is updated accordingly. State
s is now replaced by
s
′ and the process is repeated until reaching the terminal state [
7]. The detailed mathematical foundation and formulation, as well as the core algorithm of Q-learning, can be found in [
8] therefore it is not repeated here.
Q-learning is an attractive method of learning because of the simplicity of the computational demands per step and also because of proof of convergence to a global optimum, avoiding all local optima, as long as the Markov Decision Process (MDP) requirement is met; that is the next state depends only on the current state and the taken action (it is worth noting that the MDP requirement applies to all RL methods) [
9].
Clearly, a MAS can be an uncertain environment, and the environment may change any time. Reinforcement learning explicitly considers the problem of an agent that learns from interaction with an uncertain environment in order to achieve a goal. The learning agent must discover which actions yield the most reward via a trial-and-error search rather than being told which actions to take as in most forms of machine learning. It is this special characteristic of reinforcement learning that makes it a naturally suitable learning method for trust evaluating agents in MASs. Furthermore, the suitability of RL can also be seen if we note that a TR observes the TEs, selects a TE, and receives the service from that TE. In other words, these agents get some input, take an action, and receive some reward. Indeed, this framework is the same framework used in reinforcement learning, which is why we chose reinforcement learning as a learning method for our proposed agents model [
10].
A wide variety of trust and reputation models have been developed in the literature. Different dimensions to classify and characterize computational trust and reputation models were presented in different surveys [
11]. One of the most cited and used general purpose surveys is the one developed by Sabater and Sierra [
12]. Their dimensions of analysis were prepared to enhance the properties of the Regret model [
13] and show basic characteristics of trust models [
11]. The dimensions proposed in [
12] are:
-
Paradigm type, where models are classified as cognitive and numerical. The numerical paradigm includes models that do not have any explicit representation of cognitive attitudes to describe trust. On the other hand, cognitive paradigm includes models in which the notion of trust or reputation is built on beliefs and their degrees [
11].
-
Information sources where a set of models use direct experiences while other models use third-party testimonials from other agents in the same environment, refered to as witness. Yet, others depend on the analysis of social relations among the agents [
11].
-
Visibility, where the trust information of an agent is be considered a private property that each agent build or a global property that all other agents can observe.
-
Granularity, which refers to the context-dependence of trust and reputation models.
-
Cheating behavior, which refers to the models’ assumptions regarding information from witnesses. According to [
12] a model may assume that witnesses are honest, or that witnesses may hide information but never lies or, alternatively, that witnesses can be cheaters.
-
Type of exchanged information, where information assumed to be either boolean, or continuous estimations.
In addition to the those dimintions, [
11] added the procedural dimension to reflect weather or not a bootstrap mechanizime is embeded within the model. Furthermore, they introduced generality dimension to classify models that are general purpose versus the ones that focus on very particular scenarios.
Form architectural point of view, [
11,
14-
16], among others, differentiated between centralized and distributed models. Typically, a central entity manage the reputation of all agents in the first approach, but each agent performs its trust estimations without a central entity in the second approach [
16]. Decentralized systems may use flat or hierarchical architectures. Generally speaking, decentralized models are more complex than centralized models. But Single point of failure and performance bottleneck are major concerns for centralized models [
17]. centralized models are subject to single point of failure and need powerful and reliable central entities and communication bandwidth [
16].
Witnesses locating dimension was described in [
14] to reflect weather or not a mechanism is embedded within the model for locating third party information sources. Few general purpose trust, such as FIRE [
15] use a distributed approach for this purpose.
This section reviews a selection of decentralized models, and not meant to provide a comprehensive survey of trust modeling literature in MASs. Recent surveys such as [
2,
11,
14] provide further detailed overview of the literature.
A decentralized, subjective, RL based trustworthiness estimation model for buying and selling agents in an open, dynamic, uncertain and untrusted e-marketplace is described in [
18] and further elaborated in [
7]. This model is based on information collected from past direct experiences where buyers model the trustworthiness of the sellers as trustworthy, untrustworthy and neutral sellers. A buying agent chooses to purchase from a trustworthy seller. If no trustworthy seller is available, then a seller from the list of non-untrustworthy sellers is chosen. The seller’s trustworthiness estimation is updated based on whether the seller meets the expected value for the demanded product with proper quality. The update process is maintained after comparing the obtained information about the reliability of a seller against the obtained product quality from the same seller. The model described in [
7,
18], uses some certain thresholds set to categorize TEs to trustworthy and untrustworthy agents. TRs do not interact with the untrustworthy TEs and among the trustworthy ones, TEs with the highest values are selected as interaction partners. The information is based on TR’s personal interaction experience, so the new entry TR has lack of knowledge about different TEs. This model has limited applicability, if repeated transactions between traders are rare [
19].
A decentralized, subjective, extension to the model used in [
18] is describe in [
20,
21], to enable indirect trustworthiness based on third party witnesses in e-marketplace. In this model, witnesses are partitioned into trustworthy, untrustworthy and neutral sets to address buyers’ subjectivity in opinions. However, the authors did not present any experimental results to justify their theoretical approach [
22].
A combined model based on direct and indirect trustworthiness estimation is described in [
23] as an extension to [
20,
21] where advising agents are partitioned into trustworthy, untrustworthy and neutral sets. To address the subjectivity of witnesses, the mean of the differences between the witness’s trustworthiness estimation and the buyer’s trustworthiness estimation of a seller is used to adjust the advisory’s trustworthiness estimation for that seller. Nevertheless, how witnesses are located was not specified.
All those RL based modes [
18,
21] and [
22] classify TEs into three non overlapping sets namely trusted, distrusted, and neutral, also referred to as neither trusted nor untrusted [
18]. Similarly, the computational model of [
24], classifying TEs as trusted, distrusted or untrusted, where the last one means neither trusted nor distrusted. The authors extend the model of [
25] and enriched their model with the use of regret and forgiveness to aid TRs classifying TEs. Their trust estimation at time instance
t+1 depends of the current estimation (at time t) and a forgiveness function. The forgiveness function, in turn, depends on both the regret the TR has because it trusted a TE that does not fulfill its needs, and the regret that the corresponding TE express, if any, for not satisfying the demand of the TR. However, this later factor, the regret of the TE, can be misleading if the TE communicate false regret value. The model does not use third party witnesses.
TRAVOS [
26] is a well-known decentralized, general purpose trustworthiness estimation model for MASs. The model depends on beta distributions to predict the likelihood of honesty of a TE [
19]. If the TR’s confidence in its evaluation is below a predefined threshold, the TR seeks advices from witnesses. In TRAVOS, trust is a private and subjective property that each TR builds. The model computes the reliability of witnesses via the direct experience of interaction between the TR and witnesses, and discards inaccurate advices [
4]. Unfortunately, it takes certain time for the TR to recognize the inaccuracy of the provided reports from the previously trusted witness agents [
4]. The model does not describe how witnesses are located.
Regret [
13] is a decentralized, general purpose trustworthiness estimation model for open MASs that takes into account direct experiences, witness information and social structures to calculate trust, reputation and levels of credibility [
11]. The model assumes that witnesses are willing to cooperate, and depends on social networks among agents to find witnesses, then uses a set of fuzzy rules to estimate the credibility of witness and therefore there testimonies [
2]. Even though the model heavily depends on agents’ social networks, it does not show how TRs may build them [
15].
Yu and Singh [
27], presented a decentralized, subjective trustworthiness estimation model where a TR estimates the trustworthiness of a TE using both its own experience, and advices from witnesses. The model use social network concepts in MASs, where it incorporates for each agent a TrustNet structure. Each agent in the system maintains a set of acquaintances and their expertise. The set of neighbors is a subset of the acquaintances set. The model locate witnesses based on individual agents’ knowledge and help through each agent’s neighbors without relying on a centralised service [
15]. Thus, when looking for a certain piece of information, an agent can send the query to a number of its neighbors who will try to answer the query if possible or, they will refer the requester to a subset of its neighbors [
15]. The requester considers the information only if the referred witnesses are whiten a limit in the social tree [
11]. To address the subjectivity of witnesses, agents model acquaintances expertise [
27]. An agent’s expertise is then used to determine how likely it is to have interaction with or to know witnesses of the target agent [
15]. The model uses Dempster Shafer evidence theory to aggregate the information from different witnesses.
FIRE [
15] is a well-known, general purpose, decentralized, trustworthiness estimation model for open MASs. The model takes into account multiple sources of information. FIRE categorizes trust components into four categories; direct experience called Interaction Trust, Witness Reputation, Role-based Trust and Certified Reputation. The model assumes that witnesses are honest and willing to cooperate and use weighted summation to aggregate trust components [
15]. In FIRE, trust is a private and subjective property that each TR builds [
11].
To locate witnesses, FIRE uses a variant of the referral system used by [
27], but does not model witnesses experties the same way as in [
27]. Instead, FIRE assumes that addressing subjectivity of witnesses is application dependent. To address resources limitations; the branching factor and the referral length threshold parameters were used. The first used to limit the breadth of search and the second is used to limit the depth of search [
15]. An important aspect of FIRE is that a TR does not it does not create a trust graph, as in [
27], and therefore may quickly evaluates the TE’s trust value using a relatively small number of transactions [
4]. Unfortunately, if a TE proposes some colluding referee for certified reputation, this source of information and be misleading to the TR [
4].
To address the subjectivity of witnesses, most models allows a TR to evaluate the subjectivity of witnesses simply based on their deviations from its own opinions [
28]. Most models of trustworthiness estimation allow the communication of trustworthiness information regardless of their contexts, even though trustworthiness estimations are context-dependent [
28]. A functional ontology of context for evaluating trust (FOCET), a context-aware trustworthiness estimation model for multi-agent systems, is described in [
28] to address the case where a TE may offer different kinds of services in different contexts. These contexts might be totally different or have some features in common.
To measure the effect of a particular context, two individual metrics were defined: 1) a weight matrix (WM) that includes the importance level of each feature of context; and 2) a relevancy matrix (RM) that indicates the degree of similarity of each feature in the first context with the corresponding one in the second context. The WM is a 1 ∗n matrix, where n is the number of FOCET context features and matrix entry w
i
is in [0, 1]. The RM is an n ∗1 matrix where matrix entry v
i
is in [0, 1] and refers to the degree of importance of the corresponding feature. For example, an agent, called B1, may consider the “fast-enough” delivery of a specific transaction very important and uses 0.9 as the corresponding value in its WM. Similarly, another agent, called B2, may also consider the “fast-enough” delivery of another transaction very important and uses 0.9 as the corresponding value in its WM. However, for B1, “fast enough” means: within one week. On the other hand, for B2, “fast enough” means: within one day. Therefore, B1 will use a lower value for “delivery time” feature in its RM (e.g. 0.2) whereas, B2 will use a higher value for “delivery time” feature in its RM (e.g. 0.7).
Given the WM and RM matrixes, the influence of the context of the direct interaction between the witness agent and the TE in which the trustworthiness of the TE is estimated, known as the context effect factor (CEF), is computed in [
28] by
$$ CEF=\frac{\sum_{i=1}^{n}(1-w_{i})+w_{i}*v_{i}}{n} $$
(1)
TRs subjectively specify a degree of decay p (0 ≤ p ≤ 1) that is based on their policies in order to reduce the influence of the old trustworthiness estimation information adaptively. Time decaying is used to emphasize that utility gain (UG) from recent transaction weigh more compared to UG from old transactions if they have the same absolute value [
28].
$$ CEF^{'}=e^{(-\text{p\ensuremath{\Delta}}t)}CEF $$
(2)
where △t indicates the time elapsed since previous interactions took place and can be determined according to the temporal factor concept in FOCET.
A common issue with RL based trust models that TRs do not quickly recognize the environment changes and adapt with new settings. To address this shortcoming, DTMAS uses a technique similar to regret described in [
24] in order to improve the model responses to dynamic changes in the system. We propose suspending the use of a TE as a response to unsatisfactory transaction with the TE. The suspension is temporary, and its period increases as the transaction importance increases. Similar to regret [
24], it represent an immediate reaction of a TR when it is not satisfied with an interaction with a TE. However, unlike the use of regret in [
24], TRs do not depend of any expressed feel of sorry from TEs. Furthermore, suspension is more aggressive. No interactions with suspended TEs as long as it is suspended. On the other hand, while forgiveness is used to reduce the effect of regret in [
24], our suspension decays with time only. This is to avoid being mislead by false feel of sorry expressed by the TE, and to allow the effect of accidental misbehavior to phase out, and to magnify the effect of a behavior change as testimonies form witnesses are aggregated. Furthermore, DTMAS integrates context-dependency of third party testimonies [
28] together with reinforcement learning for MASs. Similar to [
13,
15,
27] DTMAS defines a way to locate a set of witnesses to consult for indirect trust information. Like [
15], using DTMAS a TR does not create a trust graph, as in [
27], and therefore may quickly evaluates the TE’s trust value using a relatively small number of transactions. Unlike existing trust models, DTMAS uses of a semi-hierarchical structure for MASs, coupled with the notion of the “small world” [
5] to help reducing the communication overhead associate locating witnesses.