Introduction
The motivation of the current study
-
Researchers in the current and the past are developing programs for successful IRL implementation by targeting one or two identified problems that may or may not be important at present in the real-time scenario [8‐11]. The current study tries to fill this research gap by prioritizing the IRL implementation problems using fuzzy AHP.
-
The past literature shows that IRL’s theoretical background including problems and solutions is not disclosed comprehensively much by researchers.
-
Different researchers have mentioned different solutions for the IRL problems, and these solutions are not been properly organized and analyzed in the past [12‐16]. The current study analyzes and ranks the solutions using the fuzzy TOPSIS method, and helps the decision-makers to make decisions by targetting the prioritized solutions for IRL problems.
Contributions to the paper
-
This is the first study that uses the fuzzy AHP approach to rank the IRL implementation barriers/problems.
-
This is the first study that uses a fuzzy TOPSIS approach to rank the solutions that will overcome the IRL implementation problems.
-
To the best of our knowledge, no other research exists that reports the scope and analysis of this current study about IRL barriers and their solutions. The only hybrid fuzzy AHP–TOPSIS proposed study can be a torchbearer for the researchers to understand the barriers and their solutions in the IRL field.
-
The experts’ opinions in the IRL have been collected in the form of linguistics scales for fuzzy AHP and fuzzy TOPSIS implementation. The current study has used fuzzy MCDM methods as they are capable to handle vagueness and uncertainties in decision-makers’ judgments.
-
The results of the current study can be beneficial to the software companies, industries, and governments that are using reinforcement learning in real-time scenarios.
-
The results show that the most important solution is ‘Supports optimal policy and rewards functions along with stochastic transition models’ and the most significant problem that should be taken care of, while IRL implementation is ‘lack of robust reward functions’.
Hybrid fuzzy AHP–TOPSIS approach performance aspects in IRL
-
Traditional IRL methods are unable to estimate the reward function when there are no state-action trajectories available. The hybrid approach helps to look for the solutions to solve the above problem. Let us illustrate the issue with an example, “Person A can go from position X to position Y with any route. There exists different scenery while going routing through different routes. Person A has some specific preferences for the scenery while routing from position X to position Y. Let’s suppose, the routing time is known, Can we predict person A preferences regarding scenery?” [17]. This is a classical IRL problem having a large problem size or large state spaces. The fuzzy AHP approach of the current study has weighted and ranked this problem high and this is the problem of scalability with large problem sizes or large state spaces. The fuzzy TOPSIS of the current study has focused on solving such problems using “Support multiple rewards and non-linear reward functions for large state spaces”. The new algorithms which support non-linearity for large state spaces or large problem sizes will be used for the proper estimation of reward functions, and it also motivates the researchers and companies to develop new algorithms that can solve such problems in better ways.
-
Feature expectation is another issue with IRL. This is the quality evaluation or assessment of the reward function. The fuzzy AHP approach of the current study has ranked this issue as the number one issue “Lack of robust reward functions” and the fuzzy TOPSIS approach has advocated the use of a solution “Supports optimal policy and rewards functions along with stochastic transition models” that is also ranked one solution to solve the above-mentioned issue. The solution advocated for the building up of algorithms for handling robust reward functions.
-
Using the results of the hybrid approach, the failure rate of IRL projects in the software companies and manufacturing industries can be minimized or reduced.
Literature review
Problems in the implementation of IRL
Code | Problem | References | Major contributions | Limitations |
---|---|---|---|---|
P-IRL1 | Lack of robust reward functions | To produce and improve the likelihood of experts' demonstrated trajectory, single or multiple reward functions are used | The rewards function applicability becomes difficult when large and high-dimensional problems are present having unknown dynamics | |
P-IRL2 | Imperfect and noisy inputs | Imperfect and noisy inputs give rise to Gaussian IRL and Bayesian IRL which can further help to deal with multitasking settings. To deal with perturbation demonstrations that are based on inputs, probabilistic frameworks came into existence like gpirl, mlirl, birl, and maxentirl | IRL implementation presumes that expert’s demonstrations (based on inputs) are optimal, but when actually in practice, this is not the case as IRL implementation faces difficulties and may not perceive full demonstrations trajectory | |
P-IRL3 | Stochastic policy | These policies are more robust than deterministic policies in two areas. The first one is when the environment is stochastic and it selects the action according to the learned probability distribution. The second one is partially observable states when states are partially hidden and stochastic policy considers the uncertainty of states while taking action | Transitions of states and actions require precise modeling of stochasticity if dynamics are not deterministic. An action cannot guarantee that agent will be in that state where he wants to be | |
P-IRL4 | Ill-posed problems | There is uncertainty involved in obtaining the reward function so it gave rise to approaches like maximum margin planning, loss functions, probability functions like maximum entropy, etc. and Bayesian IRL approaches | There are multiple optimal policies for the same reward function and multiple reward functions for the same optimal policy. This is an issue and computational costs that are involved in solving the problem grow disproportionately according to the problem size | |
P-IRL5 | Inaccurate inferences | In the Markov decision process, inferences drawn by humans are considered an inverse planning problem. The important concept here becomes is how we measure accuracy. Here, the birth of closeness of a learned reward function and inverse learning error came into existence | There are many factors of the learning process that impact inferences accuracy and these are inputs, multiple solutions, algorithm performance, and feature selection The inputs are finite and contain a small set of trajectories. Many reward functions could explain the observed demonstration that decreases inferences' accuracy. There are ambiguous solutions that directly impact feature selection and algorithm performance | |
P-IRL6 | Sensitivity to correctness of prior knowledge | To reduce the impact of feature selection, there is significant growth found in research toward hybrid-IRL methods and maximum entropy methods | Feature functions of rewards and transition function of Markov decision process are specifications of prior knowledge that is entering into IRL. Scaling of correct features along with expert dynamics modeled impacts IRL’s accuracy. The challenge is twofold by looking at the significant role of prior knowledge (i) Assurance of accuracy and it is very difficult to achieve practically (ii) Substitute the learned information with the existing knowledge | |
P-IRL7 | Lack of scalability with large problem size | The concept of importance sampling (relative entropy in IRL and guided cost learning method), state-space down-scaling by low-dimensional features, hierarchically task decomposition, and assuming that demonstrations are locally optimal (PI-IRL) came into existence to handle large state spaces | The IRL algorithm complexity is dependent upon time, space, and sampling complexity. As problem size increases, the number of iterations increases in the algorithm, and state-space also increases exponentially. This makes scalability tough and impractical. Sampling complexity means how many trajectories are present in the input demonstration. When problem size increases, a greater number of trajectories are added in the demonstration but this leads to intractability as well as poor performance of the model | |
P-IRL8 | Lack of reliability | To handle reliability, research has been tilting toward learning the optimal reward function. This gives rise to hybrid-IRL, probabilistic methods, and newer frameworks like multifidelity Bayesian optimization framework, etc. | When problem size increases to a larger extent, finding only one reliable solution becomes unrealistic |
Solutions to overcome the identified problems
Code | Solution | References |
---|---|---|
S-IRL1 | Learning from failed and successful demonstrations having noisy and imperfect inputs | |
S-IRL2 | Supports optimal policy and rewards functions along with stochastic transition models | |
S-IRL3 | Maximum entropy and its optimization | |
S-IRL4 | Support multiple rewards and non-linear reward functions for large state spaces | |
S-IRL5 | Inculcate risk-awareness factors in IRL algorithms | |
S-IRL6 | Posterior distribution of the agent’s preferences | |
S-IRL7 | Development of improved IRL algorithms like AIRl, CIRL, DeepIRL, Gradient IRL, REIRL, score-based IRL, and Bayesian IRL for improving imitation learning | |
S-IRL8 | Maximum margin planning and its optimization |
Fuzzy AHP
Linguistic term | TFNs (u, v, r) | TFNs reciprocal (1/w, 1/v, 1/u) | Linguistic term | TFNs (u, v, w) | TFNs reciprocal (1/w, 1/v, 1/u) |
---|---|---|---|---|---|
Tremendous importance | \(\tilde{9}\) = (9, 9, 9) | \(\tilde{9}\)−1 = (1/9, 1/9, 1/9) | Intermediate value between very strong and tremendous importance | \(\tilde{8}\) = (7, 8, 9) | \(\tilde{8}\)−1 = (1/9, 1/8, 1/7) |
Very strong importance | \(\tilde{7}\) = (6, 7, 8) | \(\tilde{7}\)−1 = (1/8, 1/7, 1/6) | Intermediate value between strong and very strong importance | \(\tilde{6}\) = (5, 6, 7) | \(\tilde{6}\)−1 = (1/7, 1/6, 1/5) |
Strong importance | \(\tilde{5}\) = (4, 5, 6) | \(\tilde{5}\)−1 = (1/6, 1/5, 1/4) | Intermediate value between moderate and strong importance | \(\tilde{4}\) = (3, 4, 5) | \(\tilde{4}\)−1 = (1/5, 1/4, 1/3) |
Moderate importance | \(\tilde{3}\) = (2, 3, 4) | \(\tilde{3}\)−1 = (1/4, 1/3, 1/2) | Intermediate value between equal and moderate importance | \(\tilde{2}\) = (1, 2, 3) | \(\tilde{2}\)−1 = (1/3, 1/2, 1) |
Equal importance | \(\tilde{1}\) = (1, 1, 1) | \(\tilde{1}\)−1 = (1, 1, 1) |
Fuzzy TOPSIS
Proposed method
-
TFNs are used for the formalization of the experts’ evaluation as pairwise comparison matrix and fuzzy evaluation matrix are involved in the proposed method phases.
-
For any subjective evaluation process, there are imprecision and vagueness associated inherently. Therefore, all experts’ evaluations are affected by uncertainty and ambiguity as all experts have a different level of cognitive vagueness (based on their experience and knowledge). This is the reason for the usage of the fuzzy approach with TFNs, so that uncertainty and ambiguity can be handled in a better way.
-
There are no external conditions that impact the uncertainty as experts’ are confident about their evaluation in the proposed method phases and there is no necessity for the usage of more complex fuzzy tools like type-2 fuzzy sets, neutrosophic, etc.
Experimental results
Fuzzy analytical hierarchical process experimental results
Problems | P-IRL1 | P-IRL2 | –– | –– | –– | P-IRL7 | P-IRL8 |
---|---|---|---|---|---|---|---|
P-IRL1 | (1, 1, 1) | (0.25, 0.33, 0.50) | (0.25, 0.33, 0.50) | (2.00, 3.00, 4.00) | |||
P-IRL2 | (2.00, 3.00, 4.00) | (1, 1, 1) | (0.20, 0.25, 0.33) | (5.00, 6.00, 7.00) | |||
P-IRL3 | (0.20, 0.25, 0.33) | (4.00, 5.00, 6.00) | (0.20, 0.25, 0.33) | (0.20, 0.25, 0.33) | |||
P-IRL4 | (0.14, 0.17, 0.20) | (0.25, 0.33, 0.50) | (3.00, 4.00, 5.00) | (0.20, 0.25, 0.33) | |||
P-IRL5 | (0.25, 0.33, 0.50) | (0.14, 0.17, 0.20) | (0.20, 0.25, 0.33) | (0.20, 0.25, 0.33) | |||
P-IRL6 | (1.00, 2.00, 3.00) | (0.14, 0.17, 0.20) | (1, 1, 1) | (0.20, 0.25, 0.33) | |||
P-IRL7 | (2.00, 3.00, 4.00) | (3.00, 4.00, 5.00) | (0.20, 0.25, 0.33) | (3.00, 4.00, 5.00) | |||
P-IRL8 | (0.25, 0.33, 0.50) | (0.14, 0.17, 0.20) | (2.00, 3.00, 4.00) | (1, 1, 1) |
Problems | P-IRL1 | P-IRL2 | –– | –– | –– | P-IRL7 | P-IRL8 |
---|---|---|---|---|---|---|---|
P-IRL1 | (1, 1, 1) | (1.04, 1.29, 1.39) | (2.58, 3.16, 3.77) | (3.40, 4.40. 5.40) | |||
P-IRL2 | (2.23, 3.04, 3.86) | (1, 1, 1) | (1.51, 2.00, 2.51) | (4.34, 5.28, 6.21) | |||
P-IRL3 | (1.68, 2.06, 2.48) | (3.22, 4.16, 5.10) | (1.53, 1.90, 2.29) | (1.85, 2.28, 2.73) | |||
P-IRL4 | (0.51, 0.66, 0.81) | (2.44, 3.25, 4.08) | (1.72, 2.15, 2.60) | (1.08, 1.38, 1.70) | |||
P-IRL5 | (0.56, 0.71, 0.92) | (1.94, 2.88, 3.81) | (1.85, 2.28, 2.73) | (0.51, 0.62, 0.75) | |||
P-IRL6 | (1.16, 2.04, 2.92) | (1.94, 2.42, 2.90) | (1.58, 2.13, 2.87) | (2.51, 3.14, 3.83) | |||
P-IRL7 | (1.02, 1.51, 2.01) | (1.70, 2.27, 2.86) | (1, 1, 1) | (4.11, 5.00, 5.93) | |||
P-IRL8 | (0.20, 0.26, 0.37) | (0.61, 0.71, 0.82) | (0.27, 0.44, 0.63) | (1, 1, 1) |
Fuzzy synthetic criteria (Ci) | CVs |
---|---|
C1 (P-IRL1) | p1 = 0.10, q1 = 0.16, r1 = 0.25 |
C2 (P-IRL2) | p2 = 0.08, q2 = 0.12, r2 = 0.19 |
C3 (P-IRL3) | p3 = 0.09, q3 = 0.13, r3 = 0.20 |
C4 (P-IRL4) | p4 = 0.08, q4 = 0.13, r4 = 0.20 |
C5 (P-IRL5) | p5 = 0.06, q5 = 0.09, r5 = 0.14 |
C6 (P-IRL6) | p6 = 0.08, q6 = 0.13, r6 = 0.21 |
C7 (P-IRL7) | p7 = 0.08, q7 = 0.14, r7 = 0.21 |
C8 (P-IRL8) | p8 = 0.06, q8 = 0.10, r8 = 0.15 |
D (C1) | D (C2) | D (C3) | D (C4) | D (C5) | D (C6) | D (C7) | D (C8) | |
---|---|---|---|---|---|---|---|---|
Degree of possibility | 1.00 | 0.69 | 0.74 | 0.75 | 0.35 | 0.78 | 0.81 | 0.44 |
1.00 | 0.95 | 1.00 | 1.00 | 0.64 | 1.00 | 1.00 | 0.74 | |
1.00 | 0.95 | 1.00 | 1.00 | 0.60 | 1.00 | 1.00 | 0.69 | |
1.00 | 1.00 | 1.00 | 1.00 | 0.60 | 1.00 | 1.00 | 0.69 | |
1.00 | 0.93 | 0.97 | 0.98 | 0.59 | 1.00 | 1.00 | 1.00 | |
1.00 | 0.90 | 0.95 | 0.95 | 0.56 | 0.97 | 1.00 | 0.68 | |
1.00 | 1.00 | 1.00 | 1.00 | 0.91 | 1.00 | 1.00 | 0.65 | |
MinD | 1.00 | 0.69 | 0.74 | 0.75 | 0.35 | 0.78 | 0.81 | 0.44 |
Performance analysis metrics
Criteria/problems | Normalized weights | Ranking of criteria |
---|---|---|
P-IRL1 | 0.180 | 1 |
P-IRL2 | 0.125 | 6 |
P-IRL3 | 0.133 | 5 |
P-IRL4 | 0.134 | 4 |
P-IRL5 | 0.063 | 8 |
P-IRL6 | 0.141 | 3 |
P-IRL7 | 0.145 | 2 |
P-IRL8 | 0.079 | 7 |
Fuzzy TOPSIS experimental results
P-IRL1 | P-IRL2 | ––– | –– | ––– | P-IRL7 | P-IRL8 | |
---|---|---|---|---|---|---|---|
S-IRL1 | (1, 2, 3) | (5, 6, 7) | (5, 6, 7) | (4, 5, 6) | |||
S-IRL2 | (2, 3, 4) | (3, 4, 5) | (2, 3, 4) | (3, 4, 5) | |||
S-IRL3 | (4, 5, 6) | (3, 4, 5) | (7, 8, 9) | (1, 2, 3) | |||
S-IRL4 | (5, 6, 7) | (2, 3, 4) | (3, 4, 5) | (2, 3, 4) | |||
S-IRL5 | (3, 4, 5) | (1, 2, 3) | (5, 6, 7) | (3, 4, 5) | |||
S-IRL6 | (1, 2, 3) | (5, 6, 7) | (1, 2, 3) | (5, 6, 7) | |||
S-IRL7 | (4, 5, 6) | (3, 4, 5) | (3, 4, 5) | (6, 7, 8) | |||
S-IRL8 | (4, 5, 6) | (2, 3, 4) | (3, 4, 5) | (7, 8, 9) |
P-IRL1 | P-IRL2 | ––– | –– | ––– | P-IRL7 | P-IRL8 | |
---|---|---|---|---|---|---|---|
S-IRL1 | (1, 5.33, 7) | (3, 5.73, 7) | (3, 5.73, 7) | (3, 5.67, 7) | |||
S-IRL2 | (2, 4.47, 9) | (3, 4.80, 9) | (2, 4.73, 9) | (3, 4.80, 9) | |||
S-IRL3 | (2, 6.13, 9) | (2, 6.63, 9) | (2, 6.60, 9) | (1, 6.20, 9) | |||
S-IRL4 | (2, 4.27, 9) | (2, 3.67, 9) | (2, 3.73, 9) | (2, 3.67, 9) | |||
S-IRL5 | (1, 3.60, 7) | (1, 4.33, 9) | (2, 4.60, 9) | (2, 4.47, 9) | |||
S-IRL6 | (1, 5.47, 9) | (2, 6.60, 9) | (1, 6.33, 9) | (2, 6.60, 9) | |||
S-IRL7 | (2, 3.93, 9) | (2, 3.80, 9) | (2, 3.80, 9) | (2, 4, 9) | |||
S-IRL8 | (2, 3.47, 7) | (2, 4.07, 9) | (2, 4.13, 9) | (2, 4.40, 9) |
P-IRL1 | P-IRL2 | ––– | –– | ––– | P-IRL7 | P-IRL8 | |
---|---|---|---|---|---|---|---|
S-IRL1 | (0.14, 0.19, 1.00) | (0.14, 0.17, 0.33) | (0.14, 0.17, 0.33) | (0.14, 0.18, 0.33) | |||
S-IRL2 | (0.11, 0.22, 0.50) | (0.11, 0.21, 0.33) | (0.11, 0.21, 0.50) | (0.11, 0.21, 0.33) | |||
S-IRL3 | (0.11, 0.16, 0.50) | (0.11, 0.16, 0.50) | (0.11, 0.15, 0.50) | (0.11, 0.16, 1.00) | |||
S-IRL4 | (0.11, 0.23, 0.50) | (0.11, 0.27, 0.50) | (0.11, 0.27, 0.50) | (0.11, 0.27, 0.50) | |||
S-IRL5 | (0.14, 0.28, 1.00) | (0.11, 0.23, 1.00) | (0.11, 0.22, 0.50) | (0.11, 0.22, 0.50) | |||
S-IRL6 | (0.11, 0.18, 1.00) | (0.11, 0.15, 0.50) | (0.11, 0.16, 1.00) | (0.11, 0.15, 0.50) | |||
S-IRL7 | (0.11, 0.25, 0.50) | (0.11, 0.26, 0.50) | (0.11, 0.26, 0.50) | (0.11, 0.25, 0.50) | |||
S-IRL8 | (0.14, 0.29, 0.50) | (0.11, 0.25, 0.50) | (0.11, 0.24, 0.50) | (0.11, 0.23, 0.50) |
P-IRL1 | P-IRL2 | ––– | –– | ––– | P-IRL7 | P-IRL8 | |
---|---|---|---|---|---|---|---|
S-IRL1 | (0.03, 0.03, 0.18) | (0.02, 0.02, 0.04) | (0.02, 0.03, 0.05) | (0.01, 0.01, 0.03) | |||
S-IRL2 | (0.02, 0.04, 0.09) | (0.01, 0.03, 0.04) | (0.02, 0.03, 0.07) | (0.01, 0.02, 0.03) | |||
S-IRL3 | (0.02, 0.03, 0.09) | (0.01, 0.02, 0.06) | (0.02, 0.02, 0.07) | (0.01, 0.01, 0.08) | |||
S-IRL4 | (0.02, 0.04, 0.09) | (0.01, 0.03, 0.06) | (0.02, 0.04, 0.07) | (0.01, 0.02, 0.04) | |||
S-IRL5 | (0.03, 0.05, 0.18) | (0.01, 0.03, 0.13) | (0.02, 0.03, 0.07) | (0.01, 0.02, 0.04) | |||
S-IRL6 | (0.02, 0.03, 0.18) | (0.01, 0.02, 0.06) | (0.02, 0.02, 0.15) | (0.01, 0.01, 0.04) | |||
S-IRL7 | (0.02, 0.05, 0.09) | (0.01, 0.03, 0.06) | (0.02, 0.04, 0.07) | (0.01, 0.02, 0.04) | |||
S-IRL8 | (0.03, 0.05, 0.09) | (0.01, 0.03, 0.06) | (0.02, 0.04, 0.07) | (0.01, 0.02, 0.04) |
Solutions | \({\varvec{d}}_{{\varvec{g}}}^{\user2{*}}\) | \({\varvec{d}}_{{\varvec{g}}}^{ - }\) | CofCg | Rank |
---|---|---|---|---|
S-IRL1 | 0.359454674 | 7.709058919 | 0.955449703 | 6 |
S-IRL2 | 0.263666223 | 7.764375813 | 0.967156846 | 1 |
S-IRL3 | 0.330983924 | 7.733963964 | 0.958960191 | 3 |
S-IRL4 | 0.333067586 | 7.709728135 | 0.958588083 | 5 |
S-IRL5 | 0.407248177 | 7.671022845 | 0.94958721 | 8 |
S-IRL6 | 0.399932677 | 7.695024256 | 0.950594836 | 7 |
S-IRL7 | 0.331777695 | 7.711262501 | 0.958749716 | 4 |
S-IRL8 | 0.328161181 | 7.715248471 | 0.959201235 | 2 |
Results and discussion
Conclusion and future scope
-
In the future, other multi-criteria and multi-facet decision-making criteria can be used like fuzzy VIKOR, fuzzy PROMETHEE, or fuzzy ELECTRE, and the results can be compared with the current study results.
-
The experts' experience and knowledge play important role in the results of the current study and there are chances of biases in the results. This can be minimized or eliminated in the future by adding more experts to the study.
-
In the future, case studies can be done in industries that are using IRL implementation in their processes.
-
As there is up-gradation of technology in the future, some other barriers, as well as solutions, may be identified [89] and they can be taken into the future study.
-
IRL methods should be studied in the future in the context of multi-view learning and transfer learning approaches [56].