Development of relevance of computational trust
The model for a message passing network is considered in the present work. Trust needs to be associated with messages as well. Two trust values are defined, one for the communicating node and the other for the message sent. In order to conclude about the trustworthiness of a message, a composite trust needs to be computed. A summarized description of issues of storage and retrieval of trusted information using a probabilistic temporal database approach is available in [
12]. A measure has been introduced as a composite of calculated weight of record or attributes importance value and the computational trust of the agent attached to the node. The utility measure associated with any node can be calculated based on probability of action of forwarding data from that node to the network node option for the current congestion situation. This probabilistic utility measure may indicate the fraction of the sent packets from the source that are successfully delivered at the destination thus avoiding dropped packets (Eq.
8). This net trust value which is the composite trust of attribute importance values and network utility values and the computational trust of the agent associated with the node is derived from application of principles described in Eq.
10. An event or topic modelling approach is considered where the dependency between news report event or subtopic nodes is represented from their correlated property derived from similarity or distance measure associating these news report events or subtopics. The reward value calculation for transferring message between these nodes is described and which is utilized for deriving a policy path of message transfers using a reinforcement learning approach.
$$ {\text{The weight associated with a source record}} = \left( {\sum\nolimits_{i} {(p_{i} *mc_{i} )/(ascm)} } \right) $$
(1)
where
i represents a valid or selected attribute in the merged record where merging records have identical value for this attribute,
mc
i
is occurrence count of token associated with attribute i in merged record which is calculated as the sum of occurrence counts of corresponding attribute value in the individual records,
ascm is the cumulative occurrence count of tokens in the merged news report for all valid or selected attributes
i present in the merging records and
p
i
determined from application of Eqs.
2,
3 and
4.
The value is normalized with a division by the summation of such probabilities identified for all attributes of the source record.
A measure can be associated with any field value based on the occurrence count of this value in the news report from where this record is represented. This can be calculated based on the iteratively developed conditional probability of a field token value given an event as a maximization step of EM algorithm solution as described in [
132]. The EM algorithm maximizes the posterior probability of occurrence of the event.
The expectation step is
\( {\text{p}}({\text{ej}}|{\text{xi}})\left( {{\text{t}} + {\text{1}}} \right) = {\text{p}}({\text{ej}})\left( {\text{t}}\right)\;*\;{\text{p}}({\text{xi}}|{\text{ej}})\left( {\text{t}} \right)/{\text{p}}({\text{xi}})\left( {\text{t}} \right). \) Here e
j is the jth event, x
i is the ith news report. The maximization step is derived from this equation and this also produces the probability value of nth entity.
$$ {\text{p}}({\text{w}}_{\text{n}} |{\text{e}}_{\text{j}} )({\text{t}}+1) =\left( \sum\limits_{{({\text{i}} = 1\;{\text{to}}\;{\text{M)}}}} {{\text{p(e}}_{\text{j}} |{\text{X}}_{\text{i}} ) ( {\text{t}}}+1)\,*\,\text{tf(i,n)}) /({\text{N}} + \sum\limits_{{({\text{i}} = 1\;{\text{to}}\;{\text{M)}}}} {{\text{p(e}}_{\text{j}} |{\text{X}}_{\text{i}} ) ( {\text{t}}+1)} + \sum\limits_{{(s = 1\;{\text{to}}\;{\text{N)}}}} {\text{tf(i,s)}}\right ) $$
(2)
This probability of field token value in presence of all values can also be calculated as a posterior probability value based on prior, likelihood of the value given the news reports about the event and the prior probability of occurrence of event in the reports as described in the context of combination and calibration of methods for the purpose of forecasting of events appearing in [
133].
$$ {\text{f}}({\text{p}}_{\text{t}} |{\text{n}}_{\text{t}} ) = {\text{f}}\left( {{\text{p}}_{\text{t}} } \right)*{\text{f}}\left( {{\text{n}}_{\text{t}} |{\text{p}}_{\text{t}} } \right)/{\text{f}}({\text{n}}_{\text{t}} ) $$
(3)
where
\( {\text{f}}\left( {{\text{n}}_{{\text{t}}} |{\text{p}}_{{\text{t}}} } \right) = \left( {{\text{n}}_{{\text{t}}} {\text{C}}_{{\text{m}}} } \right)*{\text{p}}_{{\text{t}}}^{{{\text{nt}}}} *\left( {{\text{1}}-{\text{p}}_{{\text{t}}} } \right)^{{({\text{m}}{-}{\text{nt}})}} \)
This probability value can also be calculated by a beta binomial method. This can also be calculated as an optimal score approach where the conditional probability of the field token value is calculated based on defined number of past observations, and observations from news reports with application of appropriate weight-age to each that minimize a posterior log likelihood measure as described in [
133].
This is described as follows,
$$ E(p_{t} |n_{t} ) = (T_{p} + w_{nt} )/(T + w_{m} ) = w_{{\mathbf{1}}} *(n_{t} /m) + w_{{\mathbf{2}}} *p $$
(4)
where
w
1 =
w
m
/(T +
w
m
), and w
2 =
T/(
T +
w
m
), where p
t is the probability associated with entity of interest (posterior), nt is the occurrence count of the entity, m is the total occurrence count of all entities, T is the sample size and p(prior) is the sample mean.
The weight w is determined with the purpose of maximizing the log-likelihood value,
$$ {\text{L}}\left( {\text{w}} \right) \, = \Pi_{{{\text{t}} = 1}}^{\text{T}}\;\text{E}({\text{p}}_{\text{t}}^{*} |{\text{ n}}_{\text{t}} ).$$
This w value determines the posterior probability of entity of interest.
Thus if w
i is the weight associated with ith message record, and T
i is the global trust value associated with source of record i, the composite trust in the event is calculated as
$$ COMPT = \sum\limits_{i} {(w_{i} *T_{i} )} $$
(5)
The fusion value of data is calculated using data fusion rule specified in [
134]. This value is determined as
$$ {\text{DV = }}{{\sum\nolimits_{{\text{i}}} {({\text{w}}_{{\text{i}}}* {\text{T}}_{{\text{i}}} )} } \mathord{\left/ {\vphantom {{\sum\nolimits_{{\text{i}}} {({\text{w}}_{{\text{i}}}* {\text{T}}_{{\text{i}}} )} } {\sum\nolimits_{{\text{i}}} {{\text{T}}_{{\text{i}}} } }}} \right. \kern-\nulldelimiterspace} {\sum\nolimits_{{\text{i}}} {{\text{T}}_{{\text{i}}} } }} $$
(6)
where Σ
i (
w
i
*
T
i
) is the composite trust in the message produced from this node and
T
i
is the Trust in the Trusted Partners associated with ith message record input to this node
$$ {\text{Composite Weight of Source Record}} = ({\text{w}}_{\text{A}} * {\text{u}}_{\text{A}} ) $$
(7)
where w
A is weight associated with source record in “A” and U
A is the utility associated with the source record from source at state “A” for the chosen forwarding action to B based on current network congestion situation. The utility measure u
A is defined as,
$$ {\text{u}}_{\text{A}} = \frac{{\left( {{\text{Frequency count of successful transfer of packets from ``A''}}} \right)}}{{({\text{Frequency count of sum total of all packets transferred from ``A''}})}} $$
(8)
The reward received at node B is COMPT
AB where the composite trust in A after interaction with “B” [
40] which has been derived using Eq.
10 from a choice condition determined from application of Eq.
9.
The unit step reward for corresponding pAB(n) is valid for n = 1.
However multi step reward calculated for such node pair (A,B) = p
AB(n) for n ≥ 1, which is symbolized as p
ABt and is derived using application of Eqs.
13,
14,
15 and
16 appearing later in this document.
A similarity measure Sim(i, j) between two records i and j is defined where each of these records are described by k attributes and attr_in_merge(j) = 1, indicates if the jth attribute values of the records have similarity value > 0, else attr_in_merge(j) = 0.
The net similarity measure is defined as
$$ Sim\left( {i, \, j} \right) = \sum\limits_{{\alpha (a = 1\,\,{\text{to}}\,\,k)}} {(attr\_in\_merge(a))(wa*Sima(i,j))} $$
(9)
where Sima (i,j) is the similarity of the ath attributes of records i and j and w
a is the normalized probabilistic weight associated with the ath attribute of the merged record which is based on application of Eqs.
2,
3 and
4.
The procedure of cumulating occurrence counts of token or evidence for merging information from multiple sources has also been discussed in [
135]. The choice of routing or forwarding of information based on reward calculated on aggregation based on data correlation and gain in reward or diminishment of distance measure to the goal node has been described in [
136].
This net similarity measure calculated using Eq.
9, is used for determining the appropriate probability disjunction strategy. Here the probabilistic trust values pA and pB associated with individual source records participating in the merge are calculated from applying Eqs.
1 and
7.
If p
B is the probabilistic importance or trust value associated with record at node “B” and pA is the probability of source record at “A” forwarded to node “B” situated in the provenance graph is determined from application of Eqs.
1 and
7. Here the weight of the relevant record is multiplied with the probabilistic trust value measuring the trustworthiness of the agent at the node to produce the probabilistic value presented to the probabilistic disjunctive strategy described for temporal probabilistic databases [
12,
137].
$$ {\text{The Net}}\_{\text{trust}} = {\text{measure defined on net similarity}} $$
(10)
Where if the
net similarity measure >
high threshold, probabilistic disjunctive strategy [
137] is adoped for the positively correlated case and the Net_trust is
min(1, pA +
pB). else if the
net similarity measure =
0.5, the probabilistic disjunctive strategy [
12] is adopted for the independent case and the Net_trust is
\( pA + pB\;{-}\;pA\;*\;pB. \) else if the
net similarity measure <
low threshold and >
minimum threshold, probabilistic disjunctive strategy is adopted for the negatively correlated case [
12] and the Net_trust is
min(1, pA +
pB). else If
net similarity measure <
minimum threshold, probabilistic disjunctive strategy is adopted for the ignorance case and the Net trust is
max(pA, pB).
This net trust value COMPT
AB or the reward received at node A is the composite or net trust in the event after merging the message records from pair of sources using method as described in Eq.
10 of this section. This measure considers the importance of similar sources in a cluster alongside sources with the news report “contribution” [
138] or measure of distinctive relevance of sources in the reward value computation.
Application of Q Learning algorithm for learning provenance path
The Q Learning procedure is described as follows. Initialize V(s) for all states using initial network congestion conditions and using procedures described in Eqs.
1,
7 and
10. The network nodes are considered to be connected by links based on probabilities determined from current congestion situation as described in Eq.
8. Here s is the set of nodes, that are base database [
140] relation sites, where record merging operation is applied on records arriving from two or more network sites.
Here the transition probability T(s,a,s’) is determined from application of Eq.
16, and R(s,a) is the trust in the component of information contributed by state s in performing the transformation step at a state s’ reached from forwarding action of data or information packet and this is determined from application of Eq.
16 after reaching state s’ as described earlier.
The policy learned is considered to be “good” if the magnitude of the updated Value attributes are less than an acceptable threshold different from the magnitude of these attribute value prior to update. The greedy policy finds the optimal policy in finite number of steps sometimes earlier than convergence of iterated value function. Here on arriving at a better optimal policy, all the previously learned policy paths including the latest that are covered are ignored and a new policy which is sub-optimal but is presently optimal is learned. This procedure continues until all policy paths satisfying the goal are learned.
Thus a message is transferred through a selected policy path of nodes in a provenance network for maximizing the iterated value described for a node in the graph.
A composite trust value of a node “A” is calculated from its content importance, which includes the trust of the subject associated with the node and importance of the node. The change of importance measure between a node “A” of interest and active node “B” in the provenance path after traversing n steps in the provenance path is calculated where node B is active after executing n provenance steps.
A provenance trust score is calculated for every data “d” passing through this node “A”. This is calculated as minimum or average of the trust values associated with nodes in the provenance path. The provenance trust scores of all data passing through this node “A” are averaged to produce a measure of trust associated with node “A”.
The subject node “B” is included in the same cluster as the cluster identified for subject node “A” if the following conditions are satisfied.
The weight of source record at “A” after satisfying constraints associated with merging with record at “B” is w
AB and is calculated using Eq.
7. The weight measure is further normalized by maximum w
AB calculated for all links describing the provenance graph.
Also prior weight associated with record at “A” is w
ABt = w
AB/max(w
AB)
$$ {\text{p}}_{\text{AB}}(1) = { \text{max} }\left( {{\text{w}}_{\text{ABt}} ,0. 5} \right) $$
(12)
and for “B” reachable from “A” in 1 step, where p
AB (1) is the Transition probabilistic trust value between nodes or the strength of trust connecting these nodes and also where in the absence of prior information this probability is initialized to 0.5.
If A is connected to B with a provenance path through one intermediate node C for a run of the workflow:
$$ {\text{p}}_{\text{AB}} \left( 2\right) = {\text{p}}_{\text{AC}} \left( 1\right)*{\text{p}}_{\text{CB}} \left( 1\right) $$
(13)
Alternatively p
AB(2) can also be calculated as the average or the weighted average which is of more relevance in the adopted of Q Learning procedure of the trust transitions p
AC(1) and p
CB(1) if we adopt a procedure described in [
44].
The transition probability p
ABt which is the probabilistic link value connecting any connected pair of nodes in the provenance graph has been defined in [
27] and is calculated using a procedure Cluster_Node_1 described as follows.
If PABt > threshold value these nodes are positioned in the same cluster PABt which is the derived trust value of “A” to “B” obtained using distinct provenance paths (pathi) for a data input “d” to the stream.
Alternatively this trust value can be calculated as the weighted average of the trust values computed along the alternative paths where the weights are determined from the path length. Here this may be noteworthy that the links nearer to the start node contribute heavily to the net path trust value connecting any pair of start and goal nodes in the provenance path.
This is also described as to the procedure of combination of path trust values in [
44].
The normalized derived probability of trust link A to B for action taken at state A
$$ {\text{p}}_{\text{ABc}} = {{{\text{P}}_{\text{ABt}} } \mathord{\left/ {\vphantom {{{\text{P}}_{\text{ABt}} } {\sum\nolimits_{x} { ( {\text{P}}_{\text{AXt}} )} }}} \right. \kern-0pt} {\sum\nolimits_{x} { ( {\text{P}}_{\text{AXt}} )} }}. $$
(16)
After every updated policy identification round, the revised trust values of agents/actors associated with nodes in the provenance network are propagated and updated according to the trust propagation rules described for social networks in [
141].
Our solution to clustering nodes in provenance graph
The approach adopted in this work for Clustering of nodes in a provenance graph includes the following steps.
1.
Identify new node of high centrality importance based on its connectedness or between-ness. These nodes may also be marked as split or merge nodes in the provenance graph describing application of news reports topic modelling. This is sub-goal node for a cluster.
2.
Grow the cluster starting from a node of high importance which may be nodes of high measure value of degree or between-ness.
2.1.
The cluster node expansion and cluster edge expansion measures are considered for growing a cluster using a derived cluster node expansion capability measure. This method considers the nodes in the cluster and the complement of the nodes in the cluster remaining in the graph for defining the expansion measure. Also here if the link trust probabilities are fluctuating, a measure based on relative dependence is used for identifying the stability of clustering. The distance measure between any two nodes in the cluster is defined using a Joint Entropy distance value as specified later in Eq.
17. Pair of nodes are only considered to be members of the same cluster if the activity or task steps or units corresponding to these node are within a predefined threshold distance measure defined using Joint Entropy distance measure between node pair. Also the similarity or distance measure is refined using similarities of roles of Agents associated with these nodes. This distance measure is defined based on length of category or role classification tree path separating the nearest ancestor nodes of agent node pair situated in this topic-subtopic role hierarchy tree. The agent similarity or distance measure is provided higher weight-age when compared to trust link measure in calculating the similarity or distance measure between event node pair. Here an agent role may be represented as an attribute in the event record associated with a network node. This leverages the nodes associated with the same agent role for membership in the same cluster. The entity words and the topic words are important in calculating similarity or distance measure for documents or news reports for these nodes to be positioned in the same topic or sub topic cluster. This similarity measure is represented using the probabilistic trust weight link connecting the nodes positioned in the provenance network. The distance measure calculation from this weight measure has been described later in this section.
2.2.
An average distance measure is calculated for every pair of nodes in the identified cluster. Only those nodes are retained in the same cluster whose pair-wise distance is less than a defined threshold measure different from this average distance.
3.
Continue Steps 1 to 3 until a cluster is identified for all nodes in the graph.
4.
Identify all the cluster components in the graph.
4.1.
Alternatively, node pairs with very high trust link weights may be removed from the provenance graph, and thus the clusters of connected graphs may be detected. The high trust link weighted edges may then be introduced, to establish links to sub-goal node(s) of a cluster.
The adjacent clusters/modules are merged based on the following procedure. A module score or a cluster score is defined for each module group in a workflow [
45] based on distances of module element pairs contained in the module group. The workflow score is defined as the sum of the module group scores. A greedy strategy is adopted which merges two adjacent module groups with best workflow score. This process continues until only a predefined number of module groups remain. The adjacent cluster components are merged if these identified sub-goal nodes corresponding to the clusters are the same. A concept of role of macro action is applicable where this macro action satisfies a sub-goal and where a reward function may be used which is particular to the sub-topic sub-goal. The roles corresponding to the macro actions satisfies the same sub-topic or sub-goal role and have correlated topic models describing the connected event nodes present in the sub-goal or sub-topic cluster. This role can be interpreted as the semantic role or users of the same role having the same or similar “structural signatures” or users having a trusted group [
41] role corresponding to the sub-topic cluster associated with tracking a particular topic or sub-topic or sub-goal of the story. All agents associated with a macro action may be interpreted as playing an informing role to the evaluating agent associated with the sub-topic sub-goal agent. As a consequence of merging of clusters, a hierarchical clustering of sub-topic sub-goal nodes is achieved where sub-goal nodes inferred from recursively defined merged clusters may indicate split or merge points in the goal topic or whole story description. Here a merging of clusters may aim at forming a merged cluster which have nodes in the merged cluster of similar Q values or Reward values as calculated from application of Eq.
16. The actions connecting the sub-topic sub-goal nodes of finally produced clusters are the macro actions on which the Q Learning procedure is re-applied. The distance measure defined on agent roles and trust links in combination with the reward based Q value approach produces a clustering where every cluster has nodes with same or similar agent roles [
64].
This condition for merging of clusters is applied with the module score approach to obtain the final clustering. The probabilistic trust weight connecting the sub-goal nodes has been described as a macro action for identifying clusters which are updated using principles described earlier. A role is associated with interpreting a story theme or topic or subtopic or aspect of a story which may be derived from input vector of aspects.
An alternative clustering or classification approach has been described as follows. The examples associated with a state can be classified into one or more identified categories. The full set of states is subdivided into smaller set of states to reduce the entropy measure defined on categories. An action is described on a state with an attribute which is used to further classify the subset of examples in the state. The Information Gain functions that are used are Entropy based, Gini Index based and Discriminant power function based [expected count/(total count)]. The leaf nodes of this decision tree must satisfy the constraint that all activity or tasks nodes which are members of any leaf cluster node are executed by agents with the same role.
Here for measuring the applicability of our approach, a module group score can be defined for a group of module element nodes based on Joint Entropy distance [
139] separating every pair of nodes (A, B) defined on this derived probabilistic weight P
ABc as discussed in Eq.
16.
This Joint entropy value based distance measure
$$ {\text{D}}_{\text{ij}} = {\text{p}}_{\text{ij}} * {\text{log}}({\text{p}}_{\text{ij}} ) + ( 1{-}{\text{p}}_{\text{ij}} )*{ \log }( 1{-}{\text{p}}_{\text{ij}} ) $$
(17)
This distance measure is further normalized by the maximum distance value max (Dij) for the provenance graph model. Thus Dij = Dij/max(Dij).
Thus quality of identified grouping or clustering of nodes in a module group can be represented using this distance measure and the difference from the weighted average distance calculated between every pair of nodes in the identified cluster. If this difference is lesser than a threshold value, the nodes are considered as elements of the same cluster. The adjacent groups of modules can be merged that satisfies a workflow score constraint as discussed earlier for a pre-specified limit on the number of clusters/groups in workflow.
Rationale for provenance graph structure and classifier learning
Only those events have been considered that have one or more of the necessary seed terms that have been used to describe a topic of interest to the user. Here the topic specific words that remove other words commonly occurring across topics have been identified using procedure described in [
83]. The identification of story clusters for snapshot intervals of time and procedure for linking these clusters have been described in [
106]. The models thus identified may be merged based on their importance in description of the story. Here interpreting clustering of documents for representing model cluster and Bayesian probabilistic approach [
142] of assignment of document to a model cluster as has described in [
143] are relevant. The application driven text data source can be static or dynamic where a static source implies that the document collection as not having frequent updates, while other text streams can be characterized as having many updates [
144]. The topic modelling techniques have also been adapted with respect to the temporal scale for narrowing down events to fine granularity [
144]. Hierarchical topics provide an overview of topics from one corpora [
34]. The method suggested in [
34] provides a full picture of topics from multiple corpora which can represent time updated versions of earlier corpora, where the hierarchical topic models have been merged based on graph matching methods [
34], such as graph edit distance and other such methods [
34]. A phrase reinforcement learning has been proposed in [
145] where a starting phrase represents the topic for which generating a summary of tweets has been proposed, and this best partial summary represents the selection of path with the maximum sum of weights along the path [
145]. The above description summarizes related work in the merging of topic models and representing summary policy path.
The classifier policy path has been detected from this structure by application of Q Learning algorithm for learning provenance path as has been described in an earlier section. Q Learning procedure has earlier been described to produce a plan [
10] of event or topic executions [
74] from the start topic or event [
107] to the goal topic or event [
107].
Provenance graph model(s) have been constructed for time window (s) which have been determined from window length [
127] and refresh rate [
127]. The time windows can also be defined based on a concept of sliding window [
127] with overlaps between the models defined for these time windows.
The integer linear programming, constraint satisfaction, and other emerging algorithm based set covering problem solutions have been described as relevant for generating a cover set of classifiers for the models. A classifier in a model is compared with all classifiers in every other model learned and the best match is considered for calculating significance of a classifier [
146]. Each one of these significant classifiers present in the minimum cover set and which are associated with one or more model (s) are compared with all such significant classifiers which are associated with other such model (s) and a similarity measure between these models has been computed from the classifiers present in these models using a correlation ratio measure as described in [
147]. Alternatively, this similarity measure can be calculated as Pearson Product moment correlation measure. A principle of covariance measure defined between a single value and a vector of values is the sum of the covariance measures calculated between the single value and each element of the vector. Alternatively a distance or similarity measure can be described between a pair of these models using a graph edit distance measure as has been described in [
125]. The models are clustered using pair-wise similarity or distance measures between a model pair described by their representative classifiers as described earlier. A hierarchical agglomerative Approach [
148] can be adopted for this with merging of most similar model pairs at every hierarchy where a merged model can been derived using procedures as has been described later in Steps 9.2, 9.3 and 11. The above description summarizes the relevance of covariance or correlation based clustering methods to our work.
A partitioning approach can be adopted for this model merging using the K-medoids approach where a model has been selected as representative for the cluster of models and this clustering method is realized with a Partitioning around “medoids” (PAM) or Clustering Large Applications (CLARA). The algorithm described as PAM starts with randomly collected seed models and improves the clustering with a greedy strategy by randomly selecting a model as a “medoid” which reduces the measure associated with the absolute error criterion [
148] representing the sum of dissimilarities. Here all models present in a cluster are merged. Alternatively, models are merged from using their property of markov equivalence producing an essential model. These have similarities with ideas appearing in [
35,
36]. The algorithm described in Step 11 has been utilized to complete the definition of merged model. PAM (partitioning around “medoids”), a medoid based clustering algorithm which has been cited in our work is less influenced by outliers. Our representation of record instances for a time interval with a provenance graph model reduces the cluster space and makes it feasible for application of PAM. For large data sets, a sampling based method called CLARA can be used where after sampling the clustering methodology PAM has been applied to detect the best “medoid”. A “medoid” based approach to clustering of points described in PAM, CLARA or CLARANS where two clusters has been be merged based on the farthest distance between two points in the cluster pair and this tested merged cluster satisfies less than a certain threshold value for the diameter measure. The revised centroid point has been identified from merging of clusters. This centroid point has a maximum distance measure value to a point in the merged cluster and satisfies a magnitude less than a threshold value in radius measure. The above description summarizes the database clustering methods such as PAM, CLARA, CLARANS which are relevant to our work.
Also for the cause of merging models, the “medoid” model in a cluster has been selected as the initial model. The other present models in the cluster has been merged iteratively such that the intermediate model minimally increases the error defined based on a maximum likelihood measure as has been described in [
149]. Alternatively a Bayesian Model Merging principle has been adopted such that product of model prior and likelihood measure associated with the models has a maximum value [
149]. Also the models can be ordered based on their relevance to a subject or topic using principles of Bayesian Factor or Bayesian sampling approaches as has been described in [
150] and the models can be merged in an appropriate order within a cluster. Here a unified provenance graph model is thus constructed where model unification procedure has been applied for merging element models from incrementally acquired information obtained at more recent time intervals as has been described earlier. A merging of time series data using principles of Dynamic Time warping (DTW) has been described in [
151] where the time series pair participating in the merge have been optimally aligned using principles of Dynamic Programming if the lengths of the time series pair have not been observed as same. A cluster has been represented by time series data which have not been considered as similar in nature and a representative time series has been derived from the time series data present in the cluster that considers the DTW distance to identify the closest time series [
151]. The above description summarizes related work in merging models within a cluster using Bayesian Factor, or using methods for Time Series Merging based on Dynamic Time warping.
A hierarchical agglomerative clustering approach appearing in [
11] has been adopted with merging of most similar model pairs at every hierarchy where a merged model has been derived using a procedure described later in Steps 9.2, 9.3 and 11. BOAT uses attribute selection method like Gini index which has been used for constructing for Regression Trees.
Boosting is a method of combining ensemble classifiers created from a weighted version of learning sample, where weights have been adjusted at each step to provide increased weight to cases misclassified earlier. Adapting re-sampling has been identified as the key to success with misclassified cases receiving larger weights in the next step. Bagging has been applied to larger trees in contrast to boosting that works well with stumps or slightly larger trees [
152]. The idea of combining ensemble of classifiers using a weighted version of each as has been described in boosting and this has interpretation of relevance to our work where earlier steps of learning have been provided more weight-age than those derived later and the significant classifiers thus derived have more weight-age value. Also the start nodes describing a topic in the Phrase Reinforcement Learning in learning of topic description as has been described in [
145]. The above description summarizes the relevance of Boosting over Bagging for combining ensemble classifiers.
A reinforcement learning approach has been described as providing a balance between pruning for generalization and growing deeper trees for accuracy [
153]. A continuous U tree algorithm transfers traditional U tree algorithm to reinforcement learning and this U tree algorithm can be viewed as a Regression Tree algorithm for storing state values [
154]. The regression clustering (CART) with splits satisfying a maximum gain measure as modelled by ginni coefficient describes a probabilistic measure modelling the fraction of points of the predecessor nodes that are present in the one or the other successor node. The above description summarizes the relevance of U tree algorithm and CART algorithm to our work.
Structural Regression Trees integrates the regression method of learning into inductive logic programming [
155]. This method however produces a solution to the Relational Regression Problems which have been difficult to understand, and assumes that all features are equally relevant to all parts of instance space, and also does not have easy utilization of domain space [
155]. The SRT method has a simple method of tree selection based on Minimum Description Length (MDL) [
113] principle. The MDL algorithm measures the simplicity and accuracy of the theory and data. The theory description length has been derived from encoding of literals and encoding of the predicted values in leaves [
155]. The data length has been derived from encoding of errors [
155]. A model has been selected with minimum message length associated with the sum of theory message length and data message length of the model [
155]. The balance provided by Regression Tree approach [
153] with the methodology of error complexity pruning and growing deeper tree for accuracy has similarity with the MDL based approach to learning as has been described in [
155]. The interaction network summarization has been described with independent topical events that are temporally and topically coherent and this interaction network has been summarized by large events [
156]. A collection of k-events has been selected that maximizes the node coverage and this task maps to the finding the maximum set cover solution [
156]. The above description summarizes related work on the MDL principle and maximum cover set solutions for representing models.