Spatial object priors relying on local objects enables a spatio-temporal localization of unseen actions. However, local objects do not tell the whole story. When a person performs an action, this is typically happens in a suitable context. Think about someone
playing tennis. While the tennis racket provides a relevant cue about the action and its location, surrounding objects from context, such as tennis court and tennis net, further enforce the action likelihood. Here, we add three additional object priors to integrate knowledge from global objects for unseen action classification and localization. We start from the common word embedding setup for semantic matching, which we extend with three simple priors that make for effective unseen action matching with global objects. Lastly, we outline how to integrate the semantic and spatial object priors for unseen actions. Figure
2 illustrates our proposal.
4.2 Priors for Ambiguity, Discrimination, and Naming
Similar to the common word embedding setup, for a video
\(v \in {\mathcal {V}}\), we seek to obtain a score for action
\(a \in {\mathcal {A}}\) using a set of global objects
\({\mathcal {G}}\). Global objects generally come from deep networks (Mettes et al.
2020) pre-trained on large-scale object datasets (Deng et al.
2009). We build upon current semantic matching approaches by providing three simple priors that deal with semantic ambiguity, non-discriminative objects, and object naming.
Object prior IV (semantic ambiguity prior) A zero-shot likelihood estimation of action a in video v benefits from minimal semantic ambiguity between a and global objects \({\mathcal {G}}\).
The score of a target action depends on the semantic relations to source objects. However, semantic relations can be ambiguous, since words can have multiple meanings depending on the context. For example for the action
kicking, an object such as
tie is deemed highly relevant, because one of its meanings is a draw in a football match (Mettes and Snoek
2017). However, a
tie can also denote an entirely different object, namely a
necktie. Such semantic ambiguity may lead to the selection of irrelevant objects for an action.
To combat semantic ambiguity in the selection of objects, we consider two properties of object coherence across languages (Malt
1995). First, most object categories are common across different languages. Second, the formation of some categories can nevertheless differ among languages. We leverage these two properties of object coherence across languages by introducing a multi-lingual semantic similarity. For computing multi-lingual semantic representations of words at a large-scale, we are empowered by recent advances in the word embedding literature, where embedding models have been trained and made publicly available for many languages (Grave et al.
2018). In a multi-lingual setting, let
L denote the total number of languages to use. Furthermore, let
\(\tau _{l}(g)\) denote the translator for language
\(l \in L\) applied to object
g. Multi-lingual unseen action classification can then be done by simply updating the semantic matching function to:
$$\begin{aligned} \varPsi _L(g, a) = \frac{1}{L} \sum _{l=1}^{L} \cos (\phi _l(\tau _{l}(g)), \phi _l(\tau _{l}(a))), \end{aligned}$$
(12)
where
\(\phi _l\) denotes the semantic word embedding of language
l. The multi-lingual semantic similarity states that for a high semantic match between object and action, the pair should be of a high similarity across languages. In this manner, accidental high similarity due to semantic ambiguity can be addressed, as this phenomenon is factored out over languages.
Object prior V (object discrmination prior) A zero-shot likelihood estimation of action a in video v benefits from knowledge about which objects in \({\mathcal {G}}\) are suitable for action discrimination.
The second semantic prior is centered around finding discriminative objects. Only using semantic similarity to select objects ignores the fact that an object can be non-discriminative, despite being semantically similar. For example, for the action diving, the objects person and diving board might both correctly be considered as semantically relevant. The object person is however not a strong indicator for the action diving, as this object is present in many actions. The object diving board on the other hand is a distinguishing indicator, as it is not shared by many other actions.
To incorporate an object discrimination prior, we take inspiration from object taxonomies. When organizing such taxonomies, care must be taken to convey the most important and discriminant information (Murphy
2004). Here, we are searching for the most unique objects for actions,
i.e., objects with low inclusivity. It is desirable to select indicative objects, rather than focus on objects that are shared among many actions. To do so, we propose a formulation to predict the relevance of every object for unseen actions. We extend the action-object matching function as follows:
$$\begin{aligned} \varPsi _r(g, a) = \varPsi (g, a) + r(g, \cdot , a), \end{aligned}$$
(13)
where
\(r(g, \cdot , a)\) denotes a function that estimates the relevance of object
g for the action
a. We propose two score functions. The first penalizes objects that are not unique for an action
a:
$$\begin{aligned} r_a(g, A, a) = \varPsi (g, a) - \max _{c \in A \setminus a} \varPsi (g, c). \end{aligned}$$
(14)
An object
g scores high if it is relevant for action
a and for no other action. If either of these conditions are not met, the score decreases, which negatively affects the updated matching function.
The second score function solely uses inter-object relations for discrimination and is given as:
$$\begin{aligned} r_o(g, {\mathcal {G}}, a) = \varPsi (g, a) - \frac{1}{|{\mathcal {G}}|} \sum _{g'\in {\mathcal {G}} \setminus g} \varPsi (g, g')^{\frac{1}{2}}. \end{aligned}$$
(15)
Intuitively, this score function promotes objects that have an intrinsically high uniqueness across the set of objects, regardless of their match to actions. The square root normalization is applied to reduce the skewness of the object set distribution.
Object prior VI (object naming prior) A zero-shot likelihood estimation of action a in video v benefits from a bias towards basic-level object names.
The third semantic prior concerns object naming. The matching function between actions and objects relies on the object categories in the set \({\mathcal {G}}\). The way objects are named and categorized has an influence on their matching score with an action. For example for the action walking with a dog, it would be more relevant to simply name the object present in the video as a dog rather than a domesticated animal, or an Australian terrier. Indeed, the dog naming yields a higher matching score with the action walking with a dog than the too generic domesticated animal or too specific Australian terrier namings.
As is well known, there exists a preferred entry-level of abstraction in linguistics, for naming objects (Jolicoeur et al.
1984; Rosch et al.
1976). The basic-level naming (Rosch et al.
1976; Rosch
1988) is a trade-off between superordinates and subordinates. Superordinates concern broad category sets, while subordinates concern very fine-grained categories. Hence, basic-level categories are preferred because they convey the most relevant information and are discriminative from one another (Rosch et al.
1976). It would then be valuable to emphasize basic-level objects rather than objects from other levels of abstraction. Here, we enforce such an emphasis by using the relative WordNet depth of the objects in
\({\mathcal {G}}\) to weight each object. Intuitively, the deeper an object is in the WordNet hierarchy, the more specific the object is and vice versa. To perform the weighting, we start from the beta distribution:
$$\begin{aligned} \text {Beta}(d | \alpha , \beta ) =&\frac{d^{\alpha -1} \cdot (1-d)^{\beta -1}}{B(\alpha ,\beta )},\nonumber \\ \quad B(\alpha ,\beta ) =&\frac{\varGamma (\alpha ) \cdot \varGamma (\beta )}{\varGamma (\alpha +\beta )}, \end{aligned}$$
(16)
where
d denotes the relative depth of an object and
\(\varGamma (\cdot )\) denotes the gamma function. Different values for
\(\alpha \) and
\(\beta \) determine which levels to focus on. For a focus on basic-level we want to weight objects of intermediate level higher and the most specific and generic objects lower. We can do so by setting
\(\alpha = \beta = 2\). Setting
\(\alpha = \beta = 1\) results in the common setup where all objects are equally weighted. We incorporate the objects weights by adjusting the semantic similarity function between objects and actions.
Combined semantic priors We combine the three semantic object priors into the following function of global objects for unseen actions:
$$\begin{aligned} \ell _\text {video}(v, a) = \sum _{g \in {\mathcal {G}}_a}&((\varPsi _L(g, a) + \varDelta (o,\cdot ,a)) \cdot \nonumber \\&\text {Beta}(d_g | \alpha , \beta )) \cdot Pr(g|v), \end{aligned}$$
(17)
where
\(d_g\) denotes the depth of object
g, [0,1] normalized based on the minimum WordNet depth (2) and maximum WordNet depth (18) over all objects in
\({\mathcal {G}}\). In this formulation, the proposed embedding is more robust to semantic ambiguity, non-discriminative objects, and non-basic level objects compared to Equation
10.