Introduction
In many areas of applied network science, researchers are interested in studying the outcomes of particular networked processes: the spread of disease, the development of consensus, the movement of people, etc. When this is the case, domain-specific research questions often center around the process rather than the network that it is unfolding over (Bockholt and Zweig
2020). In the context of such questions, it can be difficult to interpret the results of out-of-the-box network analyses (Borgatti
2005). What we need are techniques that keep the focus of analysis on some particular networked process, itself (Lambiotte et al.
2018; Schwarze and Porter
2020; Xu et al.
2016). Here, we take a process-driven approach towards analyzing observational data about networked walk processes with the goal of devising an approach that can answer relevant domain-specific research questions.
We focus on two specific real-world walk processes: a ball passed among players during matches within seven professional football competitions and e-money transacted among mobile wallets over a single mobile money service. Association football is a hugely popular sport and data-rich analytics of sports is of growing interest (Kuper
2011; Sarmento et al.
2014). Researchers and analysts might like to know if classic findings in sports science—such as how 80% of goals are scored from short possessions—replicate using detailed spatio-temporal match data available for recent competitions (Hughes and Franks
2005; Reep and Benjamin
1968; Reep et al.
1971). As predominant styles of play have moved away from “long ball” strategies, coaches might like to know the extent to which teams benefit from developing complex multi-player tactics (Schoenfeld
2019).
With regards to the second networked process considered in this paper, mobile money is a new financial industry that has expanded rapidly across Africa, South Asia, and Southeast Asia since 2007 (GSMA Mobile Money
2015b; Suri
2017). Mobile money providers host e-money accounts and process digital transactions on behalf of users over the cellular infrastructure, which is more widely available than traditional banking infrastructure in many areas. Mobile money providers and proponents of financial inclusion are for example interested in understanding how mobile money systems are used (International Finance Corporation and Mastercard Foundation
2018; Stuart and Cohen
2011), to what extent e-money is re-used (Athique
2019; Kendall et al.
2011), and for how long e-money is saved (Blumenstock et al.
2015; Demombynes and Thegeya
2012).
The data recorded about these processes takes the form of timestamped
events in both cases (Blumenstock et al.
2016; Economides and Jeziorski
2017; Pappalardo et al.
2019; Sarmento et al.
2014). These events are football passes and financial transactions, respectively. Individual football players (account holders) initiate and receive near-instantaneous passes (transactions) in continuous time. While we
could choose to interpret each event as a link in a temporal network (Aslak et al.
2018; Holme and Saramäki
2012; Rocha and Masuda
2014; Taylor et al.
2017), it is unclear how this would provide answers to the questions posed above. Instead, we propose to consider each event as a record of the movement of something tangible: football passes move the ball, financial transactions move money.
There are, however, no established techniques for analyzing event data recording steps in a real-world walk process, as such. So we first identify three ways that we would want our technique to engage with domain knowledge about particular processes. First, the method should be interpretable in light of the integrity constraints inherent to walk processes. Players cannot kick the ball unless they have it and bookkeeping protocols prohibit accounts from spending money they do not have. Second, we want an approach that retains the meaningful sequential information that is implicit in the ordering of event data. Players hold onto the ball, and accounts hold onto funds, for some period of time between sequential events. Finally, we would like to incorporate contextual knowledge on fouls and throw-ins and deposits and withdrawals and other ways in which real-world processes are in fact bounded, i.e., there are specific events that begin, end, or re-start the process.
We propose to extract and analyze the trajectories taken by those tangible items whose movements are recorded in the event data. Extracting trajectories can be done by tracing the same football (or the same e-money) across sequences of observed events in a systematic way. In both cases we must take care to define the bounds according to the rules governing the process. Tracing the single football is then relatively straightforward. Tracing funds is more involved, in particular because there are no unique identifiers on e-money as there are on paper bills. This weighted situation requires also an informed choice on how to allocate funds to particular trajectories where this is otherwise ambiguous. Once extracted, we can analyze trajectories to answer research questions centered around the walk processes itself. In this paper, we propose a systematic approach for extracting trajectories from both unweighted and weighted processes.
Our work highlights four benefits of extracting and analyzing trajectories, each of which lets us produce a result of relevance to association football or mobile money.
First, trajectories are a particularly useful and interpretable structure because they relate directly to concepts that are already well-researched. Since at least the 1960s, researchers in sports science have studied possessions in association football; these are passing sequences with particular criteria for delineating how they begin and end (Reep and Benjamin
1968; Reep et al.
1971). We adapt the definition laid out in Hughes and Franks (
2005) to trace out trajectories and produce a dataset of possessions from the 2018 FIFA World Cup that is directly comparable to theirs from the 1990 and 1994 FIFA World Cups, albeit more data-driven. Using our transparent trajectory extraction approach we reproduce their findings that over 80% of goals were made from “short” possessions with three or fewer completed passes, and that longer passing sequences produced proportionately more shots.
Second, the pattern of event attributes along the sequence of events in a trajectory can be contextually meaningful. Trajectory extraction surfaces such sequential patterns from the data and these can be used to neatly summarize the observed process. Many stand-alone use cases of mobile money involve making more than one transaction in sequence, e.g., paying a bill would mean making a cash deposit followed by a digital bill payment (Economides and Jeziorski
2017; GSMA Mobile Money
2015b; Mbiti and Weil
2013). We find that 73% of the e-money moving through this system follows a pattern that corresponds to one of several well-defined, stand-alone, use cases. Only 19.7% of e-money was re-used within the data collection window. This means that e-money is primarily single-use, in practice, even though it could be re-transacted indefinitely with little cost (and substantial benefit) to the provider.
Third, trajectories detail the location of tangible items between events. In the context of mobile money, this means that we can quantify the extent to which accounts use e-money for saving. “Saving” as we intuitively understand it requires building up a balance wherein some of the money entering an account remains there, undisturbed, for an appreciable length of time. We find that 21.7% of active users of this mobile money system succeeded in saving at least 5% of inflows for over 30 days at one point or another. A much larger fraction save trivial amounts for substantial periods of time and very few save larger amounts.
Finally, extracted trajectories can serve as the input for a suite of existing computational approaches for trajectory-based network analysis (LaRock et al.
2020; Peixoto and Rosvall
2017; Rosvall et al.
2014; Scholtes
2020). It is possible, for instance, to parametrize the Markov order of a real-world walk process (Scholtes
2017). In the context of association football, “second-order” passing processes correspond to complex multi-player dynamics where the next pass reliably depends both on who has the ball and from whom that player received the ball. We find that only a select group of
very successful professional club football teams played with consistent second-order passing dynamics in the 2017–2018 season. This includes the four top-ranked teams in England’s Premier League, the six top-ranked teams in Italy’s Serie A, as well as the champions of the Spanish La Liga, the German Bundesliga, and the French Ligue 1.
The remainder of the paper is structured as follows. In the “
Theory and related work” section, we review related approaches and discuss what we gain by taking a process-driven approach. This section details the network theory behind how we observe and study real-world walk processes on networks. The “
Data” section describes the specific datasets analyzed in this paper and key ancillary details about the two processes. The “
Methods” section introduces trajectory extraction and various ways to analyse the resulting sets of trajectories. This section details the methodology behind our work in the form of the algorithm and its computational complexity. In the “
Results” section, we apply our approach to answer four domain-specific research questions. The “
Conclusion” section concludes.
In this section, we first note specific issues that would arise if we were to consider football passes or financial transactions as links in a temporal network. We then discuss random walks on networks, real-world walk processes, and two distinctions that can be made regarding how real-world walk processes are observed. Records of football passes and financial transactions let us observe events, or “steps”, in these two real-world walk processes as they unfold over networks that we do not observe.
Temporal networks
To analyze observational data on association football or mobile money, it would be simple to interpret each pass or transaction as a link in a temporal network. Temporal network analysis is a well-developed approach with many established techniques and available computational tools (Holme and Saramäki
2012,
2019; Lambiotte and Masuda
2016; Paranjape et al.
2017). In our particular cases, however, the most common temporal network analysis techniques would involve considerable simplification of the underlying data on passes and transactions.
Existing temporal network analysis techniques do not reflect the substantive context in which this data is generated. Time-aggregation into a static network does not capture the fact that players and account holders interact with one another almost instantaneously over a continuous period of time. Temporal network techniques that use sequences of network snapshots (Rocha and Masuda
2014; Taylor et al.
2017), or multilayer networks (Aslak et al.
2018), likewise do not help us make sense of hundreds of football passes, or hundreds of millions of financial transactions, happening one at a time. At the same time, temporal network analysis techniques that treat each link separately (e.g., motif counting, subgraph matching, and reachability analysis: Badie-Modiri et al.
2020; Boekhout et al.
2019; Bogdanov et al.
2011; Jazayeri and Yang
2020; Kovanen et al.
2011; Locicero et al.
2021; Paranjape et al.
2017; Petrovic and Scholtes
2019) do not account for the inherently sequential dependencies in how passes and transactions come to be.
Players must receive the ball to pass the ball, and accounts must have money to spend money. This can make it difficult to interpret the outputs even of
basic temporal network analysis methods (as in: Holme and Saramäki
2012). There are many time-respecting paths through a temporal network of football passes, but in practice the ball follows only a single one. Ambiguity in how paths should be derived from networked processes makes it difficult to interpret the outputs of centrality measures and similar methods that are based on time-respecting paths (see: Saramäki and Holme
2015). Football matches also happen under a very peculiar set of rules—inter-contact times computed on 2018 FIFA World Cup match-event data would include water breaks, but only for matches played at over 32
\(^{\circ }\)C (Earls
2019; Houssein et al.
2016). Such minutiae would then muddle output metrics. As an added complication, financial transactions are weighted in a way that one cannot ignore. Transactions raise or lower a node’s account balance by sometimes drastically different amounts, so paths through a node , inter-event times at a node, and motifs involving a node are also—in some sense—weighted.
Walk processes on networks
Footballs and money are tangible things, and walk processes are networked processes that correspond to the movement of tangible things. Random walks have long been used as a way to explore and quantify the structure of networks; they are a pillar of network science methodology. PageRank was developed to simulate the movement of a “surfer” who moves from page to page through a hyperlink network, randomly and with probabilistic re-starts (Page et al.
1999). Infomap finds sub-network structure by minimizing the average number of bits needed to describe one step in a random walk on the network (Rosvall and Bergstrom
2008). A set of other commonly-used network analysis techniques assume the dynamics of a walk process, more or less explicitly (Backstrom and Leskovec
2010; Fouss et al.
2007; Kloumann et al.
2017; Newman
2005).
Walk processes themselves can be weighted or unweighted, discrete or continuous, node-centric or edge-centic, and active or passive according to a taxonomy by Masuda et al. (
2017). Football passing process and financial transaction processes both operate in continuous time; transactions are weighted while passes are not. The authors define
node-centric processes as those where the dynamics of the process is defined in terms of the nodes. Players kick the ball. Accounts spend money.
Active walk processes are those where “walkers” are agents stepping though the network of their own volition. In our case each pass in football is a “step” for the ball, and each financial transaction is a “step” for a certain amount of money, but neither footballs nor sums of money have agency in any sense. The processes in this study are thus examples of otherwise elusive
node-centric,
passive walk processes.
Real-world walk processes
It remains relatively uncommon to model and simulate real-world walk processes on networks. Examples with some presence in the literature include travellers and goods in transit (Heath et al.
2008; LaRock et al.
2020; Peixoto and Rosvall
2017; Xu et al.
2016), packets routed over the internet (Ash
1997; Echenique et al.
2004; Fronczak and Fronczak
2009), and users surfing the web (Borges and Levene
2007; Chierichetti et al.
2012; Page et al.
1999; Xu et al.
2016). This work establishes two additional real-world examples: the passing process during football matches and the transaction process among financial accounts within a payment system. Here we consider two key features common across each of these real-world walk processes.
First, real-world walk processes maintain their integrity in practice and often occur within systems that are highly engineered to this end. Process integrity refers to the tendency of tangible items to stay where they are placed and not suddenly multiply or disappear. This is largely trivial for processes involving passengers, goods, footballs, or other physical items. Even so, there may be an authority overseeing the system who is able to intervene and fix glitches. Football matches are presided over by a team of referees who would quickly interrupt the match if a second ball were to come onto the field. Many important real-world walk processes rely on digital protocols to keep track of digital items. Packets are routed over the Internet using TCP/IP and related protocols; these have safeguards against packet loss and duplication (Forouzan
2002). Bookkeeping protocols can be decentralized (cash), centralized (checking), or algorithmic (blockchain). Payment system providers have a very strong incentive to ensure their bookkeeping is accurate, because they themselves end up on the hook for wayward funds. Exceptions to this rule are extraordinary—the president and chief executive of Liberty Bank in the United States chose to allow large ATM withdrawals in the aftermath of hurricane Katrina, for humanitarian reasons, although its flooded systems were unable to verify account balances at the time (Rivlin
2015).
Second, real-world walk processes are rarely, if ever, entirely self-contained. They are bounded in a way that is determined entirely by the real-world context. There may be complicated rules that begin and end walks, or related processes that create and destroy “walkers”. These are conceptually distinct from the walk process itself and often substantively important. For traffic flow it matters greatly where people live and work. For money flow it matters greatly how people deposit and withdraw. Association football has very specific rules for when the ball enters and exits play, which are enforced (again) by the team of referees.
Observing walk processes on networks
Observational data about walk processes on networks can take many forms. Complete data would include information about the network structure underlying the process, the dynamics of this particular process, and the actual volumes involved. Most forms of data thus convey only partial information about a real-world walk process or do so piecemeal. The structure of the data is what determines which aspects of a walk process are directly incorporated, and which are left to be found, assumed, or inferred separately.
We systematically categorize different types of observational data about walk processes on networks in Table
1. Very often, data collection focuses on the network structure over which the process unfolds (Table
1, top row). In some cases, one can directly observe the relevant links, like roads (Hu et al.
2007; OpenStreetMap contributors
2017; Zhan and Noon
1998) or submarine fiber-optic cables (TeleGeography
2020). Such
network data leaves the dynamics of the process implicit, for the researcher to define separately. In other cases one actually defines process dynamics, explicitly, in order to query the network structure. Web crawlers (Thelwall
2002), tools such as
traceroute
(Cisco
2006), and transit apps (Kujala et al.
2018) give
path data about the network underlying the processes they parrot. In both cases, the researcher would need to incorporate empirical data on volumes to get a complete view of the process.
Table 1
Examples of observational data used to study walk processes on networks
| Implicit | Explicit |
Network | Network data | Path data |
Transit network | Transit routes |
Internet connections | traceroute output
|
Hyperlink network | Web crawler output |
Football passing network | Hypothetical plays |
Payment networks | Hypothetical flows |
Process | Event data | Trajectory data |
Vessel manifests | Travel itineraries |
Router-based logs | Packet-based logs |
Hyperlink clicks | User click-streams |
Football match events | Passing sequences |
Transaction records | Flows of money |
Data can also be collected about walk processes themselves (Table
1, bottom row). This is often done in the form of timestamped events, such as airline flights (Guimerà et al.
2005) or hyperlink clicks (Dimitrov et al.
2017; Joachims
2002).
Event data is similar to network data in that the dynamics of the walk process—that arriving passengers either transfer to a later flight or leave the airport—are implicit and would need to be handled separately. In some cases, however, it is possible to observe individual “walkers” as the process they are a part of unfolds. Passenger itineraries (LaRock et al.
2020; Xu et al.
2016) and user click-streams (Chierichetti et al.
2012; Paranjape et al.
2016; Scholtes
2017) are examples of such
trajectory data. Trajectory data fully incorporates both the dynamics and the volume of the networked process, giving an exceptionally detailed observational account.
Transit processes are worth highlighting because each of the four combinations are well represented in the literature: Road networks are readily observable and used to study transit by car (Hu et al.
2007; OpenStreetMap contributors
2017; Zhan and Noon
1998). It is understood, implicitly, that road networks are used by individual cars that behave as tangible objects moving from their origin to their destination. Models of traffic flow take this into account, and generally supplement the observed network data with origin/destination records or measurements of traffic flow (Toole et al.
2015; Iqbal et al.
2014; Çolak et al.
2016). The movement of passengers via public transportation can be studied using the schedules of trains and busses. This data structure makes explicit the connections that would need to be made by individual passengers along each possible path and the associated travel times (Kujala et al.
2018). Even so, hypothetical path data must be supplemented with information on the actual usage of different routes (Sánchez-Martínez
2017). Data can also be collected about transit processes themselves, as in the case of passengers travelling by air (Guimerà et al.
2005). Flight manifests directly record distinct events in the transit process. But the fact that some passengers remain where they arrive, some travel onward, and none take
two departing flights remains implicit within this data structure. Data in the form of individual travel itineraries sidesteps the issue by making process dynamics explicit (LaRock et al.
2020; Xu et al.
2016).
In this section we have presented a systematic categorization of observational data on real-world walk processes over networks. In the “
Methods” section we present a method for extracting trajectory data from event data by leveraging process integrity and systematically incorporating detailed domain knowledge on process bounds. The resulting trajectory data encodes information about the dynamics of the process that were not accessible in the original event data.
Conclusion
This paper has demonstrated a new approach for analyzing networked walk processes. We systematically characterized observational data about real-world walk processes on networks, noting that event data is common but has properties that prohibit the use of standard approaches from temporal network analysis. We then proposed a trajectory extraction technique that respects integrity constraints, incorporates domain-specific process bounds, and retains inherent sequential information. This method was applied to mobile money and association football by considering transactions and passes as records of events in the respective real-world walk process.
Regarding football, trajectories let us replicate classic findings on possessions from sports science, demonstrating that several findings about the game in 1990 still hold in 2018. We also demonstrated that passing play is a first order Markovian process among most teams, while exceptional league teams show non-Markovian dynamics. Higher-order passing dynamics let us identify the top teams in the most competitive European leagues. In the domain of mobile money, trajectories let us summarize use of a system and quantify the extent to which account holders build up e-money savings; both are of top concern for the payment industry as they help better understand the system’s clients. Proponents of financial inclusion, and perhaps also regulators, might use these new metrics to compare and monitor mobile money systems.
Within both domains this work opens up considerable avenues for further research. Regarding football, the question of why many top league teams play with second-order dynamics is deserving of study. Event data from other team sports may also benefit from analysis as unweighted walk processes with bounds delineated by the rules of the game. Regarding mobile money, it is of likely interest whether providers offering different services (or operating under different regulatory frameworks) are used similarly. Our approach is also applicable to transaction records from other systems including app-based, intra-bank, and large-value payment systems.
Taking a methodological perspective on potential future work, each of our results is an empirical finding that could serve to better parametrize walk-based models of the observed network processes. We would like realistic models of real-world walk processes to reproduce basic features of empirical trajectories; this is already the logic underlying multi-order network representations of complex systems (Lambiotte et al.
2018; Xu et al.
2016). There is every opportunity for future work to incorporate meaningful process bounds, weighted walks, and a notion of continuous time into these types of frameworks.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.