Systematic bias in transport model calibration arising from the variability of linear data projection

https://doi.org/10.1016/j.trb.2015.02.004Get rights and content

Highlights

  • The variability of the linear projection function may cause bias in calibration.

  • An adjustment factor is proposed to decrease this systematic bias.

  • Simulations are used to demonstrate the effectiveness of the proposed method.

  • A case study is used to illustrate a real-life application of the proposed method.

Abstract

In transportation and traffic planning studies, accurate traffic data are required for reliable model calibration to accurately predict transportation system performance and ensure better traffic planning. However, it is impractical to gather data from an entire population for such estimations because the widely used loop detectors and other more advanced wireless sensors may be limited by various factors. Thus, making data inferences based on smaller populations is generally inevitable. Linear data projection is a commonly and intuitively adopted method for inferring population traffic characteristics. It projects a sample of observable traffic quantities such as traffic count based on a set of scaling factors. However, scaling factors are subject to different types of variability such as spatial variability. Models calibrated based on linearly projected data that do not account for variability may introduce a systematic bias into their parameters. Such a bias is surprisingly often ignored. This paper reveals the existence of a systematic bias in model calibration caused by variability in the linear data projection. A generalized multivariate polynomial model is applied to examine the effect of this variability on model parameters. Adjustment factors are derived and methods are proposed for detecting and removing the embedded systematic bias. A simulation is used to demonstrate the effectiveness of the proposed method. To illustrate the applicability of the method, case studies are conducted using real-world global positioning system data obtained from taxis. These data calibrate the Macroscopic Bureau of Public Road function for six 1 × 1 km regions in Hong Kong.

Introduction

Reliable model calibration is crucial in transportation studies as it helps to establish a better understanding of the interactions between transportation infrastructure, vehicles and road users. Accurate model calibration leads to better urban and traffic planning and the implementation of traffic management and control measures. Consequently, it helps to develop a less congested and more efficient network, keeps a city more economically competitive and decreases traffic emissions. In addition, due to the irreversible patterns of development restricted by infrastructures and the critical role of infrastructure in promoting economic growth (Carlsson et al., 2013), careful planning with the support of reliable model calibration is essential for preventing the misuse of the public budget and resources.

The accurate measurement and estimation of traffic quantities result in reliable model calibration. Technological advancements have improved the accuracy and efficiency of traffic data collection methods over the past decades. Hand tally measurement has gradually been replaced by automatic systems such as inductive loop sensors, radar and television cameras. In addition to point measurement, methods for measuring along a length of road and the collection of data by a moving observer have also been developed. The rapid development of intelligent transportation systems has made it possible to conduct measurements over a wide area at a relatively low cost.

On-road fixed detectors such as inductive loop sensors are still the most commonly adopted means of collecting traffic data for important roadways, as such methods provide an acceptable level of accuracy with minimal effort. However, high installation and maintenance costs sometimes make it impractical or economically unviable to ubiquitously deploy these sensors on all highways and the entire arterial network (Herrera and Bayen, 2010, Herrera et al., 2010). Hence, the coverage is normally limited to a subset of links (Caceres et al., 2012).

Given that vehicle movement can be interrupted by signals, the travel time estimates of loop detectors could be inaccurate. In principle, a vehicle re-identification system can improve the accuracy as follows. Sensors installed at the two ends of a selected arterial link record the times when a vehicle passes by and measure its signature. The travel time of the vehicle is calculated when the signature is matched at the two consecutive locations of the link (Kwong et al., 2009). The radio frequency identification (RFID) transponders (Wright and Dahlgren, 2001, Ban et al., 2010), license plate recognition (LPR) systems (Herrera et al., 2010) and other unique tags are readily available utilities for this scheme. However, in addition to raising privacy concerns, these systems are similarly limited by the cost of sensor deployment over the entire arterial network, thus restricting coverage. Kwong et al. (2009) presented a scheme based on matching signatures measured by wireless magnetic sensors installed at the two ends of the arterial link. Although this scheme is able to avoid the risk of privacy issues, it fails to resolve cost and coverage problems. More recently, the Bluetooth Media Access Control Scanner (BMS) was proposed as a complementary traffic data source (Bhaskar and Chung, 2013). However, Jie et al. (2011) identified the poor quality of its data and the uncertainty surrounding its identification of Bluetooth device carriers (i.e., whether a carrier belongs to a vehicle, a cyclist or a pedestrian).

Cellular systems were introduced a decade ago (Bolla and Davoli, 2000, Ygnace and Drane, 2001, Zhao, 2000) to overcome the limitations imposed by expensive implementation costs and the limited coverage of stationary roadside equipment (Herrera et al., 2010) in systems such as loop detectors and vehicle re-identification systems. However, because the use of cell phones while driving disrupts drivers’ attention (Liang et al., 2007), it is prohibited or discouraged in many countries, thus limiting the application of the proposed models. Moreover, flow measurements from cellular systems follow an aggregate format for each group of links intercepting the corresponding inter-cell boundary (Caceres et al., 2012), making it impossible to estimate traffic flow for any individual link.

Advancements in global positioning systems (GPSs) have made it possible to collect data from GPS-equipped vehicles. These systems have been widely adopted to extend the coverage of data collected from stationary roadside equipment to almost the entire network at a relatively low cost (Miwa et al., 2013). Many recent travel time estimation studies have been based on GPS probe vehicle data (Nanthawichit et al., 2003, Hofleitner et al., 2012, Peer et al., 2013, Herring et al., 2010, Jenelius and Koutsopoulos, 2013, Zheng and Van Zuylen, 2013, Zhan et al., 2013). Although they lend potential to future global coverage, these probe vehicle data come from various sources that present specific challenges. First, fleet data (FedEx, UPS, taxis, etc.) (Moore et al., 2001, Schwarzenegger et al., 2008, Bertini and Tantiyanugulchai, 2004, Wong et al., 2014) pose bias problems due to the operational constraints and specific travel patterns involved. Second, participatory sensing data taken from industry models (INRIX, Waze, etc.) are unpredictable, and no single company has ubiquitous coverage (Hofleitner et al., 2012). Moreover, the added cost of equipping every vehicle with GPS trackers coupled with potential privacy issues prevent this system from being applied on a global scale, making direct measurement of total traffic flows implausible.

Despite the advancement of technologies, the collection of traffic data via different devices remains limited by various factors. Mathematical techniques used for traffic data estimations, such as sampling methods, filtering algorithms and data scaling, offer possible solutions to the problems presented by data acquisition. Linear data projection is a prevalent data scaling method that infers population traffic characteristics by projecting the observable traffic characteristics of a smaller population via the mean of a set of scaling factors.

The scaling factors used in linear data projections vary by situation. Example scaling factors include traffic composition ratios and passenger car units (PCUs). The factor is usually a random variable that is subject to variability and assumed to follow a distribution, rather than a constant. Depending on the sampling method used, the variance of the sampled scaling factor measures different types of variability, such as spatial and temporal variability. If traffic composition ratios are sampled across a network, then the variance measures spatial variability. Contrary to the usual assumption, a PCU is not essentially static (Chandra et al., 1995). Thus, if it is selected as the scaling factor, its variance during different time points at the same site measures temporal variability. Because the mean of the distribution is the most probable observed scaling factor, it is usually adopted in linear data projections.

Linear data projections are especially useful for traffic data estimations in situations where direct measurement is not possible such as the lack of spatial coverage of sensors. For instance, a linear data projection can be adopted to estimate an hourly total traffic flow on a link where on-road fixed detectors are not installed. Assuming that occupied taxi flow is observable on every roadway in a network and that total traffic flow is only observable on a subset of links outfitted with detectors in the network, the total traffic-to-occupied-taxi ratio can be the chosen scaling factor, and is assumed to follow a distribution over a region due to geographical proximity. Scaling factors can be sampled at sites outfitted with detectors. The mean of the sampled scaling factors is the expected total traffic-to-occupied-taxi ratio across that region in the long run. The variance of the sampled scaling factors measures the spatial variability of the total traffic-to-occupied-taxi ratio within this network. If the hourly occupied taxi flow on the link of interest is 10 veh/h and the mean of scaling factors sampled at the nearby sites is 100, the hourly total traffic on this link can be estimated by the product of the mean of the scaling factors and the occupied taxi flow, which is 1000 veh/h in this case. In their study of urban-scale macroscopic fundamental diagrams, Geroliminis and Daganzo (2008) leveraged the notion of linear data projection to infer the total traffic flow of sites without loop detector installations from the flow of a small group of GPS-equipped taxis, using the traffic composition ratio as the scaling factor. This scaling method is not limited to projecting traffic flow. It can also be used to infer other quantities such as trip completion rates, vehicular accumulations and space-mean speeds (Geroliminis and Daganzo, 2008).

Due to its simple concept, linear data projection has been widely adopted in many real-world situations that necessitate data scaling via scaling factor. However, scaling factors such as traffic composition ratios and PCUs are random variables with variations rather than absolute constants. Systematic bias may be embedded in the parameters of a model calibrated based on linearly projected data because the variance, skewness, kurtosis and even higher-ordered moment of the distribution of the scaling factor are not captured in the linear data projection.

This embedded systematic bias remains unexplored in the field, as it is not easily evident. To reveal and demonstrate the existence of the bias, a numerical example of the calibration of a simple polynomial model simulating a linear data projection is presented as follows:y=a+bXn=a+b(fx)nwhere x is the observable independent variable; f is the scaling factor of x;X=fx is the projected value; y is the observable dependent variable; and a,b and n are the model parameters.

Ten thousand data points of x, which serve as the observed data for the independent variable, are sampled from a uniform distribution with a domain from 0 to 1. Because scaling factors are generally positive, a lognormal distribution with f¯=1 and σf=0.2 is chosen to sample the corresponding scaling factors for the 10,000 samples of x. f¯ and σf are respectively the mean and standard deviation of the scaling factor f. Depending on the sampling method used, both the standard deviation σf and variance σf2 can measure variability such as the spatial variation or temporal variation of the scaling factors across the dimension under consideration. Assuming that a = 1, b = 1 and n = 3, the corresponding 10,000 points of y and X=fx, which serve as the observed data for the dependent and projected independent variable, can be calculated based on the assumed values of the parameters and sampled x and f.

Suppose that the values of all individual X are no longer available and can only be estimated via a linear projection function based on the mean value of f,f¯, a common real-world occurrence. Regression analysis is then conducted between y and the linearly projected X. The calibrated values of the parameters are aˆ=0.999 and bˆ=1.130. It is obvious that aˆ is close to the assumed true value. However, the calibrated value of b is apparently overestimated. The overestimation of bˆ (+13.0%) reveals the existence of systematic bias due to the ignorance of scaling factor variability in the linear data projection. A linear data projection provides good estimates of unobservable independent variables because it captures the first moment of the scaling factor that carries most of the information. However, such point estimates are not sufficient for reliable model calibration.

Models depicting the characteristics and performance of a network use fundamental diagrams and both link- and area-based cost-flow functions. These models such as volume delay functions (e.g., Spiess, 1990, Akcelik, 1978, Tisato, 1991, Davidson, 1966, Akcelik, 1980) and speed-density relationships (e.g., Jayakrishnan et al., 1995, Kerner and Konhäuser, 1994, Drake et al., 1967, Drew, 1965, Munjal and Pipes, 1971, Pipes, 1967, Macnicholas, 2008, Del Castillo and Benitez, 1995a, Del Castillo and Benitez, 1995b, Van Aerde, 1995) require traffic speed, flow and density data, the three most important quantities in transportation. However, if a non-negligible subset of links within a network is not equipped with adequate instruments for direct traffic data measurement, which is usually the case in urban transportation network (Lederman and Wynter, 2011), a linear data projection may be leveraged using the observable traffic data of a smaller population to estimate traffic data. Models calibrated based on these linearly projected data may be systematically biased. To remove this bias, information provided by the scaling factor variability should be incorporated into the calibration of the model.

This paper fills the aforementioned knowledge gap by proposing the incorporation of adjustment factors that capture scaling factor variability into the model calibration process. We derive global adjustment factors that correct the calibrated sensitivity parameters of chosen generalized multivariate models in polynomial form. The Bureau of Public Roads (BPR) function adopted in the Highway Capacity Manual (Transportation Research Board, 2000) is a polynomial function that can model the relationship between travel time and the traffic volume in a link. It is commonly used in many European countries and the United States (Dowling et al., 1998, Lum et al., 1998) and plays an important role in static user equilibrium analysis (García-ródenas and Verastegui-rayo, 2013). The case studies section presents calibrations of Macroscopic Bureau of Public Roads (MBPR) functions using real-life GPS data and demonstrates the application of the derived global adjustment factor. The main contribution of the proposed global adjustment factor is that it can remove the systematic bias introduced in the calibrated parameters and hence ensure more accurate model calibration.

The remainder of this paper is structured as follows. In Section 2, the existence of the systematic bias embedded in parameters calibrated from linear projected data is proven based on a Taylor series expansion. In Section 3, the adjustment factor for models in generalized multivariate polynomial form is derived. The metric measuring the extent of the systematic bias, factors affecting the extent of the embedment of the systematic bias and the method for removing the bias embedded in the calibrated sensitivity parameters are also presented in Section 3. Section 4 presents a simulation to illustrate the significant correction power of the derived global adjustment factors for multivariate functional models, and demonstrates that the applicability of the global adjustment factor is not restricted to the magnitudes of the mean and coefficient of variation (CV) of the scaling factor. In Section 5, real-world taxi GPS data are used to calibrate the macroscopic cost-flow function, and the derived global adjustment factor is applied in an illustrative case study. Finally, Section 6 summarizes the findings of the paper and discusses possible future research directions.

Section snippets

Existence of systematic bias

This section reveals the necessary and sufficient condition for the introduction of systematic bias into the calibrated model parameters arising from linearly projected data, and thereby proves its existence. The origin of the systematic bias is then discussed.

Global adjustment factors for generalized multivariate polynomial models

The paper uses a generalized multivariate polynomial model to examine the effect of the ignorance of scaling factor variability in model calibration. The goal of this section is to derive the global adjustment factors that capture scaling factor variability. A metric measuring the extent of the embedment of systematic bias is proposed, and the factors affecting the amount of introduced systematic bias are discussed. A method for incorporating the captured variability to correct the calibrated

Simulation

In this section, simulations are performed using sampled scaling factors from 100 lognormal distributions with different combinations of means and standard deviations, and hence different CVs, to demonstrate the correction power and efficiency of the derived global adjustment factor, Fk. The association between the correction power and magnitudes of the mean and CV of the scaling factor is also investigated to illustrate the applicability of the global adjustment factor. Assuming that a0=3, a1=

Case studies

To illustrate the application of the derived global adjustment factor for generalized multivariate polynomial model in Section 3, case studies using real-world data were conducted in relation to the model calibration of Macroscopic Bureau of Public Road (MBPR) function for six 1 km × 1 km regions in Tin Hau, Ma Tau Wai, Fortress Hill, Admiralty, Jordan and Kowloon Tong, Hong Kong. The MBPR function, which is in generalized multivariate polynomial form, is an essential input for the continuum

Conclusion

In the transportation field, using different instruments to acquire data representing population traffic characteristics through direct measurement may meet with various limitations and restrictions. Traffic data inferences are often made based on the data of a population subset. Linear data projection is a prevailing method adopted for data inference. However, the possibility of a systematic bias being introduced into the parameters of models calibrated from linearly projected data has

Acknowledgements

The work described in this paper was supported by a Research Postgraduate Studentship and grants from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. HKU 17208614). We would like to express our sincere thanks to Concord Pacific Satellite Technologies Limited and Motion Power Media Limited for providing the taxi GPS data, and to the Transport Department of the HKSAR Government for providing the traffic flow data from the ATC.

References (55)

  • T. Miwa et al.

    Allocation planning for probe taxi devices based on information reliability

    Transp. Res. C: Emerg. Technol.

    (2013)
  • P.K. Munjal et al.

    Propagation of on-ramp density perturbations on unidirectional two-and three-lane freeways

    Transp. Res.

    (1971)
  • S. Peer et al.

    Door-to-door travel times in RP departure time choice models: an approximation method using GPS data

    Transp. Res. B: Meth.

    (2013)
  • L.A. Pipes

    Car following models and the fundamental diagram of road traffic

    Transp. Res.

    (1967)
  • S.C. Wong

    Multi-commodity traffic assignment by continuum approximation of network flow with variable demand

    Transp. Res. B: Meth.

    (1998)
  • X. Zhan et al.

    Urban link travel time estimation using large-scale taxi data with partial information

    Transp. Res. C: Emerg. Technol.

    (2013)
  • F. Zheng et al.

    Urban link travel time estimation based on sparse probe vehicle data

    Transp. Res. C: Emerg. Technol.

    (2013)
  • R. Akcelik

    A new look at Davidson’s travel time function

    Traffic Eng. Control

    (1978)
  • R. Akcelik

    Time-dependent expressions for delay, stop rate and queue length at traffic signals

    (1980)
  • X.J. Ban et al.

    Performance evaluation of travel-time estimation methods for real-time traffic applications

    J. Intell. Transp. Syst.

    (2010)
  • R.L. Bertini et al.

    Transit buses as traffic probes: empirical evaluation using geo-location data

    Transp. Res. Rec.: J. Transp. Res. Board

    (2004)
  • R. Bolla et al.

    Road traffic estimation from location tracking data in the mobile cellular network

  • N. Caceres et al.

    Traffic flow estimation models using cellular phone data

    IEEE Trans. Intell. Transp. Syst.

    (2012)
  • R. Carlsson et al.

    The role of infrastructure in macroeconomic growth theories

    Civil Eng. Environ. Syst.

    (2013)
  • S. Chandra et al.

    Dynamic PCU and estimation of capacity of urban roads

    Indian Highways

    (1995)
  • K.B. Davidson

    A flow travel time relationship for use in transportation planning

  • J.M. Del Castillo et al.

    On the functional form of the speed-density relationship—I: general theory

    Transp. Res. B: Meth.

    (1995)
  • Cited by (24)

    • On the estimation of connected vehicle penetration rate based on single-source connected vehicle data

      2019, Transportation Research Part B: Methodological
      Citation Excerpt :

      Moreover, as the mean of the distribution is the most probable observed CV penetration rate, it can be taken as the unbiased estimator of the CV penetration rates of links without detectors in a network. Such a simple notion had been leveraged in conjunction with linear data projection (Wong and Wong, 2015, 2016c; Wong et al., 2019a), which is one of the commonly adopted and highly transferrable data scaling method that can estimate the unobservable traffic data by projecting the observable traffic data using the mean of a set of scaling factors, to unbiasedly estimate different traffic data. For instance, they had been employed to estimate the total hourly traffic flow in the studies of estimating macroscopic Bureau of Public Roads function (Wong and Wong, 2015, 2016b,c) and a study of traffic incident impact evaluation (Wong and Wong, 2016a).

    View all citing articles on Scopus
    View full text