Modelling travel time uncertainty in urban networks based on floating taxi data
DOI: 10.1186/s12544-019-0381-5
© The Author(s) 2019
Received: 27 March 2019
Accepted: 6 September 2019
Published: 13 November 2019
Abstract
The prediction of the uncertainty of route travel time predictions for all possible routes in an urban road network is of importance for example for logistics. Such predictions need to take the essential features of the data set as well as the underlying traffic dynamics into account.In this paper a large floating taxi data set is used in order to derive predictions of route travel time uncertainty based on link travel time uncertainty predictions. Prediction errors, that is actual travel times minus predicted travel times, are differentiated from model errors, that is measured travel times minus predicted travel times. These two errors are related, but not identical, as model errors contain measurement noise while the prediction errors do not. Detailed models for the variance of the link travel time prediction errors as well as the correlation between the model errors for different links are derived. The models are validated in depth using two different validation data sets.Estimates for the variance of prediction errors are obtained. The standardized model error distributions show a remarkable stability, such that modelling the variance appears to be sufficient for quantifying the uncertainty of the model errors.Furthermore we show that the model errors for adjacent links are highly correlated but correlations fade with increasing distance. Additionally usage of the road network plays a role with high correlation for links along common routes and low correlations for links along seldom used routes. We assume identical features for the prediction errors which is partly validated based on additional data.The paper provides a way to estimate the complete distribution of route travel time prediction errors for any given route in the street network.
Keywords
Taxi floating car Travel time uncertainty Travel time prediction1 Introduction
The quantification of route travel time uncertainty is of importance for logistics applications as well as for publicly available routing services (see [3] for a survey of studies dealing with valuing reliability; compare also [14]). Some authors even found that travel time reliability is valued more highly than travel time itself [8]. For logistic applications planning usually involves pre trip decisions, in many cases several hours or even days before the trips are executed. Both private individuals as well as logistics companies typically have asymmetric costs with being late implying higher penalties than being early. Accordingly for cost optimal decisions not only the predicted travel time but also the uncertainty involved in the prediction is of interest.
Typically routing services are based on a directed mathematical graph (a set of nodes connected by links) representing the street network such that the shortest path between two points can be obtained using the Dijkstra algorithm. Here ’shortest’ is to be understood in a broad sense and could involve predictions of link travel times for a particular departure time. For such predictions a huge amount of different methods based on a large number of different data sets have been obtained, first for highways (compare the papers contained in the compendium [2]), subsequently for general road networks. Excellent surveys of the many contributions can be found in [16, 17]. Usually in these approaches the predicted route travel time is obtained as the sum of the predicted link travel times.
With respect to travel time reliability a number of different measures could be used, compare the survey in [6]. It is important to note that these measures not only depend on the respective link quantities but also on the relationship between the various link travel times. The most basic uncertainty measure is constituted by the variance. This measure for highways has been criticized as not telling the whole story [10]. Consequently approaches such as quantile regression for quantifying the whole distribution of travel time prediction errors have been developed (see the contributions [7, 11] and the references contained therein). Often confidence intervals for prediction uncertainty based on standard deviations (that is, square roots of variances) are constructed assuming Gaussian distribution of the errors. A more elaborate approach would imply a constant distribution scaled by standard deviations. Quantile regression methods replace this simple model by a detailed model for a number of quantiles depending on influencing factors. Such methods provide better quantifications of uncertainty compared to the scaling approach if the shape of the distribution changes a lot depending on influencing factors such as the time-of-the-day for example.
The variance of a sum of random variables (such as the sum of link travel times) equals the sum of the variances plus the sum of all possible covariances between pairs of random variables. In [9] the covariances of prediction errors are neglected and several different measures for route travel time variances are compared without reaching a compelling conclusion. It is clear that the omission of covariances is unjustified if the contribution of the covariances is substantial. It is unclear, however, if this is the case.
Travel time uncertainty is related to different levels of congestion. It may be argued that congestion is spread only along routes driven by many cars while crossing traffic might be unaffected by congestion in the orthogonal direction. This conjecture will be investigated in this paper by using a variable called trip count ratio indicating for each pair of links the proportion of trips along one link also traversing the second link.
As routing applications potentially build routes including all possible combinations of links, the estimation of the variance of an arbitrary route travel time prediction error requires the estimation of all covariances between the link travel time prediction errors for all pairs of links. For a large map this is impossible and hence a model is needed that provides an estimate of the correlation of the travel time prediction errors for any two given links based on some influential factors such as the distance as well as the trip count ratio.
An ideal source for modelling is constituted by taxi floating car data as a large fleet can cover the whole street network and be active throughout the day. Taxis show the advantage – compared to other fleets – to be operated continuously. Consequently this paper investigates the properties of the uncertainties of link and route travel time prediction errors based on models developed for a large floating taxi data set (FCD) in Vienna [15]. In the paper we distinguish prediction errors, that is actual travel time minus predicted travel time, from model errors which additionally include measurement errors. Both errors depend on the traffic state and contain inter-driver and intra-driver variation. In the paper we discuss the relation between the various components of the two errors and their impact on the estimation in detail.
Note that the related paper [13] deals with a slightly different problem by assuming low covering of the floating car data. This is countered by imposing much structure (in the form of regression equations) on the relation between measured variables while our data set is large enough in order to achieve a good coverage of the network (see below). However, for a map with several thousands of links, estimation of the covariances of link travel time prediction errors for all pairs of links still is not feasible.
The main contribution of this paper thus is the thorough investigation of link and route travel time prediction and model errors. First, it is shown that for our data set the model error variances for the link travel time models show a strong dependence on the measurement conditions. In particular the number of single taxi observations for one link and one time interval as well as the current traffic conditions influence the variance of the model errors. Second, the distribution of the standardized model errors (such that the conditional mean is zero and the conditional variance equal to one) is remarkably stable such that quantifying the variance is sufficient in order to obtain the whole model error distribution. Third, the correlations of link travel time model errors for different pairs of links are investigated in detail, identifying two influencing factors: the driving distance between the two links and the trip count ratio. Fourth, all models are thoroughly validated by means of two separate validation data sets: the first consists of a second period of the floating taxi measurements which is not used for modelling. The second comprises an even tougher test by comparing the estimated route travel time uncertainty obtained from the models to the uncertainty of the measured route travel time based on trajectory data for single taxi observations for a collection of routes.
The paper is organized as follows: In the next section the data set is described while Section 3 provides the methodology for the estimation of the route travel time uncertainty. The empirical results are described in Section 4. Finally Section 5 concludes the paper.
2 Data set
In Vienna the movement of some taxis is observed using low frequent (with a sampling frequency of about 1 minute) GPS sensing since 2004 with a fleet of approximately 3500 taxis in total of which around 2000 are on the road at any one time. These raw data are used in different ways as discussed in the following two subsections.
2.1 Floating taxi data
The floating taxi dataset used in this paper covers the time period from July 1st 2008 until July 31st 2010, a total of 761 days. For each GPS observation information on the status of the taxis is available which allows to filter out only those movements that are made when carrying a passenger.
The raw data set is map matched (using trajectory to route map matching; a commercial map is used to encode the street network and hence the location and length of the links are given exogenously; note that the links in the map are directed such that two way segments of a street are represented by two links in opposite direction in the map) and interpolated between observations in order to obtain an estimated route as a continuous sequence of links in combination with an estimated entry and exit time for each link. Additionally the routes found are assessed and unreliable routes (implying very high speeds) are excluded from further analysis. Details on data collection and preprocessing can be found in [15].
The corresponding obtained route data is aggregated to obtain link specific average travel times within a given time interval by using arithmetic averaging. The time intervals have been chosen heuristically to equal 15 minutes leading to 96 time intervals per day.
- 1.
One data set contains estimated taxi routes including the estimated timing of link entry and exit events providing direct measurements of route travel times.
- 2.
The second one consists of average travel time measurements \(\check \Pi _{d,i}^{l}\) for each link l and each fifteen minute interval i on any given day d.
2.2 Link travel time data sets
-
Hietzing (H): 191 links around the main arterial in the West of the city leading past the tourist attraction Schönbrunn castle.
-
Westbahnhof (WBH): 122 links in the area of the Westbahnhof rail station. This area is adjacent to the shopping street Mariahilferstrasse.
-
Ring (R): 79 links in the innermost city with lots of tourist attractions.
-
Südosttangente (SOT): 58 links on the largest (in terms of traffic) inner city highway including a number of feeder links.
The four sites are selected as a compromise of including many different traffic environments such as city highway, main arterial as well as inner city regions on the one hand and respecting time restrictions for analysis with the soft- and hardware tools available to the authors.
In total the dataset covers 38.9km of roads (approximately 19.4km in Hietzing, 6.1km at site WBH, 6.4km at R and 7.0km at SOT) containing approximately 32.9 million taxi observations. All four datasets include a varying number of missing observations in all dimensions. On 8.5% of days no observation exists at all due to coding errors either in the data collection or extraction from the database. The missing days occur on a continuous stretch of adjacent days, therefore there does not appear to be a systematic pattern of missing observation days. The fraction of missing observations per link (that is time intervals in which no taxi is observed on the corresponding link) varies from 10% to 80%. 19% of all measurements are based on only one taxi observation. For more details on the data set see [1].
It can be seen that on the city highway Südosttangente in general smaller local travel times (corresponding to larger speeds) are observed while in the inner city larger local travel times are observed (R). Plot (b) provides a boxplot grouped across time-of-day-intervals for a link in WBH showing a number of characteristic features: Throughout the day congestion causes larger local travel times. Furthermore the standard deviation in general is large and varies a lot over the course of the day. During the afternoon peak period the standard deviation is a substantial fraction of the average local travel time.
In addition to \(\check {\Pi }_{d,i}^{l}\) in the dataset also the empirical variance of local speed observations within one (link, day, time of day interval) combination before aggregation is provided for the Ring dataset. This information will be used in order to estimate the measurement error variance in Section 3.2.
2.3 Link distance data
Driving distances (denoted with Di, j) between the middle points of two links are obtained from the underlying map. Intersection information such as turning restrictions are not used. Correspondingly the distance between two links is static over time. Two issues might arise in particular for links close to the boundary of the considered regions: As only the subgraphs in the four regions are used, there might exist shorter paths outside the considered region connecting the two links. Second, the missing turning restrictions might reduce the driving distance. Both effects are considered minor problems for the chosen regions. In addition the number of links is relatively large such that such problems for a small number of pairs of links should be ’averaged out’ for all results presented below.
2.4 Trip count ratios
Note that this definition of the trip count ratio produces a symmetric measure in the sense that τi, j=τj,i. This measure will be used in order to model the correlation of model errors which inherently are symmetric. Typically high values of ζij where link i lies upstream of link j imply low values of ζji in the reverse direction. Alternatively in the model both the maximum and the minimum value of ζij and ζji could be used. This is left for future research.
Therefore using the overall trip count ratio τi, j appears to be justified, where we have to be more careful with interpretation of the results for the R dataset.
3 Methods for uncertainty modelling
The main goal of this paper is to propose and validate a model for the distribution of the errors of the route travel time prediction along a given route R=(Lj)j=1,...,J (seen as a set of J links with indices Lj). The focus here is on long-term predictions, say at least one hour ahead, such that temporal correlation between deviations from ’usual’ circumstances are no longer significantly different from zero. Therefore route travel time prediction and the corresponding uncertainty is modelled as a function of the time when embarking onto the route.
One of the difficulties related to our data set is that we don’t have observations of actual link travel times \(\Pi ^{l}_{d,i}\) of a (single) taxi on given link l on day d and time-of-the-day interval i. We only have given an aggregate3, \(\check {\Pi }^{l}_{d,i}\) say, of noisy measurements of (single) taxi travel times for a (random) number, \(N^{l}_{d,i}\) say, of taxis.
the prediction error. Note that \(u^{l}_{d,i}\) and \(\check {u}^{l}_{d,i}\) are closely related but they are not identical.
As has been noted above we do not have direct observations of the link travel times and hence it is not possible to directly estimate the variances and correlations of the link travel time prediction errors \(u^{l}_{d,i}\). Instead we propose estimates of these quantities which are based on the model errors \(\check {u}^{l}_{d,i}\).
inter driver variability: under free flow conditions every driver sets his/her free speed which differs between drivers.
varying traffic conditions: congestion is not identical on different days leading to deviations (that is random variables with zero mean) from the expected travel times. Additionally weather conditions and further noise factors might lead to deviations from usual traffic conditions.
measurement errors: as link travel times are only measured based on low frequent GPS signals there is a measurement error. We assume that these measurement errors are independent of the traffic state and of the time-of-day.
These three factors are all mixed in the observations. It is hard to separate them based only on the (aggregated) link travel time observations. Due to the aggregation the first and the third factor (inter driver variability and measurement errors) diminish with increasing number of observations \(\left (N^{l}_{d,i}\right)\) per time-of-day-interval while the second (varying traffic conditions) does not.
The correlations of the prediction errors are estimated via the correlations of the normalized (with the inverse of the standard deviation \(\sqrt {\mathrm {V}(\check {u}^{l}_{d,i})}\)) model errors. These are seen as proxies for the correlations of the prediction errors to which we do not have to access.
in Section 3.1 we discuss the modelling of the mean travel time \(\mu ^{l}_{d,i}\).
in Section 3.2 we present a model for the variances of the measured link travel times and discuss how to construct estimates for the variance of the link travel time predictions from this model and suitable estimates of the measurement error variance.
in Section 3.3 a model for the (spatial) correlations \(\text {Corr}(\check {u}^{a}_{d,i},\check {u}^{b}_{d,i})\) for all pairs of links a,b is presented.
finally Section 3.4 shows how these pieces are put together to get an estimate of the variance of the route travel time prediction error.
For each model we will investigate the dependence on links, days and time-of-day-intervals.
3.1 Model for the expected link travel time
for the measured link travel time \(\check \Pi ^{l}_{d,i}\) where the regressor vector xd contains the constant, dummies for the day category, school holidays and additional cyclical terms to model potential yearly effects (cos(ωjd), sin(ωjd),ωj=2πj/365,j=1,...,5).
The regression coefficients βl,i are specific to the time-of-day-interval i and the link l.
According to [15] the models are estimated using stepwise regression techniques and excessive model selection in order to identify the most relevant regressors. A model for the variance of \(e_{d,i}^{l}\) (see the next section) as a function of the underlying number of observations as well as the average mean speed reduces the influence of heteroskedasticity. For details see [15].
As a result we obtain estimates \(\hat {\mu }^{l}_{d,i}\) which serve as predictions for the actual link travel times as explained above.
3.2 Model for the variance of link travel times
We will use estimates for the variance of \(e^{l}_{d,i}=\check {\Pi }^{l}_{d,i}-\mu ^{l}_{d,i}\) as estimates for the variance of the model errors \(\check {u}^{l}_{d,i}=\check {\Pi }^{l}_{d,i}-\hat {\mu }^{l}_{d,i} =e^{l}_{d,i}+(\mu ^{l}_{d,i}-\hat {\mu }^{l}_{d,i})\). This simplification is justified since, due to the large data set used for the estimation of \(\mu ^{l}_{d,i}\), the estimation error \((\hat {\mu }^{l}_{d,i} - \mu ^{l}_{d,i})\) is "small" compared to the noise \(e^{l}_{d,i}\).
where \(\bar {N}^{l}_{d,i}\) denotes the dummy variable indicating that the corresponding measurement is only based on one taxi observation. This makes the model for one taxi measurement insensitive to misspecifications of the dependence on \(N_{d,i}^{l}\).
As in [15] this can be estimated in logarithms using \(\log ((\check \Pi ^{l}_{d,i}-\hat {\mu }^{l}_{d,i})^{2})\) as the dependent variable. Here the coefficients δl,i≥0,ϕl,i≥0 are restricted to be positive, since we expect that averaging individual taxi observations decreases the variance.
Additionally a penalization is introduced in order to obtain smooth (over time-of-day-intervals) variation of coefficients. For details on the penalization approach used see [5]. In the following let \(\hat {\sigma }^{2}_{{l,d,i}}\) denote the estimate for \(\sigma ^{2}_{{l,d,i}}\).
Note that this variance contains all three components of the link travel time uncertainty. The inter driver variability and the varying traffic conditions act as influencing factors. The uncertainty related to a single trip from a single driver is obtained by setting \(N_{d,i}^{l} = 1\).
With regard to the third component, the variance ωl of the travel time measurement error for link l is assumed to be independent of time while the other components of the variance of the travel times vary with the traffic state: in conditions of synchronized traffic, inter driver variability vanishes. Varying traffic conditions lead to varying levels of the variance of travel time measurements. Therefore the measurement error variance can be bounded by the minimum of all observed variances. Assuming that in all cases the long data set contains worst case scenarios we will use the minimum of all observed variances as the measurement error variance.
In the final step the Delta method is used to transfer the estimated variances \(\hat \omega _{l}^{\mathrm {V}}\) for the speed measurements to the corresponding variance \(\hat \omega _{l}\) of the travel time measurements.
We note that this estimate uses a number of assumptions that are not obvious. Therefore a detailed verification of the assumptions using thorough validation on different data sets will be presented below.
3.3 Modelling spatial correlations
The pairwise correlation ρi, j between the model errors \(\check {u}^{l}_{d,i}\) of the link travel time for two links i, j is modelled as a function of the driving distance Di, j between the two links as well as the trip count ratio τi, j.
Note that we model the correlation of the normalized model errors, whereas for the estimation of the uncertainty of the route travel times the correlations of the normalized prediction errors would be needed.
Since we do not have access to the latter, we assume here that the two correlations show similar features such that the correlations of model errors are indicative of the correlations of the prediction errors. Limitations in our data set do not allow a detailed investigation.
3.4 Estimation of the route travel time variance
with \(\hat {\mathrm {V}}(u^{L_{a}}_{d,i})\) computed according to (7).
NC: no correlation, \({{\mathrm {Cor\widehat {r}}}} (\check {u}^{L_{a}}_{d,i}, \check {u}^{L_{b}}_{d,i}) = 0\) for La≠Lb. It is expected that this underestimates variability by neglecting typically positive correlations.
EM: empirical correlation matrix including all pairwise estimates. It is to be expected that the corresponding estimates are noisy in particular for pairs of links with only a few joint observations.
ET: using the information in Fig. 9 below empirical correlations are used for all pairs of links with distance smaller than 1km and correlations are set to zero for higher distances.
MI: the correlations are estimated with an individual model according to Eq. 8 for each of the four datasets
MJ: correlations are predicted from a model according to Eq. 8, estimated based on all datasets jointly
SD: at the opposite end of the spectrum lies the case of perfect correlation: \({{\mathrm {Cor\widehat {r}}}} (\check {u}^{L_{a}}_{d,i}, \check {u}^{L_{b}}_{d,i}) = 1\) for all La,Lb which provides an upper bound of uncertainty.
3.5 Validation procedures
All components of the model are validated carefully in the most appropriate context. The predictions for the observed (aggregated) link travel times as well as the corresponding variance estimates \(\hat {\sigma }^{2}_{{l,d,i}}(\hat \mu _{d,i}^{l}, N_{d,i}^{l})\) are evaluated on a validation data by splitting the dataset into the first 701 days as the estimation data and the last 60 days as the validation data.
Secondly, we achieve a detailed validation of the estimation procedure from comparing the route travel time predictions as well as the variance estimates \(\hat {V}(u^{R}_{d,i})\) to estimation of single trip observations based on the additional trip data.
4 Results and discussion
4.1 Mean and variance model
Plot (b) of Fig. 7 provides the (2.5%, 50%, 97.5%) percentiles (computed over days and links) grouped into time-of-day-intervals (validation data is plotted in bold, estimation data in thin lines; the results on the validation data set being almost identical to the ones on the estimation data which are hence almost invisible). Only the H data shows deviations over time-of-day-intervals for the morning and the evening peak in the validation data set. The three percentiles are located at approximately -1 (2.5%), 0 (50%) and 2.5 (97.5%). The location of the given percentiles of the normalized residuals is very stable across different datasets, links and time-of-day-intervals. This indicates that the distribution of the normalized model errors is identical in all cases such that the whole distribution can be characterized by the normalized distribution times the scaling using the estimated standard deviation.
Note, however, that the prediction errors equal model errors minus measurement errors. Since measurement errors cannot be measured directly, it is unclear whether prediction errors also equal standard deviation times a random variable with distribution not depending on the factors influencing the standard deviation. This is left for future research.
4.2 Spatial correlation models
It might be suspected that there are a few factors driving the deviation from normal conditions such as an unexpectedly high level of congestion uniformly on the whole urban region (corresponding for example to weather incidents such as snowfall, heavy rain etc.). The data in all four regions do not support this hypothesis. In all four regions approximately 60% of the factors are needed to reach a cumulative explanation of 90% of the variance in a factor model.
Spatial correlations are low with an average of 0.10 to 0.05 (at a standard deviation of 0.12 to 0.18). For high values of trip count ratio τi, j, however, substantial correlation exists in particular for the urban highway and the arterial in Hietzing. For τi, j>0.8 we obtain mean values of 0.41 (H, standard deviation: 0.29), 0.35 (WBH, std: 0.35), 0.61 (R, std: 0.26) and 0.56 (SOT, std: 0.29). This demonstrates that the τi, j values influence the correlation.
The H dataset shows the expected behaviour with high correlations occurring exclusively for links with small distances and high τi, j values. Note that this dataset shows mainly two arterials in and out of the city with few alternative routes. A similar behaviour can be seen for the urban highway dataset SOT where, however, very few pairs of links with larger distance and high τi, j values are contained. The same behaviour also occurs in the R dataset.
The urban WBH dataset shows distinctly different patterns with a cloud of few points corresponding to small τi,j values and high correlations, the remaining points possessing very small correlations. Such data points are occasionally seen also in the H dataset for small distances. Many instances of such pairs of links occur on segments of streets in opposite direction which indicates influences of common disturbances affecting both directions. On the urban highway SOT and the arterials in the H data set such scenarios do not occur. Interestingly this also does not occur in the R dataset.
The results can be seen in Fig. 9. As expected, high correlations are obtained only for small distances and high values of τ. For τ of 0.8 correlations already are smaller than 0.2 in all four datasets except for extremely small distances. Also for distances of 0.5km correlations are smaller than 0.4.
4.3 Comparison to single trips
The previous discussion led to the development of a number of models for the variance of travel time prediction errors which are validated in this section using trip data of single taxis obtained out of sample after the modeling took place. Here nine days (Sunday 1.1.-Wednesday 4.1., Sunday 8.1.-Tuesday 10.1., Wednesday 1.2. and Tuesday 24.7) in 2012 of trip data are used. These days contain weekdays and weekends, holiday periods (1.1.-4.1., 24.7.) and school periods.
On these days for a total of 8 heavily used routes in the four data sets single trip start and end points are estimated. Details on the routes are given in Table 2 of [1], the location of the routes is presented in Fig. 7 in [1].
The average bias as a percent of mean measured travel times amounts to 12 and 13% for H, 32 and 23% for WBH, 10 and 12% for R. For the SOT we underestimate travel time on average by 8 and 19%. This holds although on the validation sample no bias in the predictions has been detected (see Fig. 7). Note, however, that the validation period is limited to a few days in January 2012 where the weather conditions might interfere with predictions.
Corresponding to the various routes the travel time variance is estimated according to Eq. 9. For each single trip we calculate the deviation between the measured and the predicted route travel time and divide by \(\sqrt {\hat {\mathrm {V}}(u^{R}_{d,i})}\). If the variance is correctly estimated then the corresponding sample should show unit variance. If the variance is underestimated then the normalized prediction errors have empirical variance larger than one; if the variance is overestimated the normalized prediction errors show empirical variance smaller than unity. Naturally the route travel time measurements are also subject to measurement errors.
Standard deviation of normalized travel time prediction errors
NC | EM | ET | MI | MJ | SD | |
---|---|---|---|---|---|---|
H1 | 1.33 | 0.73 | 0.76 | 0.67 | 0.69 | 0.33 |
H2 | 1.12 | 0.65 | 0.67 | 0.65 | 0.67 | 0.32 |
WBH1 | 0.96 | 0.74 | 0.75 | 0.59 | 0.57 | 0.32 |
WBH2 | 1.19 | 0.72 | 0.73 | 0.64 | 0.61 | 0.34 |
R1 | 1.22 | 0.88 | 0.88 | 0.75 | 0.79 | 0.52 |
R2 | 1.19 | 0.79 | 0.79 | 0.73 | 0.78 | 0.51 |
SOT1 | 1.50 | 1.08 | 1.08 | 1.25 | 1.19 | 0.91 |
SOT2 | 1.67 | 1.42 | 1.42 | 1.55 | 1.49 | 1.22 |
This also provides some indication that the assumption of correlations between model errors being similar to correlations between prediction errors is realistic. However, more research based on more appropriate data sets is needed.
Somewhat of an outlier in these comparisons is the SOT data set. Here as can be seen in Fig. 12d the variability is overestimated heavily during daytime with all approaches while it is underestimated close to midnight. Fig. 11d shows that the observed route travel times do not contain observations of heavy congestion on the SOT during typical peak hours. Therefore the failure to match uncertainty might be an artefact of too few validation measurements.
5 Conclusion
In this paper, based on a large real world data set, models for the estimation of route travel times and the corresponding associated uncertainty have been obtained. Application of the models demonstrates that predictions of link travel times show considerable heteroskedasticity that needs to be taken into account for accurate estimation of route travel time uncertainties. We find that heteroskedasticity is related to the number of vehicles observed on each link in each time-of-day-interval but also to the traffic conditions. Explicit models for the dependency are derived.
Investigating the model errors in link travel time estimates further we found significant correlations between residuals on adjacent links that additionally have been shown to depend on the joint usage of roads. In this respect the trip count ratio is used as an indicator of joint usage and shown to have an impact on the spatial correlation.
Based on models for the correlation of link travel times as a function of the distance and the trip count ratio, formulas for the route travel time prediction error variance are suggested. Using directly measured route travel times we find that including the correlation into the calculation of the route travel time uncertainty appears to result in partially superior results compared to simple assumptions of zero or perfect correlation. However, we also found that explicit modelling of the correlation only leads to minor performance enhancements compared to simple models using sample correlations for nearby links and setting the correlation to zero for distances larger than 1km.
Concluding this leads to the suggestion to quantify route travel time uncertainty based on empirical spatial correlation estimates which can be confined to adjacent links and hence do not face the same data problems that empirical estimates in the whole network face. Using empirical correlation has the advantage of not requiring any other information (such as trip count ratios, location of traffic lights and so forth). Moreover this might also alleviate the restriction to correlations of model errors which needed to be imposed in this paper due to data availability. Alternatively prediction error correlations for adjacent links could be measured directly based on high frequent taxi FCD. The analysis in this paper justifies the usage of this simple method over more complex model based approaches.
Our model uses a simple scaling approach by modelling the model error being distributed according to a unique distribution scaled by a standard deviation depending on the current traffic conditions. This assumption has been verified empirically in the paper for the model errors. For the prediction errors a partial verification is contained in Fig. 13. However, this figure also contains substantial deviations that need to be investigated in more depth.
Summing up a model for the estimation of route travel time variability can be obtained based on the material in this paper which is also operational for a large street network without relying on excessive amounts of other data than the floating taxi measurements.
Acknowledgments
Part of the work has been done while the first author was with the AIT Austrian Institute of Technology GmbH.
We gratefully thank Taxi 31300 (taxi31300.at) and Taxi 40100 (taxi40100.at) for providing the taxi data used in this study and the AIT Austrian Institute of Technology (in particular Hannes Koller has been very helpful with the details) for processing the raw data and making the data available. We acknowledge support for the Article Processing Charge by the Deutsche Forschungsgemeinschaft and the Open Access Publication Fund of Bielefeld University.
Competing interests
The authors declare that they have no competing interests.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.