Using Machine Learning To Tease Out A Dynamic Pricing Algorithm

{Time for something different again! This piece is a follow-up and is originally posted on my Thought Blog with Ahan Analytics, LLC: http://ahan-analytics.drduru.com/thoughtblog/2013/12/07/machine-learning-dynamic-pricing/}

On November 29, 2013, I wrote a piece titled “Not Worth the Cost: A 17-Month Case Study of Congestion Pricing in the SF Bay Area.” In that piece, I presented data I manually collected on toll costs for the westbound Express Lane on Highway 237 (running from Milpitas to Sunnyvale, CA) versus the drive time on the highway’s general purpose lanes. I was disappointed to find that the relationship between the two was not very reliable. Moreover, I concluded that neither the toll nor the overall projects costs are worth paying.

Motivated by comments and questions from a reader, I decided to take a deeper look at the data to see whether I could tease out some more complex relationships. I will be doing this analysis in stages. In this first stage, I developed a simple machine learning model using a regression tree to predict drive times based on a full array of variables.

I broke up the data into the following independent variables:

  • Cost: price of the toll on the Express Lane in dollars.
  • Month: index for month of the year (1=Jan, 2=Feb, etc…) using the date of the data collection.
  • DayofWeek: index for the day of the week (2=Mon, 3=Tue, etc..) using the date of the data collection. Note that the tolls only apply on non-holiday weekdays.
  • WeekOfYear: index for the week of the year (1=the first week which is Jan 1st, 2 = the second week, etc..) using the date of the data collection. Note that the week starts on Monday.
  • Hour: the hour component of the start time of the drive on the general purpose lane.
  • Minute: the minute component of the start time of the drive on the general purpose lane.

The dependent variable (what the model is trying to predict/classify) is the duration of the drive on the general purpose lane in seconds. I coded this as DriveTimeSeconds.

I used the e1071 package in R for creating the best regression tree using 10-fold cross-validation.

The initial results are promising. The regression tree below shows that drive time on the general purpose lane is influenced by the day of the week and the fraction of the hour (but NOT the hour itself!). Adding these variables to the cost information from the toll lanes provides a richer understanding of resulting drive times. Of course, what the model cannot know is whether congestion itself just so happens to depend on the day of the week and the time (the Valley Transportation Authority (VTA) might be able to tell me though!).


Classification Tree Applied to the Dynamic Pricing Algorithm on the Highway 237 Express Lane (San Francisco Bay Area)
Classification Tree Applied to the Dynamic Pricing Algorithm on the Highway 237 Express Lane (San Francisco Bay Area)

Here is how to interpret the branches of the tree (from left to right of the leaf or end nodes):

  1. If the cost of the Express Lane is less than $2.05, then I can an average drive time of 479.2 seconds (8.0 minutes).
  2. If the cost is less than $2.55 but at least $2.05 AND the weekday is a Wednesday, Thursday, or Friday, then I can expect an average drive time of 641.5 seconds (10.7 minutes).
  3. If the cost is less than $2.55 but at least $2.05 AND the weekday is a Monday or Tuesday, then I can expect an average drive time of 700.7 seconds (11.7 minutes).
  4. If the cost is at least $2.55 AND the time is before half past the hour AND the time is at least 15.5 minutes past the hour, then I can expect an average drive time of 701.2 seconds (11.7 minutes).
  5. If the cost is at least $2.55 AND the time is before 15.5 minutes past the hour, then I can expect an average drive time of 766.8 seconds (12.8 minutes).
  6. If the cost is at least $2.55 AND the time at least 30 minutes past the hour, then I can expect an average drive time of 803 seconds (13.4 minutes).

With these results, I can move beyond the disappointing scatter of the 2-dimensional graph of drive time versus cost and see the more complex relationships at work. It is VERY interesting to see that while the tolls ranged from $0.85 to $4.25, the tree only contains two branching points based on cost. This verifies that cost is not a sufficient determinant of average driving time from the perspective of the driver in the general purpose lane.

The chart below recasts the original chart: it color-codes the points according to the rules from the regression tree. You can now visualize how the algorithm partitioned the data. The “nodes” in the legend are ordered and numbered as shown in the list above.

(Note this latest data set includes two additional days: December 3rd and 6th. It just so happens that these two days were unlike all the other days of data in that I actually observed some moments of congestion in the Express Lane).


Highway 237 Drive Time Versus Cost of Express Lane (Random Dates from Jun 18, 2012 to Dec 6, 2013)
Highway 237 Drive Time Versus Cost of Express Lane
(Random Dates from Jun 18, 2012 to Dec 6, 2013)

With this format, you can also visualize which parts of the model have the highest error rates. The very first rule, “Node1”, has the highest error rate given that with a cost less than $2.05 drive time can range from 200 to 800 seconds (3.3 to 13.3 minutes). If I had additional variables at my disposable, I might be able to reduce the error rate of this region of data. This model can also be a starting point to help the VTA generate a more consistent congestion pricing model (again, from the perspective of the general purpose driver).

In a future analyze, I will apply k-means clustering to these data to see whether I can generate even richer results. I think the partitioning routine of k-means should be well-suited to this problem. I will also explore metrics of performance of these models. Stay tuned!

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.