Most people enjoy the freedom and convenience of having a car, but personally an appeal of the New York City lifestyle is never having to drive. Although this usually means taking the Subway, cabs are a necessary staple of getting around the city. It's hard to imagine a day without hearing their horns blaring or seeing them weave about traffic. But it certainly begs the question of how many rides are accessed on a daily basis?
On an average day, how many total rides are accessed, how much in fares is generated, and how much is the average fare?
The Data and Conclusions
Unsurprisingly, the data corroborates that Midtown was by far the most active in terms of cab pickups, followed by the Upper East Side. Both LaGuardia and JFK are noticeable locales in the outer-boroughs.
The total amount generated in a day on a neighborhood-basis essentially mirrors the number of pickups. Midtown accounts for more than $1MM in total fares per day. However, both airports now perform roughly equivalent to the Upper West Side by this metric.
Things shake up rather significantly once we observe the average fare per pickup. Neighborhoods in the outer boroughs usually have more costly rides for cabbers, and the average fare is understandably very high at both airports.
The Tech and the Sources
The ultimate goal was to primarily explore and plot data using primarily Ipython Notebook (Jupyter). Initially, I thought this was going to be a quick project and started the project in July and thus utilized data from January-June. However, the data has since been updated at the Taxi & Limousine Commission. More than anything, it was a great lesson on data cleansing. I first removed any incomplete data which included (0,0) geocoordinates, $0 fares, etc. Next, I had to assign neighborhoods to each ride which entailed:
Extracting the latitude and longitude of each valid pickup
Rounding the longitudes to the nearest half hundredth decimal (due to volume restrictions on geocoding APIs)
De-duping the values and feeding it through New York Times Districts API
Re-assign the neighborhoods back to the dataframe and merge it with a GeoJSON of NYC
Manually assign neighborhoods for non-matching neighborhoods between both the NY Times API and the Pediacities GeoJSON (most time consuming part)
The data was manipulated in the notebook and the images were produced using the GeoPandas library. I could have done a better job of creating axis titles and units in the legend but that was outside of the initial scope of the project. The intervals for the cluster was created from a Fisher Jenk's Natural Breaks Classification that optimized the difference between the clusters. Limitations of the study are certainly the rounding of the geocoordinates, the limited definition of neighborhoods in the GeoJSON, and the quality of the data. As you can see, Midtown is so broadly mapped and is therefore unsurprisingly the leader. The data could easily be utilized for a time-lapse on an hourly basis to show pickup frequency over time. Now I'm waiting to explore the Uber datasets.