Transit data made easy: the gtfsr R package

Danton Noriega
TransLoc TechLog
Published in
4 min readJul 11, 2016

--

This interactive map of the Duke University transit system was created using gtfsr with a few lines of code.

Plenty of data about transit systems is publicly available, but that doesn’t mean it’s easy to find and use for analysis and visualization. To make it more accessible, the TransLōc data science team recently released an R package called gtfsr. The package is designed to make it easy to find, validate, and map transit system data. We hope it will also be used by other developers to incorporate transit data into related R packages, leading to better tools for all of us.

What is an R package?

R is a very popular open source language for data science and statistics work. R consists of the language with a lot of basic functionality, enhanced by a tremendous ecosystem of packages. R packages are considered the “fundamental units of reproducible R code”. Packages allow any R user to create a set of reusable R functions for easy, open distribution. They are the lifeblood of R’s vibrant, expanding community and what make R the most popular open source programming language for data scientists and statisticians.

What is the GTFS (feed) format?

The General Transit Feed Specification provides a standard format for recording and distributing public transit data. A GTFS “feed” comprises a collection of required (and optional) flat text files (usually .csv) where each file must follow a predefined naming and variable structure. It includes information about stops, routes, schedules, and other data needed to describe a transit system.

The TransLōc blog has a great post about the history of GTFS feeds. Also, see Google’s GTFS reference materials for details on formatting specifics.

Why create a R package for GTFS feeds (gtfsr)?

While R users could certainly work with GTFS data without a special package, standardizing and re-using functions enables the building of reliable tools and analysis workflows. Converting a GTFS feed into a structured data object empowers users to easily extract and merge data for plotting, modeling, and analysis. This is the aim of the gtfsr package.

Features of the gtfsr package

  1. Find. A key feature of the gtfsr package is helping users quickly find GTFS feeds from the TransitFeeds API. Users need only get an API key to begin accessing GTFS feed easily from within R.
  2. Import. All that is required is a valid URL to a GTFS feed zip file. One need not use the TransitFeeds API. Local paths to GTFS zip files are also allowed.
  3. Structure. One of the key features of the gtfsr packages is that it converts a GTFS feed into a gtfs data object. Each gtfs data object is just a structured collection of “tidy” data tables. In essence, it takes the files in any given GTFS feed and converts them into a “living” data object — data a user can easily interact with. We hope this standardized yet flexible structure will become the standard for incorporating GTFS data into other packages.
  4. Validate. A single GTFS feed comprises many different text files, and each text file, many different fields. Each GTFS feed must contain certain required files and fields, but it may also contain optional files and fields. The gtfsr package checks for required files during import and reports any problems that may be present within the data. Users can go a step further to extract validation data for meta analysis.
  5. Mapping. One of the best features of the gtfsr package is the ability to easily map transit stops, routes, or networks. All that is required is a valid gtfs data object.

Why is TransLōc, a private company, interested in making open source package?

TransLōc believes everyone benefits when the barriers to public knowledge are reduced. Simple data tools open up public data to consumers, citizens, and institutions, and simple data structures improve the quality, sharing, and understanding of these data.

We hope to encourage more people and transit agencies to explore transit data and do interesting things with it. The gtfsr packages, by providing simple functions, reduces the tedium of obtaining, validating, and mapping GTFS feed data. Furthermore, by providing a well structured gtfs data object, we hope this package will provide a foundation for other R users to produce other packages expanding the ecosystem of open source data tools for GTFS feed data. Simply put, we want users to spend less time fretting about the less fun parts of data science (i.e. cleaning, validating) and spend more time playing with transit data.

How can I try it out?

The best place to get started is to check out the readme file on the github page. This walks you through each step of using the package in detail. Even if you’re not an R user, if you’re up for an adventure, you should be able to get data for your favorite transit system and map it by making simple modifications to the sample code (you’ll need to download R first, and using RStudio will make your experience much easier).

--

--