Predicting Business Closure

This project assesses the likelihood of a given business listed on Yelp going out of business.
Built by Andrew Tang, Oliver Goodman, and Daniel Letscher for EECS 349 at Northwestern University

Ford's Theatre
Washington, DC

Was murdered here. Would not recommend.

Abe L., August 8, 2011

Taco Santo
Brooklyn, NY

The entire kitchen and wait staff saw an ice cream truck and ran outside, leaving me alone in the restaurant. 10 minutes later they all came back with ice cream cones.

I still can't believe this actually happened.

Ross F., September 7, 2014

Problem Statement

Yelp serves as a great resource to research the quality of many restaurants and businesses. However, given the competitive environment associated with operating a personal business, it is not uncommon for owners to be forced to close their doors, only to be quickly replaced with another business. The goal of our project was to use data from the Yelp Dataset Challenge to predict whether a given business will close down in the near future.

Our project is important because we could help solve a problem that confounds Evanston residents each year: why do so many restaurants close? The outcome of this project could help solve larger economic problems behind the restaurant business in America. Many of these places are likely to be family-owned, and thus have big implications for the healthiness of small business in America.

Solution Overview

The three models we used to test were logistic regression, decision tree, and a Naive Bayes network. To select our models and classify our data, we used the scikit-learn machine learning package for python. This package allowed us to process our csv file using the various models, outputting the training accuracy, test accuracy, test precision and test recall and training time for each model.

We used the following features from our dataset:

Testing and Training

Our dataset includes information from over 77,000 businesses from 10 cities across the world. We wanted to explore which features about businesses on Yelp were influential in determining whether a business would close down or not. Our original dataset consisted of over 80 attributes, many of which were only relevant to a certain type of business. As a result, we aimed to focus on features that applied to most of the businesses in our dataset.

When testing and training our various models, we looked to see how accurate each algorithm was able to classify businesses as currently operating or closed down. Since the majority of the instances in our dataset consisted of businesses that were open, we paid attention to precision and recall scores as well as accuracy. We experimented with different parameters for each of our models, such as the depth limit for our decision tree, to try to improve the classification results.

Results

Below are the accuracy, precision, recall, and training time scores for our three models:

For our key finding, we found that the most important feature in determining if a business will close down is its star-rating on Yelp. To visualize this finding, below is the decision tree that was trained. Notice that the root node splits on stars.

Final Report

We have created a detailed final report that goes more in-depth on our methodology, dataset, and results. Click below to download the PDF.

Download