A basic linear regression model for predicting movie revenue

Our instructors from Datascope Analytics designed the bootcamp around a number of projects.

In week 2 & 3, we were all let loose on movie revenue data scraped from Box Office Mojo. The major goal was to predict a movie-related value of choice by using a linear regression model, optionally involving logarithmic and/or polynomial terms.

The first days were spent on writing and running the web scraper. I used Beautiful Soup and good old regular expressions to parse and convert the messy Box Office Mojo data into a nice and clean comma-separated file.

The scope of my personal project shrunk as the data came in and the days went by. In the end, after many iterations in IPython, I got to fit a number of multilinear regression models with statsmodels, using sklearn for model evaluation (root-mean-squared error), and the inevitable pyplot for visualization.

On Friday, the whole class delivered their presentations. My first-ever Keynote presentation was for a ficticious investor known by the name of Dino Brangelino. My conclusion: knowing the box office revenue of the opening weekend -on top of the production budget- reduces total revenue prediction error by 15%.

Presenting performance improvements in linear regression models to the Metis Data Science class

A not-so-random sample of subjects from my fellow students’ talks:

  • how to convince laggards among the actors and actresses to go on Twitter, by predicting how many followers they would have if they did

  • can monthly consumer optimism predict the mutual prevalence of movie genres?

  • does the choice of a distributor help to secure an Oscar nomination?

  • which actor, actress, director or producer is overpaid or underpaid, given the gross revenue of the movies they have contributed to?

  • does the co-featuring network of actors/actresses determine movie revenue?

Presenting observed vs. predicted movie gross revenue on the best-performing linear regression model