Our instructors from Datascope Analytics designed the bootcamp around a number of projects.
In week 2 & 3, we were all let loose on movie revenue data scraped from Box Office Mojo. The major goal was to predict a movie-related value of choice by using a linear regression model, optionally involving logarithmic and/or polynomial terms.
The first days were spent on writing and running the web scraper. I used Beautiful Soup and good old regular expressions to parse and convert the messy Box Office Mojo data into a nice and clean comma-separated file.
The scope of my personal project shrunk as the data came in and the days went by. In the end, after many iterations in IPython, I got to fit a number of multilinear regression models with statsmodels, using sklearn for model evaluation (root-mean-squared error), and the inevitable pyplot for visualization.
On Friday, the whole class delivered their presentations. My first-ever Keynote presentation was for a ficticious investor known by the name of Dino Brangelino. My conclusion: knowing the box office revenue of the opening weekend -on top of the production budget- reduces total revenue prediction error by 15%.
A not-so-random sample of subjects from my fellow students’ talks:
-
how to convince laggards among the actors and actresses to go on Twitter, by predicting how many followers they would have if they did
-
can monthly consumer optimism predict the mutual prevalence of movie genres?
-
does the choice of a distributor help to secure an Oscar nomination?
-
which actor, actress, director or producer is overpaid or underpaid, given the gross revenue of the movies they have contributed to?
-
does the co-featuring network of actors/actresses determine movie revenue?