• Entertainment

We Taught a Computer Program to Predict the Oscars. Here’s the Movie It Says Will Win Best Picture

6 minute read

We predict that the Academy Award for Best Picture will be Roma, putting aside the fact that neither of us have seen it. Not to worry: The machine that whispered this prediction to us hasn’t seen it either.

Our oracle is a fairly simple computer program we wrote that accepts 69 years of data on major film awards, ignores all but the most predictive variables, and returns a statistical model that can predict past winners with extraordinary accuracy. Even in a chaotic year for Hollywood, the model is still bullish on Roma, directed by Alfonso Cuarón and streamed on Netflix, as the leading contender for the top honors, assigning it a score of 45.5 out of 100, which is four times higher than the closest competition.

That score represents the historical odds that films with the same accolades as Roma have gone on to win Best Picture, with one caveat: This model considers each film’s odds of success independently through the wide lens of history, not as an eight-way race among this year’s nominees. Thus, the probabilities we computed for 2019 do not sum to 100%.

In descending order, our prediction that each film wins is:

  • Roma (45.5%)
  • Vice, BlacKkKlansman and The Favourite (tied at 11.2% each)
  • Bohemian Rhapsody and Green Book (tied at 1.7%)
  • A Star Is Born and Black Panther (tied at 0.2%)
  • Read more: Here’s Who Will Win at the 2019 Oscars

    That’s the headline news, but there’s a better story here. Thanks to its elegance and simplicity, our model can tell us precisely why it thinks Cuarón’s drama will win. Its logic is enlightening even if its 2019 prediction turns out to be wrong. Of the 47 variables we fed the program, it ultimately only needed only three pieces of information to guess historical outcomes extremely well.

    Those three key factors are:

  • Whether or not the film was also nominated for the Oscar for Best Director
  • Whether it was also nominated for the Oscar for Best Editing
  • Whether it won top honors from the Directors Guild of America, which is announced before the Oscars
  • Curiously, Roma has only two of these bona fides, given that it missed out on a nomination for Best Film Editing to five other Best Picture nominees. It survived on the power of Cuarón having already won the guild award for Outstanding Directorial Achievement in a Feature Film, an honor dating back to 1948. In our data, which goes back to the award year 1950, we found that, of the 69 Best Picture winners since then, 54 won the DGA nod for best director as well. Only one—Driving Miss Daisy—didn’t at least get nominated, though this fact is not a detail that the model considers.

    Winning the guild award, according to our approach, gives a movie 66 times better odds to win the Oscar for Best Picture when considered independent of other variables—the highest individual predictive power of any of the 47 factors we considered. That constellation of data points included whether the film (or individuals working on it, like actresses) was nominated for 14 other Academy Awards that are reasonably consistent going back to the 1950 ceremony, and whether it was nominated for and did or did not win 10 Golden Globes, six awards from the British Academy of Film and Television Arts and the one Directors Guild award.

    Curiously, while the Golden Globes are at least casually scrutinized for hints of Oscar success, the model ignored everything we told it about the junior varsity honors, as well as the Academy’s British counterpart.

    We are not the first to notice that the Oscar for Film Editing—which will be awarded this year during a commercial break—has a strong correlation to Best Picture. (Update: the Academy reversed the decision to air Film Editing off-screen.) What’s particularly powerful about deputizing one’s computer in making predictions is that it considers the predictive power of combinations of different awards, which is very difficult to notice on one’s own. The tincture of the two Oscars and the guild award is so powerful, in fact, that it is more accurate than the same trio plus the added contribution of the fourth-best predictor, which is a nomination for the Oscar for Best Supporting Actor.

    No matter how hard we pummeled the algorithm, written in open-source language R, that chooses the best model, running simulations for over an hour that forcibly overfeed it with more and more data to consider, we did not get any foie gras. The model only wanted three of the 47 entrees on the menu.

    This runs against the grain of a common assumption about Big Data, which feeds astronomically more information into more complicated models under the rallying cry of Machine Learning: That the bigger the data, the better. Our approach could, at most, be called “Medium Data.” Medium Data—a phrase that, based on a Google search, a few others have also alighted on—is not a mere convenience when it comes to sitting around waiting for a model to run. It offers something extremely valuable that Big Data cannot: Human learning.

    Because our approach culled only three variables from a list of 47 candidates in our model, we are able to look under the hood and actually learn something about moviedom: that good editors and great directors are indispensable to would-be Best Pictures, assuming you listen to the opinion of other directors.

    Of course, this doesn’t mean we couldn’t be absolutely wrong this year. The model lives in a blissful world that is ignorant of the controversy around Green Book or the impenetrable politics inside the Academy. In fact, it doesn’t even know what a movie is, and we didn’t tell it. It didn’t need to, because unlike its bigger, smarter and moodier cousins in industrial-grade machine learning, it still knew how to talk back in a language we can understand.

    Methodology

    ­The source data for the model was compiled and fact-checked using a variety of industry databases including IMDB and the unrelated OMDB. For the model, which is written in the open-source language R, we used a logistic regression powered by 47 variables and 381 films nominated for Best Picture between 1950 and 2018. The “we” we keep referring to is a tag-team of the director of data journalism at TIME and an assistant professor of statistics at Virginia Tech. Full disclosure: said tag-team, jointly known to their respective parents as “CW” and “CF,” have been collaborating, in one way or another, since they met, at age 2.

    More Must-Reads From TIME

    Write to Chris Wilson at chris.wilson@time.com