Blog Series, Article 1: Text Mining Movie Scripts
The Data Science team at Panoramic is on a mission to continuously innovate and challenge the norm. We’re always searching for actionable insights and to apply to new products marketers will find useful in reaching their goals. In order to better serve marketers in the entertainment industry, we’ve set our sights on extracting information from movie scripts. While we have used several interesting ML techniques to create interesting outputs, the biggest challenge in this project was getting the data right. Let’s not kid ourselves – this is the biggest challenge in most Data Science projects. Despite that, I’d dare to say most blogs focus on applying the methods and/or using a boring dummy dataset that poses no challenge to an experienced Data Scientist (Hello Iris).
This is the first of four articles in the series Text Mining Movie Scripts, released weekly.
- Introduction, state of the art when it comes to script analysis (Released: July 18, 2019)
- Getting the scripts and transforming them into structured data (To-be-released: July 25, 2019)
- Creating meaningful outputs out of the data (To-be-released: August 1, 2019)
- Improving the project (To-be-released: August 8, 2019)
Introducing Script Analysis
No project exists in a vacuum. It is conducted within the context of its business. The first article in this series talks about the reasons we chose to take on this project as well as the art of script analysis.
Prior to beginning a Data Science project, we always have a goal in sight. We aren’t going to process movie scripts just to process movie scripts. In this case, our goals are to::
- Build a dashboard representation of existing movies for our creative team that enables them to identify similar movies and benchmark multiple scripts against one another.
- Extract numerical features from scripts for use in any kind of analysis/prediction, such as Box Office prediction.
- Create character archetypes for movie characters that can be matched to existing influencers/audiences for better targeting.
Very few Data Science projects have no paper or blogpost to draw inspiration from. We are not ones to reinvent the wheel, so let’s look at what others have done in this aspect.
Turns out there is a paper called The emotional arcs of stories are dominated by six basic shapes that looks at books by extracting sentiment for sliding windows by 10k pages of each book from Project Gutenberg. Using this approach, the researchers were able to establish 6 basic story arcs and come with cool representations of famous books such as this one:
By doing this, the researchers were able to extract a so-called hero’s journey defined by Joseph Campbell. His now widely accepted idea had a universal impact on how advertisers sell to consumers and how storytellers frame the hero’s journey. A hero journey starts with our hero living their non-hero life, until their call to adventure. After a bit of denial, the hero meets their mentor and sets out to encounter their evil counterpart. The hero initially fails, only to get back up and become the hero they are destined to be for the final confrontation. The hero of course triumphs and lives happily ever after.
We have to keep in mind that in a movie script, a hero is just one of the characters, just as he/she is just one of the archetypes presents out of the ones defined by Jung, which can be seen in the diagram below.
It’s important to note is that these archetypes are present in many types of stories, be it mythology or literature:
While they’re just as prevalent in movies:
So it just makes perfect sense to use them in movie advertising as well. The important question is whether or not we can connect these characters to profiles of fans. Luckily, the algobeans has already blogged about it. Below is an analysis of the personality traits of male and female fans of Star Wars characters:
This is what we had initially. We all know that, in theory, theory and practice are the same, but in practice, they are not, curious how our project turned out?
Stay tuned for our next blog post that will talk about how we transformed all text from scripts into structured data.