Blog Series, Article 2: Text Mining Movie Scripts
This is the second of four articles in the series Text Mining Movie Scripts, released weekly. If you missed the first post, you can find it here: Introduction to State of the Art Script Analysis
Turning Stories into Data
There is no science without data. When it comes to any data science project, most time is spent cleaning data. This project is no different. In light of this statement, there is great fun to be had dealing with data, so let’s do this!
- First, we create structured data out of the full-text scripts. We use best practices from NLP techniques to clean and standardize the data, such as stop word removal, lemmatization/stemming, part of speech tagging, and entity recognition.
- After we structure the data, we use sentiment analysis to extract story arcs. We’ll complete this process for each character separately to compare their storylines.
- We will use topic modeling to better understand the themes of the movie character, ideally to cluster the characters based on their interests.
- Lastly, we will build character profiles using a personality extraction API and the character dialogue.
Oh, Data Where Art Thou?
Everything is on the internet. One of the most comprehensible databases of scripts you can find is the International Movie Script Database at imsdb.com. That’s our starting point. The only disadvantage is not having newer scripts, but it works as our proof of concept. A typical movie script looks something like this:
The scripts are HTML pages, so we have to do a bit of scraping. Web scraping today is not all that complex, here’s a few of our steps:
- Scrape all of the URLs that could lead to movies.
- Filter out movie script pages using regular expressions.
- Clean the scripts from buttons/navigational links.
- Remove all HTML tags.
Great, now we have around 1200 scripts. What next?
Chaos in, structure out
In this part, we’re going to discuss breaking down a script into smaller chunks. A script contains three basic building blocks:
- Scenes (Night at the sea)
- Dialogue (Darth Vader: And you know that we got it, Deathstar)
- Action (Snape kills Dumbledore)
And this is exactly how we plan to break out the full-text scripts into basic building blocks for further aggregation and analysis. Before we get into that, I’d like to introduce the characters from the movie that I’m going to mention in my examples, we’ll be looking at the Star Wars: Force Awakens script.
While we do have many scripts, our approach is to first break down one script and then try to generalize and apply it to as many movies as possible. So let’s look at how difficult it is to chunk Star Wars: Force Awakens.
First up are the scenes, the scenes are not that difficult, we have a few smart regexes we can get decent results from as each scene is numbered with an all caps annotation including a note of whether the scene takes place in an interior or exterior.
Next, we want to break down the scenes into parts belonging to individual characters. At this point, we can’t distinguish whether it’s dialogue or action. This is still quite simple as most script dialogue begins with an all caps character name. Action currently looks just as a continuation of dialogue:
Here is where the first complication arises. The dialogue flows into action with no real distinction (aside from white spaces and tabs, which are not consistently unique). So we have to be smarter. This is where part of speech tagging comes in. Action description is bound to be dominated by 3rd person verbs, while dialogue of a character is probably mostly 1st or 2nd person. We already know this distinction won’t be perfect as a character can speak about a 3rd person as well as some verbs having no distinction for a 3rd person (especially in non-present tenses: I ran, you ran, he/she/it ran). But that is what happens with any data cleaning – we can’t get everything perfect, we need to have it good enough for the final output to make sense.
With a few simple rules based on counts of 3rd person verbs vs 1st person verbs we get a decent output again:
While I don’t want to spoil anything out of the movie, this is a very special scene between Han and Kylo Ren, we see that all of the dialogue parts are correctly characterized as is_character_text, while ‘Han moves toward Kylo Ren’ is correctly characterized as is_action = TRUE. Each line has also correctly identified characters occurring in the text chunk, we achieved this by using entity recognition focusing on movie character names.
This is all nice and well, but it is just one out of many scripts we have scraped (and obviously one of the better ones when it comes to being processed, I wouldn’t be showing it here otherwise). The real question we need to ask – how can we evaluate the chunking quality for all 1200 scripts? If we would have someone read the resulting table line by line, we could very well have had them doing the chunking by hand. Keep in mind not all the scripts are in the same format. What we can do though is employ some smart heuristics, such as what is the usual number of scene chunks in a movie:
There is strong evidence that movies tend to have hundreds of scene chunks. Tens of chunks might still be found in some cases, but less than 10 means we failed as chunkers. We obtained another worthwhile measure by downloading movie run times from IMDB and calculating the number of final (scene, dialogue, and action) chunks per minute for all scripts:
We see that there is a strong tendency of movies having around 10-15 final chunks per minute of screentime. The ones over 150 chunks per minute are definitely outliers and the result of different formatting of the respective script.
After applying additional changes to the chunking code, we came out with CCA 1000 scripts we considered to be correctly chunked. There is however a valuable lesson to be learned. We often ask ourselves whether a model is good enough, but how often do we ask how we know the data is clean enough? Similarly to building a predictive model, cleaning the data requires good use of metrics that allow us to establish a baseline and move us from doing random sanity checks to identify real issues in the data.
Now that we have the data, I am sure you’re curious about what came out of it. Stay tuned for Article 3 of 4: Creating Meaningful Outputs of the Data.