Building a Command Line Movie Script Searching Tool

Drew Pappas
7 min readDec 17, 2020

Introduction

Are you a movie buff who happens to have a bunch of text files laying around somewhere? Is the little search feature on your file explorer not doing the job of finding stuff in those files like you’d expect? If so, look no further than this article!

Today I’m going to give you some starting advice on how to build a small-scale search engine for text files. This should be helpful for anyone looking to add search functionality to a small collection of relatively structured files. We’ll be looking at a collection of movie scripts, because who doesn’t love having the perfect quote for the perfect situation?

My IDE to me when trying to debug and I keep getting it wrong

The data

The Film Corpus 2.0 is a collection of movie scripts scraped from https://www.imsdb.com/, an online database of movie scripts. The information was scraped using scrapy and was last scraped in Nov. 2015. If you’re looking to use this tool with the latest movies, then you’ll need to build your own scraper and pull the information yourself.

So now that we have the data, what does it look like?

a visual representation of me peering into the corpus for the first time

In total, there are 1068 scripts in the form of .txt files. The files are sorted by genre and are available in raw format or split into scene/dialog. Movies can be appear twice in the corpus because they are in more than one genre. Here’s a few images of the raw files to get a better feel for what they look like:

forrestgump.txt
inception.txt
insidious.txt
titanic.txt

Notice anything interesting about these texts?

It’s okay, me either! What makes NLP fun is that the data is messy. The scripts vary in their usage of capitalization, quotations, tab spacing, parenthesis usage, scene description, camera effects, character dialog indication, and more. Also, the dialog/scene split is often not correct, so using the raw data is likely the best option here.

This begs the question, if we’re going to create a useful vertical search engine, what are some tasks that need to be accomplished in order to support a user’s search? Once we answer that question we can think about how to preprocess the data for efficient search.

Methods

Necessary features

After a little brainstorming, I came up with a few use cases:

  • Searching for quotes
  • Searching for characters
  • Finding a movie title based on descriptors of the movie
  • Finding scenes
  • Finding similar lines to the query
  • Searching within genres
  • Searching for shot types
  • Searching for character actions

Preprocessing

Given these use cases, how should one preprocess the data? Should we lowercase all words? That makes finding scene starts like FADE OUT harder. Should we remove punctuation? That makes discriminating between scene and dialog difficult. What about removing stopwords? That makes it harder to find quotes like “What’s in the box?” or “I’m in?” Needless to say this is a difficult task.

NLP is hard guys

So what did I do? I threw caution to the wind and experimented! On first attempt I took a bludgeon to the corpus and removed all stop words, punctuation, lowercased everything, and threw all of the remaining text into new .txt files that represented each movie script. I did this with the NLTK, which has a lot of helpful functions for cleaning up data. The thought was, “people only really remember a few words correctly from a quote and movie characters are relatively distinct, maybe this will work.”

Library

In order to test this theory, I needed something to support my search engine. I sought out python libraries as that’s the programming language I’m most comfortable with. It was surprisingly difficult to find libraries with features to support what I was looking to do, but I stumbled upon an older library called Whoosh.

Whoosh offers a lot of features out of the box that are helpful when designing a search engine.

Whoosh defaults to BM-25F for a ranking function and includes the option to provide a custom ranking function as well. BM-25 is a pretty good all around function and is often used as a baseline when doing info retrieval.

Implementation

Before we get started, here are all of my imports that’ll be needed:

First, we need an inverted index, to support fast retrieval from the corpus. Luckily Whoosh gives us the ability to create one from a schema. Here’s what I supplied to the schema to automatically create the inverted index:

This sets up our index to be filled, but it isn’t complete. We’ll need to have a processed data file with all of the movies sorted by genre. This was all preprocessed, but can be done by using the file structure provided by the film corpus 2.0 raw data set! Here’s the code to fill all of the paths. Note, this all needs to be under the same if statement as above!

Now that we have the title, content, number of words in the corpus, all of the paths to the files, and the genres indexed, we can leverage Whoosh’s query parsing to actually search through the index.

This was my query parser setup. There are a lot of different components here so it’s helpful to read the documentation to get a better understanding of how this is changing the query’s interaction with the ranking function!

Although this isn’t the full code, these functions should help you get started with actually handling user input:

Results & Discussion

So does our BM-25F ranking function work? What are we comparing against? BM-25F strikes a good balance between detecting how often a term occurs in a document compared to how frequently it appears across all documents. If you’re interested in how this works, I highly recommend reading up on the subject. The math looks scary at first, but I majored in philosophy during undergrad and hadn’t taken a calc class since this music video came out:

One could say grad school hits like this

So if I can do it, you can too!

Let’s see how our system did compared to if we were throwing darts to find documents. We’re looking at precision for each of these queries.

Performance with random retrieval
Performance with BM-25F ranking

We beat random retrieval! At the very least, this signals that our program is working in some regard. If you’re looking to setup a quick search of text files, but don’t feel like sifting through them all, using this structure may just be beneficial for you.

What’s interesting, but not reflected here, is that removing stop words ended up making the system worse at finding quotes. I tried things like “high ground Anakin” and “life’s like a box of chocolates” and it couldn’t find them! This goes to show that how you process your data matters, so take care when designing a system to consider what your data looks like first.

What’s Next

Building a search engine is hard.

Although Whoosh offers some great tools out of the box, getting them working and getting the results you actually are hoping for is a difficult task. One of the major limitations of my approach was sticking with the BM-25F implementation. There are certainly ranking functions that would probably perform better, but crafting one from the data I was given was a task that was slightly above my paygrade.

Rome wasn’t built in a day and neither was Google, so given more time the incorporation of some ML based methods for scene detection, things like topic modeling with Latent Dirchlet Allocation, or even more preprocessing would have been helpful. But for a quick and dirty search engine on some text files, this is pretty good! What do you think, where could I have improved?

--

--