What will we see/get/understand/perceive/feel if we have sight of frequency visualization of the words that have been used in a movie/movies?

V for Vendetta

This project is for one my university courses called "mapping the data" thought by Leonhard Lass. We were supposed to choose a subject for data visualization and work on it during the semester. This post is a documentation for that project.
I started this project as a research project. I was curious and have this question in my mind that can we see unexpected patterns or interesting facts if we calculate the number of the times that words are used in movies and then visualize that data. What if we do this at the same time for multiple movies that somehow are related to each other. For instance, some movies that have a same topic or the movie that has been made for few times over cinema history?
It was my first time to carry-out data visualization based on texts. So I started to watch tutorials and read documentations and see many examples related to this topic. While doing this, I came across some concepts or better to say keywords related to text analyzing such as ConcordanceStop-words, and TF-IDF, etc. I just give a short definition of stop-words because I used it a lot in this article. Stop-words are the words that when you are analyzing a text are not important, and you want to eliminate them from the text to get your desired result after analyzing. For example, most of the times (not all the time, it depends on your purpose of the text analyzing) people prefer to eliminate Prepositions such as “in”, “of”, “to” and … from the text.
The software that I used to do this project was processing. The first thing that I did was searching for the word cloud for processing. I found this one: wordCram . though it was an impressive library, it wasn’t powerful enough for my project. One of the most important things for me to do in this project was eliminating stop-words from the text, and with this library, I didn’t have control over them.

text analyzing based on "Society" song by Pearl Jam using WordCram library

I started to do it from scratch. These significant tutorials from Daniel Shiffman helped me a lot. At this time, I borrowed the book “Generative Design” from our library and on one of its pages I came across with a library from Ben Fry named Treemap. It was exactly the thing that I needed for the first step and to test my idea. I tested it with Alice in wonderland movie. Without considering the stop-words, it gave me this result :

Alice’s Adventures in Wonderland, 1972

I used the English subtitle file of the movie as a text source. Since in subtitle files, usually, the times of the sentences are written. Hence the most frequent word were numbers. I removed numbers:
At this part of the project, I had to decide which words I want to be in my Stop-words list. This step was tough and confusing for me. The first thing to do was specifying the movies that I want to use them in my project. I decided to use four movies that are about freedom. I choose Room, V for Vendetta, Braveheart, and Django. In the next step, I searched for a stop-words list on the internet. Many lists are available. To find stop words, I had to use other movies with very different subjects besides these four movies. Because for example, if the word “Know” would have appeared in those four movies, how could I know if it belonged to stop-words or not? Because maybe it was there because in movies about freedom, the word “know” is used most often.
But on the other hand maybe “know” was one of the words that usually is used in movies. I couldn’t know which words belong to stop-words only if I choose movies with another subject. I choose six other movies with totally different subjects. I sifted the result for about ten times to get my desired stop-words.

Braveheart, 1995

Braveheart, 1995

Braveheart, 1995

At the end, I came up with a stop-word list with 822 words. But It could be even more precise if I had more time. I made two Processing code for this project. One of them is for extracting the stop-words from the text, and the second one is for visualizing that text.

Brave Heart



V for Vendetta

Through this project, I realized that one of the most important things to do when you want to analyze a text is making an adequate stop-words list. And I think it is impossible to do it by only using methods like TF-IDF. Maybe you can make 90 percent of your list with this method or with the lists that are already on the internet but, the remaining 10 percent, which is super important, will be achieved just by putting time on it, and it’s a labor-intensive work. I also learned that you really need to know what your goal is through the text analyzing. For instance, if you want to know something about the tone of the speaking of the movies, you should keep the words like “hello”, “hey”, “hi” and etc. But in many cases, it is not necessary to keep such words.
This is not a finished project for me. Through this project, I became interested in text analysis, and I will continue to work on this project to improve it. It was interesting for me to use the content of a medium and use it through another medium, and it raised many questions on my mind, such as :
-When we see a movie, besides that, we usually listen to its soundtracks separately, sometimes we buy a poster or see some images about that movie. Can data visualization be an experience like them?
-Can this kind of data visualization be another form of experiencing the movie and help us to understand the movie better?
-Can we discover hidden facts and components about the movie with this approach?
Can we learn anything about the culture of a country by doing this to many movies from a specific country?
Currently, I am working on the appearance of the project, due to its abundant usages and applications. And also, I’m testing the code on broad spectrum of the movies and trying to answering the above questions. In advance, I think I will publish a series of these medium posts regarding this open project.
Thank you for taking the time to read through this article, and feel free to connect with me if you enjoyed it. Also, I would like to thank Leonhard Lass for teaching me data visualization principles.

You may also like

Back to Top