Internet Movie Dataviz

26-09-22

Sometimes I watch movies and sometimes I write code. Lying in the middle of these two is the creation of data visualizations. Here is a random collection of graphs from a data analysis on a movie dataset. These figures each tell a story of their own.

This project was done as the capstone project for the Google Data Analytics course on Coursera: https://www.coursera.org/professional-certificates/google-data-analytics. The dataset is from the GroupLens website, a site for comparing and getting recommendations on movies, found at: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset. An SQL database was created from this dataset. Queries were run using SQL, and results were processed using R. Figures were plotted using Plotly in R.
GitHub Page


Top-rated Movies

For fun, I plotted the best rated movies on GroupLens. Movies were filtered for having over 100 reviews to remove inaccurate outliers. Is your favorite among them?

Budget vs Rating

Next, I was interested in determining if movie budget has any influence on the eventual ratings of a movie. This is visualized as a line in the graph, resulting from a regression analysis. At first glance the correlation appears negative. However, at an R-squared value of 0.01 we can conclude that the budget of a movie holds no substantial correlation to the rating.

Note: information on budget is not available for all movies. These were filtered out.

## [1] R-squared:
## [1] 0.01037655

Popularity of genres

In this graph the number of times a movie is tagged with a certain genre. Note that one movie can have multiple tags. You can click to show or hide genres.

Judging the results, it is obvious the dataset has some flaws. Many movies lack genre tags, especially older movies. One way to solve this would be to look at the total amount of movies released each year - which you can do by clicking Total movies released. Alternatively, we can take a genre count as a ratio of all genres - which you can do by selecting the stacked area chart from the second tab Stacked area.

Finally, in Ratings per genre we can see the distribution of ratings for each genre. This was calculated by taking the average rating per movie, and looking at the primary genre that was listed.

From this graph it’s clear to see which are the most popular genres. Drama, Action and Thriller dominate.

Many genres show an increase in tags through time. Especially Documentary has seen a huge uptick. Western is an exception here: in latter-day the genre seems almost died out.

Some interesting trends through time can be detected using this graph. As mentioned at Line chart, since the 2000’s Documentary’s seems to be booming in popularity. Similarly, Western’s had a substantial share back in the 50’s through 70’s, but have since disappeared. As you would expect, Animation movies only start appearing around the 80’s. Some genres are more stable: Drama has always been dominant.

At the extremes of the x-axis dates, the graph starts to fall apart. Not enough data is available for these sections.

Some genres score exceptionally high when comparing these distributions. For instance, Animation, Documentary and War. All seemed skewed towards higher ratings.

This graph also reveals the unfortunate truth about one of my favorite genres: not many good horror-movies exist.