Post

Exploratory Data Analysis of a Movie Dataset

Exploratory Analysis is a crucial step in Data Analysis process. It involves summarizing, investigating and visualising data to understand its characteristics and identify tends and patterns.

The dataset was downloaded from Kaggle.

1
2
3
4
5
#importing the required libraries
import pandas as pd
import numpy as np
import matplotlib as plt

1
2
3
4
5
6
#loading and viewing movie data
df = pd.read_csv("archive/movie.csv")
# df.head()
# df.shape[0]
df[df.columns[0]].count()
df
movieIdtitlegenres
01Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy
12Jumanji (1995)Adventure|Children|Fantasy
23Grumpier Old Men (1995)Comedy|Romance
34Waiting to Exhale (1995)Comedy|Drama|Romance
45Father of the Bride Part II (1995)Comedy
............
27273131254Kein Bund für's Leben (2007)Comedy
27274131256Feuer, Eis & Dosenbier (2002)Comedy
27275131258The Pirates (2014)Adventure
27276131260Rentun Ruusu (2001)(no genres listed)
27277131262Innocence (2014)Adventure|Fantasy|Horror

27278 rows × 3 columns

1
2
3
4
5
6
7
#splitting the title column into title and year column
split_title = df['title'].str.rsplit(' ', n=1, expand=True).fillna("")

df['title'] = split_title[0]
df['year'] = split_title[1]

df
movieIdtitlegenresyear
01Toy StoryAdventure|Animation|Children|Comedy|Fantasy(1995)
12JumanjiAdventure|Children|Fantasy(1995)
23Grumpier Old MenComedy|Romance(1995)
34Waiting to ExhaleComedy|Drama|Romance(1995)
45Father of the Bride Part IIComedy(1995)
...............
27273131254Kein Bund für's LebenComedy(2007)
27274131256Feuer, Eis & DosenbierComedy(2002)
27275131258The PiratesAdventure(2014)
27276131260Rentun Ruusu(no genres listed)(2001)
27277131262InnocenceAdventure|Fantasy|Horror(2014)

27278 rows × 4 columns

1
2
3
4
5
#Removing the parenthesis from the year column
import re
pattern = r"[()]"
df['year'] = df['year'].str.replace(pattern,"",regex=True)
df
movieIdtitlegenresyear
01Toy StoryAdventure|Animation|Children|Comedy|Fantasy1995
12JumanjiAdventure|Children|Fantasy1995
23Grumpier Old MenComedy|Romance1995
34Waiting to ExhaleComedy|Drama|Romance1995
45Father of the Bride Part IIComedy1995
...............
27273131254Kein Bund für's LebenComedy2007
27274131256Feuer, Eis & DosenbierComedy2002
27275131258The PiratesAdventure2014
27276131260Rentun Ruusu(no genres listed)2001
27277131262InnocenceAdventure|Fantasy|Horror2014

27278 rows × 4 columns

1
2
3
4
5
6
7
#extracting all the unique genres from the genres col.
#since some of them have multiple genres in a single cell

genres_list = df["genres"].str.split("|").sum()
unique_genres = list(set(genres_list))

print(unique_genres)
1
['Sci-Fi', '(no genres listed)', 'Romance', 'Mystery', 'Musical', 'Horror', 'Children', 'IMAX', 'Documentary', 'Comedy', 'Animation', 'Fantasy', 'Film-Noir', 'Adventure', 'Western', 'Thriller', 'Crime', 'War', 'Action', 'Drama']
1
2
3
4
5
6
#using the unique genres extracted earlier as the definition
#to count the number of occurance of each genres

df_genres = df["genres"].str.split("|").explode() #splits each genres into separate items
genre_counts = df_genres[df_genres.isin(unique_genres)].value_counts()
print(genre_counts)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
genres
Drama                 13344
Comedy                 8374
Thriller               4178
Romance                4127
Action                 3520
Crime                  2939
Horror                 2611
Documentary            2471
Adventure              2329
Sci-Fi                 1743
Mystery                1514
Fantasy                1412
War                    1194
Children               1139
Musical                1036
Animation              1027
Western                 676
Film-Noir               330
(no genres listed)      246
IMAX                    196
Name: count, dtype: int64
1
2
#Plotting the count of each genres
genre_counts.plot(kind="bar", color="skyblue")
1
<Axes: xlabel='genres'>

Bar Graph Count of Each Genres

This post is licensed under CC BY 4.0 by the author.