IMDB - Dataset Analysis - Basic

Project 1: Explanatory Data Analysis & Data Presentation (Movies Dataset)

Project Brief for Self-Coders

Here you´ll have the opportunity to code major parts of Project 1 on your own. If you need any help or inspiration, have a look at the Videos or the Jupyter Notebook with the full code.

Keep in mind that it´s all about getting the right results/conclusions. It´s not about finding the identical code. Things can be coded in many different ways. Even if you come to the same conclusions, it´s very unlikely that we have the very same code.

Data Import and first Inspection

Import the movies dataset from the CSV file "movies_complete.csv". Inspect the data.

Some additional information on Features/Columns:

id: The ID of the movie (clear/unique identifier).
title: The Official Title of the movie.
tagline: The tagline of the movie.
release_date: Theatrical Release Date of the movie.
genres: Genres associated with the movie.
belongs_to_collection: Gives information on the movie series/franchise the particular film belongs to.
original_language: The language in which the movie was originally shot in.
budget_musd: The budget of the movie in million dollars.
revenue_musd: The total revenue of the movie in million dollars.
production_companies: Production companies involved with the making of the movie.
production_countries: Countries where the movie was shot/produced in.
vote_count: The number of votes by users, as counted by TMDB.
vote_average: The average rating of the movie.
popularity: The Popularity Score assigned by TMDB.
runtime: The runtime of the movie in minutes.
overview: A brief blurb of the movie.
spoken_languages: Spoken languages in the film.
poster_path: The URL of the poster image.
cast: (Main) Actors appearing in the movie.
cast_size: number of Actors appearing in the movie.
director: Director of the movie.
crew_size: Size of the film crew (incl. director, excl. actors).

Import Necessary Libraries for this Task:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.options.display.max_columns=None   # To show all columns
pd.options.display.float_format = '{:.2f}'.format

Read the Movie Data

df = pd.read_csv('movies_complete.csv',parse_dates=['release_date'])  #parse date will convert release_date column into Datetime.
df

44691 rows × 22 columns

Getting Info About Data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44691 entries, 0 to 44690
Data columns (total 22 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   id                     44691 non-null  int64         
 1   title                  44691 non-null  object        
 2   tagline                20284 non-null  object        
 3   release_date           44657 non-null  datetime64[ns]
 4   genres                 42586 non-null  object        
 5   belongs_to_collection  4463 non-null   object        
 6   original_language      44681 non-null  object        
 7   budget_musd            8854 non-null   float64       
 8   revenue_musd           7385 non-null   float64       
 9   production_companies   33356 non-null  object        
 10  production_countries   38835 non-null  object        
 11  vote_count             44691 non-null  float64       
 12  vote_average           42077 non-null  float64       
 13  popularity             44691 non-null  float64       
 14  runtime                43179 non-null  float64       
 15  overview               43740 non-null  object        
 16  spoken_languages       41094 non-null  object        
 17  poster_path            44467 non-null  object        
 18  cast                   42502 non-null  object        
 19  cast_size              44691 non-null  int64         
 20  crew_size              44691 non-null  int64         
 21  director               43960 non-null  object        
dtypes: datetime64[ns](1), float64(6), int64(3), object(12)
memory usage: 7.5+ MB

Statistical Summary

df.describe()

budget_musd

revenue_musd

vote_count

vote_average

popularity

runtime

cast_size

crew_size

count

44691.00

8854.00

7385.00

44691.00

42077.00

44691.00

43179.00

44691.00

mean

107186.24

21.67

68.97

111.65

6.00

2.96

97.57

12.48

10.31

std

111806.36

34.36

146.61

495.32

1.28

6.04

34.65

12.12

15.89

min

2.00

0.00

1.00

0.00

25%

26033.50

2.00

2.41

3.00

5.30

0.40

86.00

6.00

2.00

50%

59110.00

8.20

16.87

10.00

6.10

1.15

95.00

10.00

6.00

75%

154251.00

25.00

67.64

35.00

6.80

3.77

107.00

15.00

12.00

max

469172.00

380.00

2787.97

14075.00

10.00

547.49

1256.00

313.00

435.00

df.hist(figsize=(20,12),bins=100)

The best and the worst movies...

Filter the Dataset and find the best/worst n Movies with the

Highest Revenue
Highest Budget
Highest Profit (=Revenue - Budget)
Lowest Profit (=Revenue - Budget)
Highest Return on Investment (=Revenue / Budget) (only movies with Budget >= 10)
Lowest Return on Investment (=Revenue / Budget) (only movies with Budget >= 10)
Highest number of Votes
Highest Rating (only movies with 10 or more Ratings)
Lowest Rating (only movies with 10 or more Ratings)
Highest Popularity

The Best and Worst Movies ever

We will try to filter our data based on criteria , that is responsible to determine the best and worst movies ever. We are also going to import HTML , as we will convert our analysis to a beautiful web page. To do this, all you need to do is to import HTML.

from IPython.display import HTML   # we are using this to try to present our data in good looking website format

Filtering Columns responsible to determine best and worst movies

edf = df[
    ['poster_path','title','budget_musd','revenue_musd','vote_count','vote_average','popularity']
].copy()

Create a column 'profit_musd' (revenue - budget)

edf['profit_musd'] = edf['revenue_musd'] - edf['budget_musd']
edf.head()

Create a column 'return_musd' (revenue/budget)

edf['return_musd'] = edf['revenue_musd']/edf['budget_musd']
edf.head()

Rename Columns in Something Meaningful to present it later in Graphs

edf.columns = ['','Title','Budget','Revenue','Votes','Average Rating','Popularity','Profit','Return']
edf.head()

Set Title as Index

edf.set_index('Title',inplace=True)

Convert Our DataFrame into HTML (Poster , Title , Popularity')

subset = df[['poster_path','title','popularity']].head()
HTML(subset.to_html(escape=False))   # convert dataframe into HTML

Highest Rated Movies

HTML(edf.sort_values(by='Average Rating',ascending=False).head().to_html(escape=False))

Now this approach does not make sense as you can see there is only one vote and it is not sufficient enough to judge on rating. So let us find median of Votes and consider it to be the minimum number of votes to be given to any movie.

minimum_votes = edf['Votes'].median()
minimum_votes

edf.loc[edf['Votes']>=minimum_votes].sort_values(by='Average Rating',ascending=False).head()

Movies With Highest ROI

Here also we will keep above approach , as there are few movies with close to zero budget , we must exclude them and so let us find the median of budget.

minimum_budget = edf['Budget'].median()
minimum_budget

edf.loc[edf['Budget']>=minimum_budget].sort_values(by='Return',ascending=False).head()

Before moving ahead , let us fill all na values of Budget and Votes with 0.

edf['Budget'].fillna(0,inplace=True)
edf['Votes'].fillna(0,inplace=True)

Create a Function to find Best and Worst Movies

def find_movies(n,by,ascending=False,min_bud=minimum_budget,min_votes=minimum_votes):
   return edf.loc[(edf['Budget']>=min_bud) & (edf['Votes']>=min_votes)].sort_values(by=by,ascending=ascending).head(n)

Top 5 - Highest Revenue

HTML(find_movies(5,by='Revenue').to_html(escape=False))

Top 5 - Highest Budget

HTML(find_movies(n=5,by='Budget').to_html(escape=False))

Top 5 - Highest Profit

HTML(find_movies(5,'Profit').to_html(escape=False))

Top 5 - Highest ROI

HTML(find_movies(5,by='Return').to_html(escape=False))

Top 5 - Lowest Profit

HTML(find_movies(5,by='Profit',ascending=True).to_html(escape=False))

Top 5 - Most Popular

HTML(find_movies(5,'Popularity').to_html(escape=False))

Find Your Next Movie

Science Fiction Action Movie With Bruce Willis

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.options.display.max_columns=None   # To show all columns
pd.options.display.float_format = '{:.2f}'.format

df = pd.read_csv('movies_complete.csv',parse_dates=['release_date'])  #parse date will convert release_date column into Datetime.
df.head()

Filtering Genres (Science Fiction and Action)


filter_genres = df['genres'].str.contains('Science Fiction') & df['genres'].str.contains('Action')

Filtering Bruce Willis Movies

# Filtering Bruce Willis Movies
filter_cast = df['cast'].str.contains('Bruce Willis')

Filtering

edf = df.loc[filter_genres & filter_cast].sort_values(by='vote_average',ascending=False).head(5)
edf = edf[['title','poster_path','vote_average']]

HTML(edf.to_html(escape=False))

Movies With Uma Thurman and Quentin Tarantino

filter_cast = df['cast'].str.contains('Uma Thurman')  # Actor/ Cast
filter_director = df['director'] == 'Quentin Tarantino'  # Director

fdf = df.loc[(filter_cast) & (filter_director)].sort_values(by='runtime')
HTML(fdf[['title','poster_path','vote_average','runtime']].to_html(escape=False))

Most Successful Pixar Movies from 2010 to 2015 (Highest Revenue)

Filtering Pixar Movies

filter_production = df.loc[df['production_companies'].str.contains('Pixar').fillna(False)]
filter_production.head()

Filtering Release Date

successful = filter_production.loc[filter_production['release_date'].between('2010-01-01','2015-12-31')]
successful.head()

Result

successful = successful.sort_values(by='revenue_musd')
HTML(successful[['title','poster_path','revenue_musd','release_date']].head().to_html(escape=False))

Action Or Thriller Movie with Original Language English with minimum rating of 7.5(Most Recent)

Filtering Genre (Action Or Thriller)

mask_genre = df['genres'].str.contains('Action') | df['genres'].str.contains('Thriller')

Filtering Language

mask_language = df['original_language']=='en'

Filtering Vote (greater than 10)

mask_vote_count = df['vote_count']>=10

Filter Average Rating

mask_rating = df['vote_average']>=7.5

Filter:

next_mov = df.loc[mask_genre & mask_language & mask_rating & mask_vote_count ].sort_values(by='release_date',ascending=False).head()
HTML(next_mov[['title','poster_path','release_date']].to_html(escape=False))

Most Common Words in Titles and Taglines

from wordcloud import WordCloud

titles =df['title'].dropna()
taglines = df['tagline'].dropna()

titles = '.'.join(titles)
taglines = '.'.join(taglines)

title_wordcloud = WordCloud(background_color='white', height=2000, width=4000,max_words=200).generate(titles)
plt.figure(figsize=(16,8))
plt.imshow(title_wordcloud, interpolation= "bilinear")
plt.axis('off')y
plt.show()

tagline_wordcloud = WordCloud(background_color='white', height=2000, width=4000,max_words=200).generate(taglines)
plt.figure(figsize=(16,8))
plt.imshow(tagline_wordcloud, interpolation= "bilinear")
plt.axis('off')
plt.show()

Are Franchises More Successful ?

All Franchises

df['franchise'] = df['belongs_to_collection'].notna()

Count Franchise/Standalone Movies

df['franchise'].value_counts()

False    40228
True      4463
Name: franchise, dtype: int64

Revenue (Franchise Vs Standalone Movies)

df.groupby('franchise')['revenue_musd'].mean()

franchise
False    44.74
True    165.71
Name: revenue_musd, dtype: float64

Budget (Franchise Vs Standalone Movies)

df.groupby('franchise')['budget_musd'].mean()

franchise
False   18.05
True    38.32
Name: budget_musd, dtype: float64

Average Rating (Franchise Vs Standalone Movies)

df.groupby('franchise')['vote_average'].mean()

franchise
False   6.01
True    5.96
Name: vote_average, dtype: float64

Popularity (Franchise Vs Standalone Movies)

df.groupby('franchise')['popularity'].mean()

franchise
False   2.59
True    6.25
Name: popularity, dtype: float64

Return Of Investments (Franchise Vs Standalone Movies)

df['roi'] = df['revenue_musd']/df['budget_musd']
df.groupby('franchise')['roi'].median()

franchise
False   1.62
True    3.71
Name: roi, dtype: float64

Aggregate Functions

We will use aggregate functions to calculate all necessary info about Franchise.

franchises = df.groupby('franchise').agg({
    'title':'count',
    'revenue_musd':['sum','mean'],
    'budget_musd':['sum','mean'],
    'roi':'median',
    'vote_average':'mean',
    'popularity':'mean',
    'vote_count':['sum','mean']
})
franchises

Most Successful Franchise ?

df['belongs_to_collection'].dropna()

0                  Toy Story Collection
2             Grumpy Old Men Collection
4        Father of the Bride Collection
9                 James Bond Collection
12                     Balto Collection
                      ...              
44582           The Carry On Collection
44585           The Carry On Collection
44596           The Carry On Collection
44598    DC Super Hero Girls Collection
44609              Red Lotus Collection
Name: belongs_to_collection, Length: 4463, dtype: object

Largest Franchise

So we can use sort_values to get the maximum number of count of a movie.

franchises.sort_values(by=('title','count'),ascending=False)

We can also use nlargest to get the n numbers of big franchises.

franchises.nlargest(5,('title','count'))

Highest Revenue

franchises.sort_values(by=('revenue_musd','sum'),ascending=False).head(20)

title

revenue_musd

budget_musd

roi

vote_average

popularity

vote_count

count

sum

mean

sum

mean

median

mean

sum

mean

belongs_to_collection

Harry Potter Collection

7707.37

963.42

1280.00

160.00

6.17

7.54

26.25

47866.00

5983.25

Star Wars Collection

7434.49

929.31

854.35

106.79

8.24

7.38

23.41

43443.00

5430.38

James Bond Collection

7106.97

273.35

1539.65

59.22

6.13

6.34

13.45

33392.00

1284.31

The Fast and the Furious Collection

5125.10

640.64

1009.00

126.12

4.94

6.66

10.80

25576.00

3197.00

Pirates of the Caribbean Collection

4521.58

904.32

1250.00

250.00

3.45

6.88

53.97

25080.00

5016.00

Transformers Collection

4366.10

873.22

965.00

193.00

5.20

6.14

14.43

15232.00

3046.40

Despicable Me Collection

3691.07

922.77

299.00

74.75

12.76

6.78

106.72

18248.00

3041.33

The Twilight Collection

3342.11

668.42

385.00

77.00

10.27

5.84

29.50

13851.00

2770.20

Ice Age Collection

3216.71

643.34

429.00

85.80

8.26

6.38

16.08

13219.00

2643.80

Jurassic Park Collection

3031.48

757.87

379.00

94.75

7.03

6.50

10.77

18435.00

4608.75

Shrek Collection

2955.81

738.95

535.00

133.75

5.56

6.46

12.97

11721.00

2344.20

The Hunger Games Collection

2944.16

736.04

490.00

122.50

6.27

6.88

54.77

26174.00

6543.50

The Hobbit Collection

2935.52

978.51

750.00

250.00

3.83

7.23

25.21

17944.00

5981.33

The Avengers Collection

2924.96

1462.48

500.00

250.00

5.96

7.35

63.63

18908.00

9454.00

The Lord of the Rings Collection

2916.54

972.18

266.00

88.67

11.73

8.03

30.27

24759.00

8253.00

X-Men Collection

2808.83

468.14

983.00

163.83

3.02

6.82

9.71

27563.00

4593.83

Avatar Collection

2787.97

237.00

11.76

7.20

185.07

12114.00

Mission: Impossible Collection

2778.98

555.80

650.00

130.00

4.55

6.60

16.51

14005.00

2801.00

Spider-Man Collection

2496.35

832.12

597.00

199.00

3.92

6.47

22.62

13517.00

4505.67

The Dark Knight Collection

2463.72

821.24

585.00

195.00

4.34

7.80

57.42

29043.00

9681.00

Can you do it with nlargest ???

Highest Average Revenue

franchises.sort_values(by=('revenue_musd','mean'),ascending=False).head(20)

Most Expensive Franchises (Budget)

franchises.nlargest(20,('budget_musd','sum')).head(20)

Highest Rated Franchises

franchises.loc[franchises[('vote_count','mean')]>=1000].nlargest(10,('vote_average','me

Most Successful Directors

Most Number Of Movies (top 5)

df['director'].value_counts().head()

John Ford           66
Michael Curtiz      65
Werner Herzog       54
Alfred Hitchcock    53
Georges Méliès      49
Name: director, dtype: int64

Highest Revenues By Directors

df.groupby('director')['revenue_musd'].sum().sort_values(ascending=False).head()

director
Steven Spielberg   9256.62
Peter Jackson      6528.24
Michael Bay        6437.47
James Cameron      5900.61
David Yates        5334.56
Name: revenue_musd, dtype: float64

Highest Number of Franchises directed by Directors

df.loc[df['belongs_to_collection'].notna()]['director'].value_counts().head()

Gerald Thomas       25
William Beaudine    19
Ere Kokkonen        17
Kunihiko Yuyama     15
Robert Rodriguez    13
Name: director, dtype: int64

Aggregate Functions

directors = df.groupby('director').agg({
    'title' : 'count',
    'vote_count':'sum',
    'vote_average' : 'mean'
})
directors.head()

Highest Rated Movies

directors.loc[(directors['vote_count']>10000) & (directors['title']>10)].nlargest(20,'vote_ave

title

vote_count

vote_average

director

Hayao Miyazaki

14700.00

7.70

Christopher Nolan

67344.00

7.62

Martin Scorsese

35541.00

7.22

Peter Jackson

47571.00

7.14

Joel Coen

18139.00

7.02

James Cameron

33736.00

6.93

Stanley Kubrick

18214.00

6.91

Steven Spielberg

62266.00

6.89

Danny Boyle

16504.00

6.87

Robert Zemeckis

37666.00

6.79

Terry Gilliam

10049.00

6.76

Tim Burton

36922.00

6.73

Ang Lee

11164.00

6.71

Antoine Fuqua

15519.00

6.71

Woody Allen

15512.00

6.69

Clint Eastwood

24001.00

6.69

Alfred Hitchcock

12772.00

6.64

Ridley Scott

43083.00

6.60

Kenneth Branagh

11275.00

6.59

Luc Besson

19627.00

6.53

To find succesful director in any specific genre i.e. Action

edf = df[df['genres'].notna()]
edf[edf['genres'].str.contains('Horror')].groupby('director')['revenue_musd'].sum().sort_values(ascending=False).head(10)

director
Paul W.S. Anderson    982.29
James Wan             861.31
Wes Craven            834.93
Francis Lawrence      816.23
Ridley Scott          689.00
Marc Forster          531.87
Steven Spielberg      500.10
William Friedkin      466.40
Darren Lynn Bousman   456.34
M. Night Shyamalan    375.37
Name: revenue_musd, dtype: float64

To Find Successful Actors

df['cast']

0        Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...
1        Robin Williams|Jonathan Hyde|Kirsten Dunst|Bra...
2        Walter Matthau|Jack Lemmon|Ann-Margret|Sophia ...
3        Whitney Houston|Angela Bassett|Loretta Devine|...
4        Steve Martin|Diane Keaton|Martin Short|Kimberl...
                               ...                        
44686              Leila Hatami|Kourosh Tahami|Elham Korda
44687    Angel Aquino|Perry Dizon|Hazel Orencio|Joel To...
44688    Erika Eleniak|Adam Baldwin|Julie du Page|James...
44689    Iwan Mosschuchin|Nathalie Lissenko|Pavel Pavlo...
44690                                                  NaN
Name: cast, Length: 44691, dtype: object

Set id as index

df.set_index('id',inplace=True)

Split Actor Names to a DataFrame

actors = df['cast'].str.split('|',expand=True)
actors.head()

actors.stack().reset_index()

actors = actors.stack().reset_index(drop=True,level=1) #only remove the given level by default remove all levels
actors

Convert Series to DataFrame

actors = actors.to_frame()
actors

Rename column label from 0 to 'Actor'

actors.columns=['Actor']
actors.head()

Merge Dataframe with Actors DataFrame

actors = actors.merge(df[['title','revenue_musd','vote_average','popularity']],on='id')
actors

Number of Unique Actors

actors['Actor'].nunique()

Actors with highest number of movies

actors['Actor'].value_counts().head(5)

Bess Flowers         240
Christopher Lee      148
John Wayne           125
Samuel L. Jackson    122
Michael Caine        110
Name: Actor, dtype: int64

# This is known as label aggregation
data = actors.groupby('Actor').agg(
        total_revenue = ('revenue_musd','sum'),
        average_rating = ('vote_average','mean'),
        average_popularity = ('popularity','mean'),
        movies = ('title','count'),
        average_revenue=('revenue_musd','mean')
)
data

Actors who have acted in more than 10 films

data = data.loc[data['movies']>=10]
data

Highest Revenue

actors_with_largest_revenue = data.nlargest(20,'total_revenue')

Highest Number of Films

actors_with_highest_movies = data.nlargest(20,'movies')

Highest Rating

actors_highest_rating = data.nlargest(20,'average_rating')

Popularity

actor_with_popularity = data.nlargest(20,'average_popularity')

Find Common Actors in the top lists

Concat all the dataframes

top_list = pd.concat([actors_with_largest_revenue,actors_with_highest_movies,actors_highest_rating,actor_with_popularity])
top_list

Find Duplicate Records of Actors

top_list = top_list.reset_index()   # Resets Index
top_list.loc[top_list.duplicated(subset='Actor')]

What are the most successful/popular genres? Has this changed over time (e.g. 80ths vs. 90ths)?

gen = df['genres'].str.split('|',expand=True)
gen = gen.stack().reset_index(drop=True,level=1).to_frame()
gen.columns = ['gen']
gen

Merge gen and original dataframe

genre = gen.merge(df[['title','revenue_musd','vote_average','popularity','release_date']],on='id')
genre

Aggregate Functions

data = genre.groupby('gen').agg(
    {
        'revenue_musd' : ['sum','mean'],
        'vote_average' : 'mean',
        'popularity' : 'mean'

    }
)

Genre With Highest Revenue

data.nlargest(5,('revenue_musd','sum'))

revenue_musd

vote_average

popularity

sum

mean

gen

Action

201388.05

116.07

5.75

4.78

Adventure

199978.67

179.19

5.88

6.00

Comedy

166845.05

64.10

5.97

3.25

Drama

160754.36

43.83

6.17

3.03

Thriller

129724.55

69.52

5.74

4.51

Genre With Highest Rating

data.nlargest(5,('vote_average','mean'))

revenue_musd

vote_average

popularity

sum

mean

gen

Documentary

1449.11

6.65

6.66

0.96

Animation

67432.97

176.99

6.45

4.75

History

14902.20

50.52

6.41

3.48

Music

13370.29

50.08

6.33

2.56

War

15910.46

65.48

6.29

3.35

Genre With Highest Popularity

data.nlargest(5,('popularity','mean'))

revenue_musd

vote_average

popularity

sum

mean

gen

Adventure

199978.67

179.19

5.88

6.00

Fantasy

103920.15

166.01

5.93

5.36

Science Fiction

97847.96

131.52

5.48

5.00

Action

201388.05

116.07

5.75

4.78

Family

107076.78

159.10

5.93

4.77

Highest revenue generated by Genre in 90's

nineties = genre.loc[genre['release_date'].between('1900-01-01','1999-12-31')]
nineties

Popularity of Genres in Nineties

nineties.groupby('gen')['popularity'].mean().head(5)

gen
Action      3.32
Adventure   3.80
Animation   2.91
Comedy      2.63
Crime       3.22
Name: popularity, dtype: float64

Try to find Total revenue and average rating too.

Highest revenue generated by Genre in 20's

twenties = genre.loc[genre['release_date'].between('2000-01-01','2022-12-31')]
twenties

Popularity of Genres in Twenties

twenties.groupby('gen')['popularity'].mean().head(5)

gen
Action      6.23
Adventure   8.60
Animation   5.92
Comedy      3.81
Crime       5.30
Name: popularity, dtype: float64

Find Most Successful Production Companies On your own ?

PreviousRetail Sales Step By Step Next2. Pandas Assignments

Last updated 2 years ago