IMDB - Dataset Analysis - Basic
Project 1: Explanatory Data Analysis & Data Presentation (Movies Dataset)
Project Brief for Self-Coders
Here you´ll have the opportunity to code major parts of Project 1 on your own. If you need any help or inspiration, have a look at the Videos or the Jupyter Notebook with the full code.
Keep in mind that it´s all about getting the right results/conclusions. It´s not about finding the identical code. Things can be coded in many different ways. Even if you come to the same conclusions, it´s very unlikely that we have the very same code.
Data Import and first Inspection
Import the movies dataset from the CSV file "movies_complete.csv". Inspect the data.
Some additional information on Features/Columns:
id: The ID of the movie (clear/unique identifier).
title: The Official Title of the movie.
tagline: The tagline of the movie.
release_date: Theatrical Release Date of the movie.
genres: Genres associated with the movie.
belongs_to_collection: Gives information on the movie series/franchise the particular film belongs to.
original_language: The language in which the movie was originally shot in.
budget_musd: The budget of the movie in million dollars.
revenue_musd: The total revenue of the movie in million dollars.
production_companies: Production companies involved with the making of the movie.
production_countries: Countries where the movie was shot/produced in.
vote_count: The number of votes by users, as counted by TMDB.
vote_average: The average rating of the movie.
popularity: The Popularity Score assigned by TMDB.
runtime: The runtime of the movie in minutes.
overview: A brief blurb of the movie.
spoken_languages: Spoken languages in the film.
poster_path: The URL of the poster image.
cast: (Main) Actors appearing in the movie.
cast_size: number of Actors appearing in the movie.
director: Director of the movie.
crew_size: Size of the film crew (incl. director, excl. actors).
Import Necessary Libraries for this Task:
Read the Movie Data

44691 rows × 22 columns
Getting Info About Data
Statistical Summary
count
44691.00
8854.00
7385.00
44691.00
42077.00
44691.00
43179.00
44691.00
44691.00
mean
107186.24
21.67
68.97
111.65
6.00
2.96
97.57
12.48
10.31
std
111806.36
34.36
146.61
495.32
1.28
6.04
34.65
12.12
15.89
min
2.00
0.00
0.00
0.00
0.00
0.00
1.00
0.00
0.00
25%
26033.50
2.00
2.41
3.00
5.30
0.40
86.00
6.00
2.00
50%
59110.00
8.20
16.87
10.00
6.10
1.15
95.00
10.00
6.00
75%
154251.00
25.00
67.64
35.00
6.80
3.77
107.00
15.00
12.00
max
469172.00
380.00
2787.97
14075.00
10.00
547.49
1256.00
313.00
435.00

The best and the worst movies...
Filter the Dataset and find the best/worst n Movies with the
Highest Revenue
Highest Budget
Highest Profit (=Revenue - Budget)
Lowest Profit (=Revenue - Budget)
Highest Return on Investment (=Revenue / Budget) (only movies with Budget >= 10)
Lowest Return on Investment (=Revenue / Budget) (only movies with Budget >= 10)
Highest number of Votes
Highest Rating (only movies with 10 or more Ratings)
Lowest Rating (only movies with 10 or more Ratings)
Highest Popularity
The Best and Worst Movies ever
We will try to filter our data based on criteria , that is responsible to determine the best and worst movies ever. We are also going to import HTML , as we will convert our analysis to a beautiful web page. To do this, all you need to do is to import HTML.
Filtering Columns responsible to determine best and worst movies
Create a column 'profit_musd' (revenue - budget)

Create a column 'return_musd' (revenue/budget)

Rename Columns in Something Meaningful to present it later in Graphs

Set Title as Index
Convert Our DataFrame into HTML (Poster , Title , Popularity')

Highest Rated Movies


Movies With Highest ROI
Here also we will keep above approach , as there are few movies with close to zero budget , we must exclude them and so let us find the median of budget.

Before moving ahead , let us fill all na values of Budget and Votes with 0.
Create a Function to find Best and Worst Movies
Top 5 - Highest Revenue

Top 5 - Highest Budget

Top 5 - Highest Profit

Top 5 - Highest ROI

Top 5 - Lowest Profit

Top 5 - Most Popular

Find Your Next Movie
Science Fiction Action Movie With Bruce Willis

Filtering Genres (Science Fiction and Action)
Filtering Bruce Willis Movies
Filtering

Movies With Uma Thurman and Quentin Tarantino

Most Successful Pixar Movies from 2010 to 2015 (Highest Revenue)
Filtering Pixar Movies
Filtering Release Date
Result

Action Or Thriller Movie with Original Language English with minimum rating of 7.5(Most Recent)
Filtering Genre (Action Or Thriller)
Filtering Language
Filtering Vote (greater than 10)
Filter Average Rating
Filter:
Most Common Words in Titles and Taglines


Are Franchises More Successful ?
All Franchises
Count Franchise/Standalone Movies
Revenue (Franchise Vs Standalone Movies)
Budget (Franchise Vs Standalone Movies)
Average Rating (Franchise Vs Standalone Movies)
Popularity (Franchise Vs Standalone Movies)
Return Of Investments (Franchise Vs Standalone Movies)
Aggregate Functions
We will use aggregate functions to calculate all necessary info about Franchise.

Most Successful Franchise ?
Largest Franchise
So we can use sort_values to get the maximum number of count of a movie.

We can also use nlargest to get the n numbers of big franchises.

Highest Revenue
count
sum
mean
sum
mean
median
mean
mean
sum
mean
belongs_to_collection
Harry Potter Collection
8
7707.37
963.42
1280.00
160.00
6.17
7.54
26.25
47866.00
5983.25
Star Wars Collection
8
7434.49
929.31
854.35
106.79
8.24
7.38
23.41
43443.00
5430.38
James Bond Collection
26
7106.97
273.35
1539.65
59.22
6.13
6.34
13.45
33392.00
1284.31
The Fast and the Furious Collection
8
5125.10
640.64
1009.00
126.12
4.94
6.66
10.80
25576.00
3197.00
Pirates of the Caribbean Collection
5
4521.58
904.32
1250.00
250.00
3.45
6.88
53.97
25080.00
5016.00
Transformers Collection
5
4366.10
873.22
965.00
193.00
5.20
6.14
14.43
15232.00
3046.40
Despicable Me Collection
6
3691.07
922.77
299.00
74.75
12.76
6.78
106.72
18248.00
3041.33
The Twilight Collection
5
3342.11
668.42
385.00
77.00
10.27
5.84
29.50
13851.00
2770.20
Ice Age Collection
5
3216.71
643.34
429.00
85.80
8.26
6.38
16.08
13219.00
2643.80
Jurassic Park Collection
4
3031.48
757.87
379.00
94.75
7.03
6.50
10.77
18435.00
4608.75
Shrek Collection
5
2955.81
738.95
535.00
133.75
5.56
6.46
12.97
11721.00
2344.20
The Hunger Games Collection
4
2944.16
736.04
490.00
122.50
6.27
6.88
54.77
26174.00
6543.50
The Hobbit Collection
3
2935.52
978.51
750.00
250.00
3.83
7.23
25.21
17944.00
5981.33
The Avengers Collection
2
2924.96
1462.48
500.00
250.00
5.96
7.35
63.63
18908.00
9454.00
The Lord of the Rings Collection
3
2916.54
972.18
266.00
88.67
11.73
8.03
30.27
24759.00
8253.00
X-Men Collection
6
2808.83
468.14
983.00
163.83
3.02
6.82
9.71
27563.00
4593.83
Avatar Collection
1
2787.97
2787.97
237.00
237.00
11.76
7.20
185.07
12114.00
12114.00
Mission: Impossible Collection
5
2778.98
555.80
650.00
130.00
4.55
6.60
16.51
14005.00
2801.00
Spider-Man Collection
3
2496.35
832.12
597.00
199.00
3.92
6.47
22.62
13517.00
4505.67
The Dark Knight Collection
3
2463.72
821.24
585.00
195.00
4.34
7.80
57.42
29043.00
9681.00
Can you do it with nlargest ???
Highest Average Revenue

Most Expensive Franchises (Budget)

Highest Rated Franchises

Most Successful Directors
Most Number Of Movies (top 5)
Highest Revenues By Directors
Highest Number of Franchises directed by Directors
Aggregate Functions

Highest Rated Movies
director
Hayao Miyazaki
14
14700.00
7.70
Christopher Nolan
11
67344.00
7.62
Martin Scorsese
39
35541.00
7.22
Peter Jackson
13
47571.00
7.14
Joel Coen
17
18139.00
7.02
James Cameron
11
33736.00
6.93
Stanley Kubrick
16
18214.00
6.91
Steven Spielberg
33
62266.00
6.89
Danny Boyle
14
16504.00
6.87
Robert Zemeckis
19
37666.00
6.79
Terry Gilliam
14
10049.00
6.76
Tim Burton
21
36922.00
6.73
Ang Lee
14
11164.00
6.71
Antoine Fuqua
12
15519.00
6.71
Woody Allen
49
15512.00
6.69
Clint Eastwood
35
24001.00
6.69
Alfred Hitchcock
53
12772.00
6.64
Ridley Scott
24
43083.00
6.60
Kenneth Branagh
14
11275.00
6.59
Luc Besson
16
19627.00
6.53
To find succesful director in any specific genre i.e. Action
To Find Successful Actors
Set id as index
Split Actor Names to a DataFrame



Convert Series to DataFrame

Rename column label from 0 to 'Actor'

Merge Dataframe with Actors DataFrame

Number of Unique Actors
Actors with highest number of movies

Actors who have acted in more than 10 films

Highest Revenue
Highest Number of Films
Highest Rating
Popularity
Find Common Actors in the top lists
Concat all the dataframes

Find Duplicate Records of Actors

What are the most successful/popular genres? Has this changed over time (e.g. 80ths vs. 90ths)?

Merge gen and original dataframe

Aggregate Functions
Genre With Highest Revenue
sum
mean
mean
mean
gen
Action
201388.05
116.07
5.75
4.78
Adventure
199978.67
179.19
5.88
6.00
Comedy
166845.05
64.10
5.97
3.25
Drama
160754.36
43.83
6.17
3.03
Thriller
129724.55
69.52
5.74
4.51
Genre With Highest Rating
sum
mean
mean
mean
gen
Documentary
1449.11
6.65
6.66
0.96
Animation
67432.97
176.99
6.45
4.75
History
14902.20
50.52
6.41
3.48
Music
13370.29
50.08
6.33
2.56
War
15910.46
65.48
6.29
3.35
Genre With Highest Popularity
sum
mean
mean
mean
gen
Adventure
199978.67
179.19
5.88
6.00
Fantasy
103920.15
166.01
5.93
5.36
Science Fiction
97847.96
131.52
5.48
5.00
Action
201388.05
116.07
5.75
4.78
Family
107076.78
159.10
5.93
4.77
Highest revenue generated by Genre in 90's

Popularity of Genres in Nineties
Try to find Total revenue and average rating too.
Highest revenue generated by Genre in 20's

Popularity of Genres in Twenties
Find Most Successful Production Companies On your own ?
Last updated