IMDB - Dataset Analysis - Basic

Project 1: Explanatory Data Analysis & Data Presentation (Movies Dataset)

Project Brief for Self-Coders

Here you´ll have the opportunity to code major parts of Project 1 on your own. If you need any help or inspiration, have a look at the Videos or the Jupyter Notebook with the full code.

Keep in mind that it´s all about getting the right results/conclusions. It´s not about finding the identical code. Things can be coded in many different ways. Even if you come to the same conclusions, it´s very unlikely that we have the very same code.

Data Import and first Inspection

  1. Import the movies dataset from the CSV file "movies_complete.csv". Inspect the data.

Some additional information on Features/Columns:

  • id: The ID of the movie (clear/unique identifier).

  • title: The Official Title of the movie.

  • tagline: The tagline of the movie.

  • release_date: Theatrical Release Date of the movie.

  • genres: Genres associated with the movie.

  • belongs_to_collection: Gives information on the movie series/franchise the particular film belongs to.

  • original_language: The language in which the movie was originally shot in.

  • budget_musd: The budget of the movie in million dollars.

  • revenue_musd: The total revenue of the movie in million dollars.

  • production_companies: Production companies involved with the making of the movie.

  • production_countries: Countries where the movie was shot/produced in.

  • vote_count: The number of votes by users, as counted by TMDB.

  • vote_average: The average rating of the movie.

  • popularity: The Popularity Score assigned by TMDB.

  • runtime: The runtime of the movie in minutes.

  • overview: A brief blurb of the movie.

  • spoken_languages: Spoken languages in the film.

  • poster_path: The URL of the poster image.

  • cast: (Main) Actors appearing in the movie.

  • cast_size: number of Actors appearing in the movie.

  • director: Director of the movie.

  • crew_size: Size of the film crew (incl. director, excl. actors).

Import Necessary Libraries for this Task:

Read the Movie Data

44691 rows × 22 columns

Getting Info About Data

Statistical Summary

id
budget_musd
revenue_musd
vote_count
vote_average
popularity
runtime
cast_size
crew_size

count

44691.00

8854.00

7385.00

44691.00

42077.00

44691.00

43179.00

44691.00

44691.00

mean

107186.24

21.67

68.97

111.65

6.00

2.96

97.57

12.48

10.31

std

111806.36

34.36

146.61

495.32

1.28

6.04

34.65

12.12

15.89

min

2.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

25%

26033.50

2.00

2.41

3.00

5.30

0.40

86.00

6.00

2.00

50%

59110.00

8.20

16.87

10.00

6.10

1.15

95.00

10.00

6.00

75%

154251.00

25.00

67.64

35.00

6.80

3.77

107.00

15.00

12.00

max

469172.00

380.00

2787.97

14075.00

10.00

547.49

1256.00

313.00

435.00

The best and the worst movies...

  1. Filter the Dataset and find the best/worst n Movies with the

  • Highest Revenue

  • Highest Budget

  • Highest Profit (=Revenue - Budget)

  • Lowest Profit (=Revenue - Budget)

  • Highest Return on Investment (=Revenue / Budget) (only movies with Budget >= 10)

  • Lowest Return on Investment (=Revenue / Budget) (only movies with Budget >= 10)

  • Highest number of Votes

  • Highest Rating (only movies with 10 or more Ratings)

  • Lowest Rating (only movies with 10 or more Ratings)

  • Highest Popularity

The Best and Worst Movies ever

We will try to filter our data based on criteria , that is responsible to determine the best and worst movies ever. We are also going to import HTML , as we will convert our analysis to a beautiful web page. To do this, all you need to do is to import HTML.

Filtering Columns responsible to determine best and worst movies

Create a column 'profit_musd' (revenue - budget)

Create a column 'return_musd' (revenue/budget)

Rename Columns in Something Meaningful to present it later in Graphs

Set Title as Index

Convert Our DataFrame into HTML (Poster , Title , Popularity')

Highest Rated Movies

Now this approach does not make sense as you can see there is only one vote and it is not sufficient enough to judge on rating. So let us find median of Votes and consider it to be the minimum number of votes to be given to any movie.

Movies With Highest ROI

Here also we will keep above approach , as there are few movies with close to zero budget , we must exclude them and so let us find the median of budget.

Before moving ahead , let us fill all na values of Budget and Votes with 0.

Create a Function to find Best and Worst Movies

Top 5 - Highest Revenue

Top 5 - Highest Budget

Top 5 - Highest Profit

Top 5 - Highest ROI

Top 5 - Lowest Profit

Find Your Next Movie

  • Science Fiction Action Movie With Bruce Willis

Filtering Genres (Science Fiction and Action)

Filtering Bruce Willis Movies

Filtering

Movies With Uma Thurman and Quentin Tarantino

Most Successful Pixar Movies from 2010 to 2015 (Highest Revenue)

  • Filtering Pixar Movies

Filtering Release Date

Result

Action Or Thriller Movie with Original Language English with minimum rating of 7.5(Most Recent)

Filtering Genre (Action Or Thriller)

Filtering Language

Filtering Vote (greater than 10)

Filter Average Rating

Filter:

Most Common Words in Titles and Taglines

Are Franchises More Successful ?

All Franchises

Count Franchise/Standalone Movies

Revenue (Franchise Vs Standalone Movies)

Budget (Franchise Vs Standalone Movies)

Average Rating (Franchise Vs Standalone Movies)

Popularity (Franchise Vs Standalone Movies)

Return Of Investments (Franchise Vs Standalone Movies)

Aggregate Functions

We will use aggregate functions to calculate all necessary info about Franchise.

Most Successful Franchise ?

Largest Franchise

So we can use sort_values to get the maximum number of count of a movie.

We can also use nlargest to get the n numbers of big franchises.

Highest Revenue

title
revenue_musd
budget_musd
roi
vote_average
popularity
vote_count

count

sum

mean

sum

mean

median

mean

mean

sum

mean

belongs_to_collection

Harry Potter Collection

8

7707.37

963.42

1280.00

160.00

6.17

7.54

26.25

47866.00

5983.25

Star Wars Collection

8

7434.49

929.31

854.35

106.79

8.24

7.38

23.41

43443.00

5430.38

James Bond Collection

26

7106.97

273.35

1539.65

59.22

6.13

6.34

13.45

33392.00

1284.31

The Fast and the Furious Collection

8

5125.10

640.64

1009.00

126.12

4.94

6.66

10.80

25576.00

3197.00

Pirates of the Caribbean Collection

5

4521.58

904.32

1250.00

250.00

3.45

6.88

53.97

25080.00

5016.00

Transformers Collection

5

4366.10

873.22

965.00

193.00

5.20

6.14

14.43

15232.00

3046.40

Despicable Me Collection

6

3691.07

922.77

299.00

74.75

12.76

6.78

106.72

18248.00

3041.33

The Twilight Collection

5

3342.11

668.42

385.00

77.00

10.27

5.84

29.50

13851.00

2770.20

Ice Age Collection

5

3216.71

643.34

429.00

85.80

8.26

6.38

16.08

13219.00

2643.80

Jurassic Park Collection

4

3031.48

757.87

379.00

94.75

7.03

6.50

10.77

18435.00

4608.75

Shrek Collection

5

2955.81

738.95

535.00

133.75

5.56

6.46

12.97

11721.00

2344.20

The Hunger Games Collection

4

2944.16

736.04

490.00

122.50

6.27

6.88

54.77

26174.00

6543.50

The Hobbit Collection

3

2935.52

978.51

750.00

250.00

3.83

7.23

25.21

17944.00

5981.33

The Avengers Collection

2

2924.96

1462.48

500.00

250.00

5.96

7.35

63.63

18908.00

9454.00

The Lord of the Rings Collection

3

2916.54

972.18

266.00

88.67

11.73

8.03

30.27

24759.00

8253.00

X-Men Collection

6

2808.83

468.14

983.00

163.83

3.02

6.82

9.71

27563.00

4593.83

Avatar Collection

1

2787.97

2787.97

237.00

237.00

11.76

7.20

185.07

12114.00

12114.00

Mission: Impossible Collection

5

2778.98

555.80

650.00

130.00

4.55

6.60

16.51

14005.00

2801.00

Spider-Man Collection

3

2496.35

832.12

597.00

199.00

3.92

6.47

22.62

13517.00

4505.67

The Dark Knight Collection

3

2463.72

821.24

585.00

195.00

4.34

7.80

57.42

29043.00

9681.00

Can you do it with nlargest ???

Highest Average Revenue

Most Expensive Franchises (Budget)

Highest Rated Franchises

Most Successful Directors

Most Number Of Movies (top 5)

Highest Revenues By Directors

Highest Number of Franchises directed by Directors

Aggregate Functions

Highest Rated Movies

title
vote_count
vote_average

director

Hayao Miyazaki

14

14700.00

7.70

Christopher Nolan

11

67344.00

7.62

Martin Scorsese

39

35541.00

7.22

Peter Jackson

13

47571.00

7.14

Joel Coen

17

18139.00

7.02

James Cameron

11

33736.00

6.93

Stanley Kubrick

16

18214.00

6.91

Steven Spielberg

33

62266.00

6.89

Danny Boyle

14

16504.00

6.87

Robert Zemeckis

19

37666.00

6.79

Terry Gilliam

14

10049.00

6.76

Tim Burton

21

36922.00

6.73

Ang Lee

14

11164.00

6.71

Antoine Fuqua

12

15519.00

6.71

Woody Allen

49

15512.00

6.69

Clint Eastwood

35

24001.00

6.69

Alfred Hitchcock

53

12772.00

6.64

Ridley Scott

24

43083.00

6.60

Kenneth Branagh

14

11275.00

6.59

Luc Besson

16

19627.00

6.53

To find succesful director in any specific genre i.e. Action

To Find Successful Actors

Set id as index

Split Actor Names to a DataFrame

Convert Series to DataFrame

Rename column label from 0 to 'Actor'

Merge Dataframe with Actors DataFrame

Number of Unique Actors

Actors with highest number of movies

Actors who have acted in more than 10 films

Highest Revenue

Highest Number of Films

Highest Rating

Popularity

Find Common Actors in the top lists

Concat all the dataframes

Find Duplicate Records of Actors

Merge gen and original dataframe

Aggregate Functions

Genre With Highest Revenue

revenue_musd
vote_average
popularity

sum

mean

mean

mean

gen

Action

201388.05

116.07

5.75

4.78

Adventure

199978.67

179.19

5.88

6.00

Comedy

166845.05

64.10

5.97

3.25

Drama

160754.36

43.83

6.17

3.03

Thriller

129724.55

69.52

5.74

4.51

Genre With Highest Rating

revenue_musd
vote_average
popularity

sum

mean

mean

mean

gen

Documentary

1449.11

6.65

6.66

0.96

Animation

67432.97

176.99

6.45

4.75

History

14902.20

50.52

6.41

3.48

Music

13370.29

50.08

6.33

2.56

War

15910.46

65.48

6.29

3.35

Genre With Highest Popularity

revenue_musd
vote_average
popularity

sum

mean

mean

mean

gen

Adventure

199978.67

179.19

5.88

6.00

Fantasy

103920.15

166.01

5.93

5.36

Science Fiction

97847.96

131.52

5.48

5.00

Action

201388.05

116.07

5.75

4.78

Family

107076.78

159.10

5.93

4.77

Highest revenue generated by Genre in 90's

Popularity of Genres in Nineties

Try to find Total revenue and average rating too.

Highest revenue generated by Genre in 20's

Popularity of Genres in Twenties

Find Most Successful Production Companies On your own ?

Last updated