PDA Assignments
  • Python For Data Analytics
    • 1.Python
      • 1.Python Documents
        • 1.Data Types
        • 2.Variables In Python
        • 3.Operators In Python
        • 4.User Input In Python
        • 5.TypeCasting In Python
        • 6.Strings In Python
        • 7.Conditional Statements In Python
        • 8.Branching using Conditional Statements and Loops in Python
        • 9.Lists In Python
        • 10.Sets In Python
        • 11.Tuples In Python
        • 12.Dictionary In Python
        • 13.Functions In Python
        • 14.File Handling In Python
        • 15.Numerical Computing with Python and Numpy
      • 2.Python Assignments
        • Data Type & Variables
        • Operators Assignment
        • User Input & Type Casting
        • Functions- Basic Assignments
        • String Assignments
          • String CheatSheet
        • Conditional Statements Assignments
        • Loops Assignments
        • List Assignments
          • List Cheatsheet
        • Set Assignments
          • Sets Cheatsheet
        • Dictionary Assignments
          • Dictionary Cheatsheet
        • Function Assignments
        • Functions used in Python
      • 3.Python Projects
        • Employee Management System
        • Hamming distance
        • Webscraping With Python
          • Introduction To Web Scraping
          • Importing Necessary Libraries
          • Basic Introduction To HTML
          • Introduction To BeautifulSoup
          • Flipkart Web Scraping
            • Scraping Step By Step
        • Retail Sales Analysis
        • Guess the Word Game
        • Data Collection Through APIs
        • To-Do List Manager
        • Atm-functionalities(nested if)
        • Distribution of Cards(List & Nested for)
        • Guess the Number Game
      • 4.Python + SQL Projects
        • Bookstore Management System
    • 2.Data Analytics
      • 1.Pandas
        • 1.Pandas Documents
          • 1.Introduction To Pandas
          • Reading and Loading Different Data
          • 2.Indexing and Slicing In Pandas
          • 3.Joining In Pandas
          • 4.Missing Values In Pandas
          • 5.Outliers In Pandas
          • 6.Aggregating Data
          • 7.DateTime In Pandas
          • 8.Validation In Pandas
          • 9.Fetching Data From SQL
          • 10. Automation In Pandas
          • 11.Matplotlib - Data Visualization
          • 12. Seaborn - Data Visualization
          • 13. Required Files
        • 3.Pandas Projects
          • Retail Sales Analysis
            • Retail Sales Step By Step
          • IMDB - Dataset Analysis - Basic
        • 2. Pandas Assignments
          • 1. Reading and Loading the Data
          • 2. Data frame Functions and Properties
          • 3. Series - Basic Operations
          • 4. Filtering in Pandas
          • 5. Advance Filtering
          • 6. Aggregate Functions & Groupby
          • 7. Pivot Tables
          • 8. Datetime
          • 9. String Functions
Powered by GitBook
On this page
  • Project 1: Explanatory Data Analysis & Data Presentation (Movies Dataset)
  • Project Brief for Self-Coders
  • Data Import and first Inspection
  • Import Necessary Libraries for this Task:
  • Read the Movie Data
  • Getting Info About Data
  • Statistical Summary
  • The best and the worst movies...
  • The Best and Worst Movies ever
  • Filtering Columns responsible to determine best and worst movies
  • Create a column 'profit_musd' (revenue - budget)
  • Create a column 'return_musd' (revenue/budget)
  • Rename Columns in Something Meaningful to present it later in Graphs
  • Set Title as Index
  • Convert Our DataFrame into HTML (Poster , Title , Popularity')
  • Highest Rated Movies
  • Movies With Highest ROI
  • Create a Function to find Best and Worst Movies
  • Top 5 - Highest Revenue
  • Top 5 - Highest Budget
  • Top 5 - Highest Profit
  • Top 5 - Highest ROI
  • Top 5 - Lowest Profit
  • Top 5 - Most Popular
  • Find Your Next Movie
  • Movies With Uma Thurman and Quentin Tarantino
  • Most Successful Pixar Movies from 2010 to 2015 (Highest Revenue)
  • Action Or Thriller Movie with Original Language English with minimum rating of 7.5(Most Recent)
  • Most Common Words in Titles and Taglines
  • Are Franchises More Successful ?
  • All Franchises
  • Count Franchise/Standalone Movies
  • Revenue (Franchise Vs Standalone Movies)
  • Budget (Franchise Vs Standalone Movies)
  • Average Rating (Franchise Vs Standalone Movies)
  • Popularity (Franchise Vs Standalone Movies)
  • Return Of Investments (Franchise Vs Standalone Movies)
  • Aggregate Functions
  • Most Successful Franchise ?
  • Largest Franchise
  • Highest Revenue
  • Highest Average Revenue
  • Most Expensive Franchises (Budget)
  • Highest Rated Franchises
  • Most Successful Directors
  • Highest Rated Movies
  • To find succesful director in any specific genre i.e. Action
  • To Find Successful Actors
  • Convert Series to DataFrame
  • Rename column label from 0 to 'Actor'
  • Number of Unique Actors
  • Actors with highest number of movies
  • Actors who have acted in more than 10 films
  • Highest Revenue
  • Highest Number of Films
  • Highest Rating
  • Popularity
  • Find Common Actors in the top lists
  • Find Duplicate Records of Actors
  • What are the most successful/popular genres? Has this changed over time (e.g. 80ths vs. 90ths)?
  • Genre With Highest Revenue
  • Genre With Highest Rating
  • Genre With Highest Popularity
  • Highest revenue generated by Genre in 90's
  • Popularity of Genres in Nineties
  • Highest revenue generated by Genre in 20's
  • Popularity of Genres in Twenties
  • Find Most Successful Production Companies On your own ?
  1. Python For Data Analytics
  2. 2.Data Analytics
  3. 1.Pandas
  4. 3.Pandas Projects

IMDB - Dataset Analysis - Basic

PreviousRetail Sales Step By StepNext2. Pandas Assignments

Last updated 2 years ago

Project 1: Explanatory Data Analysis & Data Presentation (Movies Dataset)

Project Brief for Self-Coders

Here you´ll have the opportunity to code major parts of Project 1 on your own. If you need any help or inspiration, have a look at the Videos or the Jupyter Notebook with the full code.

Keep in mind that it´s all about getting the right results/conclusions. It´s not about finding the identical code. Things can be coded in many different ways. Even if you come to the same conclusions, it´s very unlikely that we have the very same code.

Data Import and first Inspection

  1. Import the movies dataset from the CSV file "movies_complete.csv". Inspect the data.

Some additional information on Features/Columns:

  • id: The ID of the movie (clear/unique identifier).

  • title: The Official Title of the movie.

  • tagline: The tagline of the movie.

  • release_date: Theatrical Release Date of the movie.

  • genres: Genres associated with the movie.

  • belongs_to_collection: Gives information on the movie series/franchise the particular film belongs to.

  • original_language: The language in which the movie was originally shot in.

  • budget_musd: The budget of the movie in million dollars.

  • revenue_musd: The total revenue of the movie in million dollars.

  • production_companies: Production companies involved with the making of the movie.

  • production_countries: Countries where the movie was shot/produced in.

  • vote_count: The number of votes by users, as counted by TMDB.

  • vote_average: The average rating of the movie.

  • popularity: The Popularity Score assigned by TMDB.

  • runtime: The runtime of the movie in minutes.

  • overview: A brief blurb of the movie.

  • spoken_languages: Spoken languages in the film.

  • poster_path: The URL of the poster image.

  • cast: (Main) Actors appearing in the movie.

  • cast_size: number of Actors appearing in the movie.

  • director: Director of the movie.

  • crew_size: Size of the film crew (incl. director, excl. actors).

Import Necessary Libraries for this Task:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.options.display.max_columns=None   # To show all columns
pd.options.display.float_format = '{:.2f}'.format

Read the Movie Data

df = pd.read_csv('movies_complete.csv',parse_dates=['release_date'])  #parse date will convert release_date column into Datetime.
df

44691 rows × 22 columns

Getting Info About Data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44691 entries, 0 to 44690
Data columns (total 22 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   id                     44691 non-null  int64         
 1   title                  44691 non-null  object        
 2   tagline                20284 non-null  object        
 3   release_date           44657 non-null  datetime64[ns]
 4   genres                 42586 non-null  object        
 5   belongs_to_collection  4463 non-null   object        
 6   original_language      44681 non-null  object        
 7   budget_musd            8854 non-null   float64       
 8   revenue_musd           7385 non-null   float64       
 9   production_companies   33356 non-null  object        
 10  production_countries   38835 non-null  object        
 11  vote_count             44691 non-null  float64       
 12  vote_average           42077 non-null  float64       
 13  popularity             44691 non-null  float64       
 14  runtime                43179 non-null  float64       
 15  overview               43740 non-null  object        
 16  spoken_languages       41094 non-null  object        
 17  poster_path            44467 non-null  object        
 18  cast                   42502 non-null  object        
 19  cast_size              44691 non-null  int64         
 20  crew_size              44691 non-null  int64         
 21  director               43960 non-null  object        
dtypes: datetime64[ns](1), float64(6), int64(3), object(12)
memory usage: 7.5+ MB

Statistical Summary

df.describe()
id
budget_musd
revenue_musd
vote_count
vote_average
popularity
runtime
cast_size
crew_size

count

44691.00

8854.00

7385.00

44691.00

42077.00

44691.00

43179.00

44691.00

44691.00

mean

107186.24

21.67

68.97

111.65

6.00

2.96

97.57

12.48

10.31

std

111806.36

34.36

146.61

495.32

1.28

6.04

34.65

12.12

15.89

min

2.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

25%

26033.50

2.00

2.41

3.00

5.30

0.40

86.00

6.00

2.00

50%

59110.00

8.20

16.87

10.00

6.10

1.15

95.00

10.00

6.00

75%

154251.00

25.00

67.64

35.00

6.80

3.77

107.00

15.00

12.00

max

469172.00

380.00

2787.97

14075.00

10.00

547.49

1256.00

313.00

435.00

df.hist(figsize=(20,12),bins=100)

The best and the worst movies...

  1. Filter the Dataset and find the best/worst n Movies with the

  • Highest Revenue

  • Highest Budget

  • Highest Profit (=Revenue - Budget)

  • Lowest Profit (=Revenue - Budget)

  • Highest Return on Investment (=Revenue / Budget) (only movies with Budget >= 10)

  • Lowest Return on Investment (=Revenue / Budget) (only movies with Budget >= 10)

  • Highest number of Votes

  • Highest Rating (only movies with 10 or more Ratings)

  • Lowest Rating (only movies with 10 or more Ratings)

  • Highest Popularity

The Best and Worst Movies ever

We will try to filter our data based on criteria , that is responsible to determine the best and worst movies ever. We are also going to import HTML , as we will convert our analysis to a beautiful web page. To do this, all you need to do is to import HTML.

from IPython.display import HTML   # we are using this to try to present our data in good looking website format

Filtering Columns responsible to determine best and worst movies

edf = df[
    ['poster_path','title','budget_musd','revenue_musd','vote_count','vote_average','popularity']
].copy()

Create a column 'profit_musd' (revenue - budget)

edf['profit_musd'] = edf['revenue_musd'] - edf['budget_musd']
edf.head()

Create a column 'return_musd' (revenue/budget)

edf['return_musd'] = edf['revenue_musd']/edf['budget_musd']
edf.head()

Rename Columns in Something Meaningful to present it later in Graphs

edf.columns = ['','Title','Budget','Revenue','Votes','Average Rating','Popularity','Profit','Return']
edf.head()

Set Title as Index

edf.set_index('Title',inplace=True)

Convert Our DataFrame into HTML (Poster , Title , Popularity')

subset = df[['poster_path','title','popularity']].head()
HTML(subset.to_html(escape=False))   # convert dataframe into HTML

Highest Rated Movies

HTML(edf.sort_values(by='Average Rating',ascending=False).head().to_html(escape=False))

Now this approach does not make sense as you can see there is only one vote and it is not sufficient enough to judge on rating. So let us find median of Votes and consider it to be the minimum number of votes to be given to any movie.

minimum_votes = edf['Votes'].median()
minimum_votes
edf.loc[edf['Votes']>=minimum_votes].sort_values(by='Average Rating',ascending=False).head()

Movies With Highest ROI

Here also we will keep above approach , as there are few movies with close to zero budget , we must exclude them and so let us find the median of budget.

minimum_budget = edf['Budget'].median()
minimum_budget
edf.loc[edf['Budget']>=minimum_budget].sort_values(by='Return',ascending=False).head()

Before moving ahead , let us fill all na values of Budget and Votes with 0.

edf['Budget'].fillna(0,inplace=True)
edf['Votes'].fillna(0,inplace=True)

Create a Function to find Best and Worst Movies

def find_movies(n,by,ascending=False,min_bud=minimum_budget,min_votes=minimum_votes):
   return edf.loc[(edf['Budget']>=min_bud) & (edf['Votes']>=min_votes)].sort_values(by=by,ascending=ascending).head(n)

Top 5 - Highest Revenue

HTML(find_movies(5,by='Revenue').to_html(escape=False))

Top 5 - Highest Budget

HTML(find_movies(n=5,by='Budget').to_html(escape=False))

Top 5 - Highest Profit

HTML(find_movies(5,'Profit').to_html(escape=False))

Top 5 - Highest ROI

HTML(find_movies(5,by='Return').to_html(escape=False))

Top 5 - Lowest Profit

HTML(find_movies(5,by='Profit',ascending=True).to_html(escape=False))

Top 5 - Most Popular

HTML(find_movies(5,'Popularity').to_html(escape=False))

Find Your Next Movie

  • Science Fiction Action Movie With Bruce Willis

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.options.display.max_columns=None   # To show all columns
pd.options.display.float_format = '{:.2f}'.format

df = pd.read_csv('movies_complete.csv',parse_dates=['release_date'])  #parse date will convert release_date column into Datetime.
df.head()

Filtering Genres (Science Fiction and Action)


filter_genres = df['genres'].str.contains('Science Fiction') & df['genres'].str.contains('Action')

Filtering Bruce Willis Movies

# Filtering Bruce Willis Movies
filter_cast = df['cast'].str.contains('Bruce Willis')

Filtering

edf = df.loc[filter_genres & filter_cast].sort_values(by='vote_average',ascending=False).head(5)
edf = edf[['title','poster_path','vote_average']]
HTML(edf.to_html(escape=False))

Movies With Uma Thurman and Quentin Tarantino

filter_cast = df['cast'].str.contains('Uma Thurman')  # Actor/ Cast
filter_director = df['director'] == 'Quentin Tarantino'  # Director

fdf = df.loc[(filter_cast) & (filter_director)].sort_values(by='runtime')
HTML(fdf[['title','poster_path','vote_average','runtime']].to_html(escape=False))

Most Successful Pixar Movies from 2010 to 2015 (Highest Revenue)

  • Filtering Pixar Movies

filter_production = df.loc[df['production_companies'].str.contains('Pixar').fillna(False)]
filter_production.head()

Filtering Release Date

successful = filter_production.loc[filter_production['release_date'].between('2010-01-01','2015-12-31')]
successful.head()

Result

successful = successful.sort_values(by='revenue_musd')
HTML(successful[['title','poster_path','revenue_musd','release_date']].head().to_html(escape=False))

Action Or Thriller Movie with Original Language English with minimum rating of 7.5(Most Recent)

Filtering Genre (Action Or Thriller)

mask_genre = df['genres'].str.contains('Action') | df['genres'].str.contains('Thriller')

Filtering Language

mask_language = df['original_language']=='en'

Filtering Vote (greater than 10)

mask_vote_count = df['vote_count']>=10

Filter Average Rating

mask_rating = df['vote_average']>=7.5

Filter:

next_mov = df.loc[mask_genre & mask_language & mask_rating & mask_vote_count ].sort_values(by='release_date',ascending=False).head()
HTML(next_mov[['title','poster_path','release_date']].to_html(escape=False))

Most Common Words in Titles and Taglines

from wordcloud import WordCloud
titles =df['title'].dropna()
taglines = df['tagline'].dropna()
titles = '.'.join(titles)
taglines = '.'.join(taglines)
title_wordcloud = WordCloud(background_color='white', height=2000, width=4000,max_words=200).generate(titles)
plt.figure(figsize=(16,8))
plt.imshow(title_wordcloud, interpolation= "bilinear")
plt.axis('off')y
plt.show()
tagline_wordcloud = WordCloud(background_color='white', height=2000, width=4000,max_words=200).generate(taglines)
plt.figure(figsize=(16,8))
plt.imshow(tagline_wordcloud, interpolation= "bilinear")
plt.axis('off')
plt.show()

Are Franchises More Successful ?

All Franchises

df['franchise'] = df['belongs_to_collection'].notna()

Count Franchise/Standalone Movies

df['franchise'].value_counts()
False    40228
True      4463
Name: franchise, dtype: int64

Revenue (Franchise Vs Standalone Movies)

df.groupby('franchise')['revenue_musd'].mean()
franchise
False    44.74
True    165.71
Name: revenue_musd, dtype: float64

Budget (Franchise Vs Standalone Movies)

df.groupby('franchise')['budget_musd'].mean()
franchise
False   18.05
True    38.32
Name: budget_musd, dtype: float64

Average Rating (Franchise Vs Standalone Movies)

df.groupby('franchise')['vote_average'].mean()
franchise
False   6.01
True    5.96
Name: vote_average, dtype: float64

Popularity (Franchise Vs Standalone Movies)

df.groupby('franchise')['popularity'].mean()
franchise
False   2.59
True    6.25
Name: popularity, dtype: float64

Return Of Investments (Franchise Vs Standalone Movies)

df['roi'] = df['revenue_musd']/df['budget_musd']
df.groupby('franchise')['roi'].median()
franchise
False   1.62
True    3.71
Name: roi, dtype: float64

Aggregate Functions

We will use aggregate functions to calculate all necessary info about Franchise.

franchises = df.groupby('franchise').agg({
    'title':'count',
    'revenue_musd':['sum','mean'],
    'budget_musd':['sum','mean'],
    'roi':'median',
    'vote_average':'mean',
    'popularity':'mean',
    'vote_count':['sum','mean']
})
franchises

Most Successful Franchise ?

df['belongs_to_collection'].dropna()
0                  Toy Story Collection
2             Grumpy Old Men Collection
4        Father of the Bride Collection
9                 James Bond Collection
12                     Balto Collection
                      ...              
44582           The Carry On Collection
44585           The Carry On Collection
44596           The Carry On Collection
44598    DC Super Hero Girls Collection
44609              Red Lotus Collection
Name: belongs_to_collection, Length: 4463, dtype: object

Largest Franchise

So we can use sort_values to get the maximum number of count of a movie.

franchises.sort_values(by=('title','count'),ascending=False)

We can also use nlargest to get the n numbers of big franchises.

franchises.nlargest(5,('title','count'))

Highest Revenue

franchises.sort_values(by=('revenue_musd','sum'),ascending=False).head(20)
title
revenue_musd
budget_musd
roi
vote_average
popularity
vote_count

count

sum

mean

sum

mean

median

mean

mean

sum

mean

belongs_to_collection

Harry Potter Collection

8

7707.37

963.42

1280.00

160.00

6.17

7.54

26.25

47866.00

5983.25

Star Wars Collection

8

7434.49

929.31

854.35

106.79

8.24

7.38

23.41

43443.00

5430.38

James Bond Collection

26

7106.97

273.35

1539.65

59.22

6.13

6.34

13.45

33392.00

1284.31

The Fast and the Furious Collection

8

5125.10

640.64

1009.00

126.12

4.94

6.66

10.80

25576.00

3197.00

Pirates of the Caribbean Collection

5

4521.58

904.32

1250.00

250.00

3.45

6.88

53.97

25080.00

5016.00

Transformers Collection

5

4366.10

873.22

965.00

193.00

5.20

6.14

14.43

15232.00

3046.40

Despicable Me Collection

6

3691.07

922.77

299.00

74.75

12.76

6.78

106.72

18248.00

3041.33

The Twilight Collection

5

3342.11

668.42

385.00

77.00

10.27

5.84

29.50

13851.00

2770.20

Ice Age Collection

5

3216.71

643.34

429.00

85.80

8.26

6.38

16.08

13219.00

2643.80

Jurassic Park Collection

4

3031.48

757.87

379.00

94.75

7.03

6.50

10.77

18435.00

4608.75

Shrek Collection

5

2955.81

738.95

535.00

133.75

5.56

6.46

12.97

11721.00

2344.20

The Hunger Games Collection

4

2944.16

736.04

490.00

122.50

6.27

6.88

54.77

26174.00

6543.50

The Hobbit Collection

3

2935.52

978.51

750.00

250.00

3.83

7.23

25.21

17944.00

5981.33

The Avengers Collection

2

2924.96

1462.48

500.00

250.00

5.96

7.35

63.63

18908.00

9454.00

The Lord of the Rings Collection

3

2916.54

972.18

266.00

88.67

11.73

8.03

30.27

24759.00

8253.00

X-Men Collection

6

2808.83

468.14

983.00

163.83

3.02

6.82

9.71

27563.00

4593.83

Avatar Collection

1

2787.97

2787.97

237.00

237.00

11.76

7.20

185.07

12114.00

12114.00

Mission: Impossible Collection

5

2778.98

555.80

650.00

130.00

4.55

6.60

16.51

14005.00

2801.00

Spider-Man Collection

3

2496.35

832.12

597.00

199.00

3.92

6.47

22.62

13517.00

4505.67

The Dark Knight Collection

3

2463.72

821.24

585.00

195.00

4.34

7.80

57.42

29043.00

9681.00

Can you do it with nlargest ???

Highest Average Revenue

franchises.sort_values(by=('revenue_musd','mean'),ascending=False).head(20)

Most Expensive Franchises (Budget)

franchises.nlargest(20,('budget_musd','sum')).head(20)

Highest Rated Franchises

franchises.loc[franchises[('vote_count','mean')]>=1000].nlargest(10,('vote_average','me

Most Successful Directors

Most Number Of Movies (top 5)

df['director'].value_counts().head()
John Ford           66
Michael Curtiz      65
Werner Herzog       54
Alfred Hitchcock    53
Georges Méliès      49
Name: director, dtype: int64

Highest Revenues By Directors

df.groupby('director')['revenue_musd'].sum().sort_values(ascending=False).head()
director
Steven Spielberg   9256.62
Peter Jackson      6528.24
Michael Bay        6437.47
James Cameron      5900.61
David Yates        5334.56
Name: revenue_musd, dtype: float64

Highest Number of Franchises directed by Directors

df.loc[df['belongs_to_collection'].notna()]['director'].value_counts().head()
Gerald Thomas       25
William Beaudine    19
Ere Kokkonen        17
Kunihiko Yuyama     15
Robert Rodriguez    13
Name: director, dtype: int64

Aggregate Functions

directors = df.groupby('director').agg({
    'title' : 'count',
    'vote_count':'sum',
    'vote_average' : 'mean'
})
directors.head()

Highest Rated Movies

directors.loc[(directors['vote_count']>10000) & (directors['title']>10)].nlargest(20,'vote_ave
title
vote_count
vote_average

director

Hayao Miyazaki

14

14700.00

7.70

Christopher Nolan

11

67344.00

7.62

Martin Scorsese

39

35541.00

7.22

Peter Jackson

13

47571.00

7.14

Joel Coen

17

18139.00

7.02

James Cameron

11

33736.00

6.93

Stanley Kubrick

16

18214.00

6.91

Steven Spielberg

33

62266.00

6.89

Danny Boyle

14

16504.00

6.87

Robert Zemeckis

19

37666.00

6.79

Terry Gilliam

14

10049.00

6.76

Tim Burton

21

36922.00

6.73

Ang Lee

14

11164.00

6.71

Antoine Fuqua

12

15519.00

6.71

Woody Allen

49

15512.00

6.69

Clint Eastwood

35

24001.00

6.69

Alfred Hitchcock

53

12772.00

6.64

Ridley Scott

24

43083.00

6.60

Kenneth Branagh

14

11275.00

6.59

Luc Besson

16

19627.00

6.53

To find succesful director in any specific genre i.e. Action

edf = df[df['genres'].notna()]
edf[edf['genres'].str.contains('Horror')].groupby('director')['revenue_musd'].sum().sort_values(ascending=False).head(10)
director
Paul W.S. Anderson    982.29
James Wan             861.31
Wes Craven            834.93
Francis Lawrence      816.23
Ridley Scott          689.00
Marc Forster          531.87
Steven Spielberg      500.10
William Friedkin      466.40
Darren Lynn Bousman   456.34
M. Night Shyamalan    375.37
Name: revenue_musd, dtype: float64

To Find Successful Actors

df['cast']
0        Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...
1        Robin Williams|Jonathan Hyde|Kirsten Dunst|Bra...
2        Walter Matthau|Jack Lemmon|Ann-Margret|Sophia ...
3        Whitney Houston|Angela Bassett|Loretta Devine|...
4        Steve Martin|Diane Keaton|Martin Short|Kimberl...
                               ...                        
44686              Leila Hatami|Kourosh Tahami|Elham Korda
44687    Angel Aquino|Perry Dizon|Hazel Orencio|Joel To...
44688    Erika Eleniak|Adam Baldwin|Julie du Page|James...
44689    Iwan Mosschuchin|Nathalie Lissenko|Pavel Pavlo...
44690                                                  NaN
Name: cast, Length: 44691, dtype: object

Set id as index

df.set_index('id',inplace=True)

Split Actor Names to a DataFrame

actors = df['cast'].str.split('|',expand=True)
actors.head()
actors.stack().reset_index()
actors = actors.stack().reset_index(drop=True,level=1) #only remove the given level by default remove all levels
actors

Convert Series to DataFrame

actors = actors.to_frame()
actors

Rename column label from 0 to 'Actor'

actors.columns=['Actor']
actors.head()

Merge Dataframe with Actors DataFrame

actors = actors.merge(df[['title','revenue_musd','vote_average','popularity']],on='id')
actors

Number of Unique Actors

actors['Actor'].nunique()

Actors with highest number of movies

actors['Actor'].value_counts().head(5)
Bess Flowers         240
Christopher Lee      148
John Wayne           125
Samuel L. Jackson    122
Michael Caine        110
Name: Actor, dtype: int64
# This is known as label aggregation
data = actors.groupby('Actor').agg(
        total_revenue = ('revenue_musd','sum'),
        average_rating = ('vote_average','mean'),
        average_popularity = ('popularity','mean'),
        movies = ('title','count'),
        average_revenue=('revenue_musd','mean')
)
data

Actors who have acted in more than 10 films

data = data.loc[data['movies']>=10]
data

Highest Revenue

actors_with_largest_revenue = data.nlargest(20,'total_revenue')

Highest Number of Films

actors_with_highest_movies = data.nlargest(20,'movies')

Highest Rating

actors_highest_rating = data.nlargest(20,'average_rating')

Popularity

actor_with_popularity = data.nlargest(20,'average_popularity')

Find Common Actors in the top lists

Concat all the dataframes

top_list = pd.concat([actors_with_largest_revenue,actors_with_highest_movies,actors_highest_rating,actor_with_popularity])
top_list

Find Duplicate Records of Actors

top_list = top_list.reset_index()   # Resets Index
top_list.loc[top_list.duplicated(subset='Actor')]

What are the most successful/popular genres? Has this changed over time (e.g. 80ths vs. 90ths)?

gen = df['genres'].str.split('|',expand=True)
gen = gen.stack().reset_index(drop=True,level=1).to_frame()
gen.columns = ['gen']
gen

Merge gen and original dataframe

genre = gen.merge(df[['title','revenue_musd','vote_average','popularity','release_date']],on='id')
genre

Aggregate Functions

data = genre.groupby('gen').agg(
    {
        'revenue_musd' : ['sum','mean'],
        'vote_average' : 'mean',
        'popularity' : 'mean'

    }
)

Genre With Highest Revenue

data.nlargest(5,('revenue_musd','sum'))
revenue_musd
vote_average
popularity

sum

mean

mean

mean

gen

Action

201388.05

116.07

5.75

4.78

Adventure

199978.67

179.19

5.88

6.00

Comedy

166845.05

64.10

5.97

3.25

Drama

160754.36

43.83

6.17

3.03

Thriller

129724.55

69.52

5.74

4.51

Genre With Highest Rating

data.nlargest(5,('vote_average','mean'))
revenue_musd
vote_average
popularity

sum

mean

mean

mean

gen

Documentary

1449.11

6.65

6.66

0.96

Animation

67432.97

176.99

6.45

4.75

History

14902.20

50.52

6.41

3.48

Music

13370.29

50.08

6.33

2.56

War

15910.46

65.48

6.29

3.35

Genre With Highest Popularity

data.nlargest(5,('popularity','mean'))
revenue_musd
vote_average
popularity

sum

mean

mean

mean

gen

Adventure

199978.67

179.19

5.88

6.00

Fantasy

103920.15

166.01

5.93

5.36

Science Fiction

97847.96

131.52

5.48

5.00

Action

201388.05

116.07

5.75

4.78

Family

107076.78

159.10

5.93

4.77

Highest revenue generated by Genre in 90's

nineties = genre.loc[genre['release_date'].between('1900-01-01','1999-12-31')]
nineties

Popularity of Genres in Nineties

nineties.groupby('gen')['popularity'].mean().head(5)
gen
Action      3.32
Adventure   3.80
Animation   2.91
Comedy      2.63
Crime       3.22
Name: popularity, dtype: float64

Try to find Total revenue and average rating too.

Highest revenue generated by Genre in 20's

twenties = genre.loc[genre['release_date'].between('2000-01-01','2022-12-31')]
twenties

Popularity of Genres in Twenties

twenties.groupby('gen')['popularity'].mean().head(5)
gen
Action      6.23
Adventure   8.60
Animation   5.92
Comedy      3.81
Crime       5.30
Name: popularity, dtype: float64

Find Most Successful Production Companies On your own ?

16MB
imdb-dataset_explanatory analysis.zip
archive