Due March 7 (Monday), 2022 by 11:59pm. Submit by editing this file and then uploading to Canvas.
Please rename this file with "LastName_FirstName_hw5"
The purpose of this assignment is to practice accessing and analyzing data in a database.
Notes:
Put your answers in new cells after each exercise. You can make as many new cells as you like. Use code cells for code and Markdown cells for text. Answer all questions with complete sentences.
Your code should be readable; writing a piece of code should be compared to writing a page of a book. Adopt the one-statement-per-line rule. The lenghth of your code should not exceed the maximum length of each cell for display. If your code is too long, you should split it into multiple lines to improve readability. (You will lose one point for each line that does not follow the rule.)
To help understand and maintain code, you should always add comments to explain your code. Use the hash symbol (#) to start writing a comment (homework without any comments will automatically receive 0 points). If you are writing a function, consider using docstring to add explanation.
Submit your final work with a .pdf (or a .html) file to Canvas. To convert a .ipynb file to a .pdf/.html file, go to "File", click "Download as", and then click "PDF via LaTeX"/"HTML".
Please do not clear your output so that we can see your answers without running all of the cells.
For full credit, do most of your work in SQL. Subsetting, sorting, grouping, aggregation, and merging can all be done in SQL. It is okay to use Python code for parts of your analysis that are difficult or impossible to do in SQL, such as creating visualizations or computing statistics (other than the mean).
In this assignment, you'll use SQL queries and pandas to explore data in Lahman's Baseball Database, which contains "complete batting and pitching statistics from 1871 to 2020, plus fielding statistics, standings, team stats, managerial records, post-season data, and more."
We use the 2019 version for this homework. You can find the database in SQLite format in the hw5 folder on Piazza.
Documentation for the database, including a description of all tables, is in the readme2019.txt
file included in the hw5 folder.
Exercises:
(3 points) From 2006 to 2016, which 5 teams had the highest payrolls? Payroll means the sum of the salaries for all people on the team in a given year.
(3 points) Is there any general trend in payrolls over this time period? (Hint: make a plot to find the trend)
(2 points) There are at least 2 anomalies in the payrolls over this time period. What are they, and can you find information (online or elsewhere) to explain them?
(2 points) How do the top 5 teams compare to other teams?
# From 2006 to 2016, determine which 5 teams had the highest payrolls, where payroll means the sum of the salaries for all people on the
# team in a given year.
import numpy as np # First, we import all the required packages.
import pandas as pd # First, we import all the required packages.
import sqlite3 as sql # First, we import all the required packages.
db = sql.connect("lahmansbaseballdb.sqlite") # First, we apply the sql package to connect
# we can first check to display the database.
db
<sqlite3.Connection at 0x1c80df64300>
cur = db.execute("SELECT * FROM sqlite_master")
# SQLite databases store metadata in a special table called sqlite_master.
# We can use sqlite_master to find out the names of the other tables in the database.
cur = db.execute("SELECT * FROM sqlite_master") #Using the execute function
rows = cur.fetchall() # To gain access to the rows names from the sql database. #15210 #15939
pd.read_sql("SELECT * FROM sqlite_master", db).head() # We can gain access to the sql database.
type | name | tbl_name | rootpage | sql | |
---|---|---|---|---|---|
0 | table | allstarfull | allstarfull | 2 | CREATE TABLE "allstarfull" (\n\t"ID" INTEGER N... |
1 | index | allstarfull_lgID | allstarfull | 3 | CREATE INDEX "allstarfull_lgID" ON "allstarful... |
2 | index | allstarfull_playerID | allstarfull | 4 | CREATE UNIQUE INDEX "allstarfull_playerID" ON ... |
3 | index | allstarfull_team_ID | allstarfull | 5 | CREATE INDEX "allstarfull_team_ID" ON "allstar... |
4 | table | appearances | appearances | 151 | CREATE TABLE "appearances" (\n\t"ID" INTEGER N... |
pd.read_sql("SELECT Name, Type FROM sqlite_master", db).head() # We can only read only name, and type from the spl database.
name | type | |
---|---|---|
0 | allstarfull | table |
1 | allstarfull_lgID | index |
2 | allstarfull_playerID | index |
3 | allstarfull_team_ID | index |
4 | appearances | table |
pd.read_sql("SELECT * FROM salaries;", db).head() # Then, we select the corresponding table ("Salaries")
ID | yearID | teamID | team_ID | lgID | playerID | salary | |
---|---|---|---|---|---|---|---|
0 | 1 | 1985 | ATL | 1918 | NL | barkele01 | 870000.0 |
1 | 2 | 1985 | ATL | 1918 | NL | bedrost01 | 550000.0 |
2 | 3 | 1985 | ATL | 1918 | NL | benedbr01 | 545000.0 |
3 | 4 | 1985 | ATL | 1918 | NL | campri01 | 633333.0 |
4 | 5 | 1985 | ATL | 1918 | NL | ceronri01 | 625000.0 |
pd.read_sql("SELECT * FROM salaries WHERE yearID BETWEEN 2006 AND 2016;", db).head() # We then select the accurate corresponding time period.
ID | yearID | teamID | team_ID | lgID | playerID | salary | |
---|---|---|---|---|---|---|---|
0 | 17295 | 2006 | ARI | 2506 | NL | aquingr01 | 342000.0 |
1 | 17296 | 2006 | ARI | 2506 | NL | batismi01 | 4750000.0 |
2 | 17297 | 2006 | ARI | 2506 | NL | byrneer01 | 2250000.0 |
3 | 17298 | 2006 | ARI | 2506 | NL | clarkto02 | 1034000.0 |
4 | 17299 | 2006 | ARI | 2506 | NL | counscr01 | 1750000.0 |
# For the first line here, we select the sum of the salary and their corresponding teamID, and year ID from the Salaries database
# For the second line here, we specify the condition: Let the year Between 2006 and 2016.
# For the third line here, we then group by the yearID, and teamID.
# For the fourth line here, we order our result from the largest payroll to the smallest payroll.
import pandas as pd # Import the required package
pd.read_sql("""
SELECT SUM(salary)AS sum , teamID, yearID FROM salaries
WHERE yearID BETWEEN 2006 AND 2016
GROUP BY yearID, teamID
ORDER BY SUM(salary) DESC
""", db).head(20)
sum | teamID | yearID | |
---|---|---|---|
0 | 231978886.0 | NYA | 2013 |
1 | 223362196.0 | LAN | 2013 |
2 | 222997792.0 | NYA | 2016 |
3 | 221288380.0 | LAN | 2016 |
4 | 217014600.0 | LAN | 2014 |
5 | 215792000.0 | LAN | 2015 |
6 | 212751957.0 | NYA | 2015 |
7 | 207896789.0 | NYA | 2008 |
8 | 206333389.0 | NYA | 2010 |
9 | 202275028.0 | NYA | 2011 |
10 | 201449189.0 | NYA | 2009 |
11 | 197543907.0 | NYA | 2014 |
12 | 196522289.0 | NYA | 2012 |
13 | 194876481.0 | DET | 2016 |
14 | 194663079.0 | NYA | 2006 |
15 | 189259045.0 | NYA | 2007 |
16 | 188545761.0 | BOS | 2016 |
17 | 181103400.0 | BOS | 2015 |
18 | 180944967.0 | PHI | 2014 |
19 | 176038723.0 | TEX | 2016 |
pd.read_sql("SELECT * FROM teams WHERE teamID = 'NYA';", db).head() # Then, we select the corresponding table ("Salaries")
# We check the name of the teamID.
ID | yearID | lgID | teamID | franchID | divID | div_ID | teamRank | G | Ghome | ... | DP | FP | name | park | attendance | BPF | PPF | teamIDBR | teamIDlahman45 | teamIDretro | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 425 | 1903 | AL | NYA | NYY | None | NaN | 4 | 136 | 67 | ... | 87 | 0.953 | New York Highlanders | Hilltop Park | 211808 | 106 | 105 | NYY | NYA | NYA |
1 | 441 | 1904 | AL | NYA | NYY | None | NaN | 2 | 155 | 75 | ... | 90 | 0.958 | New York Highlanders | Hilltop Park | 438919 | 105 | 104 | NYY | NYA | NYA |
2 | 457 | 1905 | AL | NYA | NYY | None | NaN | 6 | 152 | 75 | ... | 88 | 0.952 | New York Highlanders | Hilltop Park | 309100 | 111 | 110 | NYY | NYA | NYA |
3 | 473 | 1906 | AL | NYA | NYY | None | NaN | 2 | 155 | 76 | ... | 69 | 0.957 | New York Highlanders | Hilltop Park | 434700 | 110 | 110 | NYY | NYA | NYA |
4 | 489 | 1907 | AL | NYA | NYY | None | NaN | 5 | 152 | 75 | ... | 79 | 0.947 | New York Highlanders | Hilltop Park | 350020 | 108 | 110 | NYY | NYA | NYA |
5 rows × 50 columns
pd.read_sql("SELECT * FROM teams WHERE teamID = 'PHI';", db).head() # Then, we select the corresponding table ("Salaries")
# We check the name of the teamID.
ID | yearID | lgID | teamID | franchID | divID | div_ID | teamRank | G | Ghome | ... | DP | FP | name | park | attendance | BPF | PPF | teamIDBR | teamIDlahman45 | teamIDretro | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 121 | 1883 | NL | PHI | PHI | None | NaN | 8 | 99 | NaN | ... | 62 | 0.858 | Philadelphia Quakers | Recreation Park | NaN | 91 | 98 | PHI | PHI | PHI |
1 | 146 | 1884 | NL | PHI | PHI | None | NaN | 6 | 113 | NaN | ... | 67 | 0.888 | Philadelphia Quakers | Recreation Park | NaN | 94 | 99 | PHI | PHI | PHI |
2 | 169 | 1885 | NL | PHI | PHI | None | NaN | 3 | 111 | NaN | ... | 66 | 0.905 | Philadelphia Quakers | Recreation Park | NaN | 97 | 98 | PHI | PHI | PHI |
3 | 185 | 1886 | NL | PHI | PHI | None | NaN | 4 | 119 | NaN | ... | 46 | 0.921 | Philadelphia Quakers | Recreation Park | NaN | 100 | 99 | PHI | PHI | PHI |
4 | 202 | 1887 | NL | PHI | PHI | None | NaN | 2 | 128 | NaN | ... | 76 | 0.912 | Philadelphia Quakers | Philadelphia Baseball Grounds | NaN | 106 | 104 | PHI | PHI | PHI |
5 rows × 50 columns
# For the first line here, we select the sum of the salary and their corresponding teamID, and year ID from the Salaries database
# For the second line here, we specify the condition: Let the year Between 2006 and 2016.
# For the third line here, we then group by the just the teamID.
# For the fourth line here, we order our result from the largest payroll to the smallest payroll.
import pandas as pd # Import the required package
pd.read_sql("""
SELECT SUM(salary)AS sum , teamID, yearID FROM salaries
WHERE yearID BETWEEN 2006 AND 2016
GROUP BY teamID
ORDER BY SUM(salary) DESC
""", db)
# This time, we cam show the entire output.
sum | teamID | yearID | |
---|---|---|---|
0 | 2.263671e+09 | NYA | 2006 |
1 | 1.675458e+09 | BOS | 2006 |
2 | 1.598053e+09 | LAN | 2006 |
3 | 1.457055e+09 | DET | 2006 |
4 | 1.399510e+09 | PHI | 2006 |
5 | 1.347061e+09 | LAA | 2006 |
6 | 1.315003e+09 | SFN | 2006 |
7 | 1.243140e+09 | CHN | 2006 |
8 | 1.215769e+09 | NYN | 2006 |
9 | 1.186187e+09 | CHA | 2006 |
10 | 1.151854e+09 | SLN | 2006 |
11 | 1.090018e+09 | SEA | 2006 |
12 | 1.085058e+09 | TEX | 2006 |
13 | 1.019353e+09 | TOR | 2006 |
14 | 1.009046e+09 | BAL | 2006 |
15 | 9.640186e+08 | WAS | 2006 |
16 | 9.565559e+08 | ATL | 2006 |
17 | 9.308869e+08 | MIN | 2006 |
18 | 9.236573e+08 | CIN | 2006 |
19 | 9.017237e+08 | MIL | 2006 |
20 | 8.677214e+08 | COL | 2006 |
21 | 8.122377e+08 | HOU | 2006 |
22 | 8.094887e+08 | KCA | 2006 |
23 | 7.865642e+08 | CLE | 2006 |
24 | 7.764995e+08 | ARI | 2006 |
25 | 7.450685e+08 | SDN | 2006 |
26 | 7.270517e+08 | OAK | 2006 |
27 | 6.724917e+08 | PIT | 2006 |
28 | 5.905886e+08 | TBA | 2006 |
29 | 3.388875e+08 | MIA | 2012 |
30 | 2.177977e+08 | FLO | 2006 |
# For the first line here, we select the sum of the salary and their corresponding teamID, and year ID from the Salaries database
# For the second line here, we specify the condition: Let the year Between 2006 and 2016, and team id = "NYA"
# For the third line here, we then group by the yearID, and teamID.
# For the fourth line here, we order our result from the largest payroll to the smallest payroll.
plot_HYA = pd.read_sql("""
SELECT SUM(salary)AS sum , teamID, yearID FROM salaries
WHERE yearID >= 2006 AND teamID = 'NYA'
GROUP BY yearID, teamID
""", db)
plot_HYA
sum | teamID | yearID | |
---|---|---|---|
0 | 194663079.0 | NYA | 2006 |
1 | 189259045.0 | NYA | 2007 |
2 | 207896789.0 | NYA | 2008 |
3 | 201449189.0 | NYA | 2009 |
4 | 206333389.0 | NYA | 2010 |
5 | 202275028.0 | NYA | 2011 |
6 | 196522289.0 | NYA | 2012 |
7 | 231978886.0 | NYA | 2013 |
8 | 197543907.0 | NYA | 2014 |
9 | 212751957.0 | NYA | 2015 |
10 | 222997792.0 | NYA | 2016 |
import pandas as pd
import matplotlib.pyplot as plt
plot_HYA.plot(x = 'yearID', y = "sum", kind = "line", title = "salary for NYA")
plt.show
<function matplotlib.pyplot.show(close=None, block=None)>
# For the first line here, we select the sum of the salary and their corresponding teamID, and year ID from the Salaries database
# For the second line here, we specify the condition: Let the year Between 2006 and 2016, and team id = "LAN"
# For the third line here, we then group by the yearID, and teamID.
# For the fourth line here, we order our result from the largest payroll to the smallest payroll.
plot_LAN = pd.read_sql("""
SELECT SUM(salary)AS sum , teamID, yearID FROM salaries
WHERE yearID >= 2006 AND teamID = 'LAN'
GROUP BY yearID, teamID
""", db)
plot_LAN
sum | teamID | yearID | |
---|---|---|---|
0 | 98447187.0 | LAN | 2006 |
1 | 108454524.0 | LAN | 2007 |
2 | 118588536.0 | LAN | 2008 |
3 | 100414592.0 | LAN | 2009 |
4 | 95358016.0 | LAN | 2010 |
5 | 104188999.0 | LAN | 2011 |
6 | 95143575.0 | LAN | 2012 |
7 | 223362196.0 | LAN | 2013 |
8 | 217014600.0 | LAN | 2014 |
9 | 215792000.0 | LAN | 2015 |
10 | 221288380.0 | LAN | 2016 |
import pandas as pd
import matplotlib.pyplot as plt
plot_LAN.plot(x = 'yearID', y = "sum", kind = "line", title = "salary for LAN")
plt.show
<function matplotlib.pyplot.show(close=None, block=None)>
# For the first line here, we select the sum of the salary and their corresponding teamID, and year ID from the Salaries database
# For the second line here, we specify the condition: Let the year Between 2006 and 2016, and team id = "DET"
# For the third line here, we then group by the yearID, and teamID.
# For the fourth line here, we order our result from the largest payroll to the smallest payroll.
plot_DET = pd.read_sql("""
SELECT SUM(salary)AS sum , teamID, yearID FROM salaries
WHERE yearID >= 2006 AND teamID = 'DET'
GROUP BY yearID, teamID
""", db)
plot_DET
sum | teamID | yearID | |
---|---|---|---|
0 | 82612866.0 | DET | 2006 |
1 | 94800369.0 | DET | 2007 |
2 | 137685196.0 | DET | 2008 |
3 | 115085145.0 | DET | 2009 |
4 | 122864928.0 | DET | 2010 |
5 | 105700231.0 | DET | 2011 |
6 | 132300000.0 | DET | 2012 |
7 | 145989500.0 | DET | 2013 |
8 | 152855500.0 | DET | 2014 |
9 | 172284750.0 | DET | 2015 |
10 | 194876481.0 | DET | 2016 |
import pandas as pd
import matplotlib.pyplot as plt
plot_DET.plot(x = 'yearID', y = "sum", kind = "line", title = "salary for DET") # We plot the payroll for DET from 2006 to 2016.
plt.show
<function matplotlib.pyplot.show(close=None, block=None)>
# For the first line here, we select the sum of the salary and their corresponding teamID, and year ID from the Salaries database
# For the second line here, we specify the condition: Let the year Between 2006 and 2016, and team id = "BOS"
# For the third line here, we then group by the yearID, and teamID.
# For the fourth line here, we order our result from the largest payroll to the smallest payroll.
plot_BOS = pd.read_sql("""
SELECT SUM(salary)AS sum , teamID, yearID FROM salaries
WHERE yearID >= 2006 AND teamID = 'BOS'
GROUP BY yearID, teamID
""", db)
plot_BOS
sum | teamID | yearID | |
---|---|---|---|
0 | 120099824.0 | BOS | 2006 |
1 | 143026214.0 | BOS | 2007 |
2 | 133390035.0 | BOS | 2008 |
3 | 121345999.0 | BOS | 2009 |
4 | 162447333.0 | BOS | 2010 |
5 | 161762475.0 | BOS | 2011 |
6 | 173186617.0 | BOS | 2012 |
7 | 151530000.0 | BOS | 2013 |
8 | 139019929.0 | BOS | 2014 |
9 | 181103400.0 | BOS | 2015 |
10 | 188545761.0 | BOS | 2016 |
import pandas as pd
import matplotlib.pyplot as plt
plot_BOS.plot(x = 'yearID', y = "sum", kind = "line", title = "salary for BOS") # We plot the payroll for BOS from 2006 to 2016.
plt.show
<function matplotlib.pyplot.show(close=None, block=None)>
# For the first line here, we select the sum of the salary and their corresponding teamID, and year ID from the Salaries database
# For the second line here, we specify the condition: Let the year Between 2006 and 2016, and team id = "DET"
# For the third line here, we then group by the yearID, and teamID.
# For the fourth line here, we order our result from the largest payroll to the smallest payroll.
plot_PHI = pd.read_sql("""
SELECT SUM(salary)AS sum , teamID, yearID FROM salaries
WHERE yearID >= 2006 AND teamID = 'PHI'
GROUP BY yearID, teamID
""", db)
plot_PHI
sum | teamID | yearID | |
---|---|---|---|
0 | 88273333.0 | PHI | 2006 |
1 | 89428213.0 | PHI | 2007 |
2 | 97879880.0 | PHI | 2008 |
3 | 113004046.0 | PHI | 2009 |
4 | 141928379.0 | PHI | 2010 |
5 | 172976379.0 | PHI | 2011 |
6 | 174538938.0 | PHI | 2012 |
7 | 169863189.0 | PHI | 2013 |
8 | 180944967.0 | PHI | 2014 |
9 | 111693000.0 | PHI | 2015 |
10 | 58980000.0 | PHI | 2016 |
import pandas as pd
import matplotlib.pyplot as plt
plot_PHI.plot(x = 'yearID', y = "sum", kind = "line", title = "salary for PHI") # We plot the payroll for PHI from 2006 to 2016.
plt.show
<function matplotlib.pyplot.show(close=None, block=None)>
import pandas as pd
import matplotlib.pyplot as plt
sum_HYA = plot_HYA["sum"] # Extract out the sum value for the HYA
sum_LAN = plot_LAN["sum"] # Extract out the sum value for the LAN
sum_DET = plot_PHI["sum"] # Extract out the sum value for the PHI
sum_BOS = plot_BOS["sum"] # Extract out the sum value for the BOS
sum_PHI = plot_PHI["sum"] # Extract out the sum value for the PHI
time = plot_BOS["yearID"] # Extract out the xvalue for the all five teams.
plt.plot(time, sum_HYA, label = 'salary for NYA' )
plt.plot(time, sum_LAN, label = 'salary for LAN' )
plt.plot(time, sum_DET, label = 'salary for DET' )
plt.plot(time, sum_BOS, label = 'salary for BOS' )
plt.plot(time, sum_PHI, label = 'salary for PHI' )
plt.legend()
plt.xlabel("time")
plt.ylabel("payroll")
plt.title("Payroll for five highest team from 2006 to 2016 trend")
#plot_BOS.plot(x = 'yearID', y = "sum", kind = "line", label = "salary for BOS") # We plot the payroll for BOS from 2006 to 2016.
#plot_HYA.plot(x = 'yearID', y = "sum", kind = "line", label = "salary for NYA")
#plot_LAN.plot(x = 'yearID', y = "sum", kind = "line", label = "salary for LAN")
#plot_DET.plot(x = 'yearID', y = "sum", kind = "line", label = "salary for DET") # We plot the payroll for DET from 2006 to 2016.
#plot_PHI.plot(x = 'yearID', y = "sum", kind = "line", label = "salary for PHI") # We plot the payroll for PHI from 2006 to 2016.
#plt.legend()
Text(0.5, 1.0, 'Payroll for five highest team from 2006 to 2016 trend')
# Reference: Picture from: https://www.baseball-reference.com/teams/LAD/2013.shtml
from IPython.display import Image # Import the Python images package.
Image("Reference _baseball.png") # We insert the image.
# Image from: https://en.wikipedia.org/wiki/Philadelphia_Phillies
Image("2016_PHI.png") # We insert the image.
# Reference: https://www.baseball-reference.com/teams/NYY/2012.shtml
Image("NYA_2012.png") # We insert the image.
# Reference: https://www.baseball-reference.com/teams/NYY/2013.shtml
Image("NYA_2013.png") # We insert the image.
# Reference: https://www.baseball-reference.com/teams/NYY/2014.shtml
Image("NYA_2014.png") # We insert the image.
# For the first line here, we select the salary for specific player,their corresponding teamID, and year ID from the Salaries database
# For the second line here, we specify the condition: Let the year Between 2006 and 2016.
# For the third line here, we order our result from the largest payroll to the smallest payroll.
import pandas as pd # Import the required package
dataset1 = pd.read_sql("""
SELECT playerID , teamID, yearID, salary FROM salaries
WHERE yearID BETWEEN 2006 AND 2016
ORDER BY salary DESC
""", db).head(20)
dataset1
playerID | teamID | yearID | salary | |
---|---|---|---|---|
0 | rodrial01 | NYA | 2009 | 33000000.0 |
1 | rodrial01 | NYA | 2010 | 33000000.0 |
2 | kershcl01 | LAN | 2016 | 33000000.0 |
3 | kershcl01 | LAN | 2015 | 32571000.0 |
4 | rodrial01 | NYA | 2011 | 32000000.0 |
5 | greinza01 | ARI | 2016 | 31799030.0 |
6 | rodrial01 | NYA | 2012 | 30000000.0 |
7 | priceda01 | BOS | 2016 | 30000000.0 |
8 | rodrial01 | NYA | 2013 | 29000000.0 |
9 | rodrial01 | NYA | 2008 | 28000000.0 |
10 | verlaju01 | DET | 2015 | 28000000.0 |
11 | cabremi01 | DET | 2016 | 28000000.0 |
12 | verlaju01 | DET | 2016 | 28000000.0 |
13 | cespeyo01 | NYN | 2016 | 27328046.0 |
14 | wellsve01 | LAA | 2011 | 26187500.0 |
15 | greinza01 | LAN | 2014 | 26000000.0 |
16 | hernafe02 | SEA | 2016 | 25857143.0 |
17 | leecl02 | PHI | 2013 | 25000000.0 |
18 | howarry01 | PHI | 2014 | 25000000.0 |
19 | leecl02 | PHI | 2014 | 25000000.0 |
team_id_20 = dataset1["teamID"] # Extract the list for the team id, which 20 highest paid players belong to
lst = team_id_20.tolist() # And we can convert it to list.
lst # Output the list
['NYA', 'NYA', 'LAN', 'LAN', 'NYA', 'ARI', 'NYA', 'BOS', 'NYA', 'NYA', 'DET', 'DET', 'DET', 'NYN', 'LAA', 'LAN', 'SEA', 'PHI', 'PHI', 'PHI']
top_5_list = ["NYA", "LAN", "BOS", "DET", "PHI"]
print("The list which contatin the top_5_team elements", [w for w in lst if w in top_5_list]) # Print the list
print("The length of the list which contatin the top_5_team elements", len([w for w in lst if w in top_5_list])) #Print the length
print("The percentage of the top 20 paid players in top_5_payroll team", len([w for w in lst if w in top_5_list])/len(lst)*100)
# Print out the percentage.
The list which contatin the top_5_team elements ['NYA', 'NYA', 'LAN', 'LAN', 'NYA', 'NYA', 'BOS', 'NYA', 'NYA', 'DET', 'DET', 'DET', 'LAN', 'PHI', 'PHI', 'PHI'] The length of the list which contatin the top_5_team elements 16 The percentage of the top 20 paid players in top_5_payroll team 80.0
# For the first line here, we select the salary for specific player,their corresponding teamID, and year ID from the Salaries database
# For the second line here, we specify the condition: Let the year Between 2006 and 2016.
# For the third line here, we order our result from the largest payroll to the smallest payroll.
import pandas as pd # Import the required package
dataset3 = pd.read_sql("""
SELECT playerID , teamID, yearID, salary FROM salaries
WHERE yearID BETWEEN 2006 AND 2016
ORDER BY salary DESC
""", db).head(10)
dataset3
playerID | teamID | yearID | salary | |
---|---|---|---|---|
0 | rodrial01 | NYA | 2009 | 33000000.0 |
1 | rodrial01 | NYA | 2010 | 33000000.0 |
2 | kershcl01 | LAN | 2016 | 33000000.0 |
3 | kershcl01 | LAN | 2015 | 32571000.0 |
4 | rodrial01 | NYA | 2011 | 32000000.0 |
5 | greinza01 | ARI | 2016 | 31799030.0 |
6 | rodrial01 | NYA | 2012 | 30000000.0 |
7 | priceda01 | BOS | 2016 | 30000000.0 |
8 | rodrial01 | NYA | 2013 | 29000000.0 |
9 | rodrial01 | NYA | 2008 | 28000000.0 |
lst_10 = dataset3["teamID"].tolist() # Extract the list for the team id, which 20 highest paid players belong to
top_5_list = ["NYA", "LAN", "BOS", "DET", "PHI"] # Top five team list.
print("The list which contatin the top_5_team elements", [w for w in lst_10 if w in top_5_list]) # Print the list
print("The length of the list which contatin the top_5_team elements", len([w for w in lst_10 if w in top_5_list])) #Print the length
print("The percentage of the top 20 paid players in top_5_payroll team", len([w for w in lst_10 if w in top_5_list])/len(lst_10)*100)
# Print out the percentage.
The list which contatin the top_5_team elements ['NYA', 'NYA', 'LAN', 'LAN', 'NYA', 'NYA', 'BOS', 'NYA', 'NYA'] The length of the list which contatin the top_5_team elements 9 The percentage of the top 20 paid players in top_5_payroll team 90.0
# For the first line here, we select the salary for specific player,their corresponding teamID, and year ID from the Salaries database
# For the second line here, we specify the condition: Let the year Between 2006 and 2016.
# For the third line here, we order our result from the largest payroll to the smallest payroll.
import pandas as pd # Import the required package
dataset2 = pd.read_sql("""
SELECT playerID , teamID, yearID, salary FROM salaries
WHERE yearID BETWEEN 2006 AND 2016
ORDER BY salary DESC
""", db).head(50)
dataset2
playerID | teamID | yearID | salary | |
---|---|---|---|---|
0 | rodrial01 | NYA | 2009 | 33000000.0 |
1 | rodrial01 | NYA | 2010 | 33000000.0 |
2 | kershcl01 | LAN | 2016 | 33000000.0 |
3 | kershcl01 | LAN | 2015 | 32571000.0 |
4 | rodrial01 | NYA | 2011 | 32000000.0 |
5 | greinza01 | ARI | 2016 | 31799030.0 |
6 | rodrial01 | NYA | 2012 | 30000000.0 |
7 | priceda01 | BOS | 2016 | 30000000.0 |
8 | rodrial01 | NYA | 2013 | 29000000.0 |
9 | rodrial01 | NYA | 2008 | 28000000.0 |
10 | verlaju01 | DET | 2015 | 28000000.0 |
11 | cabremi01 | DET | 2016 | 28000000.0 |
12 | verlaju01 | DET | 2016 | 28000000.0 |
13 | cespeyo01 | NYN | 2016 | 27328046.0 |
14 | wellsve01 | LAA | 2011 | 26187500.0 |
15 | greinza01 | LAN | 2014 | 26000000.0 |
16 | hernafe02 | SEA | 2016 | 25857143.0 |
17 | leecl02 | PHI | 2013 | 25000000.0 |
18 | howarry01 | PHI | 2014 | 25000000.0 |
19 | leecl02 | PHI | 2014 | 25000000.0 |
20 | greinza01 | LAN | 2015 | 25000000.0 |
21 | howarry01 | PHI | 2015 | 25000000.0 |
22 | lestejo01 | CHN | 2016 | 25000000.0 |
23 | pujolal01 | LAA | 2016 | 25000000.0 |
24 | sabatcc01 | NYA | 2016 | 25000000.0 |
25 | hernafe02 | SEA | 2015 | 24857000.0 |
26 | wellsve01 | NYA | 2013 | 24642857.0 |
27 | sabatcc01 | NYA | 2010 | 24285714.0 |
28 | sabatcc01 | NYA | 2011 | 24285714.0 |
29 | sabatcc01 | NYA | 2013 | 24285714.0 |
30 | wellsve01 | LAA | 2012 | 24187500.0 |
31 | canoro01 | SEA | 2014 | 24000000.0 |
32 | fieldpr01 | TEX | 2014 | 24000000.0 |
33 | pujolal01 | LAA | 2015 | 24000000.0 |
34 | canoro01 | SEA | 2015 | 24000000.0 |
35 | fieldpr01 | TEX | 2015 | 24000000.0 |
36 | canoro01 | SEA | 2016 | 24000000.0 |
37 | fieldpr01 | TEX | 2016 | 24000000.0 |
38 | hamiljo03 | TEX | 2016 | 24000000.0 |
39 | ramirma02 | LAN | 2009 | 23854494.0 |
40 | hamelco01 | PHI | 2015 | 23500000.0 |
41 | giambja01 | NYA | 2007 | 23428571.0 |
42 | giambja01 | NYA | 2008 | 23428571.0 |
43 | santajo01 | NYN | 2012 | 23145011.0 |
44 | teixema01 | NYA | 2011 | 23125000.0 |
45 | teixema01 | NYA | 2012 | 23125000.0 |
46 | teixema01 | NYA | 2013 | 23125000.0 |
47 | teixema01 | NYA | 2015 | 23125000.0 |
48 | teixema01 | NYA | 2016 | 23125000.0 |
49 | mauerjo01 | MIN | 2011 | 23000000.0 |
lst_50 = dataset2["teamID"].tolist() # Extract the list for the team id, which 20 highest paid players belong to
top_5_list = ["NYA", "LAN", "BOS", "DET", "PHI"] # Top five team list.
print("The list which contatin the top_5_team elements", [w for w in lst_50 if w in top_5_list]) # Print the list
print("The length of the list which contatin the top_5_team elements", len([w for w in lst_50 if w in top_5_list])) #Print the length
print("The percentage of the top 20 paid players in top_5_payroll team", len([w for w in lst_50 if w in top_5_list])/len(lst_50)*100)
# Print out the percentage.
The list which contatin the top_5_team elements ['NYA', 'NYA', 'LAN', 'LAN', 'NYA', 'NYA', 'BOS', 'NYA', 'NYA', 'DET', 'DET', 'DET', 'LAN', 'PHI', 'PHI', 'PHI', 'LAN', 'PHI', 'NYA', 'NYA', 'NYA', 'NYA', 'NYA', 'LAN', 'PHI', 'NYA', 'NYA', 'NYA', 'NYA', 'NYA', 'NYA', 'NYA'] The length of the list which contatin the top_5_team elements 32 The percentage of the top 20 paid players in top_5_payroll team 64.0