Mo_Wentao_hw5

Due March 7 (Monday), 2022 by 11:59pm. Submit by editing this file and then uploading to Canvas.

Please rename this file with "LastName_FirstName_hw5"

The purpose of this assignment is to practice accessing and analyzing data in a database.

Notes:

  1. Put your answers in new cells after each exercise. You can make as many new cells as you like. Use code cells for code and Markdown cells for text. Answer all questions with complete sentences.

  2. Your code should be readable; writing a piece of code should be compared to writing a page of a book. Adopt the one-statement-per-line rule. The lenghth of your code should not exceed the maximum length of each cell for display. If your code is too long, you should split it into multiple lines to improve readability. (You will lose one point for each line that does not follow the rule.)

  3. To help understand and maintain code, you should always add comments to explain your code. Use the hash symbol (#) to start writing a comment (homework without any comments will automatically receive 0 points). If you are writing a function, consider using docstring to add explanation.

  4. Submit your final work with a .pdf (or a .html) file to Canvas. To convert a .ipynb file to a .pdf/.html file, go to "File", click "Download as", and then click "PDF via LaTeX"/"HTML".

  5. Please do not clear your output so that we can see your answers without running all of the cells.

For full credit, do most of your work in SQL. Subsetting, sorting, grouping, aggregation, and merging can all be done in SQL. It is okay to use Python code for parts of your analysis that are difficult or impossible to do in SQL, such as creating visualizations or computing statistics (other than the mean).

Lahman's Baseball Database

In this assignment, you'll use SQL queries and pandas to explore data in Lahman's Baseball Database, which contains "complete batting and pitching statistics from 1871 to 2020, plus fielding statistics, standings, team stats, managerial records, post-season data, and more."

We use the 2019 version for this homework. You can find the database in SQLite format in the hw5 folder on Piazza.

Documentation for the database, including a description of all tables, is in the readme2019.txt file included in the hw5 folder.

Exercises:

  1. (3 points) From 2006 to 2016, which 5 teams had the highest payrolls? Payroll means the sum of the salaries for all people on the team in a given year.

  2. (3 points) Is there any general trend in payrolls over this time period? (Hint: make a plot to find the trend)

  3. (2 points) There are at least 2 anomalies in the payrolls over this time period. What are they, and can you find information (online or elsewhere) to explain them?

  4. (2 points) How do the top 5 teams compare to other teams?

Important, we have to group by our column in order to calculate the sum for all playeres in one team

Now, the answer for question can be concluded:

Based on the Payroll calculation,we found that the

Highest payroll: New York Highlanders in 2013

Second Highest payroll: Los Angeles Dodgers in 2013

Third Highest payroll: New York Highlanders in 2016

Fourth Highest payroll: Los Angeles Dodgers in 2016

Fifth Highest payroll: Los Angeles Dodgers in 2014

Five teams which receive higher payroll (Excluding the same team)

1st: NYA ; 2nd: LAN ; 3rd: DET; 4th: BOS ; 5th: PHI

Notice! We can try another type of output: We can also apply the sql package to output the highest payroll for all ten years( which is to say, from 2006 to 2016, excluding the situation if we have different years.)

In this view of output, we are actually showing that the total payroll in total 10 years(from 2006 - 2016):

1st: NYA, 2nd: BOS, 3rd: LAN, 4th: DET, 5th: PHI

Second Question: Is there any general trend in payrolls over this time period? (Hint: make a plot to find the trend)

In order to see more trend, we can plot two more teams that receive higher payroll.

We can draw them in the same graph to see the general trend

By graphing the payroll for the five highest team from 2006 to 2016 trend, we can actually find that the general trend shows that the payroll for the team are increasing from the year 2006 to the year 2016. We can find some anomalies in this picture which we will use to evaluate for part 3.

Question Three: There are at least 2 anomalies in the payrolls over this time period. What are they, and can you find information (online or elsewhere) to explain them?

From the resources online, we can find that in 2013, Los Angeles Dodgers Finished 1st in NL_West which is a pretty remarkable score. Based on our graph above, we find that there is huge payroll increase for Los Angeles Dodgers in 2013, and I think that the major result for the increasing payroll should be the good score they received during the year 2013.

Now, another possible anomalies is that what happens to PHI (Philadelphia Quakers) in 2016? Since they experienced a rapid decrease in payroll during that year.

From the resources online, we can find that in 2016, PHI finished fourth in the Nl East, only winning eight more games than they had the previous year, and they also experienced some personnel replacement, which may possibly be a reason to explain that they have a rapid descending trend in their payroll during the year 2016.

We can state more interesting findings here. Why the payroll data for NYA(New York Yankees) experienced a small fluctuations from 2012-2014? From the dataset provided online:

We find that they experienced some fluctuations for their rank in the American League East Competition might have some effections on their payrolls.

Questions Four: How do the top 5 teams compare to other teams?

We can do an analysis which we can extract out all some high-income players, and see what kind of result we will get:

If we display the top twenty highest paid players, we can find that most of them is in our top 5 team, and we can extract out the corresponding data to calculate the actual percentage:

What about top 10 paid players?

What about top 50 paid players?

From the above evaluation, we can mainly conclude that: top paid players are mostly in the team which receive the top 5 highest payroll, and we can see that even though the percentage is decreasing if we increase the number we use to evaluate the top paid players, earnings of high-paid players remain an important salary component for top-five payroll team.

Thanks for your reading.