Logistic Regression Part 1: Scraping the data
The ongoing COVID-19 pandemic has, at the very least, suspended the 2020 sports season worldwide. On the bright side, that gives us time to analyze and make predictions about them, given that there is sufficient data available.
In this project, using Python, I first create a web scraper to gather MLB data from previous seasons, then use logistic regression to calculate each team’s probability of making the 2020 postseason.
Data Description
The data will be scraped from Baseball Reference. It contains data from as far back as the 1870s, however, because the structure of the playoffs, and baseball as a whole, has changed often and drastically since, the data I use goes back only to 2012, the first year that the current playoff structure was implemented. The changes deal primarily with the amount of teams allowed to make it to the playoffs, and in 2012 the MLB expanded it to 10 teams.
Therefore, we will be gathering batting, and pitching data from the 2012 - 2019 seasons.
Building the web scraper
The URL that contains the data we need:
Fortunately, all the data we need are on one page, so the logic for sending requests to the url is straight forward, given its format. It just requires replacing 2019 with the year we need data for.
All the libraries I used:
from bs4 import BeautifulSoup,Comment
import pandas as pd
import time
import requests
I used BeautifulSoup to parse the HTML.
url = "https://www.baseball-reference.com/leagues/MLB/{}.shtml".format(year)
headers = {'user-agent': "Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11"}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'lxml')
The HTML code for the batting data table:
batting_table = soup.find("div", attrs={"id": "div_teams_standard_batting"})
It is essentially the same for the pitching data table, however there is an issue where the code for it is commented in for some reason, so I had to add a few lines of code to get what I needed out of it (Thanks to StackOverflow for helping me figure this out).
comments = soup.find_all(text=lambda text: isinstance(text, Comment))
pitching_html = comments[19]
pitching_table = BeautifulSoup(pitching_html, 'lxml')
Code
The rest of the code consists of extracting the stats I needed, creating a dataframe, and then exporting it to a csv file. Here it is in its entirety:
from bs4 import BeautifulSoup,Comment
import pandas as pd
import time
import requests
def batting_stats(bstat):
tables = batting_table.find_all("td", attrs={"data-stat": bstat})
b_stats = []
for table in tables:
b_stat = table.text
b_stat = float(b_stat)
b_stats.append(b_stat)
b_stats = b_stats[:-2] #exclude total and average
return b_stats
def pitching_stats(pstat):
tables = pitching_table.find_all("td", attrs={"data-stat": pstat})
p_stats = []
for table in tables:
p_stat = table.text
if p_stat == '':
p_stat = "empty"
else:
p_stat = float(p_stat)
p_stats.append(p_stat)
p_stats = p_stats[:-2]
return p_stats
def postseason(team_names, yr): # 1 if a team made the playoff that year 0 if not
playoff_teams = []
go = True
while go:
team = input("Enter playoff teams for {}. Enter any number when all are in \n".format(yr))
playoff_teams.append(team)
if team.isdigit():
go = False
print("{} Postseason teams: ".format(yr), playoff_teams[:-1])
else:
print(playoff_teams)
playoff = []
for a in range(0, len(team_names)):
if team_names[a] in playoff_teams:
playoff.append(1)
else:
playoff.append(0)
return playoff
all_years = []
for year in range(2012, 2020):
url = "https://www.baseball-reference.com/leagues/MLB/{}.shtml".format(year)
headers = {'user-agent': "Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11"}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'lxml')
batting_table = soup.find("div", attrs={"id": "div_teams_standard_batting"})
comments = soup.find_all(text=lambda text: isinstance(text, Comment))
pitching_html = comments[19]
pitching_table = BeautifulSoup(pitching_html, 'lxml')
name_table = batting_table.find_all("th", attrs={"data-stat": "team_ID"})
names = []
for name in name_table: #Get team names
n = name.text
names.append(n)
names = names[:-3]
names = names[1:]
years = []
for i in range(0, len(names)): #Get the year of the season
years.append(year)
playoffs = postseason(names, year)
b_dict = {'Tm': names, 'Yr': years, 'Playoff': playoffs}
b_stat_names = ['batters_used','age_bat','runs_per_game','G','PA','AB','R','H','2B','3B','HR','RBI',
'SB','CS','BB','SO','batting_avg','onbase_perc','slugging_perc','onbase_plus_slugging','TB'
,'GIDP','HBP','SH','SF','IBB','LOB']
p_stat_names = ['pitchers_used', 'age_pitch', 'runs_allowed_per_game', 'W', 'L', 'win_loss_perc', 'earned_run_avg',
'G', 'GS', 'GF', 'CG', 'SHO_team', 'SHO_cg', 'SV', 'IP', 'H', 'R', 'ER', 'HR', 'BB', 'IBB', 'SO',
'HBP', 'BK','WP', 'batters_faced', 'fip', 'whip', 'hits_per_nine', 'home_runs_per_nine',
'bases_on_balls_per_nine','strikeouts_per_nine', 'strikeouts_per_base_on_balls', 'LOB']
p_dict = {'Tm': names, 'Yr': years,'Playoff': playoffs}
for x in p_stat_names:
p_dict[x] = pitching_stats(x)
df_pitching = pd.DataFrame.from_dict(p_dict)
for i in b_stat_names:
b_dict[i] = batting_stats(i)
df_batting = pd.DataFrame.from_dict(b_dict)
df_batting_pitching = df_pitching.merge(df_batting,on=['Yr','Tm','Playoff'],how='left',suffixes=('_pitching','_batting'))
all_years.append(df_batting_pitching)
time.sleep(6)
df_all = pd.concat(all_years)
print(df_all)
df_all.to_csv('mlb_since_2012.csv')
We now have all the data we need for this project. I decided to also use each team’s payroll as an additional predictor. I simply entered each datapoint in manually.
In my next post I will get into the analysis, and build the logistic regression model.