Scraping UFC data with Python part 2
Previous blogs
[Part ?] - designing database tables
[Part 0] - scraping events data
[Part 1] - fighters table
[Part 2] - fight matchups table
overview
In this blog I’ll finally be completing the fight_matchups table, this table will act as a junction table, containing the fights event_id found in the events table, both fighters id’s found in the fighters table and a fight_id as the primary key.
event list
The events table has a row called event_link, containing a link to that events fights on ufcstats.com i.e:
The above link returns a table, containing all the fights on that event card, I used this link to scrape each fight of each event I have have in my events table (approx 800 events).
The below function grabbed all the event names in the events table, to be used to pass through a function that will scrape the events fight table;
def get_events_list():
try:
cur = conn.cursor()
cur.execute('SELECT event_name FROM events')
events_list = cur.fetchall()
cur.close()
return events_list #list of all events names
except BaseException as e:
print('could not get names ', e)
The scraper function looked like this;
def event_fights(event_name):
event_link = get_event_link_id(event_name) #grabs event url
r = requests.get(event_link[0]) #get request event url
event_id = event_link[1]
soup = BeautifulSoup(r.text, 'html.parser')
table_body = soup.find('tbody')
rows = table_body.find_all('tr')
for row in rows[0::]:
#parse info in each row
fighters = row.find_all('a')
fighter_1_name = fighters[1].string.strip().split(' ')
fighter_2_name = fighters[2].string.strip().split(' ')
#queries fight table and returns both fighters id's
fighter_1_id = get_fighter_id(fighter_1_name[0], fighter_1_name[1])
fighter_2_id = get_fighter_id(fighter_2_name[0], fighter_2_name[1])
#inserts event_id, fighter_1_id and fighter_2_id to fight_matchups table
insert_fight_matchup(event_id, fighter_1_id, fighter_2_id)
The insert table looked like this;
def insert _fight_matchup(event_id, fighter_1_id, fighter_2_id):
try:
cur = conn.cursor()
cur.execute('INSERT INTO fight_matchups (event_id, fighter_1_id, fighter_2_id) VALUES ', (event_id, fighter_1_id, fighter_2_id))
cur.close()
return
except BaseException as e:
print('could not insert matchup ', e)
Overall, this task took approximately 20 minutes to parse over each fight in every event in my events table (about 700 events total).
This time includes the 3 second delay I implemented, when looping over each event, this was to prevent any potential rate limiting when making requests to ufcstats.com.
As seen when making a query to the fight_matchups table and totaling the amount of rows returned, there are approximately 7000 fights inserted into the table.
When dividing the amount of fights in the fight_matchup table with the amount of events in the events table (6831 fights / 683 events) this returns an average of 10.001 fights per event.
At face value this seems accurate, especially given that earlier UFC events had very few fights per event (i.e UFC 26 only had 8 fights) however, when comparing this to current UFC events there are an average of 13 fights per event.
Ultimately these numbers make me think that not all fights were inserted into the fight_matchups table, and that a significant amount of fights are missing.
To verify this suspicion, I created a events_fight_checker() function, that looped over the list of fights in an event, counted the total and then compared that total to the corresponding amount of fights with the specified event_id stored in the fight_matchups table.
def events_fight_checker(event_name):
event_link = get_event_link_id(event_name) #returns event url
r = requests.get(event_link[0])
event_id = event_link[1] #corresponding event id to fight id on each card
saved_fights = event_fight_count(event_id) #returns number of fights with this event_id
soup = BeautifulSoup(r.text, 'html.parser')
t_body = soup.find('tbody')
rows = t_body.find_all('tr')
print(event_name, '// count: ',saved_fights-len(rows))
when looping over each event in the above function, if saved_fights-len(rows) = 0, then the correct amount of fights are stored in my fight_matchups table.
If saved_fights-len(rows) = -integer then that event is missing fights.
However, if saved_fights-len(rows) = +integer then that event has too many fights in the fight_matchups table.
As seen below:
UFC Fight Night: Hermansson vs. Pyfer // count: 12
UFC Fight Night: Dolidze vs. Imavov // count: -9
UFC Fight Night: Ankalaev vs. Walker 2 // count: 0
UFC 296: Edwards vs. Covington // count: 0