Intro
Get Data
Extra Info
There is a popular blue marlin fishing tournament called The Big Rock that happens annually in Morehead City, NC. For the past 7 years, once the tournament is over, I extract the information from the website into clean dataframes using a combination of python libraries: webdriver
, requests
, and BeautifulSoup
. Two posts will go through the process of web scraping, cleaning, and then analyzing that data.
Tournament Info:
In June, hundreds of expensive fishing boats gather to compete in a 6 day long tournament for a chance at a prize - a purse worth $6MM. Even Michael Jordan, the NBA legend, has been entering recently. The blue marlin they catch are massive and it draws a good bit of attention. The biggest payout is given to the crew that catches the heaviest blue marlin. Check it out here: Homepage for the Big Rock Tournament
Note: Catch 23 is Michael Jordan’s boat and was featured in this ESPN article for their 2020 catch.
Running the python scripts below pulls the Participants and the Activity Feed data from the website and saves them as flat files.
This data consists of the Boat Name, Boat Brand, Boat Size, Port, Owner, and Captain. The Participants are the people and their boats that enter. The cost to participate is a measly $20K.
Script to pull the participants data:
from selenium import webdriver
from bs4 import BeautifulSoup
import requests
import pandas as pd
import time
import tqdm
number = "64th" # update this each year
chromepath = "/Users/Julian/Desktop/misc/chromedriver" # update my chromedriver each year to match my current chrome version
driver = webdriver.Chrome(chromepath)
driver.get("https://www.reeltimeapps.com/live/tournaments/" + number + "-annual-big-rock-blue-marlin-tournament/participants")
# helpful stackoverflow link for line of code below: https://stackoverflow.com/questions/12570329/selenium-list-index-out-of-range-error
links = [i.get_attribute('href') for i in driver.find_elements_by_xpath(""".//a""")]
# find first link and last link for proper subset. Use links that end in a 4 digit number
for i in links:
print(i)
links_subset = links[17:283] # update this each year - make sure you add +1 to the last link number
# initiate empty lists
boat_names = []
sizes_and_brands = []
owners = []
ports = []
captains = []
for link in tqdm.tqdm(links_subset):
# slow down the code, website gets overloaded
time.sleep(3)
# get html
page = requests.get(link, timeout = 10)
soup = BeautifulSoup(page.content, 'html.parser')
# get boat names
tag1 = soup.find_all('h1', class_='m-b-5')
if len(tag1) == 1:
boat_names.append(tag1[0].get_text().strip())
else:
boat_names.append('NA')
# get boat sizes and brands
tag2 = soup.find_all('h5')
if len(tag2) == 1:
sizes_and_brands.append(tag2[0].get_text().strip())
else:
sizes_and_brands.append('NA')
# get owners/companies and port cities and states
tag3 = soup.find_all('h4', class_= 'm-0')
tag4 = soup.find_all('span')
if len(tag4) == 5:
tag4 = tag4[4].get_text().strip() # to help define if there is only a port or only an owner listed
if len(tag3) == 2:
owners.append(tag3[0].get_text().strip())
ports.append(tag3[1].get_text().strip())
elif len(tag3) == 1 and tag4 == 'Port':
owners.append('NA')
ports.append(tag3[0].get_text().strip())
elif len(tag3) == 1 and tag4 == 'Owner':
ports.append('NA')
owners.append(tag3[0].get_text().strip())
else:
owners.append('NA')
ports.append('NA')
# get captains of the boat (only started listing at the 62nd annual, or 2020)
table = soup.find_all('table', id = 'lb-table-anglers')
h4_list = soup.find_all('h4')
no_crew_listed = any([h4_list[x].contents[0].strip() == 'No crew to display' for x in range(len(h4_list))]) # need to see if anything in the list == True to test if there are any captains listed
if no_crew_listed:
captains.append('NA')
else:
table_length = len(table)
num_captain_mates_listed = len(table[table_length - 1].find_all('small'))
captain_mates_list = []
[captain_mates_list.append(table[table_length - 1].find_all('small')[x].get_text()) for x in range(num_captain_mates_listed)]
if 'Captain' in captain_mates_list:
captains.append(table[table_length - 1].find('td').contents[0].strip())
else:
captains.append('NA')
# if every piece of data I pulled has the same length, then it matches up
len(boat_names) == len(sizes_and_brands) == len(owners) == len(ports) == len(captains)
# create a dictionary
d = {'boat_name':boat_names, 'type':sizes_and_brands, 'owner':owners, 'port':ports, 'captain':captains}
# create a dataframe
df = pd.DataFrame(d)
# save the dataframe to my computer - appending the annual number to the filename
df.to_csv("/Users/Julian/Documents/projects/big_rock/data/participants/participants" + number + ".csv")
The Activity Feed comes from boats radioing in certain actions like Hook Ups, Releases, Lost, and Boated. I explain what these terms mean in the Extra Info section below.
This script worked for the first few years. Some minimal alterations were required after the website changed in 2022. Since they were still similar, I didn’t post both versions.
# get activity feed data
from bs4 import BeautifulSoup
import requests
import pandas as pd
number = "64th"
page = requests.get("https://widgets.reeltimeapps.com/live/tournaments/" + number + "-annual-big-rock-blue-marlin-tournament/widgets/feed.json?day=0&per=10000&type=scores")
page.status_code # a status code of 200 means the page downloaded successfully
soup = BeautifulSoup(page.content, 'html.parser')
newsfeed = soup.find_all('h4')
# include this line below for the 63rd annual:
newsfeed = newsfeed[0:100] # needed this to take out the long updates of only text with no boat names or times
boat_names = []
for feed in range(len(newsfeed)):
boat_names.append(newsfeed[feed].get_text().strip())
activity = []
for feed in range(len(newsfeed)):
activity.append(newsfeed[feed].next.next.next.strip())
time = []
for feed in range(len(newsfeed)):
time.append(newsfeed[feed].next.next.next.next.next.next.get_text().strip())
len(boat_names) == len(activity) == len(time) # check to make sure all have the same length
d = {'boat_name':boat_names, 'activity':activity, 'time':time}
df = pd.DataFrame(d)
df.to_csv("/Users/Julian/Documents/projects/big_rock_2019/data/activity/activity" + number + ".csv")
These boats are trolling 60-100 miles offshore in the Gulf Stream for the trophy fish and the public is tuned in to what’s happening in real time via the live updates on the site. There are a lot of other rules for the tournament that I won’t get into here.
- Port: City/State the boat is from
- Hook Ups: When a fish bites your lure and you are now fighting it. This is refered to as being “Hooked up”.
- Releases: When the fish is too small to keep, you let it go. This is called a “Release”. This earns you points towards the Release category if the fish species was a billfish.
- Lost: When the fish gets off the hook or the line breaks it’s considered “Lost”.
- Boated: When the fish is a blue marlin and it is at least 400 pounds or 110 inches long, you can keep it and bring it back to land to be weighed. This is called “Boated”. The heaviest blue marlin wins the tournament (at least, that category of the tournament, which pays out the most money.)
Check out the second post Big Rock - Analysis.