Big Rock - Web Scrape

Julian Swart Jul 2019 6 min read

Table of Contents

Intro
Get Data
Extra Info

Mates prepping fishing gear on the way out to the Gulf Stream during The Big Rock

Intro

There is a popular blue marlin fishing tournament called The Big Rock that happens annually in Morehead City, NC. For the past 7 years, once the tournament is over, I extract the information from the website into clean dataframes using a combination of python libraries: webdriver, requests, and BeautifulSoup. Two posts will go through the process of web scraping, cleaning, and then analyzing that data.

Tournament Info: In June, hundreds of expensive fishing boats gather to compete in a 6 day long tournament for a chance at a prize - a purse worth $6MM. Even Michael Jordan, the NBA legend, has been entering recently. The blue marlin they catch are massive and it draws a good bit of attention. The biggest payout is given to the crew that catches the heaviest blue marlin. Check it out here: Homepage for the Big Rock Tournament

Note: Catch 23 is Michael Jordan’s boat and was featured in this ESPN article for their 2020 catch.

Get Data

Running the python scripts below pulls the Participants and the Activity Feed data from the website and saves them as flat files.

~ Participants

This data consists of the Boat Name, Boat Brand, Boat Size, Port, Owner, and Captain. The Participants are the people and their boats that enter. The cost to participate is a measly $20K.

10 row sample of the Participants dataframe I obtained.

Script to pull the participants data:

from selenium import webdriver
from bs4 import BeautifulSoup
import requests
import pandas as pd
import time
import tqdm

number = "64th" # update this each year
chromepath = "/Users/Julian/Desktop/misc/chromedriver" # update my chromedriver each year to match my current chrome version
driver = webdriver.Chrome(chromepath)
driver.get("https://www.reeltimeapps.com/live/tournaments/" + number + "-annual-big-rock-blue-marlin-tournament/participants")

# helpful stackoverflow link for line of code below: https://stackoverflow.com/questions/12570329/selenium-list-index-out-of-range-error
links = [i.get_attribute('href') for i in driver.find_elements_by_xpath(""".//a""")]

# find first link and last link for proper subset. Use links that end in a 4 digit number
for i in links: 
    print(i)

links_subset = links[17:283] # update this each year - make sure you add +1 to the last link number 

# initiate empty lists
boat_names = []
sizes_and_brands = []
owners = []
ports = []
captains = []

for link in tqdm.tqdm(links_subset):
    
    # slow down the code, website gets overloaded  
    time.sleep(3)
    
    # get html
    page = requests.get(link, timeout = 10) 
    soup = BeautifulSoup(page.content, 'html.parser')
    
    # get boat names
    tag1 = soup.find_all('h1', class_='m-b-5')

    if len(tag1) == 1:
        boat_names.append(tag1[0].get_text().strip())    
    else:
        boat_names.append('NA')  

    # get boat sizes and brands
    tag2 = soup.find_all('h5')

    if len(tag2) == 1:
        sizes_and_brands.append(tag2[0].get_text().strip())
    else:
        sizes_and_brands.append('NA')

    # get owners/companies and port cities and states
    tag3 = soup.find_all('h4', class_= 'm-0') 
    tag4 = soup.find_all('span')
    
    if len(tag4) == 5:
        tag4 = tag4[4].get_text().strip() # to help define if there is only a port or only an owner listed
    if len(tag3) == 2:
        owners.append(tag3[0].get_text().strip())
        ports.append(tag3[1].get_text().strip()) 
    elif len(tag3) == 1 and tag4 == 'Port':
        owners.append('NA')
        ports.append(tag3[0].get_text().strip())   
    elif len(tag3) == 1 and tag4 == 'Owner':
        ports.append('NA')
        owners.append(tag3[0].get_text().strip())   
    else:
        owners.append('NA')
        ports.append('NA')
        
    # get captains of the boat (only started listing at the 62nd annual, or 2020)
    table = soup.find_all('table', id = 'lb-table-anglers')
    h4_list = soup.find_all('h4')
    no_crew_listed = any([h4_list[x].contents[0].strip() == 'No crew to display' for x in range(len(h4_list))]) # need to see if anything in the list == True to test if there are any captains listed
    
    if no_crew_listed: 
        captains.append('NA')
    else: 
        table_length = len(table)
        num_captain_mates_listed = len(table[table_length - 1].find_all('small')) 
        captain_mates_list = []
        [captain_mates_list.append(table[table_length - 1].find_all('small')[x].get_text()) for x in range(num_captain_mates_listed)]

        if 'Captain' in captain_mates_list:
            captains.append(table[table_length - 1].find('td').contents[0].strip())
        else:
            captains.append('NA')
    
# if every piece of data I pulled has the same length, then it matches up   
len(boat_names) == len(sizes_and_brands) == len(owners) == len(ports) == len(captains) 

# create a dictionary
d = {'boat_name':boat_names, 'type':sizes_and_brands, 'owner':owners, 'port':ports, 'captain':captains}

# create a dataframe
df = pd.DataFrame(d)

# save the dataframe to my computer - appending the annual number to the filename 
df.to_csv("/Users/Julian/Documents/projects/big_rock/data/participants/participants" + number + ".csv")

~ Activity

The Activity Feed comes from boats radioing in certain actions like Hook Ups, Releases, Lost, and Boated. I explain what these terms mean in the Extra Info section below.

10 row sample of the Activity Feed dataframe I obtained. A "dolphin" is a nickname for a Mahi Mahi.

This script worked for the first few years. Some minimal alterations were required after the website changed in 2022. Since they were still similar, I didn’t post both versions.

# get activity feed data 

from bs4 import BeautifulSoup
import requests
import pandas as pd

number = "64th"

page = requests.get("https://widgets.reeltimeapps.com/live/tournaments/" + number + "-annual-big-rock-blue-marlin-tournament/widgets/feed.json?day=0&per=10000&type=scores")

page.status_code # a status code of 200 means the page downloaded successfully 

soup = BeautifulSoup(page.content, 'html.parser')

newsfeed = soup.find_all('h4')

# include this line below for the 63rd annual:

newsfeed = newsfeed[0:100] # needed this to take out the long updates of only text with no boat names or times

boat_names = []

for feed in range(len(newsfeed)):
    boat_names.append(newsfeed[feed].get_text().strip())

activity = []

for feed in range(len(newsfeed)):
    activity.append(newsfeed[feed].next.next.next.strip())

time = []

for feed in range(len(newsfeed)):
    time.append(newsfeed[feed].next.next.next.next.next.next.get_text().strip())

len(boat_names) == len(activity) == len(time) # check to make sure all have the same length 

d = {'boat_name':boat_names, 'activity':activity, 'time':time}

df = pd.DataFrame(d)

df.to_csv("/Users/Julian/Documents/projects/big_rock_2019/data/activity/activity" + number + ".csv")

Extra Info

These boats are trolling 60-100 miles offshore in the Gulf Stream for the trophy fish and the public is tuned in to what’s happening in real time via the live updates on the site. There are a lot of other rules for the tournament that I won’t get into here.

~ Terms explained:

Port: City/State the boat is from
Hook Ups: When a fish bites your lure and you are now fighting it. This is refered to as being “Hooked up”.
Releases: When the fish is too small to keep, you let it go. This is called a “Release”. This earns you points towards the Release category if the fish species was a billfish.
Lost: When the fish gets off the hook or the line breaks it’s considered “Lost”.
Boated: When the fish is a blue marlin and it is at least 400 pounds or 110 inches long, you can keep it and bring it back to land to be weighed. This is called “Boated”. The heaviest blue marlin wins the tournament (at least, that category of the tournament, which pays out the most money.)

Map of a portion of the NC Coast

Check out the second post Big Rock - Analysis.