Little more automation for my leads: Part 2

Robot mascot for technology posts

This is a continuation of my previous post Little bit of automation for my leads

First of all, I want to point out a silly thing I did with my last version.

Since I was having a hard time getting the docx module set up with all of it’s dependencies, I decided to short-circuit that and just use HTML instead of a Microsoft Word Document.

So I copied all my leads (basically formatted hyperlinks) from Microsoft Word and pasted them into an HTML editor. I used a WordPress post for the HTML editor. Of course, I never published that post.

That whole rigamarole was unnecessary. You can actually just save a Microsoft Word document straight out to a .htm file.

Saving straight out of Microsoft Word actually is sort of a hassle, mainly because when saving it doesn’t encode as UTF-8 by default.

I managed to force it to encode as UTF-8 by chosing “More Options” before saving. In the new window, there is a drop down “Tools” to the left of the new “Save” button. In there you can change the encoding.

Another improvement I want to do is save this data out to a CSV instead of a text document. Then we can capture more than one field per lead.

Ultimately, I will surely want several pieces of information—or more—for each lead.

For instance, I don’t just want the URL, but also the title of the website. I also think I’m gonna want a field to record when the website was last updated, as I keep finding what look like great blogs, but they haven’t been updated in 9 years or something.

Let’s start by just extracting the URL and the link text and save them both into a csv instead of a text file.

We can use link.get_text() a BeautifulSoup method to extract the visible text from the href tag, and we can import the csv module to write to a csv file.

Here’s the new code:

So this is getting better. We can add fields for whether they have been contacted yet, their email address, or fields that prioritize them.

Next I’d like to see if I can grab the age of the webpage. Essentially, when was it updated last. Many of these leads are blogs that haven’t been updated in years. I am not sure how much I want to bother with those.

Here is the modified code:

from bs4 import BeautifulSoup
import re
import csv
import requests
def get_last_modified(url):
"""Retrieve the Last-Modified header from a URL if available."""
try:
response = requests.head(url, timeout=5)
return response.headers.get("Last-Modified", "Unknown")
except requests.RequestException:
return "Unknown"
def extract_urls_from_html(html_path):
"""Extract all URLs, their link text, and last modified dates from an HTML file."""
with open(html_path, "r", encoding="utf-8") as file:
soup = BeautifulSoup(file, "html.parser")
url_data = set() # Use a set to avoid duplicates
# Extract URLs from <a href="…"> and their link text
count = 0
for link in soup.find_all("a", href=True):
url = link["href"].strip()
link_text = link.get_text(strip=True) # Extract visible link text
last_modified = get_last_modified(url)
url_data.add((url, link_text, last_modified))
print(str(count) + ' : ' + url + " : " + link_text + " : " +last_modified)
count = count + 1
# Extract URLs appearing as plain text
url_pattern = re.compile(r"https?://[^\s\"'>]+")
for text in soup.stripped_strings:
for match in url_pattern.findall(text):
last_modified = get_last_modified(match)
url_data.add((match, "", last_modified)) # No link text for plain text URLs
return list(url_data)
def save_urls_to_csv(url_data, output_file):
"""Save extracted URLs, their link text, and last modified dates to a CSV file."""
with open(output_file, "w", newline="", encoding="utf-8") as file:
writer = csv.writer(file)
writer.writerow(["URL", "Link Text", "Last Modified"]) # Header row
writer.writerows(url_data)
print(f"URLs saved to {output_file}")
if __name__ == "__main__":
html_path = "leads-filtered-utf8.htm" # Change this to your HTML file path
extracted_urls = extract_urls_from_html(html_path)
if extracted_urls:
output_file = "lead_urls.csv"
print("Extracted URLs:")
for url, text, last_modified in extracted_urls:
print(f"URL: {url}, Link Text: {text}, Last Modified: {last_modified}")
save_urls_to_csv(extracted_urls, output_file)
else:
print("No URLs found in the HTML file.")

The function get_last_modified(url) checks the “Last-Modified” header.

Looking through my results, I see that most of them come back as “Unknown.” It also looks like we saved these dates as strings, which aren’t that easy to sort by.

This is a step in the right direction, but I don’t think the “Last Modified” field is all that useful in it’s current state.

I mean, I was able to weed out a couple of really old webpages that haven’t been updated in forever, but not many.

Let’s try to improve this part next time.


Discover more from Things I Tried

Subscribe to get the latest posts sent to your email.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *