Author: Sami Reed

  • Little more automation for my leads: Part 2

    Little more automation for my leads: Part 2

    This is a continuation of my previous post Little bit of automation for my leads

    First of all, I want to point out a silly thing I did with my last version.

    Since I was having a hard time getting the docx module set up with all of it’s dependencies, I decided to short-circuit that and just use HTML instead of a Microsoft Word Document.

    So I copied all my leads (basically formatted hyperlinks) from Microsoft Word and pasted them into an HTML editor. I used a WordPress post for the HTML editor. Of course, I never published that post.

    That whole rigamarole was unnecessary. You can actually just save a Microsoft Word document straight out to a .htm file.

    Saving straight out of Microsoft Word actually is sort of a hassle, mainly because when saving it doesn’t encode as UTF-8 by default.

    I managed to force it to encode as UTF-8 by chosing “More Options” before saving. In the new window, there is a drop down “Tools” to the left of the new “Save” button. In there you can change the encoding.

    Another improvement I want to do is save this data out to a CSV instead of a text document. Then we can capture more than one field per lead.

    Ultimately, I will surely want several pieces of information—or more—for each lead.

    For instance, I don’t just want the URL, but also the title of the website. I also think I’m gonna want a field to record when the website was last updated, as I keep finding what look like great blogs, but they haven’t been updated in 9 years or something.

    Let’s start by just extracting the URL and the link text and save them both into a csv instead of a text file.

    We can use link.get_text() a BeautifulSoup method to extract the visible text from the href tag, and we can import the csv module to write to a csv file.

    Here’s the new code:

    So this is getting better. We can add fields for whether they have been contacted yet, their email address, or fields that prioritize them.

    Next I’d like to see if I can grab the age of the webpage. Essentially, when was it updated last. Many of these leads are blogs that haven’t been updated in years. I am not sure how much I want to bother with those.

    Here is the modified code:

    from bs4 import BeautifulSoup
    import re
    import csv
    import requests
    def get_last_modified(url):
    """Retrieve the Last-Modified header from a URL if available."""
    try:
    response = requests.head(url, timeout=5)
    return response.headers.get("Last-Modified", "Unknown")
    except requests.RequestException:
    return "Unknown"
    def extract_urls_from_html(html_path):
    """Extract all URLs, their link text, and last modified dates from an HTML file."""
    with open(html_path, "r", encoding="utf-8") as file:
    soup = BeautifulSoup(file, "html.parser")
    url_data = set() # Use a set to avoid duplicates
    # Extract URLs from <a href="…"> and their link text
    count = 0
    for link in soup.find_all("a", href=True):
    url = link["href"].strip()
    link_text = link.get_text(strip=True) # Extract visible link text
    last_modified = get_last_modified(url)
    url_data.add((url, link_text, last_modified))
    print(str(count) + ' : ' + url + " : " + link_text + " : " +last_modified)
    count = count + 1
    # Extract URLs appearing as plain text
    url_pattern = re.compile(r"https?://[^\s\"'>]+")
    for text in soup.stripped_strings:
    for match in url_pattern.findall(text):
    last_modified = get_last_modified(match)
    url_data.add((match, "", last_modified)) # No link text for plain text URLs
    return list(url_data)
    def save_urls_to_csv(url_data, output_file):
    """Save extracted URLs, their link text, and last modified dates to a CSV file."""
    with open(output_file, "w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerow(["URL", "Link Text", "Last Modified"]) # Header row
    writer.writerows(url_data)
    print(f"URLs saved to {output_file}")
    if __name__ == "__main__":
    html_path = "leads-filtered-utf8.htm" # Change this to your HTML file path
    extracted_urls = extract_urls_from_html(html_path)
    if extracted_urls:
    output_file = "lead_urls.csv"
    print("Extracted URLs:")
    for url, text, last_modified in extracted_urls:
    print(f"URL: {url}, Link Text: {text}, Last Modified: {last_modified}")
    save_urls_to_csv(extracted_urls, output_file)
    else:
    print("No URLs found in the HTML file.")

    The function get_last_modified(url) checks the “Last-Modified” header.

    Looking through my results, I see that most of them come back as “Unknown.” It also looks like we saved these dates as strings, which aren’t that easy to sort by.

    This is a step in the right direction, but I don’t think the “Last Modified” field is all that useful in it’s current state.

    I mean, I was able to weed out a couple of really old webpages that haven’t been updated in forever, but not many.

    Let’s try to improve this part next time.