If you read the second part of my Leads automation project (Little more automation for my leads: Part 2), you know that I now have a python script that will go through my gangly list of links copied from blog rolls. I paste my links into a Microsoft Word document but then save it out as HTML.
The script then parses that into leads with a column for the URL, the link text, and the modified date.
The get_last_modified(url) function checks the “Last-Modified” header on the web page. Unfortunately, most of my entries get filled in as “unknown”.
I want to improve on that.
I asked my LLM, “I would like you to estimate when the blog was last updated. Is that something you could do with the URLs?“
The response:
Yes! When estimating when a blog was last updated, I can use several techniques, including:
1. Checking the Website Directly
- Latest Blog Post Date – If the blog has timestamps on articles, the most recent post gives a clue.
- Timestamps in Comments – User comments can reveal activity even if posts aren’t dated.
- Footer or Sidebar Updates – Some sites display “Last updated” notices or recent posts in widgets.
2. Inspecting the Web Page Source Code
- Meta Tags (
<meta>
elements) – Some websites include “last modified” metadata in the page’s HTML. - RSS Feeds – If available, an RSS feed often lists the most recent posts with timestamps.
- Canonical and OpenGraph Tags – These sometimes include the last update date.
3. Using Browser Developer Tools
- HTTP Headers (
Last-Modified
) – Many websites include a “Last-Modified” header in their response when serving a page. - Page Cache Information – Some CMS platforms expose the last cached update in page headers.
4. Checking External Services
- Wayback Machine (Internet Archive) – By looking at snapshots, I can see the most recent captured version.
- Google Cache (
cache:website.com
in Google Search) – Google sometimes saves a snapshot that includes the last update. - WHOIS & Domain Data – If a domain was recently updated or renewed, it might indicate recent activity.
5. Looking at Social Media & External Links
- Linked Social Media Posts – If the blog shares new articles on Twitter, Facebook, or LinkedIn, the latest share gives a clue.
- Backlinks & External Mentions – Tools like Google News or backlink checkers might reveal when the blog was last referenced.
6. Web Scraping or Automated Analysis
- Text Analysis for Dates – Searching for date patterns in the text (e.g., “Updated on March 2024”).
- Comparing Page Content Over Time – Running a diff check on stored versions of the page.
Would you like me to run a check on specific blog URLs using some of these methods?
I decided to test it at the prompt with a blog that was listed with a November 2024 date in my Last Modified column.
I got this response:
So I looked for myself and I agreed with the LLM estimate. Looked like the old method was way too generous.
Now let’s see if we can automate this. Let’s try to replace our function get_last_modified(url), which just grabs a header with a function estimate_last_update_with_llm(url).
I got ChatGPT to help me write the code to do this.
The first thing I noticed was that there was no place in my script where my API key was being defined. It turns out that the OpenAI API uses implicit credential loading, so as long as my environment variable OPENAI_API_KEY is defined, I’m good to go. I checked that, and mine was defined.
The next thing was that ChatGPT writes some bad outdated code. I had to wrestle with it a while to get a working API call.
The next thing is that the API doesn’t surf the web. I guess there is some version that does, but not mine.
So while I can talk about a URL to a chat window and the LLM will analyze the webpage, that doesn’t fly with the API.
You have to explicitly grab the web content and feed that into your API call.
This leads to the next point—you have to decide how much of the webpage you are going to fetch. You really need to set a limit. 5000 characters was suggested, but I played with 20000 characters as well in order to try to improve results.
For some reason, my results were not as good as my test with the chat window.
I played with the prompt and eventually got some results.
Mostly I got something like:
Eventually I got one test case to give me a long sentence that included the correct date. That was real progress.
But my next test gave me a series of this response:
Looks like I’m outta credits.
I’m not sure this is really saving me time, and I don’t feel like pouring money into this. I do think this sort of thing is worth learning about though, especially in the long run.
However, I think the next step will probably be to use more old-school methods to try to estimate the date of last update. For instance, just regular expression matching for dates could be effective on blogs that usually have dates on the posts.
It was a little disappointing that the LLM didn’t just magically give me all the dates, no matter how messy the content. The reality is that this is a heavy approach that uses a lot of compute, and it isn’t surprising that there is a quota to how much compute they let me use.
I could pay for more. My LLM bill is still pretty small, but I think it’s time to try something different.
Here is the last version of my code. It does sort of work on some blog URLs:
from bs4 import BeautifulSoup
import re
import csv
import requests
import openai
'''
import os
print("API Key:", os.getenv("OPENAI_API_KEY"))
exit()
'''
def estimate_last_update_with_llm(url):
"""Estimate the last update date of a webpage using ChatGPT."""
try:
# Fetch the page content
response = requests.get(url, timeout=10)
response.raise_for_status()
page_content = response.text[:5000] # Limit to first 5000 characters
# Construct prompt for ChatGPT
'''
prompt = (
"Based on the following webpage content, estimate the most recent update date if possible. "
"Look for any visible timestamps, published or modified dates, or context that might indicate recency.\n\n"
f"Content:\n{page_content}\n\nEstimated last update date:"
)
'''
prompt = (
"Based on the following URL, estimate the most recent update date if possible. "
"Look for any visible timestamps, published or modified dates, or context that might indicate recency.\n\n"
f"URL:\n{url}\n\nEstimated last update date:"
)
# Initialize OpenAI client
client = openai.OpenAI()
# Call OpenAI API
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "system", "content": "You are an expert in analyzing webpage content."},
#{"role": "user", "content": prompt}]
{"role": "user", "content": f"Based on this webpage, when was it last updated?\n\n{page_content}"}
])
# Extract the estimated date from response
estimated_date = response.choices[0].message.content.strip()
return estimated_date
except requests.RequestException:
return "Error fetching the webpage"
except Exception as e:
return f"LLM estimation failed: {str(e)}"
def get_last_modified(url):
"""Retrieve the Last-Modified header from a URL if available."""
try:
response = requests.head(url, timeout=5)
return response.headers.get("Last-Modified", "Unknown")
except requests.RequestException:
return "Unknown"
def extract_urls_from_html(html_path):
"""Extract all URLs, their link text, and last modified dates from an HTML file."""
with open(html_path, "r", encoding="utf-8") as file:
soup = BeautifulSoup(file, "html.parser")
url_data = set() # Use a set to avoid duplicates
# Extract URLs from <a href="…"> and their link text
count = 0
for link in soup.find_all("a", href=True):
url = link["href"].strip()
link_text = link.get_text(strip=True) # Extract visible link text
last_modified = estimate_last_update_with_llm(url)
url_data.add((url, link_text, last_modified))
print(f"{count} : {url} : {link_text} : {last_modified}")
count += 1
# Extract URLs appearing as plain text
url_pattern = re.compile(r"https?://[^\s\"'>]+")
for text in soup.stripped_strings:
for match in url_pattern.findall(text):
last_modified = estimate_last_update_with_llm(match)
url_data.add((match, "", last_modified)) # No link text for plain text URLs
return list(url_data)
def save_urls_to_csv(url_data, output_file):
"""Save extracted URLs, their link text, and last modified dates to a CSV file."""
with open(output_file, "w", newline="", encoding="utf-8") as file:
writer = csv.writer(file)
writer.writerow(["URL", "Link Text", "Last Modified"]) # Header row
writer.writerows(url_data)
print(f"URLs saved to {output_file}")
if __name__ == "__main__":
html_path = "test_leads.html" #just do a small list for testing
extracted_urls = extract_urls_from_html(html_path)
if extracted_urls:
output_file = "lead_urls.csv"
print("Extracted URLs:")
for url, text, last_modified in extracted_urls:
print(f"URL: {url}, Link Text: {text}, Last Modified: {last_modified}")
save_urls_to_csv(extracted_urls, output_file)
else:
print("No URLs found in the HTML file.")
Leave a Reply