Little more automation for my leads: Part 2

This is a continuation of my previous post Little bit of automation for my leads

First of all, I want to point out a silly thing I did with my last version.

Since I was having a hard time getting the docx module set up with all of it’s dependencies, I decided to short-circuit that and just use HTML instead of a Microsoft Word Document.

So I copied all my leads (basically formatted hyperlinks) from Microsoft Word and pasted them into an HTML editor. I used a WordPress post for the HTML editor. Of course, I never published that post.

That whole rigamarole was unnecessary. You can actually just save a Microsoft Word document straight out to a .htm file.

Saving straight out of Microsoft Word actually is sort of a hassle, mainly because when saving it doesn’t encode as UTF-8 by default.

I managed to force it to encode as UTF-8 by chosing “More Options” before saving. In the new window, there is a drop down “Tools” to the left of the new “Save” button. In there you can change the encoding.

Another improvement I want to do is save this data out to a CSV instead of a text document. Then we can capture more than one field per lead.

Ultimately, I will surely want several pieces of information—or more—for each lead.

For instance, I don’t just want the URL, but also the title of the website. I also think I’m gonna want a field to record when the website was last updated, as I keep finding what look like great blogs, but they haven’t been updated in 9 years or something.

Let’s start by just extracting the URL and the link text and save them both into a csv instead of a text file.

We can use link.get_text() a BeautifulSoup method to extract the visible text from the href tag, and we can import the csv module to write to a csv file.

Here’s the new code:

	from bs4 import BeautifulSoup
	import re
	import csv

	def extract_urls_from_html(html_path):
	"""Extract all URLs and their link text from an HTML file, including <a> tag links and plain text URLs."""

	with open(html_path, "r", encoding="utf-8") as file:
	soup = BeautifulSoup(file, "html.parser")

	url_data = set() # Use a set to avoid duplicates

	# Extract URLs from <a href="…"> and their link text
	for link in soup.find_all("a", href=True):
	url = link["href"].strip()
	link_text = link.get_text(strip=True) # Extract visible link text
	url_data.add((url, link_text))

	# Extract URLs appearing as plain text
	url_pattern = re.compile(r"https?://[^\s\"'>]+")
	for text in soup.stripped_strings:
	for match in url_pattern.findall(text):
	url_data.add((match, "")) # No link text for plain text URLs

	return list(url_data)

	def save_urls_to_csv(url_data, output_file):
	"""Save extracted URLs and their link text to a CSV file."""
	with open(output_file, "w", newline="", encoding="utf-8") as file:
	writer = csv.writer(file)
	writer.writerow(["URL", "Link Text"]) # Header row
	writer.writerows(url_data)
	print(f"URLs saved to {output_file}")

	if __name__ == "__main__":
	html_path = "leads.html" # Change this to your HTML file path
	extracted_urls = extract_urls_from_html(html_path)

	if extracted_urls:
	output_file = "lead_urls.csv"
	print("Extracted URLs:")
	for url, text in extracted_urls:
	print(f"URL: {url}, Link Text: {text}")
	save_urls_to_csv(extracted_urls, output_file)
	else:
	print("No URLs found in the HTML file.")

view raw extract_urls_and_link_text.py hosted with ❤ by GitHub

So this is getting better. We can add fields for whether they have been contacted yet, their email address, or fields that prioritize them.

Next I’d like to see if I can grab the age of the webpage. Essentially, when was it updated last. Many of these leads are blogs that haven’t been updated in years. I am not sure how much I want to bother with those.

Here is the modified code:

	from bs4 import BeautifulSoup
	import re
	import csv
	import requests

	def get_last_modified(url):
	"""Retrieve the Last-Modified header from a URL if available."""
	try:
	response = requests.head(url, timeout=5)
	return response.headers.get("Last-Modified", "Unknown")
	except requests.RequestException:
	return "Unknown"

	def extract_urls_from_html(html_path):
	"""Extract all URLs, their link text, and last modified dates from an HTML file."""

	with open(html_path, "r", encoding="utf-8") as file:
	soup = BeautifulSoup(file, "html.parser")

	url_data = set() # Use a set to avoid duplicates

	# Extract URLs from <a href="…"> and their link text
	count = 0
	for link in soup.find_all("a", href=True):
	url = link["href"].strip()
	link_text = link.get_text(strip=True) # Extract visible link text
	last_modified = get_last_modified(url)
	url_data.add((url, link_text, last_modified))
	print(str(count) + ' : ' + url + " : " + link_text + " : " +last_modified)
	count = count + 1

	# Extract URLs appearing as plain text
	url_pattern = re.compile(r"https?://[^\s\"'>]+")
	for text in soup.stripped_strings:
	for match in url_pattern.findall(text):
	last_modified = get_last_modified(match)
	url_data.add((match, "", last_modified)) # No link text for plain text URLs

	return list(url_data)

	def save_urls_to_csv(url_data, output_file):
	"""Save extracted URLs, their link text, and last modified dates to a CSV file."""
	with open(output_file, "w", newline="", encoding="utf-8") as file:
	writer = csv.writer(file)
	writer.writerow(["URL", "Link Text", "Last Modified"]) # Header row
	writer.writerows(url_data)
	print(f"URLs saved to {output_file}")

	if __name__ == "__main__":
	html_path = "leads-filtered-utf8.htm" # Change this to your HTML file path
	extracted_urls = extract_urls_from_html(html_path)

	if extracted_urls:
	output_file = "lead_urls.csv"
	print("Extracted URLs:")
	for url, text, last_modified in extracted_urls:
	print(f"URL: {url}, Link Text: {text}, Last Modified: {last_modified}")
	save_urls_to_csv(extracted_urls, output_file)
	else:
	print("No URLs found in the HTML file.")

view raw extract_leads_with_dates.py hosted with ❤ by GitHub

The function get_last_modified(url) checks the “Last-Modified” header.

Looking through my results, I see that most of them come back as “Unknown.” It also looks like we saved these dates as strings, which aren’t that easy to sort by.

This is a step in the right direction, but I don’t think the “Last Modified” field is all that useful in it’s current state.

I mean, I was able to weed out a couple of really old webpages that haven’t been updated in forever, but not many.

Let’s try to improve this part next time.

Little more automation for my leads: Part 2

Discover more from Things I Tried

Comments

Leave a Reply Cancel reply

More posts

Sinners: Fun but overrated

URBANA Mexican Gastronomy & Mixology: Flavor Bombs

Facial Recognition with webcam

Test Driven Development

Little more automation for my leads: Part 2

Share this:

Discover more from Things I Tried

Comments

Leave a Reply Cancel reply

More posts

Sinners: Fun but overrated

URBANA Mexican Gastronomy & Mixology: Flavor Bombs

Facial Recognition with webcam

Test Driven Development