Little bit of automation for my leads: Part 1

Robot mascot for technology posts

Yesterday I wrote about trying to automate more of my manual process of trying to find suitable backlinks to my blog in my post More SEMRush.

Today, I was back at the manual process, and the first thing I noticed was that I really need some kind of tracking. When I go to a blog and look around, often I will see where they list other blogs they follow. I look at these like leads.

I’ve been just copying them and pasting them into a Microsoft Word document. I’d like to automate the process of cleaning that up.

So I opened VS Code created a script called extract_urls.py. I used CTRL + ~ to open a terminal. I ran extract_urls.py from that terminal, and it just sat there.

I am new to this editor. I am not used to python scripts just sitting there with no output unless it is an infinite loop or something.

This is not an output-free infinite loop.

Terminla showing python program hanging and one problem identified

See that blue circle with the numeral “1”? Looks like that is my problem.

Clicking on that shows that I couldn’t import the docx module.

"Problems" tab shows that the "docx" import didn't work.

I actually wish it would just display the error to the terminal like I am used to. I guess it just takes some getting used to.

After trying again—it actually behaved like I normally expect and threw an exception to the screen. I’m not even sure how I got that to happen.

VS Code terminal now shows exception to the terminal

Anyhow. The fix is clear. I need to install the docx module. But I always like to make a copy of my conda env berfore installing any new packages, which raises the question… which environment am I even running?

I am so used to seeing the prompt tell you the env, but not in this terminal. I guess I can find out with “conda info –envs”

That actually lists all your environments, but places and asterisk next to the currently activated one.

Ok. Now that I know that I am using my usual “best_env”, I will clone that before installing docx.

conda create –name best_env_todays_date –clone best_env

BTW, I use “best_env” to just mean one that I have installed a bunch of stuff on, and so far it works for everything I want to do.

Now “conda install docx” produced a PackagesNotFoundError, so I guess we need to try “pip install docx”

That worked I guess, but looks like it installed a lot:

Successfully build docx

Try running the script again. This time I get ModuleNotFoundError: No module named ‘exceptions’.

Hmmm… I didn’t even use the exceptions module, but it looks like docx does. I guess I have to install that now as well.

No luck with “conda install exceptions” or “pip install exceptions”. Found this stackoverflow post python – Unable to pip install exceptions Package – Stack Overflow

I guess try “pip install pyceptions” That didn’t work either.

Found this stackoverflow post python – When import docx in python3.3 I have error ImportError: No module named ‘exceptions’ – Stack Overflow

Apparently I should never have pip installed docx. For python3, I should have done “pip install python-docx”

Ok. Let’s try:
“pip uninstall docx”
“pip install python-docx”

Well… That was progress. I now have a new error:
docx.opc.exceptions.PackageNotFoundError: Package not found at ‘leads.docx’

I could keep going, but I want to be practical. Is this really saving me time?

It just seems like trying to make this docx module work may not be worth it. Can’t I just paste into some kind of HTML editor instead?

So I tried just CTRL – A followed by CTRL – C in the Microsoft Word doc, then start a new post in WordPress. In the body just CTRL – V. Save as a draft and preview. Then right-click on the preview and view document source. Save as “leads.html”

Then I changed my code to extract from an HTML file instead of a Microsoft Word document.

Now I get to use Beautiful Soup, which I already have installed.

This worked so much faster.

I now have all my lead URLs in a text file. I suppose it would be better to put them into a CSV. But at least I have the beginnings of a system for my leads.

from bs4 import BeautifulSoup
import re
def extract_urls_from_html(html_path):
"""Extract all URLs from an HTML file, including <a> tag links and plain text URLs."""
with open(html_path, "r", encoding="utf-8") as file:
soup = BeautifulSoup(file, "html.parser")
urls = set() # Use a set to avoid duplicates
# Extract URLs from <a href="…">
for link in soup.find_all("a", href=True):
urls.add(link["href"])
# Extract URLs appearing as plain text
url_pattern = re.compile(r"https?://[^\s\"'>]+")
for text in soup.stripped_strings:
urls.update(url_pattern.findall(text))
return list(urls)
def save_urls_to_file(urls, output_file):
"""Save extracted URLs to a text file."""
with open(output_file, "w", encoding="utf-8") as file:
for url in urls:
file.write(url + "\n")
print(f"URLs saved to {output_file}")
if __name__ == "__main__":
html_path = "leads.html" # Change this to your HTML file path
extracted_urls = extract_urls_from_html(html_path)
if extracted_urls:
output_file = "lead_urls.txt"
print("Extracted URLs:")
for url in extracted_urls:
print(url)
# Save to file
save_urls_to_file(extracted_urls, output_file)
else:
print("No URLs found in the HTML file.")

Discover more from Things I Tried

Subscribe to get the latest posts sent to your email.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *