This is a continuation of my previous post Little bit of automation for my leads
First of all, I want to point out a silly thing I did with my last version.
Since I was having a hard time getting the docx module set up with all of it’s dependencies, I decided to short-circuit that and just use HTML instead of a Microsoft Word Document.
So I copied all my leads (basically formatted hyperlinks) from Microsoft Word and pasted them into an HTML editor. I used a WordPress post for the HTML editor. Of course, I never published that post.
That whole rigamarole was unnecessary. You can actually just save a Microsoft Word document straight out to a .htm file.
Saving straight out of Microsoft Word actually is sort of a hassle, mainly because when saving it doesn’t encode as UTF-8 by default.
I managed to force it to encode as UTF-8 by chosing “More Options” before saving. In the new window, there is a drop down “Tools” to the left of the new “Save” button. In there you can change the encoding.
Another improvement I want to do is save this data out to a CSV instead of a text document. Then we can capture more than one field per lead.
Ultimately, I will surely want several pieces of information—or more—for each lead.
For instance, I don’t just want the URL, but also the title of the website. I also think I’m gonna want a field to record when the website was last updated, as I keep finding what look like great blogs, but they haven’t been updated in 9 years or something.
Let’s start by just extracting the URL and the link text and save them both into a csv instead of a text file.
We can use link.get_text() a BeautifulSoup method to extract the visible text from the href tag, and we can import the csv module to write to a csv file.
Here’s the new code:
So this is getting better. We can add fields for whether they have been contacted yet, their email address, or fields that prioritize them.
Next I’d like to see if I can grab the age of the webpage. Essentially, when was it updated last. Many of these leads are blogs that haven’t been updated in years. I am not sure how much I want to bother with those.
Here is the modified code:
The function get_last_modified(url) checks the “Last-Modified” header.
Looking through my results, I see that most of them come back as “Unknown.” It also looks like we saved these dates as strings, which aren’t that easy to sort by.
This is a step in the right direction, but I don’t think the “Last Modified” field is all that useful in it’s current state.
I mean, I was able to weed out a couple of really old webpages that haven’t been updated in forever, but not many.
Let’s try to improve this part next time.