Introducing - GEM
15 Jul 2024 · 📖 in 4 minutes An app to look for things in placesYesterday I pushed the first version of Gem to GitHub.
I use Gem to poll websites for a selector and then report back, via email, if they're found.
I wanted to build a small app to solve a specific problem - check if there's an item in stock on a website and be notified when that changes.
How's it work?
Powering Gem are scraper, lettre and reqwest (and of course Tokio for async).
The "main" flow is as follows:
- Read env vats (or .env file via dotenvy)
- Download a page for a given URI (usually a website over HTTPS but if reqwest can grab it so can Gem)
- Parse the content either as plain text of HTML (for now)
- Search for matches
- Email positive matches
I say "main" flow because this is all it does - once the emails have been sent the app ends.
Downloading
Not a lot to write home about in terms of the download, a URL is constructed and passed to reqwest
to do its thing. Errors are handled at each stage but the end result is a string representation of the contents of the page.
The only slightly interesting bit of this is that during my testing I was hitting some CDN DDOS protection when I first ran the app against a "real" website - to work around this I've set the User Agent to be one of the Google Bots and this seemed to sort it out.
I suspect different websites would allow list the Google Bot specific UA to the known IP ranges so I don't expect the Gem implementation to be foolproof. I'll likely add some more config to allow this to be changed more easily.
Content
Downloading a file and matching a search string is all pretty straightforward, however given that the content in most cases is going to be HTML I figured it would make sense to treat it as such.
In the world of HTML we can use combinations of CSS selectors to narrow in on a specific part of the page such that it allows for extremely granular control and export of matches.
For example in the markup below:
<div>
<p class="dinner">Sandwich</p>
</div>
<div>
<p class="greeting">Hello</p>
<p>World!</p>
</div>
The following selector can be used to match the a div
that contains a p
with the .greeting
class:
div:has(p.greeting)
While this is a fairly contrived example I wanted to demonstrate that the selector as an interface for search is supremely powerful.
In short Gem allows you to arbitrarily query websites with a selector and get an email of any matches.
Matches
An email of the matches is sent along in plaintext with a simple count. Perhaps even better than the simple count, because email can be sent as HTML the actual HTML value of the matched selector can also be sent to an address of your choosing.
How do I schedule it?
If you take a look at the repo for Gem you'll notice that there's a Dockerfile
included. I use a small setup at home to deploy this app as a container.
Scheduling is taken care of by ofelia which, with very minor config, allows the Gem container to be pulled and run, as described with cron-like config:
ofelia.job-run.gem-mon-pixel.schedule: "@every 10m"
ofelia.job-run.gem-mon-pixel.image: "hub.example.com/gem-mon:0.2.0"
ofelia.job-run.gem-mon-pixel.environment: '["TARGET_URL=https://example.com","SELECTOR=a","CONTENT_TYPE=html","SMTP_RELAY=smtp.example.com","SMTP_USER=user@example.com","EMAIL_FROM=Gem <gem@exmaple.com>","EMAIL_TO=Matt <matt@example.com>","SMTP_PASS=example-pass"]'
The huge block of text at the end is the environment vars used to configure the app and while not the nicest interface it's more than enough to get started.
In closing
All in all I'm pretty pleased with the way this has come out, I've deployed it to check on two different websites and can see if being useful for checking up on content changes e.g. items going in and out of stock.
Another use case is for monitoring my blog and checking if a given link is visible on the main page - this would require a slight change in behaviour to checks for a selector and alerts if it's not found.
Thanks for reading and hit me up if you try out Gem - as ever you can reach me on Mastodon.
First appeared on Trusty Interior, last update 2 Nov 2024