How to scrape a website using Browserbase, Puppeteer, OpenAI and Trigger.dev
Co-founder, Trigger.dev
What you'll build
In this tutorial, you'll create a Trigger.dev task that scrapes the top 3 articles from Hacker News using Browserbase and Puppeteer, summarizes them with ChatGPT and sends a nicely formatted email summary to yourself every weekday at 9AM using Resend.
Before you begin…
Check out this 4 minute video overview of this tutorial to get an idea of what we'll be building.
Prerequisites
- Create a Trigger.dev account and setup a new project
- Create a Browserbase account
- Install Puppeteer on your local machine
- Create an OpenAI account
- Create a Resend account
Warning
When web scraping, you MUST use a proxy to comply with our terms of service. Direct scraping of third-party websites without the site owner’s permission using Trigger.dev Cloud is prohibited and will result in account suspension.
Configure your environment variables
Login to each of the services and grab the API keys. Add them to your local .env
file so we can run a local test of our task later on.
_10BROWSERBASE_API_KEY: "<your Browserbase API key>"_10OPENAI_API_KEY: "<your OpenAI API key>"_10RESEND_API_KEY: "<your Resend API key>"
Install Puppeteer
Before you can run your task locally, you need to install Puppeteer on your local machine. Check out the Puppeteer installation guide for more information.
_10npm i puppeteer
Write your task code
Create a new file called trigger/scrape-hacker-news.ts
in the trigger
folder in your project and add the following code below.
The best way to understand how the following two tasks work is by following the comments, but here's a quick overview:
- The parent task
summarizeHackerNews
is set to run every weekday at 9AM using thecron
property. - It connects to Browserbase to proxy the scraping of the Hacker News articles.
- It then gets the
title
andlink
of the top 3 articles on Hacker News. - Next, it triggers a child task called
scrapeAndSummarizeArticle
for each of our 3 articles using thebatchTriggerAndWait
method. You can learn more about batching in the docs. - The child task,
scrapeAndSummarizeArticle
, scrapes the content of each article using Puppeteer and summarizes it using ChatGPT. - The parent task waits for all of the child tasks to complete before continuing.
- Finally, the parent task sends an email summary to you using Resend and React Email using the 'summaries' it has generated from the child tasks.
Ensure you replace the placeholder email addresses with your own.
Create your React Email template
Install React Email and the @react-email/components
package:
_10npm i @react-email/components
Create a new file called summarize-hn-email.tsx
in your project and add the following code. This is currently a simple but nicely styled email template that you can customize to your liking.
Do a test run locally
Once you've written your task code, you can do a test run locally to make sure everything is working as expected. Run the Trigger.dev dev
command to start the background worker:
_10npx trigger.dev@latest dev
Next, go to the Trigger.dev dashboard and click Test
in the left hand side menu (1). Choose DEV
from the environment options at the top of the page (2), select your task (3), click Now
to ensure your task runs immediately (4), and then click the Run test
button to trigger your test (5).
You should see your task run and an email sent to you with the Hacker News summary.
It's worth noting that some Hacker News articles might not be accessible. The tasks will each attempt 3 times before giving up and returning an error. Some reasons for this could be that the main content of the article isn't accessible via the article
HTML element or that the page has a paywall or the Hacker News post links to a video file. Feel free to edit the task code to handle these cases.
Deploy your task to the Trigger.dev cloud
Once you're happy with your task code, you can deploy it to the Trigger.dev cloud. To do this, you'll first need to add Puppeteer to your build configuration.
Add Puppeteer to your build configuration
Add your environment variables to the Trigger.dev project
Previously, we added our environment variables to the .env
file. Now we need to add them to the Trigger.dev project settings so our deployed task can access them. You can copy all of the environment variables from your .env
file at once, and paste them all into the Environment variables
page in the Trigger.dev dashboard.
Deploy your task
Finally, you can deploy your task to the Trigger.dev cloud by running the trigger.dev@latest deploy
command.
_10npx trigger.dev@latest deploy
Run your task in Production
Once your task is deployed, it will run every weekday at 9AM. You can check the status of your task in the Trigger.dev dashboard by clicking on the Runs
tab in the left hand side menu.
If you want to manually trigger your task in production, you can repeat the steps from your local DEV
test earlier but this time select PROD
from the environment options at the top of the page.