As an ultra marathon enthusiast, I often find it challenging to estimate my finish time for races I haven’t tried before. To tackle this, my coach suggested a helpful method: analyze runners who have completed both a race I’ve run and the one I’m targeting to derive insights into potential finish times. However, manually sifting through race results can be a tedious and time-consuming task.
This insight led to the creation of Race Time Insights—a tool designed to automate the comparison of race results by identifying athletes with experience in both events. The application retrieves race results from platforms such as UltraSignup and Pacific Multisports, enabling runners to input URLs for two races and discover how other participants performed in both.
During the development of this tool, I truly experienced the capabilities of DigitalOcean’s App Platform. Utilizing Puppeteer with headless Chrome in Docker containers allowed me to concentrate on developing solutions for runners while the App Platform managed the complexities of the infrastructure, resulting in a robust and scalable application for the running community.
To share this experience with fellow developers, I aimed to create a guide demonstrating how to use these technologies—Puppeteer, Docker containers, and DigitalOcean App Platform. I chose to focus on Project Gutenberg for this tutorial because of its extensive publicly available book collection and clear terms of service. This makes it a perfect model for showcasing responsible web scraping techniques while providing real value to users.
Project Gutenberg Book Search
I have developed a web application that responsibly scrapes book information from Project Gutenberg. This app enables users to search thousands of public domain books, access detailed information on each, and choose from various download formats. It’s an excellent case study demonstrating ethical web scraping practices while effectively serving users.
Being a Good Digital Citizen
Creating a web scraper requires adhering to best practices and respecting both legal and technical constraints. Project Gutenberg serves as an exemplary model in this regard since:
- It has transparent terms of service.
- It provides guidelines through robots.txt.
- Its content is entirely in the public domain.
- It benefits from increased accessibility to its resources.
Our implementation incorporates several best practices:
Rate Limiting
To demonstrate responsible access, we implement a naive rate limiter to ensure at least one second between requests:
const rateLimiter = { lastRequest: 0, minDelay: 1000, // 1 second between requests async wait() { const now = Date.now(); const timeToWait = Math.max(0, this.lastRequest + this.minDelay - now); if (timeToWait > 0) await new Promise(resolve => setTimeout(resolve, timeToWait)); this.lastRequest = Date.now(); }};
We utilize this rate limiter before each request to Project Gutenberg.
Clear Bot Identification
A custom User-Agent string helps website administrators understand the source of access to their site. It promotes transparency and allows site owners to monitor bot traffic, which can lead to better access or support for legitimate scrapers:
await browserPage.setUserAgent('GutenbergScraper/1.0 (Educational Project)');
Efficient Resource Management
Chrome is known to be resource-intensive. Properly closing browser pages post-use helps mitigate memory leaks and ensures the application runs smoothly without consuming excess resources.
Web Scraping in the Cloud
Our application leverages the cloud architecture and containerization capabilities of DigitalOcean’s App Platform, creating a balance between development simplicity and production reliability.
The Power of App Platform
The App Platform streamlines deployment by managing:
- Web server configuration
- SSL certificate management
- Security updates
- Load balancing
- Resource monitoring
This allows developers to focus on the code rather than infrastructure management.
Headless Chrome in a Container
Our scraping functionality relies on Puppeteer, a high-level API for controlling Chrome. The setup and usage of Puppeteer in the application look like this:
const puppeteer = require('puppeteer');class BookService { constructor() { this.baseUrl = 'https://www.gutenberg.org'; this.browser = null; } async initialize() { if (!this.browser) { this.browser = await puppeteer.launch({ headless: true, ... }); } } // Example scraping logic async searchBooks(query, page = 1) { await this.initialize(); // Additional scraping logic here... }}
Development to Deployment
To get this project running, here’s a brief outline:
- Local Development:Clone the GitHub repository, build, and run it with Docker.
- Understanding the Code:The app structure includes components for scraping, managing routes, and rendering views.
- Deployment to DigitalOcean:Creating a new App Platform application using the forked repository makes deployment easy.
Conclusion
This Project Gutenberg scraper showcases a practical approach to building a web application using modern cloud technologies. By combining Puppeteer for web scraping, Docker for consistent environments, and DigitalOcean’s App Platform for deployment, we’ve created a durable and maintainable solution.
This project serves as a solid foundation for anyone looking to develop web scraping applications, detailing how to manage browser automation and deploy to the cloud effectively.
Welcome to DediRock, your trusted partner in high-performance hosting solutions. At DediRock, we specialize in providing dedicated servers, VPS hosting, and cloud services tailored to meet the unique needs of businesses and individuals alike. Our mission is to deliver reliable, scalable, and secure hosting solutions that empower our clients to achieve their digital goals. With a commitment to exceptional customer support, cutting-edge technology, and robust infrastructure, DediRock stands out as a leader in the hosting industry. Join us and experience the difference that dedicated service and unwavering reliability can make for your online presence. Launch our website.