In the ever-evolving landscape of web development, creating efficient and scalable APIs is crucial. Whether you're building a content aggregator, a web scraper, or any service that requires fetching and processing web content, having a robust backend is essential. In this blog post, we'll walk through a simple yet powerful Node.js application that leverages Koa, Axios, and Mozilla's Readability to fetch, parse, and serve readable content from any given URL.
Table of Contents
- Introduction
- Prerequisites
- Setting Up the Project
- Understanding the Code
- Running the Application
- Testing the API
- Handling Errors
- Conclusion
- Further Reading
Introduction
Building an API that can fetch and process web content involves several steps: making HTTP requests, parsing HTML, and extracting meaningful information. In this tutorial, we'll create a Koa-based server that exposes an /api
endpoint. This endpoint accepts a URL as a query parameter, fetches the HTML content of the provided URL using Axios, parses the HTML with JSDOM, and then extracts the main readable content using Mozilla's Readability library.
Additionally, we'll include a simple /test
endpoint to verify that our server is running correctly.
Prerequisites
Before diving into the code, ensure you have the following installed on your machine:
- Node.js (version 12 or higher)
- npm or yarn package manager
Familiarity with JavaScript, Node.js, and basic understanding of Koa will be beneficial.
Setting Up the Project
-
Initialize a New Node.js Project
mkdir koa-content-fetcher cd koa-content-fetcher npm init -y
-
Install Required Dependencies
npm install koa koa-router axios jsdom @mozilla/readability
- koa: A lightweight and expressive middleware framework for Node.js.
- koa-router: Router middleware for Koa.
- axios: Promise-based HTTP client for the browser and Node.js.
- jsdom: A JavaScript implementation of the DOM and HTML standards.
- @mozilla/readability: A library to extract and parse the main content from web pages.
-
Create the Server File
Create a file named
server.js
and paste the following code:const Koa = require('koa'); const Router = require('koa-router'); const axios = require('axios'); const { JSDOM } = require('jsdom'); const { Readability } = require('@mozilla/readability'); const app = new Koa(); const router = new Router(); router.get('/api', async ctx => { const url = ctx.query.url; console.log(url); try { const response = await axios.get(url); const html = response.data; const dom = new JSDOM(html, { url }); const reader = new Readability(dom.window.document); ctx.body = reader.parse(); } catch (error) { ctx.status = 500; ctx.body = { error: 'Failed to fetch and parse content' }; } }); router.get('/test', async ctx => { ctx.body = "pong"; }); app.use(router.routes()).use(router.allowedMethods()); app.listen(3009, () => { console.log('Server running on http://localhost:3009'); });
Understanding the Code
Let's break down the server code to understand how each part contributes to the overall functionality.
Importing Dependencies
const Koa = require('koa');
const Router = require('koa-router');
const axios = require('axios');
const { JSDOM } = require('jsdom');
const { Readability } = require('@mozilla/readability');
- Koa: The core framework for building the server.
- koa-router: Facilitates routing in Koa applications.
- axios: Handles HTTP requests to fetch external web content.
- JSDOM: Parses the fetched HTML content into a DOM-like structure.
- Readability: Extracts the main content (like articles) from the DOM.
Initializing Koa and Router
const app = new Koa();
const router = new Router();
- app: The Koa application instance.
- router: An instance of Koa Router to define API endpoints.
Defining Routes
API Route (/api
)
router.get('/api', async ctx => {
const url = ctx.query.url;
console.log(url);
try {
const response = await axios.get(url);
const html = response.data;
const dom = new JSDOM(html, { url });
const reader = new Readability(dom.window.document);
ctx.body = reader.parse();
} catch (error) {
ctx.status = 500;
ctx.body = { error: 'Failed to fetch and parse content' };
}
});
- Endpoint:
/api
- Method: GET
- Functionality:
- Extract URL: Retrieves the
url
query parameter from the request. - Fetch Content: Uses Axios to make a GET request to the provided URL.
- Parse HTML: Converts the fetched HTML string into a DOM using JSDOM.
- Extract Readable Content: Utilizes Readability to parse the DOM and extract the main content.
- Respond: Sends the parsed content back to the client.
- Extract URL: Retrieves the
- Error Handling: If any step fails, the server responds with a 500 status code and an error message.
Test Route (/test
)
router.get('/test', async ctx => {
ctx.body = "pong";
});
- Endpoint:
/test
- Method: GET
- Functionality: Returns a simple string "pong" to verify that the server is operational.
Starting the Server
app.use(router.routes()).use(router.allowedMethods());
app.listen(3009, () => {
console.log('Server running on http://localhost:3009');
});
- Middleware: Registers the defined routes and allowed HTTP methods with the Koa application.
- Listening Port: The server listens on port
3009
. - Confirmation: Logs a message to the console indicating that the server is running.
Running the Application
-
Start the Server
In your terminal, navigate to the project directory and run:
node server.js
You should see the following output:
Server running on http://localhost:3009
-
Verify the Test Endpoint
Open your browser or use a tool like
curl
or Postman to access:http://localhost:3009/test
You should receive a response:
pong
Testing the API
The core functionality lies in the /api
endpoint, which fetches and parses content from a provided URL.
Example Request
To use the API, send a GET request to /api
with the url
query parameter set to the desired webpage.
Example:
curl "http://localhost:3009/api?url=https://example.com/article"
Expected Response
The API will respond with a JSON object containing the parsed content. Here's a simplified example:
{
"title": "Example Article",
"byline": "Author Name",
"content": "<p>This is the main content of the article...</p>",
"textContent": "This is the main content of the article...",
"length": 1234
}
- title: The title of the article.
- byline: The author's name.
- content: The HTML content extracted from the page.
- textContent: Plain text version of the content.
- length: The length of the text content.
Handling Errors
Robust error handling is essential for any API. In our implementation, if the server encounters any issues while fetching or parsing the content, it responds with a 500 status code and an error message.
Error Response Example:
{
"error": "Failed to fetch and parse content"
}
Common scenarios that might trigger an error include:
- Invalid URL: The provided URL is malformed or does not exist.
- Network Issues: Problems with network connectivity or the target server being down.
- Parsing Failures: Issues with parsing the HTML content, possibly due to unexpected structures.
Conclusion
In this tutorial, we built a simple yet effective API using Koa that can fetch and parse web content from any given URL. By combining the power of Axios for HTTP requests, JSDOM for HTML parsing, and Mozilla's Readability for content extraction, we created a tool that can be the backbone of various applications like content aggregators, readability-enhanced browsers, or data extraction services.
Further Reading
- Koa Documentation
- koa-router Documentation
- Axios GitHub Repository
- JSDOM GitHub Repository
- Mozilla Readability GitHub Repository
Feel free to experiment with the code, extend its functionalities, and integrate it into your projects. Happy coding!