PitchHut logo
An automatic tool for comprehensive forum scraping.
Pitch

forumscraper is an extensive and automatic scraping solution designed for various forums including Invision Power Board, PhpBB, Simple Machines Forum, and more. With its user-friendly CLI, it simplifies the process of downloading and organizing forum content into JSON files, making data extraction seamless and efficient.

Description

forumscraper is a universal, automatic, and comprehensive web scraper designed specifically for various online forums. It provides an efficient and straightforward method to extract data from a range of supported forums, facilitating data collection and analysis.

Supported Forums

The scraper enables seamless interaction with multiple forum platforms, including:

  • Invision Power Board (versions 4.x and 5.x)
  • PhpBB (excluding version 1.x)
  • Simple Machines Forum
  • XenForo
  • XMB
  • Hacker News (features aggressive protection)
  • StackExchange
  • vBulletin (3.x and higher)

Features

  • Automatic Data Extraction: Automatically downloads threads and user information from provided URLs, with output saved in JSON files named by their IDs or SHA256 hashes, depending on user preference.

  • Flexible CLI Usage: The command-line interface (CLI) allows flexible downloading and threading capabilities with options to customize output and logging:

    forumscraper --directory DIR URL1 URL2 URL3
    
  • Customizable Options: Users can modify various parameters such as thread limits, waiting durations for requests, retry settings, and output formats to tailor the scraping process to specific needs.

    forumscraper --wait 0.8 --wait-random 400 URL
    
  • Discovery Script: Easily discover new forums and their structures using the included discovery script.

Usage Example

Here is a basic usage example to scrape a forum:

import os
from forumscraper import extractor, outputs

ex = extractor()
ingest_data = ex.guess(
    "https://exampleforum.com/forum-thread",
    output=outputs.data | outputs.threads,
)

This snippet demonstrates how to automatically identify a forum and retrieve relevant data using the guess method, allowing for straightforward integration into broader data collection workflows.

Output Management

Output from scraping operations includes both the data collected and file management features, enabling users to effectively log and track their scraping activity. By default, the tool ensures that existing files are not overwritten unless specified, and it provides comprehensive logging options to capture any failures during the process.

Contributing and More Information

For additional examples, documentation, and contributions, please visit the examples directory. For more details on functions and configuration options, the README file within the repository serves as a thorough guide.

0 comments

No comments yet.

Sign in to be the first to comment.