Skip to content

feat: Scrapping Documentations from website for building Knowledge Graphs #42

@debrupf2946

Description

@debrupf2946

Implement Python Script for Scraping Documentation into Llama-Index Document Objects

Description

This issue invites contributors to develop a Python script that scrapes documentation from websites

Objective

  • Create a Python script to scrape documentation from a given website, including all sublinks.
  • Store the content from each link in a separate Llama-Index document object.
  • Each document object should have metadata that includes the source URL of the scraped data.
  • Compile all document objects into a list to form the complete Llama-Index.

Requirements

  1. Scraping the Documentation:

    • Utilize Python libraries like Beautiful Soup, Scrapy, or Requests-HTML to scrape content from the main documentation page and all associated sublinks.
    • Ensure accurate extraction of relevant content, including text, code snippets, and descriptions.
  2. Llama-Index Document Object Creation:

    • Store the data scraped from each individual link in a separate Llama-Index document object.
    • Attach metadata to each document object that records the URL of the link from which the content was scraped.
    • Compile all individual document objects into a list, representing the complete Llama-Index.
  3. Documentation:

    • Document the script clearly, providing instructions on how to use it.
    • note-book implementation for various strategies and responses mostly research
    • Later we can build a module out of it.
  4. Error Handling:

    • Implement robust error handling to manage issues such as broken links, failed requests, or unexpected data formats.

Submission Guidelines

  • Implementation: Develop the Python script ensuring it meets the outlined requirements.

  • Documentation: Include comprehensive comments and docstrings that explain the functionality and usage of the script.

  • Submit a Pull Request (PR):

    • Reference this issue in your PR.
    • Provide a description of your implementation, any challenges faced, and considerations made during development.

Resources

  • Explore Python libraries such as Beautiful Soup, Scrapy, and Requests-HTML for web scraping.
  • Refer to Llama-Index documentation for guidance on creating document objects and managing metadata.

Submission Checklist

  • Implement the scraping script in Python.
  • Create individual Llama-Index document objects with correct metadata.
  • Compile document objects into a list representing the Llama-Index.
  • note-book implimentation.
  • Submit a PR referencing this issue.

We look forward to your valuable contributions that will enhance our capability to integrate website documentation into our knowledge systems!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions