feat: Scrapping Documentations from website for building Knowledge Graphs


### **Implement Python Script for Scraping Documentation into Llama-Index Document Objects**

#### **Description**

This issue invites contributors to develop a Python script that scrapes documentation from websites

#### **Objective**

- Create a Python script to scrape documentation from a given website, including all sublinks.
- Store the content from each link in a separate Llama-Index document object.
- Each document object should have metadata that includes the source URL of the scraped data.
- Compile all document objects into a list to form the complete Llama-Index.

#### **Requirements**

1. **Scraping the Documentation**:
   - Utilize Python libraries like Beautiful Soup, Scrapy, or Requests-HTML to scrape content from the main documentation page and all associated sublinks.
   - Ensure accurate extraction of relevant content, including text, code snippets, and descriptions.

2. **Llama-Index Document Object Creation**:
   - Store the data scraped from each individual link in a separate Llama-Index document object.
   - Attach metadata to each document object that records the URL of the link from which the content was scraped.
   - Compile all individual document objects into a list, representing the complete Llama-Index.

3. **Documentation**:
   - Document the script clearly, providing instructions on how to use it.
   - note-book implementation for various strategies and responses mostly research
   - Later we can build a module out of it.

4. **Error Handling**:
   - Implement robust error handling to manage issues such as broken links, failed requests, or unexpected data formats.

#### **Submission Guidelines**

- **Implementation**: Develop the Python script ensuring it meets the outlined requirements.
- **Documentation**: Include comprehensive comments and docstrings that explain the functionality and usage of the script.

- **Submit a Pull Request (PR)**:
  - Reference this issue in your PR.
  - Provide a description of your implementation, any challenges faced, and considerations made during development.

#### **Resources**

- Explore Python libraries such as Beautiful Soup, Scrapy, and Requests-HTML for web scraping.
- Refer to Llama-Index documentation for guidance on creating document objects and managing metadata.

#### **Submission Checklist**

- [ ] Implement the scraping script in Python.
- [ ] Create individual Llama-Index document objects with correct metadata.
- [ ] Compile document objects into a list representing the Llama-Index.
- [ ] note-book implimentation.
- [ ] Submit a PR referencing this issue.

We look forward to your valuable contributions that will enhance our capability to integrate website documentation into our knowledge systems!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Scrapping Documentations from website for building Knowledge Graphs #42

Implement Python Script for Scraping Documentation into Llama-Index Document Objects

Description

Objective

Requirements

Submission Guidelines

Resources

Submission Checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: Scrapping Documentations from website for building Knowledge Graphs #42

Description

Implement Python Script for Scraping Documentation into Llama-Index Document Objects

Description

Objective

Requirements

Submission Guidelines

Resources

Submission Checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions