Skip to content

Adding domain specific examples #216

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 53 additions & 8 deletions docs/source/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,19 +3,64 @@ Examples

.. currentmodule:: examples

Vision
In this section, you will find the data loading implementations (using DataPipes) of various
popular datasets across different research domains.

Audio
-----------

LibriSpeech
^^^^^^^^^^^^^^^^^^^^^^^^^^

`LibriSpeech dataset <https://www.openslr.org/12/>`_ is corpus of approximately 1000 hours of 16kHz read
English speech. Here is the
`DataPipe implementation of LibriSpeech <https://github.com/pytorch/data/blob/main/examples/audio/librispeech.py>`_
to load the data.

Text
-----------

Audio
IMDB
^^^^^^^^^^^^^^^^^^^^^^^^^^
This is a `large movie review dataset <http://ai.stanford.edu/~amaas/data/sentiment/>`_ for binary sentiment
classification containing 25,000 highly polar movie reviews for training and 25,00 for testing. Here is the
`DataPipe implementation to load the data <https://github.com/pytorch/data/blob/main/examples/text/imdb.py>`_.


SQuAD
^^^^^^^^^^^^^^^^^^^^^^^^^^
`SQuAD (Stanford Question Answering Dataset) <https://rajpurkar.github.io/SQuAD-explorer/>`_ is a dataset for
reading comprehension. It consists of a list of questions by crowdworkers on a set of Wikipedia articles. Here are the
DataPipe implementations for `version 1.1 <https://github.com/pytorch/data/blob/main/examples/text/squad1.py>`_
is here and `version 2.0 <https://github.com/pytorch/data/blob/main/examples/text/squad2.py>`_.

Additional Datasets in TorchText
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In a separate PyTorch domain library `TorchText <https://github.com/pytorch/text>`_, you will find some of the most
popular datasets in the NLP field implemented as loadable datasets using DataPipes. You can find
all of those `NLP datasets here <https://github.com/pytorch/text/tree/main/torchtext/datasets>`_.


Vision
-----------

Module contents
---------------
Caltech 101
^^^^^^^^^^^^^^^^^^^^^^^^^^
The `Caltech 101 dataset <http://www.vision.caltech.edu/Image_Datasets/Caltech101/>`_ contains pictures of objects
belonging to 101 categories. Here is the
`DataPipe implementation of Caltech 101 <https://github.com/pytorch/data/blob/main/examples/vision/caltech101.py>`_.

Caltech 256
^^^^^^^^^^^^^^^^^^^^^^^^^^
The `Caltech 256 dataset <http://www.vision.caltech.edu/Image_Datasets/Caltech256/>`_ contains 30607 images
from 256 categories. Here is the
`DataPipe implementation of Caltech 256 <https://github.com/pytorch/data/blob/main/examples/vision/caltech256.py>`_.

Additional Datasets in TorchVision
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In a separate PyTorch domain library `TorchVision <https://github.com/pytorch/vision>`_, you will find some of the most
popular datasets in the computer vision field implemented as loadable datasets using DataPipes. You can find all of
those `vision datasets here <https://github.com/pytorch/vision/tree/main/torchvision/prototype/datasets/_builtin>`_.

.. automodule:: examples
:members:
:undoc-members:
:show-inheritance:
Note that these implementations are currently in the prototype phase, but they should be fully supported
in the coming months. Nonetheless, they demonstrate the different ways DataPipes can be used for data loading.
2 changes: 2 additions & 0 deletions docs/source/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -211,3 +211,5 @@ The following statements will be printed to show the shapes of a single batch of

Labels batch shape: 50
Feature batch shape: torch.Size([50, 20])

You can find more DataPipe implementation examples for various research domains `on this page <torchexamples.html>`_.
2 changes: 1 addition & 1 deletion torchdata/datapipes/iter/util/plain_text_reader.py
Original file line number Diff line number Diff line change
Expand Up @@ -190,7 +190,7 @@ class CSVDictParserIterDataPipe(_CSVBaseParserIterDataPipe):
within the CSV files one row at a time (functional name: ``parse_csv_as_dict``).

Each output is a `Dict` by default, but it depends on ``fmtparams``. The first row of each file, unless skipped,
will be used as the header; the contents of the header row will be used as keys for the `Dict`s
will be used as the header; the contents of the header row will be used as keys for the `Dict`\s
generated from the remaining rows.

Args:
Expand Down