Skip to content

Commit d83619c

Browse files
committed
Added usage to README
1 parent fec9506 commit d83619c

File tree

1 file changed

+115
-1
lines changed

1 file changed

+115
-1
lines changed

README.md

Lines changed: 115 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,11 +5,12 @@
55
TensorFlow I/O is a collection of file systems and file formats that are not
66
available in TensorFlow's built-in support.
77

8-
At the moment TensorFlow I/O supports 4 data sources:
8+
At the moment TensorFlow I/O supports 5 data sources:
99
- `tensorflow_io.ignite`: Data source for Apache Ignite and Ignite File System (IGFS).
1010
- `tensorflow_io.kafka`: Apache Kafka stream-processing support.
1111
- `tensorflow_io.kinesis`: Amazon Kinesis data streams support.
1212
- `tensorflow_io.hadoop`: Hadoop SequenceFile format support.
13+
- `tensorflow_io.arrow`: Apache Arrow data format support.
1314

1415
## Installation
1516

@@ -42,6 +43,119 @@ Type "help", "copyright", "credits" or "license" for more information.
4243
Note that python has to run outside of repo directory itself, otherwise python may not
4344
be able to find the correct path to the module.
4445

46+
## Using TensorFlow I/O
47+
48+
### Apache Arrow Datasets
49+
50+
Apache Arrow is a standard for in-memory columnar data, see [here](https://arrow.apache.org)
51+
for more information on the project. An Arrow dataset makes it easy to bring in
52+
structured columnar data into TensorFlow from the following sources:
53+
54+
#### Pandas DataFrame
55+
56+
An `ArrowDataset` can be made directly from an existing Pandas DataFrame, or
57+
pyarrow record batches, in a Python process. Tensor types and shapes can be
58+
inferred from the DataFrame, although currently only scalar and vector values
59+
with primitive types are supported. PyArrow must be installed to use this
60+
Dataset. Example usage:
61+
62+
```python
63+
import tensorflow as tf
64+
from tensorflow_io.arrow import ArrowDataset
65+
66+
# Assume `df` is an existing Pandas DataFrame
67+
dataset = ArrowDataset.from_pandas(df)
68+
69+
iterator = dataset.make_one_shot_iterator()
70+
next_element = iterator.get_next()
71+
72+
with tf.Session() as sess:
73+
for i in range(len(df)):
74+
print(sess.run(next_element))
75+
```
76+
77+
NOTE: The entire DataFrame will be serialized to the Dataset and is not
78+
recommended for a large amount of data
79+
80+
#### Arrow Feather Dataset
81+
82+
Feather is a light-weight file format that provides a simple and efficient way
83+
to write a Pandas DataFrame to disk, see [here](https://arrow.apache.org/docs/python/ipc.html#feather-format)
84+
for more information and limitations of the format. An `ArrowFeatherDataset`
85+
can be created to read one or more Feather files. The following example shows
86+
how to write a feather file from a Pandas DataFrame, then read multiple files
87+
back as an `ArrowFeatherDataset`:
88+
89+
```python
90+
from pyarrow.feather import write_feather
91+
92+
# Assume `df` is an existing Pandas DataFrame with dtypes=(int32, float32)
93+
write_feather(df, '/path/to/a.feather')
94+
```
95+
96+
```python
97+
import tensorflow as tf
98+
from tensorflow_io.arrow import ArrowFeatherDataset
99+
100+
# Each Feather file must have the same column types, here we use the above
101+
# DataFrame which has 2 columns with dtypes=(int32, float32)
102+
dataset = ArrowFeatherDataset(
103+
['/path/to/a.feather', '/path/to/b.feather'],
104+
columns=(0, 1),
105+
output_types=(tf.int32, tf.float32),
106+
output_shapes=([], []))
107+
108+
iterator = dataset.make_one_shot_iterator()
109+
next_element = iterator.get_next()
110+
111+
# This will iterate over each row of each file provided
112+
with tf.Session() as sess:
113+
while True:
114+
try:
115+
print(sess.run(next_element))
116+
except tf.errors.OutOfRangeError:
117+
break
118+
```
119+
120+
An alternate constructor can also be used to infer output types and shapes from
121+
a given `pyarrow.Schema`, e.g. `dataset = ArrowFeatherDataset.from_schema(filenames, schema)`
122+
123+
### Arrow Stream Dataset
124+
125+
The `ArrowStreamDataset` provides a Dataset that will connect to a host over
126+
a socket that is serving Arrow record batches in the Arrow stream format. See
127+
[here](https://arrow.apache.org/docs/python/ipc.html#writing-and-reading-streams)
128+
for more on the stream format. The following example will create an
129+
`ArrowStreamDataset` that will connect to a host that is serving an Arrow
130+
stream of record batches with 2 columns of dtypes=(int32, float32):
131+
132+
```python
133+
import tensorflow as tf
134+
from tensorflow_io.arrow import ArrowStreamDataset
135+
136+
# The str `host` should be in the format '<HOSTNAME>:<PORT>'
137+
dataset = ArrowStreamDataset(
138+
host,
139+
columns=(0, 1),
140+
output_types=(tf.int32, tf.float32),
141+
output_shapes=([], []))
142+
143+
iterator = dataset.make_one_shot_iterator()
144+
next_element = iterator.get_next()
145+
146+
# The host connection is made when the Dataset op is run and will iterate over
147+
# each row of each record batch until the Arrow stream is finished
148+
with tf.Session() as sess:
149+
while True:
150+
try:
151+
print(sess.run(next_element))
152+
except tf.errors.OutOfRangeError:
153+
break
154+
```
155+
156+
An alternate constructor can also be used to infer output types and shapes from
157+
a given `pyarrow.Schema`, e.g. `dataset = ArrowStreamDataset.from_schema(host, schema)`
158+
45159
## Developing
46160

47161
### Python

0 commit comments

Comments
 (0)