Skip to content

Commit e5dc0eb

Browse files
Merge pull request #36 from BryanCutler/arrow-dataset-13
Add Apache Arrow dataset support
2 parents d47c8f2 + 78e756b commit e5dc0eb

File tree

15 files changed

+1794
-42
lines changed

15 files changed

+1794
-42
lines changed

BUILD

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,8 @@ sh_binary(
55
"LICENSE",
66
"MANIFEST.in",
77
"setup.py",
8-
"tensorflow_io/__init__.py",
8+
"tensorflow_io/__init__.py",
9+
"//tensorflow_io/arrow:arrow_py",
910
"//tensorflow_io/hadoop:hadoop_py",
1011
"//tensorflow_io/ignite:ignite_py",
1112
"//tensorflow_io/kafka:kafka_py",

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,11 +5,12 @@
55
TensorFlow I/O is a collection of file systems and file formats that are not
66
available in TensorFlow's built-in support.
77

8-
At the moment TensorFlow I/O supports 4 data sources:
8+
At the moment TensorFlow I/O supports 5 data sources:
99
- `tensorflow_io.ignite`: Data source for Apache Ignite and Ignite File System (IGFS).
1010
- `tensorflow_io.kafka`: Apache Kafka stream-processing support.
1111
- `tensorflow_io.kinesis`: Amazon Kinesis data streams support.
1212
- `tensorflow_io.hadoop`: Hadoop SequenceFile format support.
13+
- `tensorflow_io.arrow`: Apache Arrow data format support. Usage guide [here](tensorflow_io/arrow/README.md).
1314

1415
## Installation
1516

tensorflow_io/arrow/BUILD

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
licenses(["notice"]) # Apache 2.0
2+
3+
package(default_visibility = ["//visibility:public"])
4+
5+
cc_binary(
6+
name = 'python/ops/_arrow_ops.so',
7+
srcs = [
8+
"kernels/arrow_dataset_ops.cc",
9+
"kernels/arrow_stream_client.h",
10+
"kernels/arrow_stream_client_unix.cc",
11+
"ops/dataset_ops.cc",
12+
],
13+
linkshared = 1,
14+
deps = [
15+
"@local_config_tf//:libtensorflow_framework",
16+
"@local_config_tf//:tf_header_lib",
17+
"@arrow//:arrow",
18+
],
19+
copts = ["-pthread", "-std=c++11", "-D_GLIBCXX_USE_CXX11_ABI=0", "-DNDEBUG"]
20+
)
21+
22+
py_library(
23+
name = "arrow_ops_py",
24+
srcs = [
25+
"python/ops/arrow_dataset_ops.py",
26+
],
27+
data = [
28+
":python/ops/_arrow_ops.so"
29+
],
30+
srcs_version = "PY2AND3",
31+
)
32+
33+
py_test(
34+
name = "arrow_py_test",
35+
srcs = [
36+
"python/kernel_tests/arrow_test.py"
37+
],
38+
main = "python/kernel_tests/arrow_test.py",
39+
deps = [
40+
":arrow_ops_py"
41+
],
42+
srcs_version = "PY2AND3",
43+
)
44+
45+
py_library(
46+
name = "arrow_py",
47+
srcs = [
48+
"__init__.py",
49+
"python/__init__.py",
50+
"python/ops/__init__.py",
51+
],
52+
deps = [
53+
":arrow_ops_py",
54+
],
55+
srcs_version = "PY2AND3",
56+
)

tensorflow_io/arrow/README.md

Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
# TensorFlow I/O Apache Arrow Datasets
2+
3+
Apache Arrow is a standard for in-memory columnar data, see [here](https://arrow.apache.org)
4+
for more information on the project. An Arrow dataset makes it easy to bring in
5+
column-oriented data from other systems to TensorFlow using the following
6+
sources:
7+
8+
## From a Pandas DataFrame
9+
10+
An `ArrowDataset` can be made directly from an existing Pandas DataFrame, or
11+
pyarrow record batches, in a Python process. Tensor types and shapes can be
12+
inferred from the DataFrame, although currently only scalar and vector values
13+
with primitive types are supported. PyArrow must be installed to use this
14+
Dataset. Example usage:
15+
16+
```python
17+
import tensorflow as tf
18+
from tensorflow_io.arrow import ArrowDataset
19+
20+
# Assume `df` is an existing Pandas DataFrame
21+
dataset = ArrowDataset.from_pandas(df)
22+
23+
iterator = dataset.make_one_shot_iterator()
24+
next_element = iterator.get_next()
25+
26+
with tf.Session() as sess:
27+
for i in range(len(df)):
28+
print(sess.run(next_element))
29+
```
30+
31+
NOTE: The entire DataFrame will be serialized to the Dataset and is not
32+
recommended for use with large amounts of data
33+
34+
## From Arrow Feather Files
35+
36+
Feather is a light-weight file format that provides a simple and efficient way
37+
to write Pandas DataFrames to disk, see [here](https://arrow.apache.org/docs/python/ipc.html#feather-format)
38+
for more information and limitations of the format. An `ArrowFeatherDataset`
39+
can be created to read one or more Feather files from the given pathnames. The
40+
following example shows how to write a feather file from a Pandas DataFrame,
41+
then read multiple files back as an `ArrowFeatherDataset`:
42+
43+
```python
44+
from pyarrow.feather import write_feather
45+
46+
# Assume `df` is an existing Pandas DataFrame with dtypes=(int32, float32)
47+
write_feather(df, '/path/to/a.feather')
48+
```
49+
50+
```python
51+
import tensorflow as tf
52+
from tensorflow_io.arrow import ArrowFeatherDataset
53+
54+
# Each Feather file must have the same column types, here we use the above
55+
# DataFrame which has 2 columns with dtypes=(int32, float32)
56+
dataset = ArrowFeatherDataset(
57+
['/path/to/a.feather', '/path/to/b.feather'],
58+
columns=(0, 1),
59+
output_types=(tf.int32, tf.float32),
60+
output_shapes=([], []))
61+
62+
iterator = dataset.make_one_shot_iterator()
63+
next_element = iterator.get_next()
64+
65+
# This will iterate over each row of each file provided
66+
with tf.Session() as sess:
67+
while True:
68+
try:
69+
print(sess.run(next_element))
70+
except tf.errors.OutOfRangeError:
71+
break
72+
```
73+
74+
An alternate constructor can also be used to infer output types and shapes from
75+
a given `pyarrow.Schema`, e.g. `dataset = ArrowFeatherDataset.from_schema(filenames, schema)`
76+
77+
## From a Stream of Arrow Record Batches
78+
79+
The `ArrowStreamDataset` provides a Dataset that will connect to a host over
80+
a socket that is serving Arrow record batches in the Arrow stream format. See
81+
[here](https://arrow.apache.org/docs/python/ipc.html#writing-and-reading-streams)
82+
for more on the stream format. The following example will create an
83+
`ArrowStreamDataset` that will connect to a host that is serving an Arrow
84+
stream of record batches with 2 columns of dtypes=(int32, float32):
85+
86+
```python
87+
import tensorflow as tf
88+
from tensorflow_io.arrow import ArrowStreamDataset
89+
90+
# The str `host` should be in the format '<HOSTNAME>:<PORT>'
91+
dataset = ArrowStreamDataset(
92+
host,
93+
columns=(0, 1),
94+
output_types=(tf.int32, tf.float32),
95+
output_shapes=([], []))
96+
97+
iterator = dataset.make_one_shot_iterator()
98+
next_element = iterator.get_next()
99+
100+
# The host connection is made when the Dataset op is run and will iterate over
101+
# each row of each record batch until the Arrow stream is finished
102+
with tf.Session() as sess:
103+
while True:
104+
try:
105+
print(sess.run(next_element))
106+
except tf.errors.OutOfRangeError:
107+
break
108+
```
109+
110+
An alternate constructor can also be used to infer output types and shapes from
111+
a given `pyarrow.Schema`, e.g. `dataset = ArrowStreamDataset.from_schema(host, schema)`

tensorflow_io/arrow/__init__.py

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
# ==============================================================================
15+
"""Arrow Dataset.
16+
17+
@@ArrowDataset
18+
"""
19+
20+
from __future__ import absolute_import
21+
from __future__ import division
22+
from __future__ import print_function
23+
24+
from tensorflow_io.arrow.python.ops.arrow_dataset_ops import ArrowDataset
25+
from tensorflow_io.arrow.python.ops.arrow_dataset_ops import ArrowFeatherDataset
26+
from tensorflow_io.arrow.python.ops.arrow_dataset_ops import ArrowStreamDataset
27+
28+
from tensorflow.python.util.all_util import remove_undocumented
29+
30+
_allowed_symbols = [
31+
"ArrowDataset",
32+
"ArrowFeatherDataset",
33+
"ArrowStreamDataset",
34+
]
35+
36+
remove_undocumented(__name__)

0 commit comments

Comments
 (0)