|
5 | 5 | TensorFlow I/O is a collection of file systems and file formats that are not
|
6 | 6 | available in TensorFlow's built-in support.
|
7 | 7 |
|
8 |
| -At the moment TensorFlow I/O supports 4 data sources: |
| 8 | +At the moment TensorFlow I/O supports 5 data sources: |
9 | 9 | - `tensorflow_io.ignite`: Data source for Apache Ignite and Ignite File System (IGFS).
|
10 | 10 | - `tensorflow_io.kafka`: Apache Kafka stream-processing support.
|
11 | 11 | - `tensorflow_io.kinesis`: Amazon Kinesis data streams support.
|
12 | 12 | - `tensorflow_io.hadoop`: Hadoop SequenceFile format support.
|
| 13 | +- `tensorflow_io.arrow`: Apache Arrow data format support. |
13 | 14 |
|
14 | 15 | ## Installation
|
15 | 16 |
|
@@ -42,6 +43,119 @@ Type "help", "copyright", "credits" or "license" for more information.
|
42 | 43 | Note that python has to run outside of repo directory itself, otherwise python may not
|
43 | 44 | be able to find the correct path to the module.
|
44 | 45 |
|
| 46 | +## Using TensorFlow I/O |
| 47 | + |
| 48 | +### Apache Arrow Datasets |
| 49 | + |
| 50 | +Apache Arrow is a standard for in-memory columnar data, see [here](https://arrow.apache.org) |
| 51 | +for more information on the project. An Arrow dataset makes it easy to bring in |
| 52 | +structured columnar data into TensorFlow from the following sources: |
| 53 | + |
| 54 | +#### Pandas DataFrame |
| 55 | + |
| 56 | +An `ArrowDataset` can be made directly from an existing Pandas DataFrame, or |
| 57 | +pyarrow record batches, in a Python process. Tensor types and shapes can be |
| 58 | +inferred from the DataFrame, although currently only scalar and vector values |
| 59 | +with primitive types are supported. PyArrow must be installed to use this |
| 60 | +Dataset. Example usage: |
| 61 | + |
| 62 | +```python |
| 63 | +import tensorflow as tf |
| 64 | +from tensorflow_io.arrow import ArrowDataset |
| 65 | + |
| 66 | +# Assume `df` is an existing Pandas DataFrame |
| 67 | +dataset = ArrowDataset.from_pandas(df) |
| 68 | + |
| 69 | +iterator = dataset.make_one_shot_iterator() |
| 70 | +next_element = iterator.get_next() |
| 71 | + |
| 72 | +with tf.Session() as sess: |
| 73 | + for i in range(len(df)): |
| 74 | + print(sess.run(next_element)) |
| 75 | +``` |
| 76 | + |
| 77 | +NOTE: The entire DataFrame will be serialized to the Dataset and is not |
| 78 | +recommended for a large amount of data |
| 79 | + |
| 80 | +#### Arrow Feather Dataset |
| 81 | + |
| 82 | +Feather is a light-weight file format that provides a simple and efficient way |
| 83 | +to write a Pandas DataFrame to disk, see [here](https://arrow.apache.org/docs/python/ipc.html#feather-format) |
| 84 | +for more information and limitations of the format. An `ArrowFeatherDataset` |
| 85 | +can be created to read one or more Feather files. The following example shows |
| 86 | +how to write a feather file from a Pandas DataFrame, then read multiple files |
| 87 | +back as an `ArrowFeatherDataset`: |
| 88 | + |
| 89 | +```python |
| 90 | +from pyarrow.feather import write_feather |
| 91 | + |
| 92 | +# Assume `df` is an existing Pandas DataFrame with dtypes=(int32, float32) |
| 93 | +write_feather(df, '/path/to/a.feather') |
| 94 | +``` |
| 95 | + |
| 96 | +```python |
| 97 | +import tensorflow as tf |
| 98 | +from tensorflow_io.arrow import ArrowFeatherDataset |
| 99 | + |
| 100 | +# Each Feather file must have the same column types, here we use the above |
| 101 | +# DataFrame which has 2 columns with dtypes=(int32, float32) |
| 102 | +dataset = ArrowFeatherDataset( |
| 103 | + ['/path/to/a.feather', '/path/to/b.feather'], |
| 104 | + columns=(0, 1), |
| 105 | + output_types=(tf.int32, tf.float32), |
| 106 | + output_shapes=([], [])) |
| 107 | + |
| 108 | +iterator = dataset.make_one_shot_iterator() |
| 109 | +next_element = iterator.get_next() |
| 110 | + |
| 111 | +# This will iterate over each row of each file provided |
| 112 | +with tf.Session() as sess: |
| 113 | + while True: |
| 114 | + try: |
| 115 | + print(sess.run(next_element)) |
| 116 | + except tf.errors.OutOfRangeError: |
| 117 | + break |
| 118 | +``` |
| 119 | + |
| 120 | +An alternate constructor can also be used to infer output types and shapes from |
| 121 | +a given `pyarrow.Schema`, e.g. `dataset = ArrowFeatherDataset.from_schema(filenames, schema)` |
| 122 | + |
| 123 | +### Arrow Stream Dataset |
| 124 | + |
| 125 | +The `ArrowStreamDataset` provides a Dataset that will connect to a host over |
| 126 | +a socket that is serving Arrow record batches in the Arrow stream format. See |
| 127 | +[here](https://arrow.apache.org/docs/python/ipc.html#writing-and-reading-streams) |
| 128 | +for more on the stream format. The following example will create an |
| 129 | +`ArrowStreamDataset` that will connect to a host that is serving an Arrow |
| 130 | +stream of record batches with 2 columns of dtypes=(int32, float32): |
| 131 | + |
| 132 | +```python |
| 133 | +import tensorflow as tf |
| 134 | +from tensorflow_io.arrow import ArrowStreamDataset |
| 135 | + |
| 136 | +# The str `host` should be in the format '<HOSTNAME>:<PORT>' |
| 137 | +dataset = ArrowStreamDataset( |
| 138 | + host, |
| 139 | + columns=(0, 1), |
| 140 | + output_types=(tf.int32, tf.float32), |
| 141 | + output_shapes=([], [])) |
| 142 | + |
| 143 | +iterator = dataset.make_one_shot_iterator() |
| 144 | +next_element = iterator.get_next() |
| 145 | + |
| 146 | +# The host connection is made when the Dataset op is run and will iterate over |
| 147 | +# each row of each record batch until the Arrow stream is finished |
| 148 | +with tf.Session() as sess: |
| 149 | + while True: |
| 150 | + try: |
| 151 | + print(sess.run(next_element)) |
| 152 | + except tf.errors.OutOfRangeError: |
| 153 | + break |
| 154 | +``` |
| 155 | + |
| 156 | +An alternate constructor can also be used to infer output types and shapes from |
| 157 | +a given `pyarrow.Schema`, e.g. `dataset = ArrowStreamDataset.from_schema(host, schema)` |
| 158 | + |
45 | 159 | ## Developing
|
46 | 160 |
|
47 | 161 | ### Python
|
|
0 commit comments