You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/streaming.md
+113-4Lines changed: 113 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,115 @@
1
1
### Streaming
2
-
Spark-Redis support streaming data from Redis instance/cluster, currently streaming data are fetched from Redis' List by the `blpop` command. Users are required to provide an array which stores all the List names they are interested in. The [storageLevel](http://spark.apache.org/docs/latest/streaming-programming-guide.html#data-serialization) is `MEMORY_AND_DISK_SER_2` by default, you can change it on your demand.
3
-
`createRedisStream` will create a `(listName, value)` stream, but if you don't care about which list feeds the value, you can use `createRedisStreamWithoutListname` to get the only `value` stream.
2
+
3
+
Spark-Redis supports streaming data from Stream and List data structures:
4
+
5
+
-[Redis Stream](#redis-stream)
6
+
-[Redis List](#redis-list)
7
+
8
+
9
+
## Redis Stream
10
+
11
+
To stream data from [Redis Stream](https://redis.io/topics/streams-intro) use `createRedisXStream` method:
It will automatically create a consumer group if it doesn't exist and will start listening for the messages in the stream.
37
+
38
+
### Stream Offset
39
+
40
+
By default it pulls messages starting from the latest message. If you need to start from the earliest message or any specific position in the stream, specify the `offset` parameter:
41
+
42
+
```scala
43
+
ConsumerConfig("my-stream", "my-consumer-group", "my-consumer-1", offset =Earliest) // start from '0-0'
44
+
ConsumerConfig("my-stream", "my-consumer-group", "my-consumer-1", IdOffset(42, 0)) // start from '42-0'
45
+
```
46
+
47
+
Please note, spark-redis will attempt to create a consumer group with the specified offset, but if the consumer group already exists,
48
+
it will use the existing offset. It means, for example, if you decide to re-process all the messages from the beginning,
49
+
just changing the offset to `Earliest` may not be enough. You may need to either manually delete the consumer
50
+
group with `XGROUP DESTROY` or modify the offset with `XGROUP SETID`.
51
+
52
+
### Receiver reliability
53
+
54
+
The DStream is implemented with a [Reliable Receiver](https://spark.apache.org/docs/latest/streaming-custom-receivers.html#receiver-reliability) that acknowledges
55
+
after the data has been stored in Spark. As with any other Receiver to achieve strong fault-tolerance guarantees and ensure zero data loss, you have to enable [write-ahead logs](https://spark.apache.org/docs/latest/streaming-programming-guide.html#deploying-applications) and checkpointing.
56
+
57
+
The received data is stored with `StorageLevel.MEMORY_AND_DISK_2` by default.
58
+
Storage level can be configured with `storageLevel` parameter, e.g.:
In this example we created an input DStream that corresponds to a single receiver running in a Spark executor. The receiver will create two threads pulling
76
+
data from the streams in parallel. However if the data receiving becomes a bottleneck, you may want to start multiple receivers in different executors (worker machines).
77
+
This can be achieved by creating multiple input DStreams and `union` them together. You can read more about about it [here](https://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-receiving).
78
+
79
+
For example, the following will create two receivers pulling the data from `my-stream` and balancing the load:
If the cluster resources is not large enough to process data as fast as it is being received, the receiving rate can be limited:
94
+
95
+
```scala
96
+
ConsumerConfig("stream", "group", "c-1", rateLimitPerConsumer =Some(100)) // 100 items per second
97
+
```
98
+
99
+
It defines the number of received items per second per consumer.
100
+
101
+
Another options you can configure are `batchSize` and `block`. They define the maximum number of pulled items and time in milliseconds to wait in a `XREADGROUP` call.
The stream can be also created from Redis' List, the data is fetched with the `blpop` command. Users are required to provide an array which stores all the List names they are interested in. The [storageLevel](http://spark.apache.org/docs/latest/streaming-programming-guide.html#data-serialization) is `MEMORY_AND_DISK_SER_2` by default, you can change it on your demand.
111
+
112
+
The method `createRedisStream` will create a `(listName, value)` stream, but if you don't care about which list feeds the value, you can use `createRedisStreamWithoutListname` to get the only `value` stream.
4
113
5
114
Use the following to get a `(listName, value)` stream from `foo` and `bar` list
0 commit comments