Skip to content

[DOCS] Added important updateStateByKey details #7229

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/streaming-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -854,6 +854,8 @@ it with new information. To use this, you will have to do two steps.
1. Define the state update function - Specify with a function how to update the state using the
previous state and the new values from an input stream.

Spark will run the `updateStateByKey` update function for all existing keys, regardless of whether they have new data in a batch or not. If the update function returns `None` then the key-value pair will be eliminated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole section is about updateStateByKey so saying it again here is superfluous. Just "run the update function". Also I would clarify further: "In every batch, Spark will apply the update function for all... ". It wasnt clear that whether it was for every batch or overall.


Let's illustrate this with an example. Say you want to maintain a running count of each word
seen in a text data stream. Here, the running count is the state and it is an integer. We
define the update function as:
Expand Down