chore!: make sharding configuration explicit#3468
Conversation
9a9092b to
b1b3673
Compare
waku/waku_lightpush/protocol.nim
Outdated
| let parsedTopic = NsContentTopic.parse(pushRequest.message.contentTopic).valueOr: | ||
| let msg = "Invalid content-topic:" & $error | ||
| error "lightpush request handling error", error = msg | ||
| return LightpushResponse( |
There was a problem hiding this comment.
several return in the error path skipped the metrics report!
waku/waku_lightpush/protocol.nim
Outdated
| ): Future[LightPushResponse] {.async.} = | ||
| let reqDecodeRes = LightpushRequest.decode(buffer) | ||
| var isSuccess = false | ||
| var pushResponse: LightpushResponse |
There was a problem hiding this comment.
Storing results in a var is usually a bad smell because bigger nesting is needed (instead of just exiting on an error).
Bigger nesting = less readable code.
Proof-in-pudding: some return were including in nesting leading to metrics being missed: https://github.com/waku-org/nwaku/pull/3468/files#r2168086052
b19c900 to
4a74835
Compare
|
|
||
| return netConfigRes | ||
|
|
||
| # TODO: numShardsInNetwork should be mandatory with autosharding, and unneeded otherwise |
|
You can find the image built from this PR at Built from bd022cb |
| proc validateShards(wakuConf: WakuConf): Result[void, string] = | ||
| let numShardsInNetwork = wakuConf.numShardsInNetwork | ||
|
|
||
| # TODO: fix up this behaviour |
| s: Sharding, | ||
| pubsubTopic: Option[PubsubTopic], | ||
| contentTopics: ContentTopic | seq[ContentTopic], | ||
| proc getShardsFromContentTopics*( |
There was a problem hiding this comment.
this function was always called when pubsubTopic was none...
waku/waku_lightpush/protocol.nim
Outdated
| statusDesc: some(msg), | ||
| ) | ||
| # TODO: is that fine? | ||
| let res = await handleRequest(wl, peerId, pushRequest) |
There was a problem hiding this comment.
Is that ok to call an await here? I am not yet familiar with Nim async and whether there are any gotcha
This comment was marked as resolved.
This comment was marked as resolved.
SionoiS
left a comment
There was a problem hiding this comment.
LGTM
I just hope we didn't mess yet again some edge cases...
I swear this is the 5e time we refactor shard config stuff!
| ## Then | ||
| let conf = res.get() | ||
| check conf.shards.len == expectedConf.numShardsInNetwork.int | ||
| check conf.activeRelayShards.len == expectedConf.shardingConf.numShardsInCluster.int |
There was a problem hiding this comment.
activeRelayShards Does that mean we can have inactive ones???
There was a problem hiding this comment.
This is about expressing the fact that those "shards" are really shards we want to subscribe to.
Next step will be to remove that from the config entirely as it's not a node setup matter, but a run time matter.
| let shard = node.wakuSharding.getShard(contentTopic).valueOr: | ||
| return err("Could not parse content topic: " & error) | ||
| let shard = node.wakuAutoSharding.get().getShard(contentTopic).valueOr: | ||
| return err("Could not parse content topic: " & error) |
266ca3e to
03faad4
Compare
|
Started to work on fixing interops tests: logos-messaging/logos-delivery-interop-tests#126 |
03faad4 to
1791bca
Compare
|
interop-tests PR logos-messaging/logos-delivery-interop-tests#126 is ready for review. Once merged, I'll rerun tests on this PR and it should be green. |
1791bca to
d800b5a
Compare
gabrielmer
left a comment
There was a problem hiding this comment.
Woow amazing work! Thanks so much!
Added first some questions to further understand some points
waku/factory/waku_conf.nim
Outdated
|
|
||
| clusterId*: uint16 | ||
| shards*: seq[uint16] | ||
| activeRelayShards*: seq[uint16] |
There was a problem hiding this comment.
There's something I still don't get about activeRelayShards. Shouldn't the user configure which shards they want and that's it? And isn't that what it should be in the configuration, the desired shards by the user?
There was a problem hiding this comment.
spoiler: I will remove this parameter, as I am not including it in the Waku API.
these are the shards we want to subscribe to via relay when the node starts.
Define "desired shards by the user"
There was a problem hiding this comment.
Define "desired shards by the user"
Exactly this: "the shards we want to subscribe to via relay when the node starts"
When a node is configured and the operator selects the shard they want to operate in. Why call them activeRelayShards instead of shards if it's the config desire of the user/operator and not a run time issue?
or that is the combination of both the shards specified by the user explicitly + the resolved shards by autosharding?
There was a problem hiding this comment.
ok now I'm understanding better, that if we don't allow mixing autosharding and static sharding then makes sense!
There was a problem hiding this comment.
When a node is configured and the operator selects the shard they want to operate in. Why call them
activeRelayShardsinstead ofshardsif it's the config desire of the user/operator and not a run time issue?
Because the action is unclear, why does one need to specify shards ? Well, because that the shards we want relay to subscribe to. But that only applies in one specific case: when running a service node with relay.
When running an application, the users or developer does not subscribe to shards, it subscribe to specific content topics they want to receive messages on.
And when not using relay, they don't need shards either.
And even in this case, there is still the problem of:
- I am using TWN
- I want to run a service node to support my node
- What "shard" do I select?
Well, you shouldn't need to "select" a shard, but a "content topic" instead (or a partial one), because autosharding abstract the concept of shards.
There was a problem hiding this comment.
not a run time issue?
So we are clear, it is a run time issue. Getting relay to subscribe to shards is not an initialization matter, it is a run time matter.
waku/waku_core/topics/sharding.nim
Outdated
| let shard = | ||
| if shardsRes.isErr(): | ||
| return err("Cannot deduce shard from content topic: " & $shardsRes.error) |
There was a problem hiding this comment.
Maybe this was it's simpler?
| let shard = | |
| if shardsRes.isErr(): | |
| return err("Cannot deduce shard from content topic: " & $shardsRes.error) | |
| let shard = s.getShard(content).valueOr: | |
| return err("Cannot deduce shard from content topic: " & $error) |
(and then removing the let shardsRes = s.getShard(content) and the else: shardsRes.get(), which GH doesn't let me add in the suggestion)
waku/node/waku_node.nim
Outdated
| let msg = | ||
| "Static sharding is used, relay subscriptions must specify a pubsub topic" |
There was a problem hiding this comment.
I think this msg variable is not used anywhere, same for the ContentUnsubcase
| # TODO: when using autosharding, the user should not be expected to pass any shards, but only content topics | ||
| # Hence, this joint logic should be removed in favour of an either logic: | ||
| # use passed shards (static) or deduce shards from content topics (auto) |
There was a problem hiding this comment.
Not sure I understand or agree with this, one might subscribe to static shards (for example, default TWN shards) but want to use autosharding to publish messages without having to think about shards in their application's logic. Isn't that legitimate usage?
There was a problem hiding this comment.
It's the same as #3468 (comment)
I think we first need to acknowledge that these are 2 separate use cases:
(1) A developer using nwaku (as a lib or otherwise) for their app. In this case, if they use autosharding then they only want to deal with content topic to subscribe to shards or send messages
(2) A node operator running a node on TWN. Assuming said operator don't want to run a node for all shards. Why would they do that? is it to support their app? in this case, which shard should they subscribe too? 1, 2, 3? How do they know. Well, this is back to step (1). They should just deal with content topics.
There was a problem hiding this comment.
I do agree with @gabrielmer in this point.
Autosharding seems as a technique for both distributing the traffic randomly across the available shards and to simplify the publish operation.
It seems that we will always have a combination of both static and automatic sharding. For example, TWN, and any other network, would not be able to operate with only automatic sharding, because the content-topic is only available when publishing a message and not when subscribing.
There was a problem hiding this comment.
A given network should either have static sharding, or auto sharding, not both. Do we agree with that?
I am a developer that is building my new foobar app on TWN, hence auto sharding is enabled.
Please walk me through how I need to select the "right shard" to subscribe to.
gabrielmer
left a comment
There was a problem hiding this comment.
Thanks so much! I would wait until @Ivansete-status's review before merging just in case :)
Ivansete-status
left a comment
There was a problem hiding this comment.
Thanks so much for it! 🙌
Not approving yet because I didn't review it 100%
I don't want to be a blocker so feel free to merge it if you consider appropriate.
Of course, happy to revisit again in the future if you want to apply any change or split it into smaller PRs :)
Thanks again! ❤️
| wakuPeerExchange*: WakuPeerExchange | ||
| wakuMetadata*: WakuMetadata | ||
| wakuSharding*: Sharding | ||
| wakuAutoSharding*: Option[Sharding] |
There was a problem hiding this comment.
The wakuAutoSharding shouldn't belong to WakuNode.
IMHO, this is linked to wakuRelay and should be there instead.
There was a problem hiding this comment.
I would disagree here, light push, store and filter also uses the autosharding logic, as pubsub topic are part of the request body and needs to be inferred from the content topics.
| numShardsInNetwork: Option[uint32] | ||
| shards: Option[seq[uint16]] | ||
| shardingConf: Option[ShardingConfKind] | ||
| numShardsInCluster: Option[uint16] |
There was a problem hiding this comment.
This is a great opportunity to have a good default.
I would not give freedom to select the number of shards in a cluster. No one is going to give a value there with certain decision.
Instead, we need to use the maximum allowed shards within a cluster, which is 1024, according to the spec: https://github.com/waku-org/specs/blob/master/standards/core/relay-sharding.md
There was a problem hiding this comment.
when you start a new network, you want to start with a small number of shards. So that you can have several users on the same shard. If you have no user on a shard, you cannot form a gossipsub network.
1024 shards by default means that you would not form any gossipsub networks, and users would not connect to each other because their application is likely to only connect to one shard.
There is a reason why we selected 8 shards for TWN.
Most new networks would be fine with 1 or 8 shards.
Also, are you thinking a default for service nodes? or for applications?
A default value of 1 may be better, for applications that don't want to use TWN.
| shards: Option[seq[uint16]] | ||
| shardingConf: Option[ShardingConfKind] | ||
| numShardsInCluster: Option[uint16] | ||
| activeRelayShards: Option[seq[uint16]] |
There was a problem hiding this comment.
What are the inactive relay shards :)? I think that's been commented elsewhere: what's wrong with shards name?
There was a problem hiding this comment.
because shards means nothing. I guess I shall rename this variable to relayShardsToSubscribeTo
These are the shards we want relay to subscribe to.
Do note that this is a pure wakunode2 service node matter. In an application, you would want the application to run subscribe at the start, this option is purely for CLI and I will remove it from WakuConf.
| proc withNumShardsInCluster*(b: var WakuConfBuilder, numShardsInCluster: uint16) = | ||
| b.numShardsInCluster = some(numShardsInCluster) | ||
|
|
||
| proc withActiveRelayShards*(b: var WakuConfBuilder, shards: seq[uint16]) = |
There was a problem hiding this comment.
I prefer this. More concise ;P
| proc withActiveRelayShards*(b: var WakuConfBuilder, shards: seq[uint16]) = | |
| proc withShards*(b: var WakuConfBuilder, shards: seq[uint16]) = |
There was a problem hiding this comment.
More concise because it does not convey any meaning. Please start to see nwaku as an application library. what shards should a developer select? The answer is none when using autosharding. And for static sharding, this can be done with subscribe API, meaning they are the shards to subscribe to.
| clusterId: Option[uint16] | ||
| numShardsInNetwork: Option[uint32] | ||
| shards: Option[seq[uint16]] | ||
| shardingConf: Option[ShardingConfKind] |
There was a problem hiding this comment.
According to the spec, Automatic Sharding should be set by default.
Therefore, this should not be an Option and be set to AutoSharding by default.
https://github.com/waku-org/specs/blob/master/standards/core/relay-sharding.md
There was a problem hiding this comment.
I am sorry but an Option means that it can be set, or it can be unset.
a default means that if the developer does not set it, then we apply a default value.
Meaning there is default value, does not mean it's the only possible value.
There was a problem hiding this comment.
Also well spotted that there is a gap in the spec IMO. It says that auto sharding is the default choice, but does not recommend autosharding parameters such as the number of shards.
Cc @jm-clius: what should be the default number of shards, in the same vein as
Autosharding selects shards automatically and is the default behavior for shard choice
There was a problem hiding this comment.
This is because Autosharding is only defined as a concept in this spec and presented as the preferred default option when considering sharding for Waku networks, followed by static (and even named sharding, which we don't even implement anymore). This spec does not define any specific application of autosharding for specific Waku Network(s), as this more properly belongs to the "consensus" specification for a specific network (e.g. the Status Network(s), TWN, etc.). The only network for which we have defined autosharding is TWN, as it's the only specified network with a consensus on cluster ID and number of shards: https://rfc.vac.dev/waku/standards/core/64/network#autosharding.
This out-of-band consensus on a network definition is a must for autosharding to work as expected. The spec should indeed be clarified in this regard.
My initial preference was that autosharding, as an app-level API simplification, should only be available when TWN is configured. However, it became clear for testing purposes and more modular use of Waku, that people preferred having the option of ad-hoc creating their own "network definitions" with autosharding by configuring a numShardsInCluster option. Note that this is a higher-level network definition and has nothing to do at all with which relay shards are actually being subscribed to. You could perfectly well set this parameter and not subscribe to any traffic at all. Also, it implies that these experimenters "know" that others participating in the experiment have configured their nodes in the same way (or are using the default config - see below). The way to formalise this is in the form of published network specifications for each new network, where the number of shards, services, etc. are agreed upon.
Autosharding also only makes sense in the API when applications build on top of a Waku node. In other words, it should only be used when the node is run within specific content topic (i.e. "App") context - it has no use for general service nodes that do not care about specific content topics.
My view of defaults:
Default configuration should always be for TWN.
If someone sets a different cluster ID, the default should be that that cluster is assumed to have only 1 shard. There's no need to pre-emptively "scale" these clusters - we had a specific reason for making TWN experimentally work with 8 out of the box, but that reasoning doesn't apply to the defaults.
There was a problem hiding this comment.
Thanks @jm-clius. I actually came to the same conclusion overnight. Which hopefully is in line with @Ivansete-status 's original comment.
With the introduction of the preset concept, we are moving towards TWN being the default. It is not straightforward and as RLN plays a part in it.
I suggest to have nwaku default to autosharding with 1 shard when TWN is not set. For a few reason:
- Autosharding means abstracting shards concept, which is better for dev ex
- Autosharding is yet to be properly dogfood on large applications, TWN with
8is a starter but no serious app is running on it. Hence, a value of1is low risk.
There was a problem hiding this comment.
Thanks for the feedback! I with the suggested approach 👌
| builder.shardingConf = some(StaticSharding) | ||
| of AutoSharding: | ||
| builder.shardingConf = some(AutoSharding) | ||
| builder.numShardsInCluster = some(networkConf.shardingConf.numShardsInCluster) |
There was a problem hiding this comment.
This shouldn't be needed
| builder.numShardsInCluster = some(networkConf.shardingConf.numShardsInCluster) |
There was a problem hiding this comment.
What do you mean it should not be needed? The number of shards in the network impact the autosharding protocol. If one node sets 8 and another sets 1024 then a given content topic foo will end up on different shards, meaning you won't be able to transmit messages.
There was a problem hiding this comment.
Yes indeed, you are absolutely right. When I wrote my comment I was thinking about having 1024 by default to all nodes but, subscribing to 1024 shards doesn't seem like a great idea because I suspect it won't perform very well.
The static and auto shards generates a lot of confusion and is a source of possible discrepancies among nodes. For example, what happens if someone configures the following?
- shards [1, 2, 3]
plus - numShardsInCluster=4
In that case, the foo content-topic can potentially fall in wrong shard (considering all nodes are in 1, 2, 3.)
Therefore, I think we need to infer the value of numShardsInCluster, or have a good default.
Another possible option would be to implement a hybrid solution between static and auto sharding: always have defined the shards list and, in case of using auto-sharding, infer the available shards from that available shards list. wdyt?
There was a problem hiding this comment.
Let's take it from a user PoV. I am only covering when auto-sharding is enabled.
- Library user/application developer: in this case, I don't think
--shardsshould be used at all. They should usesubscribe (contentTopic)to subscribe to a topic.
When sending a message, then we will have two choices when they are not subscribed to the right shard: - non-blocking logic: fallback to light push
- blocking logic: error and require
subscribeto be done first - Node operator for a given application: In this case, the operator want to support their specific application and run relay/store/etc for their shard. which means they would use
--content-topicsto subscribe to the right shards - Node operators - for Waku/incentivization/etc: In this instance, a node operator may want to support all shards, or specific shards where there is most traffic. Meaning
--shardsis most appropriate.
In conclusion
To setup relay shard subscription:
A. On the wakunode2 CLI, it is fine for a user to either both --shards or --content-topic to specify what shards to subscribe to, but it should not be both at the same time. Note that a partial content topic should be accepted (only first fields are needed for autosharding).
B. On the Waku API, then I would suggest that neither --shards nor --content-topic are available at node initialization, but instead, subscribe(contentTopic) is used.
When sending a message, in relay mode, on a shard not subscribed to. I suggest to error out for now. We can decide at a later stage to fallback to light push.
There was a problem hiding this comment.
Therefore, I think we need to infer the value of numShardsInCluster, or have a good default.
The default nework conf value should be "TWN", so that this number is 8.
If someone want to roll out their own network, and own cluster id, then I agree we can propose a default value. Probably 1 or 4.
@jm-clius WDYT?
There was a problem hiding this comment.
@Ivansete-status just so we are on the same page, you seem to say that numShardsInCluster is not needed, as well as wanting a good default, in lieu of the value.
Ultimately, it is needed because different network may have different shards. For example, we may grown the number of shards on Status network to 50 with the chat sdk. But someone bootstrapping a brand new network may prefer to stick to 1 until scaling is needed. So this parameter is needed.
Then the question of default value remain up to discussion, right?
There was a problem hiding this comment.
A. and B. makes sense to me in this comment: #3468 (comment)
The following looks great to me as well:
When sending a message, in relay mode, on a shard not subscribed to. I suggest to error out for now.
If someone want to roll out their own network, and own cluster id, then I agree we can propose a default value. Probably 1 or 4.
Why not 8 too?
I think we need a good first default, 8, and then, define an scaling algorithm that makes the network breath, i.e., subscribe to more or less shards depending on bandwidth needed. That will need a consensus in the network.
For that, I believe we need to also establish a fixed consensus shard on every network, on every cluster. For example, we can enforce that every cluster will have the shard 0 to be reserved for only containing Waku consensus and coordination messages. (Another option is having waku/2/rs/consensus pubsubtopic and leave shard 0 for data.)
That consensus topic will help nodes to reach an agreement in:
- Define the total number of shards in the network. Each node will subscribe/unsubscribe accordingly.
- Determine which nodes will act as bridge or aggregator between shards.
- When to start scaling up/down, i.e., when to start changing the current shard set.
- Define the shards a certain node will subscribe to (we can do math considering the peerId.)
Ultimately, it is needed because different network may have different shards.
I think you are absolutely right with that but, on the other hand, I believe we need to implement a network algorithm that makes the network scale up/down accordingly. I strongly suggest following the consensus topic idea that I commented above.
Then the question of default value remain up to discussion, right?
Yes :) I'd suggest having a good default plus an algorithm that makes it adaptive as per network conditions.
There was a problem hiding this comment.
I don't think at this stage we need a consensus algorithm to set a network definition. Consensus for the networks we define is reached by publishing an RFC. We have only one defined network, which is TWN for which number of shards is 8. If we need to scale, we'll publish a new version of the spec - TWN Gen 1 if you will - and autosharding is already defined in such a way that it will work with backwards compatibility.
There was a problem hiding this comment.
As per #3468 (comment) I will proceed with the following changes:
- autosharding by default
- num of shards in cluster:
1
| let shardingConf = | ||
| if builder.shardingConf.isSome(): | ||
| case builder.shardingConf.get() | ||
| of StaticSharding: | ||
| ShardingConf(kind: StaticSharding) | ||
| of AutoSharding: | ||
| if builder.numShardsInCluster.isSome(): | ||
| ShardingConf( | ||
| kind: AutoSharding, numShardsInCluster: builder.numShardsInCluster.get() | ||
| ) | ||
| else: | ||
| return err( | ||
| "Number of shards in cluster must be specified with autosharding enabled" | ||
| ) |
There was a problem hiding this comment.
I think this can be simplified if we apply good default for numShardsInCluster ( 1024 ) and avoid making shardingConf an Option
There was a problem hiding this comment.
ShardingConf MUST remain an enum. Because numShardsInCluster does not make sense for a static sharding conf.
| numShardsInNetwork* {. | ||
| desc: "Number of shards in the network", | ||
| desc: | ||
| "Enables autosharding and set number of shards in the cluster, set to `0` to use static sharding", |
There was a problem hiding this comment.
As I mentioned before, this is a great opportunity to remove this parameter :D
| ShardingConf* = object | ||
| case kind*: ShardingConfKind | ||
| of AutoSharding: | ||
| numShardsInCluster*: uint16 | ||
| of StaticSharding: | ||
| discard |
There was a problem hiding this comment.
This is beautiful indeed ❤️
Nevertheless, I think we can simplify it by having a good default for numShardsInCluster (1024) in which case, the ShardingConf wouldn't be needed.
Sorry for repeating this same simplification idea multiple times :D
| # TODO: when using autosharding, the user should not be expected to pass any shards, but only content topics | ||
| # Hence, this joint logic should be removed in favour of an either logic: | ||
| # use passed shards (static) or deduce shards from content topics (auto) |
There was a problem hiding this comment.
I do agree with @gabrielmer in this point.
Autosharding seems as a technique for both distributing the traffic randomly across the available shards and to simplify the publish operation.
It seems that we will always have a combination of both static and automatic sharding. For example, TWN, and any other network, would not be able to operate with only automatic sharding, because the content-topic is only available when publishing a message and not when subscribing.
A `NetworkConf` is a Waku network configuration. # Conflicts: # tests/factory/test_waku_conf.nim # Conflicts: # tests/factory/test_waku_conf.nim
A smarter data types simplifies the logic.
it # Conflicts: # waku/factory/external_config.nim
some metrics error reporting were missing # Conflicts: # waku/waku_lightpush/protocol.nim
To make it clearer that these are the shards the node will subscribe to.
81cf25e to
e15046a
Compare
Also makes shards to subscribe to all shards in auto sharding, none in static sharding.
|
Will merge when CI is green, we can always revisit with a PR. Description is updated, last change:
subscribe shards default for libwaku can be reviewed as part of the Waku API. |
* Reserve `networkconfig` name to waku network related settings * Rename cluster conf to network conf A `NetworkConf` is a Waku network configuration. # Conflicts: # tests/factory/test_waku_conf.nim # Conflicts: # tests/factory/test_waku_conf.nim * Improve sharding configuration A smarter data types simplifies the logic. * Fixing tests * fixup! rename to endpointConf * wip: autosharding is a specific configuration state and treat it like it # Conflicts: # waku/factory/external_config.nim * refactor lightpush handler some metrics error reporting were missing # Conflicts: # waku/waku_lightpush/protocol.nim * test_node_factory tests pass * remove warnings * fix tests * Revert eager previous replace-all command * fix up build tools compilation * metadata is used to store cluster id * Mount relay routes in static sharding * Rename activeRelayShards to subscribeShards To make it clearer that these are the shards the node will subscribe to. * Remove unused msg var * Improve error handling * Set autosharding as default, with 1 shard in network Also makes shards to subscribe to all shards in auto sharding, none in static sharding.
Description
Make it explicit that the node can either be configured with auto sharding or static sharding. Treat static sharding as an happy path (prev. was implicit error or do-nothing-happy).
Make it clear that without auto-sharding, pubsub topics can't be derived from content topics.
Do so by improving the internal node configuration and make it clear that there are two options, and that the number of shards in the network is only needed for auto sharding.
Also proceed with a number of naming improvement, as per logos-messaging/specs#65
Some refactoring done along the way.
Changes
1shard in networkshards(orsubscribeShards) defaults:shardsis actually the shards we want relay to subscribe too, so rename accordinglyuint16for cluster id and shard id more (but not totally) consistently across the code base.WebSocketConfto the right placeFollow-up changes
activeRelayShardsin config, in favour of havingwakunode2just automatically subscribe to the shards passed on the CLIIssue