Skip to content

Conversation

@ltagliamonte-dd
Copy link
Contributor

@ltagliamonte-dd ltagliamonte-dd commented Mar 21, 2025

Dear community,

We're working on leveraging RocksDB sideloading to ship updates to our servers instead of loading data through the Redis interface.

Sideloading in RocksDB is more efficient than direct writes because it bypasses the standard write path, reducing write amplification, CPU overhead, and compaction costs. By ingesting pre-built SST files via IngestExternalFile(), it minimizes WAL writes and MemTable pressure, making bulk inserts faster and more efficient.

Since we don't use replication in our setup, we can ship bulk updates directly to each Kvrocks server and load the data efficiently.

We’d love to contribute our sideloading work back to the community and are excited to share our approach. Looking forward to your thoughts!

Our approach:
The current approach introduce a new CMD: SST LOAD path, which detects the files to load from a folder, and sideloads the data into RocksDB.

@ltagliamonte-dd ltagliamonte-dd changed the title Sideloading feat(enhancement): Sideloading Mar 21, 2025
@PragmaTwice PragmaTwice changed the title feat(enhancement): Sideloading feat(storage): support for sideloading SSTs Mar 22, 2025
@git-hulk
Copy link
Member

@ltagliamonte-dd @PragmaTwice Do you think it's a good idea to provide a command for this purpose?

For example:

LOAD SST <local dir>

So that it would be easy to ingest SSTs online.

@ltagliamonte
Copy link

@PragmaTwice @git-hulk I'm not married to the use of a signal.
I can definitely add a new server command.
Are you folks ok with the rest of the proposal?

@PragmaTwice
Copy link
Member

@ltagliamonte-dd @PragmaTwice Do you think it's a good idea to provide a command for this purpose?

For example:

LOAD SST <local dir>

So that it would be easy to ingest SSTs online.

Better to be SST LOAD.

@git-hulk
Copy link
Member

@ltagliamonte-dd @PragmaTwice Do you think it's a good idea to provide a command for this purpose?
For example:

LOAD SST <local dir>

So that it would be easy to ingest SSTs online.

Better to be SST LOAD.

I proposed LOAD SST for the sake of a command would be better to start with its behavior.

@PragmaTwice
Copy link
Member

PragmaTwice commented Mar 22, 2025

But the command name will be LOAD if it is. Maybe LOADSST or SSTLOAD would be better.

SST LOAD is similiar to RDB LOAD.

@git-hulk
Copy link
Member

But the command name will be LOAD if it is. Maybe LOADSST or SSTLOAD would be better.

SST LOAD is similiar to RDB LOAD.

Makes sense.

@ltagliamonte-dd
Copy link
Contributor Author

@git-hulk @PragmaTwice re-worked as new CMD, thanks for the feedback

@caipengbo
Copy link
Contributor

caipengbo commented Mar 25, 2025

I feel that this is not a production-usable feature without considering replication. This scenario is too limiting.

@ltagliamonte
Copy link

I feel that this is not a production-usable feature without considering replication. This scenario is too limiting.

Thank you for the feedback! I completely understand your concern. Please know that this initial version of SST load doesn't rule out the possibility of adding replication in the future. We're committed to ensuring that nothing we implement is backwards incompatible.

@git-hulk
Copy link
Member

I feel that this is not a production-usable feature without considering replication. This scenario is too limiting.

Thank you for the feedback! I completely understand your concern. Please know that this initial version of SST load doesn't rule out the possibility of adding replication in the future. We're committed to ensuring that nothing we implement is backwards incompatible.

Yes, we could allow this command only if it's a standalone instance without any replicas.

@ltagliamonte-dd
Copy link
Contributor Author

I feel that this is not a production-usable feature without considering replication. This scenario is too limiting.

Thank you for the feedback! I completely understand your concern. Please know that this initial version of SST load doesn't rule out the possibility of adding replication in the future. We're committed to ensuring that nothing we implement is backwards incompatible.

Yes, we could allow this command only if it's a standalone instance without any replicas.

addressed via 68ef7f8

@ltagliamonte-dd
Copy link
Contributor Author

ltagliamonte-dd commented Apr 16, 2025 via email

@git-hulk
Copy link
Member

@git-hulk is there a method already there I can use to check if the node has replicas?

You could add a method in Server to export the replica count, for example:

size_t GetReplicaCount() {
   slave_threads_mu_.lock();
   auto replica_count = slave_threads_.size();
   slave_threads_mu_.unlock();
   return replica_count;
} 

git-hulk
git-hulk previously approved these changes Apr 17, 2025
@git-hulk
Copy link
Member

@ltagliamonte Thanks for your contribution and patience.

@ltagliamonte-dd
Copy link
Contributor Author

@ltagliamonte Thanks for your contribution and patience.

thank you @git-hulk.
Looking forward to pull the nightly!!

caipengbo
caipengbo previously approved these changes Apr 17, 2025
}
for (const auto &cf_name : cf_names) {
cf_files[cf_name] = std::vector<std::string>();
cf_files[cf.Name()] = std::vector<std::string>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What @PragmaTwice should be is that we don't need to initialize the empty vector for the map?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we can just remove this line.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
@PragmaTwice can you please approve?

@git-hulk
Copy link
Member

@caipengbo @PragmaTwice To see you guys have further comments, or we could merge after the CI becomes green.

Copy link
Contributor

@caipengbo caipengbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@git-hulk git-hulk merged commit 76a21bc into apache:unstable Apr 17, 2025
33 checks passed
@sonarqubecloud
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
1 Security Hotspot

See analysis details on SonarQube Cloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants