Skip to content

Conversation

@nvazquez
Copy link
Contributor

@nvazquez nvazquez commented May 19, 2022

Description

This PR enables IO_URING only when it is available on the hosts, as being discussed on #6253 (comment)

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

Copy link
Member

@weizhouapache weizhouapache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code lgtm

@nvazquez
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@nvazquez a Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@GabrielBrascher
Copy link
Member

@nvazquez maybe we should disable IO_URING to Rhel environments only, as these are the affected ones.
Ubuntu environments normally have the correct features mapped by Libvirt/Qemu.

Unfortunately, as we have been seeing in multiple KVM implementations, RedHat has a different route and tends to not keep up the pace with most of the new features that Qemu/Libvirt present. In this regard, there is no perfect world and we will always be compromising one side or the other (as many other features that had to be blocked due to Rhel).

I see that global settings are indeed a good way of disabling such features, although we can easily get lost with the growth of multiple features that need to be enabled/disabled accordingly to CentOS vs Ubuntu.

We might need better options to proceed with this.

With that said, your approach looks fine with me, as a quick fix.
A note in the upgrade process, raising how to proceed to enable in case the KVM is running in Ubuntu (or how to proceed to disable in the case of Rhel) could be interesting.

Copy link
Member

@GabrielBrascher GabrielBrascher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @nvazquez! LGTM

A documentation point raising this might be good, in order to avoid users with Ubuntu and the specific Libvirt/Qemu setup missing this feature.

Although I do not believe this is going to be a good way for handling the multiple differences between CentOS & Ubuntu in the medium/long term ... it is what it is :)

For another time & future implementations: we might need to revisit some parts where we check for the Libvirt version as well and run different checks.

@rohityadavcloud
Copy link
Member

@nvazquez cc @GabrielBrascher @wido what about Suse? If we can reliably detect the distro then the default behaviour (which should be documented somewhere) can be: if is Ubuntu, then enable io_uring, otherwise, disable it by default for other distros, and use the agent.properties always at the override.

@blueorangutan
Copy link

Packaging result: ✔️ el7 ✔️ el8 ✖️ debian ✔️ suse15. SL-JID 3445

@wido
Copy link
Contributor

wido commented May 19, 2022

io_uring brings a massive performance improvement to end-users. It's RHEL and thus CentOS/Alma/Rocky which have this problem. Ubuntu does not have this issue.

I do understand the problem here where CloudStack doesn't work out of the box, but if CloudStack becomes a project where you have a super long list of tunables/settings to get the optimal performance out of the system you will see users having endless tuning to do.

In the future io_uring should really be enabled by default or at least we should enable it if supported by the Qemu version. Otherwise users won't get the performance they want/expect.

Proxmox nowadays enables io_uring by default for all VMs being created. It's just such a massive performance boost.

@GabrielBrascher
Copy link
Member

@rohityadavcloud your idea would be great indeed in order to present the best features to each distro.

It is a shame that not all OS distros support some great features that Libvirt/Qemu present.
We've been always retrieving the Libvirt / Qemu versions in the codebase to decide which way to proceed when adding KVM features. I tend to follow Libvirt versions and release notes. Unfortunately, this has not been the best way in this particular case.

For us, as developers, it will always be a struggle to map all the different environments when developing a feature.

@andrijapanicsb
Copy link
Contributor

@GabrielBrascher @wido - we do understand that there are huge performance benefits, but that is for a small subset of OS-es that support it.

If you can propose a better solution, than this one here (to disable by default) where existing EL8 implementations are not broken (and that is the vast majority or ACS users today) - I'm for it (i.e. if we can reliably detect where to enable and where not) - otherwise we can't just go around and break things for 95% of the users...

@GabrielBrascher
Copy link
Member

@andrijapanicsb I totally understand it. I am sorry in the case I was not clear.
I am +1 on proceeding with the current route and merging this, I do understand that this impact a relevant amount of our users.

But I wanted to take the opportunity to raise the concerns.
We've been seeing this happening other times, and it will stay happening many other times in the future.

Supporting a wide set of technologies has been a great strength of CloudStack, I do not want to take this out.
But we might need to improve the way we design some features in order to explore the best of each environment that CloudStack will be running into.

Otherwise, inevitably, people will start blaming CloudStack / KVM for "poor performance" and walk away to other options that suit best them. While they could actually be having the best for their environment.

Right now, I don't have any option on the table to fix it. But it is something good to keep in mind (for another time).

@rohityadavcloud
Copy link
Member

@nvazquez based on this, possibly the best compromise is to have io_uring enabled by default for Ubuntu based KVM hosts, but disabled otherwise. Can you amend the PR with this logic and propose a doc PR to document the default behaviour by distro? Thanks.

@slavkap
Copy link
Contributor

slavkap commented May 19, 2022

Hi all, we could try by parsing the stderr (but I know it's ugly) and enable/disable according to the output:

Ubuntu

sudo /usr/bin/qemu-system-x86_64 -drive aio=io_uring
qemu-system-x86_64: Device needs media, but drive is empty

Rocky

 /usr/libexec/qemu-kvm -drive aio=io_uring 
qemu-kvm: -drive aio=io_uring: invalid aio option

@nvazquez
Copy link
Contributor Author

Have discussed with @slavkap and @GabrielBrascher about a possible fix to detect if iouring is enabled by a call to qemu-kvm with the aio parameter. Exploring this option and will be updating the PR

@blueorangutan
Copy link

Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 3460

@nvazquez
Copy link
Contributor Author

@blueorangutan test

@blueorangutan
Copy link

@nvazquez a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@apache apache deleted a comment from blueorangutan May 20, 2022
@blueorangutan
Copy link

Trillian test result (tid-4222)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 30149 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr6399-t4222-kvm-centos7.zip
Smoke tests completed. 92 look OK, 0 have errors
Only failed tests results shown below:

Test Result Time (s) Test File

Comment on lines 824 to 825
String enableIoUringConfig = (String) params.get(ENABLE_IO_URING_PROPERTY);
enableIoUring = isIoUringEnabled(enableIoUringConfig);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
String enableIoUringConfig = (String) params.get(ENABLE_IO_URING_PROPERTY);
enableIoUring = isIoUringEnabled(enableIoUringConfig);
enableIoUring = isIoUringEnabled(enableIoUringConfig);

needs to be moved after this part of the code

try {
_hvVersion = conn.getVersion();
_hvVersion = _hvVersion % 1000000 / 1000;
_hypervisorLibvirtVersion = conn.getLibVirVersion();
_hypervisorQemuVersion = conn.getVersion();
} catch (final LibvirtException e) {
s_logger.trace("Ignoring libvirt error.", e);
}

otherwise it will be always false

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @slavkap, done

@nvazquez
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@nvazquez a Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 3463

@nvazquez
Copy link
Contributor Author

@blueorangutan test

@blueorangutan
Copy link

@nvazquez a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian Build Failed (tid-4224)

@rohityadavcloud
Copy link
Member

@blueorangutan test

@blueorangutan
Copy link

@rohityadavcloud a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian Build Failed (tid-4225)

@rohityadavcloud
Copy link
Member

@blueorangutan test

@blueorangutan
Copy link

@rohityadavcloud a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian Build Failed (tid-4226)

// Do not remove - switching it to AgentProperties.Property may require accepting null values for the properties default value
String enableIoUringConfig = (String) params.get(ENABLE_IO_URING_PROPERTY);
enableIoUring = isIoUringEnabled(enableIoUringConfig);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one small suggestion if you want to log if the IO uring is enabled/disabled

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please, let's do the proper logging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks

Copy link
Contributor

@slavkap slavkap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code LGTM! Rebased on main and tested on Rocky - libvirt/qemu versions

Compiled against library: libvirt 8.0.0
Using library: libvirt 8.0.0
Using API: QEMU 8.0.0
Running hypervisor: QEMU 6.2.0

and with set/not set enable.io.uring in the agent properties

@andrijapanicsb
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@andrijapanicsb a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@nvazquez nvazquez merged commit dc975df into apache:4.16 May 23, 2022
@blueorangutan
Copy link

Trillian test result (tid-4231)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 38318 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr6399-t4231-kvm-centos7.zip
Smoke tests completed. 91 look OK, 1 have errors
Only failed tests results shown below:

Test Result Time (s) Test File
test_disable_oobm_ha_state_ineligible Error 1512.29 test_hostha_kvm.py

weizhouapache pushed a commit to weizhouapache/cloudstack that referenced this pull request May 24, 2022
* [KVM] Disable IOURING by default on agents

* Refactor

* Remove agent property for iouring

* Restore property

* Refactor suse check and enable on ubuntu by default

* Refactor irrespective of guest OS

* Improvement

* Logs and new path

* Refactor condition to enable iouring

* Improve condition

* Refactor property check

* Improvement

* Doc comment

* Extend comment

* Move method

* Add log
Pearl1594 pushed a commit to shapeblue/cloudstack that referenced this pull request Sep 6, 2022
* Extract the IO_URING configuration into the agent.properties (apache#6253)

When using advanced virtualization the IO Driver is not supported. The
admin will decide if want to enable/disable this configuration from
agent.properties file. The default value is true

* kvm: truncate vnc password to 8 chars (apache#6244)

This PR truncates the vnc password of kvm vms to 8 chars to support latest versions of libvirt.

* merge fix

Signed-off-by: Abhishek Kumar <[email protected]>

* [KVM] Enable IOURING only when it is available on the host (apache#6399)

* [KVM] Disable IOURING by default on agents

* Refactor

* Remove agent property for iouring

* Restore property

* Refactor suse check and enable on ubuntu by default

* Refactor irrespective of guest OS

* Improvement

* Logs and new path

* Refactor condition to enable iouring

* Improve condition

* Refactor property check

* Improvement

* Doc comment

* Extend comment

* Move method

* Add log

* [KVM] Fix VM migration error due to VNC password on libvirt limiting versions (apache#6404)

* [KVM] Fix VM migration error due to VNC password on libvirt limiting versions

* Fix passwd value

* Simplify implementation

Co-authored-by: slavkap <[email protected]>
Co-authored-by: Wei Zhou <[email protected]>
Co-authored-by: Nicolas Vazquez <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

No open projects
Status: Done

Development

Successfully merging this pull request may close these issues.