String::push reallocates too aggressively in corner cases

There was a post recently on /r/rust about microbenchmarks in different programming languages (https://github.com/kostya/benchmarks), and I was trying to understand how the Rust implementation works.
I noticed that in the Base64 benchmark the author used this code to construct a srting with `str_size` copies of a character:

``` rust
let str_size = 10000000;
let mut str: String = "".to_string();
for _ in 0..str_size { str.push_str("a"); }
```

I thought that push_str is a bit inefficient for a single character, so I looked at the source code. I started playing with String::push and found a trick in the [implemetation](https://github.com/rust-lang/rust/blob/master/src/libcollections/string.rs#L470) for ASCII characters (relevant PR: #20079).

The actual problem:
If the character is 1 byte long in UTF-8, it is treated as an ASCII character (as in C), but if it's bigger than 1 byte, the algorythm always reserves 4 bytes in the vector even if the pushed character's length is only 2 or 3 bytes. So even if the vector has _just_ enough space for one of these smaller characters, the vector reallocates its storage (and doubles the size of the buffer). As a result, the string now has an unnecessarily doubled capacity.
I know it doesn't seem like a big problem, but if someone uses push somehow  with non-English text, it might become a significant memory overhead. The issue happens if the length of the string becomes a power of 2, so strings with smaller length are more affected.

A possible solution:

``` rust
// This is a modified version of the original
pub fn push(&mut self, ch: char) {
    match ch.len_utf8() {
        1 => self.vec.push(ch as u8),
        ch_len => {
            let cur_len = self.len();

            self.vec.reserve(ch_len);

            unsafe {
                // Attempt to not use an intermediate buffer by just pushing bytes
                // directly onto this string.
                let slice = slice::from_raw_parts_mut (
                    self.vec.as_mut_ptr().offset(cur_len as isize),
                    ch_len
                );
                let used = ch.encode_utf8(slice).unwrap_or(0);
                self.vec.set_len(cur_len + used);
            }
        }
    }
}
```

I benchmarked this version, therer was no measurable difference.
There's one problem though: char::len_utf8 is not marked with #[inline], so without LTO, it's currently slower.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

String::push reallocates too aggressively in corner cases #26124

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

String::push reallocates too aggressively in corner cases #26124

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions