Skip to content

Conversation

DavisVaughan
Copy link
Member

@DavisVaughan DavisVaughan commented Nov 5, 2019

Because of the checks we do elsewhere, it is essentially impossible for a non-data frame y to get into df_equal_scalar(). Because of this, we are doing an expensive check of is_data_frame() a large number of times (once per row!) when I don't think we need to. Removing it has nice performance improvements for vec_equal() and the dictionary functions, which call equal_scalar() internally.

df <- data.frame(x = 1:1e6, y = 1:1e6)
# before
bench::mark(
  vec_equal(df, df, na_equal = TRUE),
  iterations = 300
)
#> # A tibble: 1 x 6
#>   expression                            min median `itr/sec` mem_alloc
#>   <bch:expr>                         <bch:> <bch:>     <dbl> <bch:byt>
#> 1 vec_equal(df, df, na_equal = TRUE) 84.2ms   90ms      11.1    11.5MB
#> # … with 1 more variable: `gc/sec` <dbl>

# after
bench::mark(
  vec_equal(df, df, na_equal = TRUE),
  iterations = 300
)
#> # A tibble: 1 x 6
#>   expression                            min median `itr/sec` mem_alloc
#>   <bch:expr>                         <bch:> <bch:>     <dbl> <bch:byt>
#> 1 vec_equal(df, df, na_equal = TRUE) 69.6ms 73.8ms      13.5    11.5MB
#> # … with 1 more variable: `gc/sec` <dbl>
# before
bench::mark(
  vec_unique_loc(df),
  iterations = 300
)
#> # A tibble: 1 x 6
#>   expression              min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>         <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 vec_unique_loc(df)   83.3ms   96.3ms      10.4    23.6MB     21.1

# after
bench::mark(
  vec_unique_loc(df),
  iterations = 300
)
#> # A tibble: 1 x 6
#>   expression              min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>         <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 vec_unique_loc(df)   77.6ms   87.7ms      11.4    23.6MB     27.9

@DavisVaughan
Copy link
Member Author

DavisVaughan commented Nov 5, 2019

I actually think we should remove all of the checks in df_equal_scalar() (the length check and the names check as well). At the very least we are doing those checks way too many times (once per row).

We can get some massive improvements from that, and I think it would be "safe" since for vec_equal() we have already vec_cast_common() x and y to a common data frame type. If we ever use equal_scalar() with data frames internally, we likely have already called vec_cast_common() on the inputs there as well (like in the dictionary function), and I think it is the responsibility of the caller to make sure the types are the same, not df_equal_scalar() (especially since we don't do these kinds of checks in any of the other *_equal_scalar functions).

These benefits would also carry into pivot_longer() and other tidyr friends.

Here are some quick benchmarks that are a result of removing these checks. vec_match() is particularly impressive.

vctrs/src/equal.c

Lines 177 to 191 in b4f7be4

if (!is_data_frame(y)) {
return false;
}
int p = Rf_length(x);
if (p != Rf_length(y)) {
return false;
}
// Don't worry about names missingness because properly formed
// data frames shouldn't have any missing names
if (!equal_names(x, y)) {
return false;
}

library(vctrs)

df <- data.frame(x = 1:1e6, y = 1:1e6)
# before
bench::mark(
  vec_equal(df, df),
  iterations = 300
)
#> # A tibble: 1 x 6
#>   expression             min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 vec_equal(df, df)   85.1ms   90.5ms      11.0    11.5MB     1.09

# after
bench::mark(
  vec_equal(df, df),
  iterations = 300
)
#> # A tibble: 1 x 6
#>   expression             min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 vec_equal(df, df)   33.8ms   35.7ms      27.8    11.5MB     2.75
# before
bench::mark(
  vec_unique_loc(df),
  iterations = 300
)
#> # A tibble: 1 x 6
#>   expression              min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>         <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 vec_unique_loc(df)   84.2ms   97.2ms      10.2    23.6MB     25.0

# after
bench::mark(
  vec_unique_loc(df),
  iterations = 300
)
#> # A tibble: 1 x 6
#>   expression              min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>         <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 vec_unique_loc(df)   59.8ms   65.2ms      15.3    23.6MB     37.4
# before
bench::mark(
  vec_match(df, df), 
  iterations = 20
)
#> # A tibble: 1 x 6
#>   expression             min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 vec_match(df, df)    355ms    379ms      2.61    19.4MB     14.8

# after
bench::mark(
  vec_match(df, df), 
  iterations = 20
)
#> # A tibble: 1 x 6
#>   expression             min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 vec_match(df, df)    217ms    230ms      4.39    19.4MB     24.9

@DavisVaughan
Copy link
Member Author

DavisVaughan commented Nov 5, 2019

Speed boost in pivot_wider() before and after this idea while using this tidyr PR tidyverse/tidyr#790.

library(tidyr)
library(dplyr, warn.conflicts = FALSE)
library(reshape2)

mydf <- expand_grid(
  case = sprintf("%03d", seq(1, 4000)),
  year = seq(1900, 2000),
  name = c("x", "y", "z")
) %>%
  mutate(value = rnorm(nrow(.)))

# before
bench::mark(
  pivot = pivot_wider(mydf, names_from = "name", values_from = "value"),
  spread = spread(mydf, name, value),
  dcast = dcast(mydf, case + year ~ name),
  iterations = 50
)
#> # A tibble: 3 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 pivot         587ms    624ms      1.58     117MB     3.45
#> 2 spread        431ms    497ms      2.04     431MB     9.72
#> 3 dcast         408ms    459ms      2.15     457MB     8.35

# after
bench::mark(
  pivot = pivot_wider(mydf, names_from = "name", values_from = "value"),
  spread = spread(mydf, name, value),
  dcast = dcast(mydf, case + year ~ name),
  iterations = 50
)
#> # A tibble: 3 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 pivot         347ms    383ms      2.58     117MB     5.63
#> 2 spread        429ms    485ms      2.05     431MB     9.75
#> 3 dcast         413ms    459ms      2.15     457MB     8.33

@lionel-
Copy link
Member

lionel- commented Nov 6, 2019

I agree it'd make sense to reach a point where we can assume a properly formed data frame in these low level routines.

But common type and casting is generic, so we can't make these assumptions without any check. Maybe vec_cast() should perform post-condition checks for data frames.

@DavisVaughan
Copy link
Member Author

DavisVaughan commented Nov 6, 2019

I'm not sure it makes sense to just special case data frames. Why is this example with a custom class built on top of a double any different? We can still make it fail by having a bad vec_cast() method, but I don't think it is vctrs's job to check for that.

library(vctrs)

x <- new_vctr(1, class = "bad_vctr")

# vec_ptype2 is as expected, common type is a bad_vctr
vec_ptype2.bad_vctr <- function(x, y, ...) {
  UseMethod("vec_ptype2.bad_vctr", y)
}

vec_ptype2.bad_vctr.double <- function(x, y, ...) {
  new_vctr(numeric(), class = "bad_vctr")
}

vec_ptype2.double.bad_vctr <- function(x, y, ...) {
  new_vctr(numeric(), class = "bad_vctr")
}

vec_cast.bad_vctr <- function(x, to, ...) {
  UseMethod("vec_cast.bad_vctr")
}

# but casting a double to a bad_vctr is specified incorrectly!
vec_cast.bad_vctr.double <- function(x, to, ...) {
  "hi"
}

vec_cast.bad_vctr.bad_vctr <- function(x, to, ...) {
  x
}

# makes sense
vec_ptype2(x, 1)
#> <bad_vctr[0]>
vec_ptype2(1, x)
#> <bad_vctr[0]>

# makes sense
vec_cast(x, x)
#> <bad_vctr[1]>
#> [1] 1

# WAT
vec_cast(1, x)
#> [1] "hi"

# meaning we get here...
vec_match(1, x)
#> Error in vec_match(1, x): STRING_PTR_RO() can only be applied to a 'character', not a 'double'

vec_match(x, 1)
#> [1] NA

Created on 2019-11-06 by the reprex package (v0.3.0.9000)

I would really like it if we could assume that vec_cast() always fulfills the contract that x will end up with the type of to. If the user created method doesn't fulfill that contract, then the problem is on their side and it isn't the responsibility of vctrs to check for that.

It just makes reasoning about the functions so much easier if we can 100% say "okay, I've casted x and y to a common type, so now I know I can make certain assumptions about them". Having to do these additional checks even after casting feels like a step away from that.

@hadley
Copy link
Member

hadley commented Nov 6, 2019

I agree with @DavisVaughan; it's not our responsibility if a user defined method doesn't fulfil the contract (although we should avoid crashing in C)

@lionel-
Copy link
Member

lionel- commented Nov 6, 2019

@DavisVaughan Data frames have a propensity to pop up in bad state. We have had a lot of crash reports with dplyr because of this.

I agree that we should only have post condition checks for things that make R crash, weird error messages are ok if the contract is not fulfilled.

@DavisVaughan
Copy link
Member Author

So with df_equal_scalar() I see one main way this can crash without the checks. If we call vec_equal() and the internal call to vec_cast_common() results in data frames with different number of columns, then we might try and index into one of them with VECTOR_ELT() at a location that doesn't exist.

One way to fix this is to do what I just did in the previous commit. We check that the number of columns are the same once, and error if they aren't. Again this should only happen if vec_cast is specified improperly so I like throwing an error rather than returning all FALSE values here (that is what would have happened before).

Since we are already computing the number of columns before the loop, I figured we might as well also pass that information down to df_equal_scalar() so we don't have to compute the number of columns for every row. This results in another small performance boost.

Another option is to have the length check somewhere else, like after the cast, but I'm not quite sure where that would go.

@lionel-
Copy link
Member

lionel- commented Nov 6, 2019

This makes sense.

@DavisVaughan DavisVaughan merged commit 0afd1f3 into r-lib:master Nov 6, 2019
@DavisVaughan DavisVaughan deleted the df-equal-scalar-y-check branch November 6, 2019 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants