-
Notifications
You must be signed in to change notification settings - Fork 68
Remove data frame check in df_equal_scalar()
#646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove data frame check in df_equal_scalar()
#646
Conversation
I actually think we should remove all of the checks in We can get some massive improvements from that, and I think it would be "safe" since for These benefits would also carry into Here are some quick benchmarks that are a result of removing these checks. Lines 177 to 191 in b4f7be4
library(vctrs)
df <- data.frame(x = 1:1e6, y = 1:1e6) # before
bench::mark(
vec_equal(df, df),
iterations = 300
)
#> # A tibble: 1 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 vec_equal(df, df) 85.1ms 90.5ms 11.0 11.5MB 1.09
# after
bench::mark(
vec_equal(df, df),
iterations = 300
)
#> # A tibble: 1 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 vec_equal(df, df) 33.8ms 35.7ms 27.8 11.5MB 2.75 # before
bench::mark(
vec_unique_loc(df),
iterations = 300
)
#> # A tibble: 1 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 vec_unique_loc(df) 84.2ms 97.2ms 10.2 23.6MB 25.0
# after
bench::mark(
vec_unique_loc(df),
iterations = 300
)
#> # A tibble: 1 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 vec_unique_loc(df) 59.8ms 65.2ms 15.3 23.6MB 37.4 # before
bench::mark(
vec_match(df, df),
iterations = 20
)
#> # A tibble: 1 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 vec_match(df, df) 355ms 379ms 2.61 19.4MB 14.8
# after
bench::mark(
vec_match(df, df),
iterations = 20
)
#> # A tibble: 1 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 vec_match(df, df) 217ms 230ms 4.39 19.4MB 24.9 |
Speed boost in library(tidyr)
library(dplyr, warn.conflicts = FALSE)
library(reshape2)
mydf <- expand_grid(
case = sprintf("%03d", seq(1, 4000)),
year = seq(1900, 2000),
name = c("x", "y", "z")
) %>%
mutate(value = rnorm(nrow(.)))
# before
bench::mark(
pivot = pivot_wider(mydf, names_from = "name", values_from = "value"),
spread = spread(mydf, name, value),
dcast = dcast(mydf, case + year ~ name),
iterations = 50
)
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 pivot 587ms 624ms 1.58 117MB 3.45
#> 2 spread 431ms 497ms 2.04 431MB 9.72
#> 3 dcast 408ms 459ms 2.15 457MB 8.35
# after
bench::mark(
pivot = pivot_wider(mydf, names_from = "name", values_from = "value"),
spread = spread(mydf, name, value),
dcast = dcast(mydf, case + year ~ name),
iterations = 50
)
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 pivot 347ms 383ms 2.58 117MB 5.63
#> 2 spread 429ms 485ms 2.05 431MB 9.75
#> 3 dcast 413ms 459ms 2.15 457MB 8.33 |
I agree it'd make sense to reach a point where we can assume a properly formed data frame in these low level routines. But common type and casting is generic, so we can't make these assumptions without any check. Maybe |
I'm not sure it makes sense to just special case data frames. Why is this example with a custom class built on top of a double any different? We can still make it fail by having a bad library(vctrs)
x <- new_vctr(1, class = "bad_vctr")
# vec_ptype2 is as expected, common type is a bad_vctr
vec_ptype2.bad_vctr <- function(x, y, ...) {
UseMethod("vec_ptype2.bad_vctr", y)
}
vec_ptype2.bad_vctr.double <- function(x, y, ...) {
new_vctr(numeric(), class = "bad_vctr")
}
vec_ptype2.double.bad_vctr <- function(x, y, ...) {
new_vctr(numeric(), class = "bad_vctr")
}
vec_cast.bad_vctr <- function(x, to, ...) {
UseMethod("vec_cast.bad_vctr")
}
# but casting a double to a bad_vctr is specified incorrectly!
vec_cast.bad_vctr.double <- function(x, to, ...) {
"hi"
}
vec_cast.bad_vctr.bad_vctr <- function(x, to, ...) {
x
}
# makes sense
vec_ptype2(x, 1)
#> <bad_vctr[0]>
vec_ptype2(1, x)
#> <bad_vctr[0]>
# makes sense
vec_cast(x, x)
#> <bad_vctr[1]>
#> [1] 1
# WAT
vec_cast(1, x)
#> [1] "hi"
# meaning we get here...
vec_match(1, x)
#> Error in vec_match(1, x): STRING_PTR_RO() can only be applied to a 'character', not a 'double'
vec_match(x, 1)
#> [1] NA Created on 2019-11-06 by the reprex package (v0.3.0.9000) I would really like it if we could assume that It just makes reasoning about the functions so much easier if we can 100% say "okay, I've casted |
I agree with @DavisVaughan; it's not our responsibility if a user defined method doesn't fulfil the contract (although we should avoid crashing in C) |
@DavisVaughan Data frames have a propensity to pop up in bad state. We have had a lot of crash reports with dplyr because of this. I agree that we should only have post condition checks for things that make R crash, weird error messages are ok if the contract is not fulfilled. |
So with One way to fix this is to do what I just did in the previous commit. We check that the number of columns are the same once, and error if they aren't. Again this should only happen if Since we are already computing the number of columns before the loop, I figured we might as well also pass that information down to Another option is to have the length check somewhere else, like after the cast, but I'm not quite sure where that would go. |
This makes sense. |
Because of the checks we do elsewhere, it is essentially impossible for a non-data frame
y
to get intodf_equal_scalar()
. Because of this, we are doing an expensive check ofis_data_frame()
a large number of times (once per row!) when I don't think we need to. Removing it has nice performance improvements forvec_equal()
and the dictionary functions, which callequal_scalar()
internally.