Skip to content

Conversation

mgirlich
Copy link
Contributor

@mgirlich mgirlich commented Jun 16, 2021

Fixes #1130.

@hadley In contrast to the PR for unchop() this PR is pretty trivial and I'm very optimistic not to break anything.

Benchmark: expand_grid()

bench::mark(
  size1 = tidyr::expand_grid(
    a = seq(1e3),
    b = seq(1e3)
  ),
  size100 = tidyr::expand_grid(
    a = seq(10e3),
    b = seq(10e3)
  ),
  size1000 = tidyr::expand_grid(
    a = seq(100e3),
    b = seq(10e3)
  ),
  check = FALSE,
  min_iterations = 3
)

# This PR with early return
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 3 x 13
#>   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
#> 1 size1        4.13ms   7.54ms   131.      16.62MB    0        66     0   504.94ms
#> 2 size100    832.32ms  865.7ms     1.15     1.49GB    1.15      3     3      2.61s
#> 3 size1000      6.99s    7.29s     0.137    14.9GB    0.274     3     6     21.92s

# This PR
#> # A tibble: 3 x 13
#>   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>
#> 1 size1        14.2ms  18.22ms   53.6      22.91MB    0        27     0   503.78ms <NULL>
#> 2 size100     900.1ms    1.15s    0.846     2.23GB    0.282     3     1      3.54s <NULL>
#> 3 size1000      11.4s      12s    0.0755   22.35GB    0.226     3     9     39.75s <NULL>

# CRAN
#> # A tibble: 3 x 13
#>   expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#>   <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#> 1 size1      15.38ms 17.82ms   55.9      16.65MB    0        28     0
#> 2 size100      1.47s   1.55s    0.652     1.49GB    0.435     3     2
#> 3 size1000    18.03s  18.05s    0.0550    14.9GB    0.110     3     6

Benchmark: pivot_longer()

Note that this benchmark gets obsolete with PR #1132.

benchmark from #613 with comparison to data.table::melt()

require(tidyverse)
require(data.table)

data.test <- matrix(
  data = sample(
    x = c(0L, 1L, 2L, NA_integer_),#the genotypes
    size = 2e+07,
    replace = TRUE,
    prob = c(0.8, 0.10, 0.05, 0.05)
  ),
  nrow = 20000,#number of SNPs/markers
  ncol = 1000,#number of samples
  dimnames = list(rownames = seq(1, 20000, 1), colnames = seq(1, 1000, 1))
) %>%
  tibble::as_tibble(x = ., rownames = "MARKERS")

bench::mark(
  pivot = tidyr::pivot_longer(data.test, cols = -MARKERS, names_to = "INDIVIDUALS", values_to = "GENOTYPES"),
  melt = data.table::as.data.table(data.test) %>%
    data.table::melt.data.table(
      data = .,
      id.vars = "MARKERS",
      variable.name = "INDIVIDUALS",
      value.name = "GENOTYPES",
      variable.factor = FALSE) %>%
    tibble::as_tibble(.),
  check = FALSE
)

# This PR with early return
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 pivot         989ms    989ms      1.01    1.12GB     4.05
#> 2 melt          236ms    337ms      2.97  458.57MB     1.48

# This PR
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 13
#>   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
#> 1 pivot         1.06s    1.06s     0.947    1.27GB     4.73     1     5      1.06s
#> 2 melt       266.71ms 331.15ms     3.02   458.56MB     1.51     2     1   662.29ms

# CRAN
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 pivot         1.27s    1.27s     0.790    1.12GB     3.16
#> 2 melt       264.05ms 338.71ms     2.95   458.57MB     1.48

Benchmark: pivot_wider()

library(tidyr)
library(dplyr, warn.conflicts = FALSE)

mydf <- expand_grid(
  case = sprintf("%03d", seq(1, 4000)),
  year = seq(1900, 2000),
  name = c("x", "y", "z")
) %>%
  mutate(value = rnorm(nrow(.)))

bench::mark(
  pivot = pivot_wider(mydf, names_from = "name", values_from = "value", values_fill = list(x = 1, y = 2, z = 3)),
  iterations = 50
)

# This PR with early return
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 1 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 pivot         124ms    145ms      6.52     112MB     14.6

# This PR
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 1 x 13
#>   expression    min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#>   <bch:expr> <bch:> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
#> 1 pivot       124ms   145ms      6.26     112MB     16.2    50   129      7.99s

# CRAN
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 1 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 pivot         138ms    162ms      5.80     112MB     13.0

@hadley
Copy link
Member

hadley commented Jun 16, 2021

Could we just eliminate vec_repeat() and call the vctrs functions directly?

@mgirlich
Copy link
Contributor Author

Yes, this would also work. I just wanted to keep vec_repeat() as it uses early returns for times = 1 which vctrs doesn't yet do (see r-lib/vctrs#1392). This early return gives a nice speed up.

@hadley
Copy link
Member

hadley commented Jun 17, 2021

I’d prefer the simplicity of removing vec_repeat() even if it takes a little longer to get the full performance benefits. Chances are that there will be a patch vctrs release before a patch tidyr release anyway, and doing it now removes an extra step.

@mgirlich
Copy link
Contributor Author

I removed vec_repeat(). For pivot_wider() and pivot_longer() the impact of no early return isn't that big. expand_grid() will get a nice boost when this is fixed in vctrs 😃

@hadley hadley requested a review from DavisVaughan August 23, 2021 15:17
@DavisVaughan DavisVaughan merged commit 7947507 into tidyverse:master Aug 27, 2021
@DavisVaughan
Copy link
Member

Thanks!

@mgirlich mgirlich deleted the speed-up-expand-grid branch November 16, 2021 07:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Easy performance improvements for vec_repeat()
3 participants