Add groupBy #108

julienrf · 2017-06-09T14:27:27Z

Summary

Add a groupBy method to Iterable
- Three implementations: a generic implementation in Buildable (copied from the current collections), and custom implementations in View and LazyList
Add getOrElse method to Map
Add getOrElseUpdate method to mutable.Map
Add a newBuilder() method to appropriate collection factories

Discussion

The current (I mean, in the current collections) implementation of groupBy relies on an available newBuilder method.

In our strawman we didn’t have such builders so I wanted to experiment with different implementations.

I compared three implementations:

one based on builders (current implementation)
one for immutable collections, which depends on an empty and a cons method to build the groups
one for mutable collections, based on Growable to build the groups

I compared the implementations with collections (List, ImmutableArray, HashSet and ArrayBuffer) of various sizes (from 1 to 7,312,102), containing elements of type Long. The benchmark calls the groupBy method with the following function x => x % 5. (So, the result is a Map of 5 groups of equivalent size)

The benchmarks show that for immutable collections with more than 7 elements it’s faster to go with the current implementation.

For mutable collections, the approach based on Growables has the same performance as the current implementation, and it has even slightly better performance on small collections (less than 4 elements).

Therefore, I decided to keep the same implementation as in the current collections (based on builders). We could override the implementation of mutable collections if you think it is worth it…

For reference, here are the different groupBy implementations, followed by the code of the benchmarks and the charts.

// Implementation similar to what we have in the current collections
def groupByBuilder[A, K, C <: Iterable[A]](
  as: C
)(
  newBuilder: () => mutable.Builder[A, C]
)(
  f: A => K
): immutable.Map[K, C] = {
  val m = mutable.Map.empty[K, mutable.Builder[A, C]]
  for (elem <- as) {
    val key = f(elem)
    val bldr = m.getOrElseUpdate(key, newBuilder())
    bldr += elem
  }
  var result = immutable.Map.empty[K, C]
  m.foreach { case (k, v) =>
    result = result + ((k, v.result))
  }
  result
}

// Alternative implementation for immutable collections
// Note that the `cons` function can either prepend or
// append the new value (we don’t guarantee to preserve
// the order of elements within groups)
def groupByImmutable[A, K, C <: Iterable[A]](
  as: C
)(
  empty: C, cons: (A, C) => C
)(
  f: A => K
): immutable.Map[K, C] = {
  var result = immutable.Map.empty[K, C]
  for (elem <- as) {
    val key = f(elem)
    val values = cons(elem, result.getOrElse(key, empty))
    result = result + ((key, values))
  }
  result
}

// Alternative implementation for growable collections
def groupByGrowable[A, K, C <: Iterable[A] with mutable.Growable[A]](
  as: C
)(
  empty: () => C
)(
  f: A => K
): immutable.Map[K, C] = {
  var result = immutable.Map.empty[K, C]
  for (elem <- as) {
    val key = f(elem)
    result.get(key) match {
      case None =>
        val values = empty()
        values += elem
        result = result + ((key, values))
      case Some(values) =>
        values += elem
    }
  }
  result
}

object ListGroupByImmutable {
  @Benchmark
  def listGroupBy(bh: Blackhole): Unit = {
    val result = GroupBys.groupByImmutable[Long, Long, List[Long]](xs)(Nil, _ :: _)(_ % 5)
    bh.consume(result)
  }
}

object ListGroupByBuilder {
  @Benchmark
  def listGroupBy(bh: Blackhole): Unit = {
    val result = GroupBys.groupByBuilder[Long, Long, List[Long]](xs)(() => List.newBuilder())(_ % 5)
    bh.consume(result)
  }
}

julienrf · 2017-06-09T15:30:48Z

src/main/scala/strawman/collection/View.scala

+    m.foreach { case (k, v) =>
+      result = result + ((k, v.view))
+    }
+    result


This is the implementation of groupBy in View: each group is represented as an ArrayBuffer, and eventually we get views of them.

Ichoran

Looks reasonable overall, but a few questions and issues to fix or at least consider.

Ichoran · 2017-06-11T21:32:56Z

src/main/scala/strawman/collection/View.scala

+    val m = mutable.Map.empty[K, ArrayBuffer[A]]
+    for (elem <- coll) {
+      val key = f(elem)
+      val bldr = m.getOrElseUpdate(key, ArrayBuffer.empty)


I think a normal ArrayBuffer allocates too much memory. If there isn't much redundancy you could end up increasing the memory usage by about 5x (!). I am not sure we have a great alternative data structure right now, but we should think about one. At the very least I think we should have an array builder and a view of the built array so at least we can reclaim all the extra memory after construction is complete.

Ichoran · 2017-06-11T21:34:45Z

src/main/scala/strawman/collection/immutable/BitSet.scala

+  def empty: BitSet = new BitSet1(0L)
+
+  def newBuilder(): Builder[Int, BitSet] =
+    new ImmutableBuilder[Int, BitSet](empty) {


This is a performance bug waiting to happen. Larger immutabe bitsets should absolutely not be built this way; they should be built with a mutable bitset, and upon asking for the result the array should be handed over and/or copied.

Oh, indeed, definitely!

Ichoran · 2017-06-11T21:36:12Z

src/main/scala/strawman/collection/immutable/HashMap.scala

-     with MapOps[K, V, HashMap, HashMap[K, V]]
-     with Serializable {
+    with MapOps[K, V, HashMap, HashMap[K, V]]
+    with Buildable[(K, V), HashMap[K, V]]


Why doesn't MapOps[K, V, HashMap, HashMap[K, V]] imply Buildable[(K, V), HashMap[K, V]]?

I think that the decision of extending Buildable can only be made at the leaf level (for each concrete collection type). For instance, Seq does not imply Buildable, which is good because we don’t want LazyList to implement Buildable.

So it's just because someone might write MapOps[K, V, Map, Map[K, V]] that we don't want to be Buildable?

For Map it probably makes sense to extend Buildable, just like for Set (which I am working on at the moment)

Ichoran · 2017-06-11T21:36:44Z

src/main/scala/strawman/collection/immutable/HashSet.scala

-     with SetOps[A, HashSet, HashSet[A]]
-     with Serializable {
+    with SetOps[A, HashSet, HashSet[A]]
+    with Buildable[A, HashSet[A]]


Same question about SetOps implying Buildable

Ichoran · 2017-06-11T21:39:06Z

src/main/scala/strawman/collection/immutable/ListSet.scala

@@ -124,5 +127,10 @@ object ListSet extends IterableFactory[ListSet] {

  def empty[A]: ListSet[A] = EmptyListSet.asInstanceOf[ListSet[A]]

+  def newBuilder[A](): Builder[A, ListSet[A]] =
+    new ImmutableBuilder[A, ListSet[A]](empty) {
+      def add(elem: A): this.type = { elems = elems + elem; this }


Any way to push this implementation up into the trait without losing performance? There's a lot of repetition of this pattern.

I could make ImmutableBuilder[A, C] take as parameter a (C, A) => C function defining how to add an element to a given collection. Then at use site that would look like the following:

new ImmutableBuilder[A, ListSet[A]](empty, _ + _)

WDYT?

That's not much better, and we want to be really careful about per-element overhead. I was thinking about pushing it up into a SetFactory which extends IterableFactory. Not sure if there will be performance consequences, though.

Ichoran · 2017-06-11T21:42:10Z

src/main/scala/strawman/collection/mutable/ListBuffer.scala

@@ -12,7 +12,7 @@ import scala.Predef.{assert, intWrapper}

 /** Concrete collection type: ListBuffer */
 class ListBuffer[A]
-  extends Seq[A]
+  extends GrowableSeq[A]


Why do we need to separately say this is growable, buildable, and a builder? Also, can't every mutable collection be treated as its own builder?

Why do we need to separately say this is growable, buildable, and a builder?

Indeed we could probably assume that Growable imply Buildable. I see no example where we would have the former without also having the latter.

I decided to put the method that returns the builder in the companion object because I think that could be useful for other purposes (see #97).

can't every mutable collection be treated as its own builder?

That’s a good question. Actually, @odersky initially designed mutable collections to be their own builders. I thought that it would make the hierarchy slightly more complicated (that’s yet another type that shows up in the linearization) and also I think the public result method of builders makes little sense on a concrete collection.

Why do we need to separately say this is growable, buildable

Actually, in the current design Growable[A] takes just one type parameter (the type of the elements). If we want to make Growable[A] imply Buildable[A, C] then we need to add the C type parameter to Growable and on all its subclasses. A consequence is that this C should be supplied by a XxxOps trait, which means that we would have a GrowableIterableOps branch in the hierarchy that would be refined for Set, Map and Seq. That would be a lot of additional traits :(

Or, could we assume that all mutable collections are Growable? (in such a case we could make mutable.IterableOps extend Buildable and get rid of the mutable.Iterable / mutable.GrowableIterable distinction)

I forgot that Arrays are not growable. I guess there is a distinction there to maintain.

julienrf · 2017-06-13T09:01:36Z

@Ichoran can you summarize what needs to be changed?

The implementation is based on builders, as in the current collections.

julienrf · 2017-06-13T10:09:08Z

I rebased the PR and implemented groupBy in Range and NumericRange. I moved the actual implementation from Buildable to a collection.generic.GroupBy object so that I could reuse it even in collections types that do not extend Buildable (such as Range and NumericRange)

Ichoran · 2017-06-13T19:49:40Z

@julienrf - I'll try to check it this evening.

Changes made, so original review not relevant.

Ichoran · 2017-06-13T19:51:10Z

Hrm, not sure if there's a way to "unreview". Dismissing the review doesn't seem to be what I wanted.

julienrf · 2017-06-21T14:46:20Z

Superseded by #111.

julienrf requested review from szeiger, odersky and Ichoran and removed request for szeiger June 9, 2017 14:28

julienrf force-pushed the group-by branch from 07ef338 to 8124e1a Compare June 9, 2017 14:36

julienrf commented Jun 9, 2017

View reviewed changes

Ichoran previously requested changes Jun 11, 2017

View reviewed changes

julienrf force-pushed the group-by branch from 20fb269 to 7c0f1e5 Compare June 12, 2017 07:40

julienrf added 5 commits June 13, 2017 11:03

Add groupBy.

f5ee957

The implementation is based on builders, as in the current collections.

Trim the array that backs groups in View.groupBy

1ed5935

More efficient immutable.BitSet builder

6368f86

Optimize groupBy for LazyList

521da09

Implement groupBy in Range and NumericRange

7aa4678

julienrf force-pushed the group-by branch from 7c0f1e5 to 7aa4678 Compare June 13, 2017 10:06

This was referenced Jun 14, 2017

Pull up the newBuilder method from Buildable to IterableOps. #110

Merged

Add groupBy (on top of #110) #111

Merged

julienrf closed this Jun 21, 2017

julienrf deleted the group-by branch June 28, 2017 09:15

Add groupBy #108

Add groupBy #108

Uh oh!

Conversation

julienrf commented Jun 9, 2017

Summary

Discussion

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ichoran left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

julienrf commented Jun 13, 2017

Uh oh!

julienrf commented Jun 13, 2017

Uh oh!

Ichoran commented Jun 13, 2017

Uh oh!

Ichoran commented Jun 13, 2017

Uh oh!

julienrf commented Jun 21, 2017

Uh oh!

Uh oh!