Skip to content
This repository was archived by the owner on Dec 22, 2021. It is now read-only.

Add groupBy #108

Closed
wants to merge 5 commits into from
Closed

Add groupBy #108

wants to merge 5 commits into from

Conversation

julienrf
Copy link
Contributor

@julienrf julienrf commented Jun 9, 2017

Summary

  • Add a groupBy method to Iterable
    • Three implementations: a generic implementation in Buildable (copied from the current collections), and custom implementations in View and LazyList
  • Add getOrElse method to Map
  • Add getOrElseUpdate method to mutable.Map
  • Add a newBuilder() method to appropriate collection factories

Discussion

The current (I mean, in the current collections) implementation of groupBy relies on an available newBuilder method.

In our strawman we didn’t have such builders so I wanted to experiment with different implementations.

I compared three implementations:

  • one based on builders (current implementation)
  • one for immutable collections, which depends on an empty and a cons method to build the groups
  • one for mutable collections, based on Growable to build the groups

I compared the implementations with collections (List, ImmutableArray, HashSet and ArrayBuffer) of various sizes (from 1 to 7,312,102), containing elements of type Long. The benchmark calls the groupBy method with the following function x => x % 5. (So, the result is a Map of 5 groups of equivalent size)

The benchmarks show that for immutable collections with more than 7 elements it’s faster to go with the current implementation.

For mutable collections, the approach based on Growables has the same performance as the current implementation, and it has even slightly better performance on small collections (less than 4 elements).

Therefore, I decided to keep the same implementation as in the current collections (based on builders). We could override the implementation of mutable collections if you think it is worth it…

For reference, here are the different groupBy implementations, followed by the code of the benchmarks and the charts.

// Implementation similar to what we have in the current collections
def groupByBuilder[A, K, C <: Iterable[A]](
  as: C
)(
  newBuilder: () => mutable.Builder[A, C]
)(
  f: A => K
): immutable.Map[K, C] = {
  val m = mutable.Map.empty[K, mutable.Builder[A, C]]
  for (elem <- as) {
    val key = f(elem)
    val bldr = m.getOrElseUpdate(key, newBuilder())
    bldr += elem
  }
  var result = immutable.Map.empty[K, C]
  m.foreach { case (k, v) =>
    result = result + ((k, v.result))
  }
  result
}

// Alternative implementation for immutable collections
// Note that the `cons` function can either prepend or
// append the new value (we don’t guarantee to preserve
// the order of elements within groups)
def groupByImmutable[A, K, C <: Iterable[A]](
  as: C
)(
  empty: C, cons: (A, C) => C
)(
  f: A => K
): immutable.Map[K, C] = {
  var result = immutable.Map.empty[K, C]
  for (elem <- as) {
    val key = f(elem)
    val values = cons(elem, result.getOrElse(key, empty))
    result = result + ((key, values))
  }
  result
}

// Alternative implementation for growable collections
def groupByGrowable[A, K, C <: Iterable[A] with mutable.Growable[A]](
  as: C
)(
  empty: () => C
)(
  f: A => K
): immutable.Map[K, C] = {
  var result = immutable.Map.empty[K, C]
  for (elem <- as) {
    val key = f(elem)
    result.get(key) match {
      case None =>
        val values = empty()
        values += elem
        result = result + ((key, values))
      case Some(values) =>
        values += elem
    }
  }
  result
}
object ListGroupByImmutable {
  @Benchmark
  def listGroupBy(bh: Blackhole): Unit = {
    val result = GroupBys.groupByImmutable[Long, Long, List[Long]](xs)(Nil, _ :: _)(_ % 5)
    bh.consume(result)
  }
}

object ListGroupByBuilder {
  @Benchmark
  def listGroupBy(bh: Blackhole): Unit = {
    val result = GroupBys.groupByBuilder[Long, Long, List[Long]](xs)(() => List.newBuilder())(_ % 5)
    bh.consume(result)
  }
}

listgroupby

immutablearraygroupby

hashsetgroupby

arraybuffergroupby

mutablehashsetgroupby

@julienrf julienrf requested review from szeiger, odersky and Ichoran and removed request for szeiger June 9, 2017 14:28
m.foreach { case (k, v) =>
result = result + ((k, v.view))
}
result
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the implementation of groupBy in View: each group is represented as an ArrayBuffer, and eventually we get views of them.

Copy link
Contributor

@Ichoran Ichoran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable overall, but a few questions and issues to fix or at least consider.

val m = mutable.Map.empty[K, ArrayBuffer[A]]
for (elem <- coll) {
val key = f(elem)
val bldr = m.getOrElseUpdate(key, ArrayBuffer.empty)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a normal ArrayBuffer allocates too much memory. If there isn't much redundancy you could end up increasing the memory usage by about 5x (!). I am not sure we have a great alternative data structure right now, but we should think about one. At the very least I think we should have an array builder and a view of the built array so at least we can reclaim all the extra memory after construction is complete.

def empty: BitSet = new BitSet1(0L)

def newBuilder(): Builder[Int, BitSet] =
new ImmutableBuilder[Int, BitSet](empty) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a performance bug waiting to happen. Larger immutabe bitsets should absolutely not be built this way; they should be built with a mutable bitset, and upon asking for the result the array should be handed over and/or copied.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, indeed, definitely!

with MapOps[K, V, HashMap, HashMap[K, V]]
with Serializable {
with MapOps[K, V, HashMap, HashMap[K, V]]
with Buildable[(K, V), HashMap[K, V]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't MapOps[K, V, HashMap, HashMap[K, V]] imply Buildable[(K, V), HashMap[K, V]]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the decision of extending Buildable can only be made at the leaf level (for each concrete collection type). For instance, Seq does not imply Buildable, which is good because we don’t want LazyList to implement Buildable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's just because someone might write MapOps[K, V, Map, Map[K, V]] that we don't want to be Buildable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Map it probably makes sense to extend Buildable, just like for Set (which I am working on at the moment)

with SetOps[A, HashSet, HashSet[A]]
with Serializable {
with SetOps[A, HashSet, HashSet[A]]
with Buildable[A, HashSet[A]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question about SetOps implying Buildable

@@ -124,5 +127,10 @@ object ListSet extends IterableFactory[ListSet] {

def empty[A]: ListSet[A] = EmptyListSet.asInstanceOf[ListSet[A]]

def newBuilder[A](): Builder[A, ListSet[A]] =
new ImmutableBuilder[A, ListSet[A]](empty) {
def add(elem: A): this.type = { elems = elems + elem; this }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any way to push this implementation up into the trait without losing performance? There's a lot of repetition of this pattern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could make ImmutableBuilder[A, C] take as parameter a (C, A) => C function defining how to add an element to a given collection. Then at use site that would look like the following:

new ImmutableBuilder[A, ListSet[A]](empty, _ + _)

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not much better, and we want to be really careful about per-element overhead. I was thinking about pushing it up into a SetFactory which extends IterableFactory. Not sure if there will be performance consequences, though.

@@ -12,7 +12,7 @@ import scala.Predef.{assert, intWrapper}

/** Concrete collection type: ListBuffer */
class ListBuffer[A]
extends Seq[A]
extends GrowableSeq[A]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to separately say this is growable, buildable, and a builder? Also, can't every mutable collection be treated as its own builder?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to separately say this is growable, buildable, and a builder?

Indeed we could probably assume that Growable imply Buildable. I see no example where we would have the former without also having the latter.

I decided to put the method that returns the builder in the companion object because I think that could be useful for other purposes (see #97).

can't every mutable collection be treated as its own builder?

That’s a good question. Actually, @odersky initially designed mutable collections to be their own builders. I thought that it would make the hierarchy slightly more complicated (that’s yet another type that shows up in the linearization) and also I think the public result method of builders makes little sense on a concrete collection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to separately say this is growable, buildable

Actually, in the current design Growable[A] takes just one type parameter (the type of the elements). If we want to make Growable[A] imply Buildable[A, C] then we need to add the C type parameter to Growable and on all its subclasses. A consequence is that this C should be supplied by a XxxOps trait, which means that we would have a GrowableIterableOps branch in the hierarchy that would be refined for Set, Map and Seq. That would be a lot of additional traits :(

Or, could we assume that all mutable collections are Growable? (in such a case we could make mutable.IterableOps extend Buildable and get rid of the mutable.Iterable / mutable.GrowableIterable distinction)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot that Arrays are not growable. I guess there is a distinction there to maintain.

@julienrf
Copy link
Contributor Author

@Ichoran can you summarize what needs to be changed?

@julienrf
Copy link
Contributor Author

I rebased the PR and implemented groupBy in Range and NumericRange. I moved the actual implementation from Buildable to a collection.generic.GroupBy object so that I could reuse it even in collections types that do not extend Buildable (such as Range and NumericRange)

@Ichoran
Copy link
Contributor

Ichoran commented Jun 13, 2017

@julienrf - I'll try to check it this evening.

@Ichoran Ichoran dismissed their stale review June 13, 2017 19:49

Changes made, so original review not relevant.

@Ichoran
Copy link
Contributor

Ichoran commented Jun 13, 2017

Hrm, not sure if there's a way to "unreview". Dismissing the review doesn't seem to be what I wanted.

@julienrf
Copy link
Contributor Author

Superseded by #111.

@julienrf julienrf closed this Jun 21, 2017
@julienrf julienrf deleted the group-by branch June 28, 2017 09:15
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants