[SPARK-20484][MLLIB] Add documentation to ALS code #17793

danielyli · 2017-04-28T02:04:21Z

What changes were proposed in this pull request?

This PR adds documentation to the ALS code.

How was this patch tested?

Existing tests were used.

This contribution is my original work. I have the license to work on this project under the Spark project’s open source license.

sethah · 2017-04-28T05:46:03Z

+1 for this change. I'll try to take a look sometime, but maybe after the QA period. Also cc @MLnick.

MLnick · 2017-04-28T07:55:49Z

ok to test

srowen

It looks OK to me as-is

srowen · 2017-04-28T08:05:38Z

mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala

    val blockRatings = partitionRatings(ratings, userPart, itemPart)
      .persist(intermediateRDDStorageLevel)
    val (userInBlocks, userOutBlocks) =
      makeBlocks("user", blockRatings, userPart, itemPart, intermediateRDDStorageLevel)
-    // materialize blockRatings and user blocks
-    userOutBlocks.count()
+    userOutBlocks.count()    // materialize blockRatings and user blocks


It's a nit, but I wouldn't make changes like this. It doesn't add anything

I moved the comment because the only other comment that has its own line, // Precompute the rating dependencies of each partition, is serving as the heading for this entire block of code, and having other whole-line comments in this block is a bit of a mismatch. If you still feel reversion is necessary though, just let me know.

srowen · 2017-04-28T08:06:09Z

mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala

-    itemOutBlocks.count()
+    itemOutBlocks.count()    // materialize item blocks
+
+    // Encoders for storing each user/item's partition ID and index within its partition using a


This is probably fine but I tend to avoid moving code around unless it really helps -- this minimizes things like back-port merge conflict problems.

I moved the code because otherwise the comment on L823 (// Precompute the rating dependencies of each partition) would reference the LocalIndexEncoders and the solver. Agreed that otherwise it would be unnecessary to move.

Why not add the comment before the encoder vals are defined (and not move this code around)? You could add a space in between the solver if you want to disambiguate the comment

SparkQA · 2017-04-28T08:10:28Z

Test build #76264 has finished for PR 17793 at commit 0a2edf0.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

danielyli · 2017-04-28T10:01:21Z

How do I fix the “fails to generate documentation” error?

srowen · 2017-04-28T10:22:10Z

You have some javadoc errors . See the full log

sethah · 2017-04-28T19:29:02Z

mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala

+   *     )
+   * }}}
+   *
+   * (In this contrived example, the rating values are chosen specifically for clarity and are in


This part seems unnecessary. Definitely the last sentence.

You're right, the first sentence is probably overkill. I'll remove it.

The second one I would say should be included, since for someone new to the code, he/she might have some confusion as to why users' ratings aren't whole numbers (like star ratings). I'm always in favor of reducing any possible ambiguity.

Actually, on second thought, the first clause of the first sentence clarifies why, if ratings are usually whole numbers, we're using floats; the first sentence justifies the second sentence. I would err on keeping the whole thing in as is.

I don't see why anyone would assume ratings have to be whole numbers. If anything it seems misleading to say that ratings "are usually whole numbers." "Ratings" need not be given by users - they could be computed in many ways, such as business rules for inferring numeric measures of preference based on user-item interactions.

Great point. Thanks for pointing out to me what I missed. Removed—updated PR coming soon.

sethah · 2017-04-28T20:04:52Z

mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala

-   * Out-link block that stores, for each dst (item/user) block, which src (user/item) factors to
-   * send. For example, outLinkBlock(0) contains the local indices (not the original src IDs) of the
-   * src factors in this block to send to dst block 0.
+   * Out-link blocks that store information about which columns of the items factor matrix are


Is this any clearer? "For each user in each block, a mapping of which item blocks that user's factors must be sent to in order to compute the updated item factors, and vice versa."

Referring to user rows or item columns seems unnecessary since you can transpose the ratings matrix and get opposite mappings. There may be some standard convention though.

Also, how about adding

/** * Say user block 0 corresponds users 1, 42, 29575. Then a corresponding outblock of: * * {{{ * [[0, 15, 42], * [12, 43], * [314]] * }}} * means that user 1 factors must be sent to item blocks 0, 15, and 42; user 42 factors must be * sent to item blocks 12 and 43; user 29575 factors must be sent to item block 314. */

I like this. I'll add something to this effect in a bit.

sethah · 2017-04-28T20:07:29Z

mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala

+   *     val blockRatings = partitionRatings(ratings, userPart, itemPart)
+   * }}}
+   *
+   * Ratings with even-valued user IDs are shuffled to partition 0 while those with odd-valued user


I'm not sure I understand why the partitioner separates based on even/odd here.

Good catch. I'll update.

sethah · 2017-04-28T20:08:13Z

mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala

@@ -1026,7 +1161,24 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
  }

  /**
-   * Partitions raw ratings into blocks.
+   * Groups an RDD of `Rating`s by the user partition and item partition to which each `Rating` maps


[[Rating]]

sethah · 2017-04-28T20:23:35Z

btw "You can build just the Spark scaladoc by running build/sbt unidoc from the SPARK_PROJECT_ROOT directory." Link

SparkQA · 2017-04-29T00:20:59Z

Test build #76289 has finished for PR 17793 at commit 57de83b.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

I don't believe Scaladoc can link to nested classes

SparkQA · 2017-04-29T02:23:58Z

Test build #76292 has finished for PR 17793 at commit 5a4eb85.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-30T10:15:27Z

Test build #76321 has finished for PR 17793 at commit e5cdba1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

I'm OK with it. It's probably fine for 2.2 as it's a doc-only change; a few lines of code are moved but it doesn't change functionality.

danielyli · 2017-05-02T20:11:05Z

Great. Let me finish adding that one change @sethah requested, and I'll update the PR sometime today.

danielyli · 2017-05-03T23:13:47Z

All comments have been addressed.

SparkQA · 2017-05-04T00:17:49Z

Test build #76430 has finished for PR 17793 at commit c82501a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2017-05-04T09:45:57Z

@danielyli I wonder if you can build the docs to make sure that all your comments render as expected? there's a fair bit of formatting going on here and the scaladoc markdown can be surprising.

sethah · 2017-05-04T21:36:13Z

mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala

+   *       0 -> Array(Array(0, 1), Array(0, 1)),
+   *       1 -> Array(Array(0), Array(0))) }}}
+   *
+   * The data structure encodes the following information:


This is all correct, but was still confusing to me. Personally I think the following is clearer, but if you don't then feel free to leave it out.

/** * Each user block contains a subset of users in fixed, but typically random order. * * User block 0 User block 1 * ________ _______ * | user12 | | user4 | * | user5 | | user2 | * | user33 | | | * |________| |_______| * * Out block 0 Out block 1 * * Array( Array( * Array(0, 2), // item block 0 Array(0), // item block 0 * Array(1, 2), // item block 1 Array(0, 1), // item block 1 * Array(1)) // item block 2 Array()) // item block 2 * * For outblocks, the index in the outer array correspond to the item block. So the first inner * array is item block 0, the second item block 1, and so on. The values in each array correspond * to the "local indices" of the user factors in this block that need to be shipped to that item * block. So for outblock 0, we know that user factors at index 0 and 2 must be shipped to item * block 0. That means that the user factors for user12 and user33 need to go to item block 0. * And for outblock 1, we know that user4 must go to item block 0 and 1 and user2 must go to item * block 1. None of the users in user block 1 need to go to item block 2. */

Yeah, I agree, it could be clearer (I didn't like it very much either when writing it; it was a struggle to make it easy to understand since the final encoded form references everything using local indices). Let me rewrite it, taking in to account your suggestions, and update the PR.

Updated, though I still don't like it very much. Honestly, reading either of our versions would make my head spin if I weren't already acquainted with the encoding; I'd still have to dive into the actual code and work out an example for myself before I'd feel familiar with it. Should we just leave it as-is?

Alternatively, if you feel you can write it clearer, please don't hesitate to directly change the PR. (If you do update, note that the user IDs are not random but are sorted ascendingly within each partition.)

danielyli · 2017-05-06T03:26:44Z

@srowen Great idea. Will do and report back.

danielyli · 2017-05-06T06:29:08Z

javaunibuild build results:

Doc for ALS.train looks fine.
No doc for ALS.OutBlock is generated; possibly because it's a type def?
No doc for ALS.InBlock is generated; possibly because it's private[recommendation]?
No doc for ALS.partitionRatings is generated; possibly because it's private?

I ran ./build/sbt unidoc from the root of the repo.

SparkQA · 2017-05-06T06:57:33Z

Test build #76516 has started for PR 17793 at commit 3d5d8a6.

SparkQA · 2017-05-06T08:10:16Z

Test build #76518 has finished for PR 17793 at commit 6d27fff.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

Yes I think a lot of the doc is internal, so, while it's fine to write it in markdown style for consistency it won't matter. As long as anything that renders looks right, OK.

srowen · 2017-05-07T09:10:13Z

Merged to master

danielyli · 2017-05-07T22:35:43Z

Thanks all.

## What changes were proposed in this pull request? This PR adds documentation to the ALS code. ## How was this patch tested? Existing tests were used. mengxr srowen This contribution is my original work. I have the license to work on this project under the Spark project’s open source license. Author: Daniel Li <[email protected]> Closes apache#17793 from danielyli/spark-20484.

danielyli added 5 commits April 27, 2017 18:26

Add documentation for the InBlock class

4661ddb

Add documentation for the OutBlock data type

7d1491e

Add documentation for partitionRatings method

2fdbcaa

Add documentation for ALS.train method

fb8f16d

Add inline comments to ALS.train method

0a2edf0

srowen approved these changes Apr 28, 2017

View reviewed changes

sethah reviewed Apr 28, 2017

View reviewed changes

danielyli added 2 commits April 28, 2017 17:02

Clarify that IDs are mapped to partitions by modulus

6da60f0

Add link to Rating in Scaladoc for partitionRatings

57de83b

Fix Scaladoc errors

5a4eb85

I don't believe Scaladoc can link to nested classes

Remove unnecessary paragraph in Scaladoc

e5cdba1

srowen approved these changes May 2, 2017

View reviewed changes

Clarify Scaladoc for OutBlock

c82501a

sethah reviewed May 4, 2017

View reviewed changes

Correct spacing in Scaladoc

983f9eb

Add more clarifying explanation about OutBlocks

3d5d8a6

Fix failing doc build

6d27fff

srowen approved these changes May 6, 2017

View reviewed changes

asfgit closed this in 88e6d75 May 7, 2017

[SPARK-20484][MLLIB] Add documentation to ALS code #17793

[SPARK-20484][MLLIB] Add documentation to ALS code #17793

Uh oh!

Conversation

danielyli commented Apr 28, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

sethah commented Apr 28, 2017

Uh oh!

MLnick commented Apr 28, 2017

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MLnick May 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 28, 2017

Uh oh!

danielyli commented Apr 28, 2017

Uh oh!

srowen commented Apr 28, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sethah commented Apr 28, 2017

Uh oh!

SparkQA commented Apr 29, 2017

Uh oh!

SparkQA commented Apr 29, 2017

Uh oh!

SparkQA commented Apr 30, 2017

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

danielyli commented May 2, 2017

Uh oh!

danielyli commented May 3, 2017

Uh oh!

SparkQA commented May 4, 2017

Uh oh!

srowen commented May 4, 2017

Uh oh!

sethah May 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danielyli May 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

MLnick May 2, 2017 •

edited

Loading

sethah May 4, 2017 •

edited

Loading

danielyli May 6, 2017 •

edited

Loading