Skip to content

[SPARK-20484][MLLIB] Add documentation to ALS code #17793

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from
Closed

[SPARK-20484][MLLIB] Add documentation to ALS code #17793

wants to merge 13 commits into from

Conversation

danielyli
Copy link
Contributor

What changes were proposed in this pull request?

This PR adds documentation to the ALS code.

How was this patch tested?

Existing tests were used.

@mengxr @srowen

This contribution is my original work. I have the license to work on this project under the Spark project’s open source license.

@sethah
Copy link
Contributor

sethah commented Apr 28, 2017

+1 for this change. I'll try to take a look sometime, but maybe after the QA period. Also cc @MLnick.

@MLnick
Copy link
Contributor

MLnick commented Apr 28, 2017

ok to test

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks OK to me as-is

val blockRatings = partitionRatings(ratings, userPart, itemPart)
.persist(intermediateRDDStorageLevel)
val (userInBlocks, userOutBlocks) =
makeBlocks("user", blockRatings, userPart, itemPart, intermediateRDDStorageLevel)
// materialize blockRatings and user blocks
userOutBlocks.count()
userOutBlocks.count() // materialize blockRatings and user blocks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a nit, but I wouldn't make changes like this. It doesn't add anything

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the comment because the only other comment that has its own line, // Precompute the rating dependencies of each partition, is serving as the heading for this entire block of code, and having other whole-line comments in this block is a bit of a mismatch. If you still feel reversion is necessary though, just let me know.

itemOutBlocks.count()
itemOutBlocks.count() // materialize item blocks

// Encoders for storing each user/item's partition ID and index within its partition using a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably fine but I tend to avoid moving code around unless it really helps -- this minimizes things like back-port merge conflict problems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the code because otherwise the comment on L823 (// Precompute the rating dependencies of each partition) would reference the LocalIndexEncoders and the solver. Agreed that otherwise it would be unnecessary to move.

Copy link
Contributor

@MLnick MLnick May 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not add the comment before the encoder vals are defined (and not move this code around)? You could add a space in between the solver if you want to disambiguate the comment

@SparkQA
Copy link

SparkQA commented Apr 28, 2017

Test build #76264 has finished for PR 17793 at commit 0a2edf0.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@danielyli
Copy link
Contributor Author

How do I fix the “fails to generate documentation” error?

@srowen
Copy link
Member

srowen commented Apr 28, 2017

You have some javadoc errors . See the full log

* )
* }}}
*
* (In this contrived example, the rating values are chosen specifically for clarity and are in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part seems unnecessary. Definitely the last sentence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, the first sentence is probably overkill. I'll remove it.

The second one I would say should be included, since for someone new to the code, he/she might have some confusion as to why users' ratings aren't whole numbers (like star ratings). I'm always in favor of reducing any possible ambiguity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, on second thought, the first clause of the first sentence clarifies why, if ratings are usually whole numbers, we're using floats; the first sentence justifies the second sentence. I would err on keeping the whole thing in as is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why anyone would assume ratings have to be whole numbers. If anything it seems misleading to say that ratings "are usually whole numbers." "Ratings" need not be given by users - they could be computed in many ways, such as business rules for inferring numeric measures of preference based on user-item interactions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point. Thanks for pointing out to me what I missed. Removed—updated PR coming soon.

* Out-link block that stores, for each dst (item/user) block, which src (user/item) factors to
* send. For example, outLinkBlock(0) contains the local indices (not the original src IDs) of the
* src factors in this block to send to dst block 0.
* Out-link blocks that store information about which columns of the items factor matrix are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this any clearer? "For each user in each block, a mapping of which item blocks that user's factors must be sent to in order to compute the updated item factors, and vice versa."

Referring to user rows or item columns seems unnecessary since you can transpose the ratings matrix and get opposite mappings. There may be some standard convention though.

Also, how about adding

   /**
   * Say user block 0 corresponds users 1, 42, 29575. Then a corresponding outblock of:
   * 
   * {{{
   *   [[0, 15, 42],
   *    [12, 43],
   *    [314]]
   * }}}
   *  means that user 1 factors must be sent to item blocks 0, 15, and 42; user 42 factors must be
   *  sent to item blocks 12 and 43; user 29575 factors must be sent to item block 314.
   */

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this. I'll add something to this effect in a bit.

* val blockRatings = partitionRatings(ratings, userPart, itemPart)
* }}}
*
* Ratings with even-valued user IDs are shuffled to partition 0 while those with odd-valued user
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand why the partitioner separates based on even/odd here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I'll update.

@@ -1026,7 +1161,24 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
}

/**
* Partitions raw ratings into blocks.
* Groups an RDD of `Rating`s by the user partition and item partition to which each `Rating` maps
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[[Rating]]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

@sethah
Copy link
Contributor

sethah commented Apr 28, 2017

btw "You can build just the Spark scaladoc by running build/sbt unidoc from the SPARK_PROJECT_ROOT directory." Link

@SparkQA
Copy link

SparkQA commented Apr 29, 2017

Test build #76289 has finished for PR 17793 at commit 57de83b.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

I don't believe Scaladoc can link to nested classes
@SparkQA
Copy link

SparkQA commented Apr 29, 2017

Test build #76292 has finished for PR 17793 at commit 5a4eb85.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 30, 2017

Test build #76321 has finished for PR 17793 at commit e5cdba1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with it. It's probably fine for 2.2 as it's a doc-only change; a few lines of code are moved but it doesn't change functionality.

@danielyli
Copy link
Contributor Author

Great. Let me finish adding that one change @sethah requested, and I'll update the PR sometime today.

@danielyli
Copy link
Contributor Author

All comments have been addressed.

@SparkQA
Copy link

SparkQA commented May 4, 2017

Test build #76430 has finished for PR 17793 at commit c82501a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented May 4, 2017

@danielyli I wonder if you can build the docs to make sure that all your comments render as expected? there's a fair bit of formatting going on here and the scaladoc markdown can be surprising.

* 0 -> Array(Array(0, 1), Array(0, 1)),
* 1 -> Array(Array(0), Array(0))) }}}
*
* The data structure encodes the following information:
Copy link
Contributor

@sethah sethah May 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all correct, but was still confusing to me. Personally I think the following is clearer, but if you don't then feel free to leave it out.

  /**
   * Each user block contains a subset of users in fixed, but typically random order. 
   *
   * User block 0  User block 1
   *  ________      _______
   * | user12 |    | user4 |
   * | user5  |    | user2 |
   * | user33 |    |       |
   * |________|    |_______|
   *
   * Out block 0                       Out block 1
   *
   * Array(                            Array(
   *   Array(0, 2), // item block 0     Array(0),    // item block 0 
   *   Array(1, 2), // item block 1     Array(0, 1), // item block 1 
   *   Array(1))    // item block 2     Array())     // item block 2
   *
   * For outblocks, the index in the outer array correspond to the item block. So the first inner
   * array is item block 0, the second item block 1, and so on. The values in each array correspond
   * to the "local indices" of the user factors in this block that need to be shipped to that item
   * block. So for outblock 0, we know that user factors at index 0 and 2 must be shipped to item 
   * block 0. That means that the user factors for user12 and user33 need to go to item block 0. 
   * And for outblock 1, we know that user4 must go to item block 0 and 1 and user2 must go to item
   * block 1. None of the users in user block 1 need to go to item block 2.
   */

Copy link
Contributor Author

@danielyli danielyli May 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree, it could be clearer (I didn't like it very much either when writing it; it was a struggle to make it easy to understand since the final encoded form references everything using local indices). Let me rewrite it, taking in to account your suggestions, and update the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, though I still don't like it very much. Honestly, reading either of our versions would make my head spin if I weren't already acquainted with the encoding; I'd still have to dive into the actual code and work out an example for myself before I'd feel familiar with it. Should we just leave it as-is?

Alternatively, if you feel you can write it clearer, please don't hesitate to directly change the PR. (If you do update, note that the user IDs are not random but are sorted ascendingly within each partition.)

@danielyli
Copy link
Contributor Author

@srowen Great idea. Will do and report back.

@danielyli
Copy link
Contributor Author

javaunibuild build results:

  1. Doc for ALS.train looks fine.
  2. No doc for ALS.OutBlock is generated; possibly because it's a type def?
  3. No doc for ALS.InBlock is generated; possibly because it's private[recommendation]?
  4. No doc for ALS.partitionRatings is generated; possibly because it's private?

I ran ./build/sbt unidoc from the root of the repo.

@SparkQA
Copy link

SparkQA commented May 6, 2017

Test build #76516 has started for PR 17793 at commit 3d5d8a6.

@SparkQA
Copy link

SparkQA commented May 6, 2017

Test build #76518 has finished for PR 17793 at commit 6d27fff.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think a lot of the doc is internal, so, while it's fine to write it in markdown style for consistency it won't matter. As long as anything that renders looks right, OK.

@srowen
Copy link
Member

srowen commented May 7, 2017

Merged to master

@asfgit asfgit closed this in 88e6d75 May 7, 2017
@danielyli
Copy link
Contributor Author

Thanks all.

liyichao pushed a commit to liyichao/spark that referenced this pull request May 24, 2017
## What changes were proposed in this pull request?

This PR adds documentation to the ALS code.

## How was this patch tested?

Existing tests were used.

mengxr srowen

This contribution is my original work.  I have the license to work on this project under the Spark project’s open source license.

Author: Daniel Li <[email protected]>

Closes apache#17793 from danielyli/spark-20484.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants