-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-20484][MLLIB] Add documentation to ALS code #17793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
+1 for this change. I'll try to take a look sometime, but maybe after the QA period. Also cc @MLnick. |
ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks OK to me as-is
val blockRatings = partitionRatings(ratings, userPart, itemPart) | ||
.persist(intermediateRDDStorageLevel) | ||
val (userInBlocks, userOutBlocks) = | ||
makeBlocks("user", blockRatings, userPart, itemPart, intermediateRDDStorageLevel) | ||
// materialize blockRatings and user blocks | ||
userOutBlocks.count() | ||
userOutBlocks.count() // materialize blockRatings and user blocks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a nit, but I wouldn't make changes like this. It doesn't add anything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the comment because the only other comment that has its own line, // Precompute the rating dependencies of each partition
, is serving as the heading for this entire block of code, and having other whole-line comments in this block is a bit of a mismatch. If you still feel reversion is necessary though, just let me know.
itemOutBlocks.count() | ||
itemOutBlocks.count() // materialize item blocks | ||
|
||
// Encoders for storing each user/item's partition ID and index within its partition using a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably fine but I tend to avoid moving code around unless it really helps -- this minimizes things like back-port merge conflict problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the code because otherwise the comment on L823 (// Precompute the rating dependencies of each partition
) would reference the LocalIndexEncoder
s and the solver
. Agreed that otherwise it would be unnecessary to move.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not add the comment before the encoder val
s are defined (and not move this code around)? You could add a space in between the solver if you want to disambiguate the comment
Test build #76264 has finished for PR 17793 at commit
|
How do I fix the “fails to generate documentation” error? |
You have some javadoc errors . See the full log |
* ) | ||
* }}} | ||
* | ||
* (In this contrived example, the rating values are chosen specifically for clarity and are in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part seems unnecessary. Definitely the last sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, the first sentence is probably overkill. I'll remove it.
The second one I would say should be included, since for someone new to the code, he/she might have some confusion as to why users' ratings aren't whole numbers (like star ratings). I'm always in favor of reducing any possible ambiguity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, on second thought, the first clause of the first sentence clarifies why, if ratings are usually whole numbers, we're using floats; the first sentence justifies the second sentence. I would err on keeping the whole thing in as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why anyone would assume ratings have to be whole numbers. If anything it seems misleading to say that ratings "are usually whole numbers." "Ratings" need not be given by users - they could be computed in many ways, such as business rules for inferring numeric measures of preference based on user-item interactions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point. Thanks for pointing out to me what I missed. Removed—updated PR coming soon.
* Out-link block that stores, for each dst (item/user) block, which src (user/item) factors to | ||
* send. For example, outLinkBlock(0) contains the local indices (not the original src IDs) of the | ||
* src factors in this block to send to dst block 0. | ||
* Out-link blocks that store information about which columns of the items factor matrix are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this any clearer? "For each user in each block, a mapping of which item blocks that user's factors must be sent to in order to compute the updated item factors, and vice versa."
Referring to user rows or item columns seems unnecessary since you can transpose the ratings matrix and get opposite mappings. There may be some standard convention though.
Also, how about adding
/**
* Say user block 0 corresponds users 1, 42, 29575. Then a corresponding outblock of:
*
* {{{
* [[0, 15, 42],
* [12, 43],
* [314]]
* }}}
* means that user 1 factors must be sent to item blocks 0, 15, and 42; user 42 factors must be
* sent to item blocks 12 and 43; user 29575 factors must be sent to item block 314.
*/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this. I'll add something to this effect in a bit.
* val blockRatings = partitionRatings(ratings, userPart, itemPart) | ||
* }}} | ||
* | ||
* Ratings with even-valued user IDs are shuffled to partition 0 while those with odd-valued user |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand why the partitioner separates based on even/odd here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I'll update.
@@ -1026,7 +1161,24 @@ object ALS extends DefaultParamsReadable[ALS] with Logging { | |||
} | |||
|
|||
/** | |||
* Partitions raw ratings into blocks. | |||
* Groups an RDD of `Rating`s by the user partition and item partition to which each `Rating` maps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[[Rating]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
btw "You can build just the Spark scaladoc by running build/sbt unidoc from the SPARK_PROJECT_ROOT directory." Link |
Test build #76289 has finished for PR 17793 at commit
|
I don't believe Scaladoc can link to nested classes
Test build #76292 has finished for PR 17793 at commit
|
Test build #76321 has finished for PR 17793 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with it. It's probably fine for 2.2 as it's a doc-only change; a few lines of code are moved but it doesn't change functionality.
Great. Let me finish adding that one change @sethah requested, and I'll update the PR sometime today. |
All comments have been addressed. |
Test build #76430 has finished for PR 17793 at commit
|
@danielyli I wonder if you can build the docs to make sure that all your comments render as expected? there's a fair bit of formatting going on here and the scaladoc markdown can be surprising. |
* 0 -> Array(Array(0, 1), Array(0, 1)), | ||
* 1 -> Array(Array(0), Array(0))) }}} | ||
* | ||
* The data structure encodes the following information: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is all correct, but was still confusing to me. Personally I think the following is clearer, but if you don't then feel free to leave it out.
/**
* Each user block contains a subset of users in fixed, but typically random order.
*
* User block 0 User block 1
* ________ _______
* | user12 | | user4 |
* | user5 | | user2 |
* | user33 | | |
* |________| |_______|
*
* Out block 0 Out block 1
*
* Array( Array(
* Array(0, 2), // item block 0 Array(0), // item block 0
* Array(1, 2), // item block 1 Array(0, 1), // item block 1
* Array(1)) // item block 2 Array()) // item block 2
*
* For outblocks, the index in the outer array correspond to the item block. So the first inner
* array is item block 0, the second item block 1, and so on. The values in each array correspond
* to the "local indices" of the user factors in this block that need to be shipped to that item
* block. So for outblock 0, we know that user factors at index 0 and 2 must be shipped to item
* block 0. That means that the user factors for user12 and user33 need to go to item block 0.
* And for outblock 1, we know that user4 must go to item block 0 and 1 and user2 must go to item
* block 1. None of the users in user block 1 need to go to item block 2.
*/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree, it could be clearer (I didn't like it very much either when writing it; it was a struggle to make it easy to understand since the final encoded form references everything using local indices). Let me rewrite it, taking in to account your suggestions, and update the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, though I still don't like it very much. Honestly, reading either of our versions would make my head spin if I weren't already acquainted with the encoding; I'd still have to dive into the actual code and work out an example for myself before I'd feel familiar with it. Should we just leave it as-is?
Alternatively, if you feel you can write it clearer, please don't hesitate to directly change the PR. (If you do update, note that the user IDs are not random but are sorted ascendingly within each partition.)
@srowen Great idea. Will do and report back. |
javaunibuild build results:
I ran |
Test build #76516 has started for PR 17793 at commit |
Test build #76518 has finished for PR 17793 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think a lot of the doc is internal, so, while it's fine to write it in markdown style for consistency it won't matter. As long as anything that renders looks right, OK.
Merged to master |
Thanks all. |
## What changes were proposed in this pull request? This PR adds documentation to the ALS code. ## How was this patch tested? Existing tests were used. mengxr srowen This contribution is my original work. I have the license to work on this project under the Spark project’s open source license. Author: Daniel Li <[email protected]> Closes apache#17793 from danielyli/spark-20484.
What changes were proposed in this pull request?
This PR adds documentation to the ALS code.
How was this patch tested?
Existing tests were used.
@mengxr @srowen
This contribution is my original work. I have the license to work on this project under the Spark project’s open source license.