Adding `first`, `last` and `describe` convenience functions. #42

codetalker7 · 2023-08-09T11:53:18Z

Still a draft PR.

This PR adds implementations of the first, last and describe convenience functions.

Examples

Here is how first works:

julia> using DTables;

julia> table = (a=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10], b=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]);

julia> d = DTable(table, 3)
DTable with 4 partitions
Tabletype: NamedTuple

julia> first_five = DTables.first(d, UInt(5))
DTable with 2 partitions
Tabletype: NamedTuple

julia> fetch(first_five)
(a = [1, 2, 3, 4, 5], b = [1, 2, 3, 4, 5])

TODO:

codetalker7 · 2023-08-09T11:54:16Z

@jpsamaroo how does this implementation of first look? If this is okay, I'll implement last similarly.

krynju · 2023-08-09T15:05:52Z

src/table/dtable.jl

+        return table
+    end
+
+    chunk_length = chunk_lengths(table)[1]


chunk lengths are not guarenteed to be equal
some may even be empty

Hi @krynju. If this is the case, is there any way to retrieve the original chunksize? If I'm not wrong it's not stored as a property of DTables.

On another note: suppose for a DTable I have chunksize greater than number of rows in the table. In that case, won't I lose information about what chunksize I passed?

Yeah, I think it was an early design decision to make chunksize an argument of the constructor for the initial partitioning and later ignore it (and for that reason not store it either)

I think including the original chunksize in the logic would also be a bit confusing and would make it more complex, but if we have any use case for that then we can revisit this

I did think of caching the current chunk sizes, because generally that information doesn't change in a dtable (after you manipulate a dtable it becomes a new dtable)
We already cache the schema so a similar mechanism could be used

and for this you can just use chunk_legths as you did
https://github.com/JuliaParallel/DTables.jl/blob/9fcbe237e0c6ddd6b6f2880f33347efe99a76fdd/src/table/dtable.jl#L252C10-L255

On another note: suppose for a DTable I have chunksize greater than number of rows in the table. In that case, won't I lose information about what chunksize I passed?

You will and you will only get one partition in the dtable

Yeah, I think it was an early design decision to make chunksize an argument of the constructor for the initial partitioning and later ignore it (and for that reason not store it either)

I think including the original chunksize in the logic would also be a bit confusing and would make it more complex, but if we have any use case for that then we can revisit this

I did think of caching the current chunk sizes, because generally that information doesn't change in a dtable (after you manipulate a dtable it becomes a new dtable) We already cache the schema so a similar mechanism could be used

@krynju how about this: to get the chunksize, can I get the maximum value from chunk_lengths? Certainly this maximum should be the original chunksize, except for a boundary case where chunksize is greater than the number of rows.

Again, not guaranteed. Why do you need the original chunksize?

krynju · 2023-08-09T15:07:53Z

src/table/dtable.jl

+        extra_chunk_rows = rowtable(fetch(extra_chunk))
+        new_chunk = Dagger.tochunk(sink(extra_chunk_rows[1:needed_rows]))
+        required_chunks = vcat(table.chunks[1:num_full_chunks], [new_chunk])


it's better to do this with with Dagger.@spawn and make the last dtable chunk just a thunk (so result of Dagger.@spawn)

@krynju how does it look now? I've used the maximum among all chunk_lengths to get the original chunk size, and made the last chunk a thunk.

We should use the actual chunk lengths and not a maximum of them

When you call first(d,50) it should go something like this
(not valid code, just writing the idea)

s = 50 csum=0 chunks = [] for (cl,chunk) in zip(chunk_lengths(d), d.chunks) if csum + cl > s # do the thing with spawn, this is the last one and we need to make a thunk from it and cut it push!(chunks, the_cut_thunk) else csum += cl push!(chunks, chunk) end return DTable(chunks) end

codetalker7 added 2 commits August 9, 2023 17:08

Adding implementation of first.

0954b56

Adding spaces.

b2eab20

krynju reviewed Aug 9, 2023

View reviewed changes

Making the last chunk dtable an EagerThunk.

063c601

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding `first`, `last` and `describe` convenience functions. #42

Adding `first`, `last` and `describe` convenience functions. #42

Uh oh!

codetalker7 commented Aug 9, 2023 •

edited

Loading

Uh oh!

codetalker7 commented Aug 9, 2023

Uh oh!

krynju Aug 9, 2023

Uh oh!

codetalker7 Aug 9, 2023

Uh oh!

krynju Aug 10, 2023

Uh oh!

krynju Aug 10, 2023

Uh oh!

krynju Aug 10, 2023

Uh oh!

codetalker7 Aug 10, 2023 •

edited

Loading

Uh oh!

krynju Sep 1, 2023

Uh oh!

krynju Aug 9, 2023 •

edited

Loading

Uh oh!

codetalker7 Aug 11, 2023

Uh oh!

krynju Sep 1, 2023 •

edited

Loading

Uh oh!

Uh oh!

Adding first, last and describe convenience functions. #42

Are you sure you want to change the base?

Adding first, last and describe convenience functions. #42

Uh oh!

Conversation

codetalker7 commented Aug 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Examples

Uh oh!

codetalker7 commented Aug 9, 2023

Uh oh!

krynju Aug 9, 2023

Choose a reason for hiding this comment

Uh oh!

codetalker7 Aug 9, 2023

Choose a reason for hiding this comment

Uh oh!

krynju Aug 10, 2023

Choose a reason for hiding this comment

Uh oh!

krynju Aug 10, 2023

Choose a reason for hiding this comment

Uh oh!

krynju Aug 10, 2023

Choose a reason for hiding this comment

Uh oh!

codetalker7 Aug 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

krynju Sep 1, 2023

Choose a reason for hiding this comment

Uh oh!

krynju Aug 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codetalker7 Aug 11, 2023

Choose a reason for hiding this comment

Uh oh!

krynju Sep 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Adding `first`, `last` and `describe` convenience functions. #42

Adding `first`, `last` and `describe` convenience functions. #42

codetalker7 commented Aug 9, 2023 •

edited

Loading

codetalker7 Aug 10, 2023 •

edited

Loading

krynju Aug 9, 2023 •

edited

Loading

krynju Sep 1, 2023 •

edited

Loading