-
Notifications
You must be signed in to change notification settings - Fork 666
FEAT-#1067: skiprows
support added for read_csv
#2607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2607 +/- ##
===========================================
- Coverage 85.34% 26.13% -59.22%
===========================================
Files 151 145 -6
Lines 15916 14946 -970
===========================================
- Hits 13584 3906 -9678
- Misses 2332 11040 +8708
Continue to review full report at Codecov.
|
8bd2638
to
46aff86
Compare
ae42db1
to
6407fe5
Compare
Performance numbers obtained by
According to the above table, performance of |
@anmyachev @dchigarev please review this PR. |
modin/pandas/test/utils.py
Outdated
if isinstance(result.index, pandas.MultiIndex): | ||
levels_number = len(result.index.levels) | ||
new_index_data = [ | ||
result.index.get_level_values(lev_id).astype(str) | ||
for lev_id in range(levels_number) | ||
] | ||
new_index = pandas.MultiIndex.from_arrays(new_index_data) | ||
new_index.names = result.index.names | ||
result.index = new_index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just result.index = result.index.astype(str)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we apply astytpe
to the Multiindex
object directly we will obtain error
TypeError: Setting <class 'pandas.core.indexes.multi.MultiIndex'> dtype to anything other than object is not supported
But actually in this PR Multiindex
conversion is not used, so these changes were removed as unrelated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So are you going to delete this?
@anmyachev @dchigarev do you have any comments on this PR? |
a2f73a4
to
2da042f
Compare
@anmyachev PR is ready for review. |
@dchigarev please review this |
You are right, we can drop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amyskov it makes sense, maybe we should test the Cython addition first?
There are some test failures here as well. Can you take a look?
5dcff16
to
0878d06
Compare
@devin-petersohn , this PR was reworked in the way it was discussed offline - now array-like or callable
As it can be seen |
Hi @devin-petersohn please take a look at this PR. |
@amyskov you got some conflicts here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amyskov What is the status of this PR?
@devin-petersohn this PR was rebased and now ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just need to move one minor change to a new PR, overall it looks great. I am excited to get this change in!
Unrelated change was moved into separated PR. Yes, it was really long way to get it done) |
Signed-off-by: Dmitry Chigarev <[email protected]> FEAT-modin-project#1067: skiprows support for read_csv added Signed-off-by: Dmitry Chigarev <[email protected]> FEAT-modin-project#1067: make 'skiprows' and 'nrows' work together Signed-off-by: Dmitry Chigarev <[email protected]> FEAT-modin-project#1067: range like 'skiprows' optimizations Signed-off-by: Dmitry Chigarev <[email protected]> Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: rework read_csv part Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: rework json and fwf dispatchers Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: add description Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: add tests Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: mark xfailed tests Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: fix skiprows when encoding is passed Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: handle partitions with empty DataFrames Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: mark xfailed tests Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: remove unused code Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: add asv tests Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: handle narrow dataframes Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: fix asv tests Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: fix Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: remove code duplication Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: consolidate some code into functions Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: correct io test to check pre_reading part Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: remove unrelated changes Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: Apply suggestions from code review Co-authored-by: Anatoly Myachev <[email protected]> Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: correct asv tests Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: addressing review comments Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: Update modin/engines/base/io/text/text_file_dispatcher.py Co-authored-by: Anatoly Myachev <[email protected]> Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: Update modin/engines/base/io/text/csv_dispatcher.py Co-authored-by: Dmitry Chigarev <[email protected]> Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: rework asv tests Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: rollback parser changes Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: Update modin/engines/base/io/text/text_file_dispatcher.py Co-authored-by: Anatoly Myachev <[email protected]> Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: update tests Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: fix for cloud tests Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: Update modin/engines/base/io/text/text_file_dispatcher.py Co-authored-by: Dmitry Chigarev <[email protected]> Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: remove asv tests Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: addressing review comments Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: addressing review comments Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: correct getting dummy_df Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: remove duplicated code Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: addressing review comments Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: addressing review comments Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: handle corner case with callable skiprows Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: localize the work with rows_considered Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: remove unnecessary changes Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: correct docstrings Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: drop complex `skiprows` after data import Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: remove fallback to pandas case Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: optimize rows skipping Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: rollback unrelated changes, formatting Signed-off-by: Alexander Myskov <[email protected]> FEAT-modin-project#1067: fix incorrect rebase Signed-off-by: Alexander Myskov <[email protected]>
Signed-off-by: Alexander Myskov <[email protected]>
Signed-off-by: Alexander Myskov <[email protected]>
Signed-off-by: Alexander Myskov <[email protected]>
@devin-petersohn CI is green now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @amyskov, looks great!
This PR introduces parallel implementation for
read_csv
function with array-like and callableskiprows
parameter.skiprows
is handled differently depending on the parameter type:skiprows
is integer - rows will be eliminated from partitioned data in thepartitioned_file
function and wouldn't be actually read.skiprows
is uniformly distributed array (for example [1, 2, 3]) - in this case rows within header and the first row fromskiprows
will be read in the first partition (if there are any) andskiprows
will be handled as integer.skiprows
is array, but not uniformly distributed, or callable - in this caseskiprows
will be dropped only after full dataset is imported (this is done because handling of such kind ofskiprows
is inefficient inpartitioned_file
function).Note 1: this PR is continuation of #1932.
What do these changes do?
flake8 modin
black --check modin
git commit -s