forked from pandas-dev/pandas
-
Notifications
You must be signed in to change notification settings - Fork 0
Speed up SAS7BDAT parser #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
jonashaag
wants to merge
252
commits into
main
Choose a base branch
from
fast-sas7bdat
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
@xhochy now with 20x speedup \o/ |
73b74d9
to
41430ff
Compare
* ENH: DTA.to_period support non-nano * update test
* add test * add blank line * fix
…7313) * ENH: Timestamp +- timedeltalike scalar support non-nano * catch and re-raise OverflowError
* ENH: get_resolution support non-nano * tzaware case
…-dev#47317) * DOC: added example of valid input dict in dfgroupby.aggregate * Updated line spacing to pass flake8 test * Removed trailing whitespace * retrying tests * Rerun tests * added user-defined function to examples * Update generic.py
* TYP: Series.quantile * common.py
* [ENH] to_orc pandas.io.orc.to_orc method definition * pandas.DataFrame.to_orc set to_orc to pandas.DataFrame * Cleaning * Fix style & edit comments & change min dependency version to 5.0.0 * Fix style & add to see also * Add ORC to documentation * Changes according to review * Fix problems mentioned in comment * Linter compliance * Address comments * Add orc test * Fixes from pre-commit [automated commit] * Fix issues according to comments * Simplify the code base after raising Arrow version to 7.0.0 * Fix min arrow version in to_orc * Add to_orc test in line with other formats * Add BytesIO support & test * Fix some docs issues * Use keyword only arguments * Fix bug * Fix param issue * Doctest skipping due to minimal versions * Doctest skipping due to minimal versions * Improve spacing in docstring & remove orc test in test_common that has unusual pyarrow version requirement and is with a lot of other tests * Fix docstring syntax * ORC is not text * Fix BytesIO bug && do not require orc to be explicitly imported before usage && all pytest tests have passed * ORC writer does not work for categorical columns yet * Appease mypy * Appease mypy * Edit according to reviews * Fix path bug in test_orc * Fix testdata tuple bug in test_orc * Fix docstrings for check compliance * read_orc does not have engine as a param * Fix sphinx warnings * Improve docs & rerun tests * Force retrigger * Fix test_orc according to review * Rename some variables and func * Update pandas/core/frame.py Co-authored-by: Matthew Roeschke <[email protected]> * Fix issues according to review * Forced reruns * Fix issues according to review * Reraise Pyarrow TypeError as NotImplementedError * Fix bugs * Fix expected error msg in orc tests * Avoid deprecated functions * Replace {} with None in arg Co-authored-by: NickFillot <[email protected]> Co-authored-by: Matthew Roeschke <[email protected]>
* TYP: plotting._matplotlib * somehow super causes issues * fix pickle issue: was accessing _kind on class * and the last plotting file * add timedelta
…das-dev#47331) * REGR: concat not sorting columns for mixed column names * Fix none in columns * BUG: concat not sorting column names when None is included * Update doc/source/whatsnew/v1.5.0.rst Co-authored-by: Matthew Roeschke <[email protected]> * Add gh reference Co-authored-by: Matthew Roeschke <[email protected]>
* Add run-tests action * Fix * Fix * Fix * Update macos-windows.yml * Update posix.yml * Update python-dev.yml * Update action.yml * Update macos-windows.yml * Update posix.yml * Update python-dev.yml * Update python-dev.yml * Update python-dev.yml * Update python-dev.yml * Update python-dev.yml * Update python-dev.yml * Update python-dev.yml * Update python-dev.yml * Update python-dev.yml
…47230) * ENH: preserve non-nano DTA/TDA in Index/Series/DataFrame * tighten xfail * _prep_ndarray->_prep_ndarraylike * xfail non-strict
* ENH: Timestamp +- timedeltalike scalar support non-nano * ENH: Timestamp.__sub__(datetime) with non-nano * better exception message
…ndas-dev#47307) * ENH: Timestamp +- timedeltalike scalar support non-nano * ENH: Timestamp.__sub__(datetime) with non-nano * better exception message * BUG: concat not sorting mixed column names when None is included (pandas-dev#47331) * REGR: concat not sorting columns for mixed column names * Fix none in columns * BUG: concat not sorting column names when None is included * Update doc/source/whatsnew/v1.5.0.rst Co-authored-by: Matthew Roeschke <[email protected]> * Add gh reference Co-authored-by: Matthew Roeschke <[email protected]> * Add run-tests action (pandas-dev#47292) * Add run-tests action * Fix * Fix * Fix * Update macos-windows.yml * Update posix.yml * Update python-dev.yml * Update action.yml * Update macos-windows.yml * Update posix.yml * Update python-dev.yml * Update python-dev.yml * Update python-dev.yml * Update python-dev.yml * Update python-dev.yml * Update python-dev.yml * Update python-dev.yml * Update python-dev.yml * Update python-dev.yml * ENH: Timestamp pickle support non-nano tzaware (pandas-dev#47340) * ENH: DTA to_pydatetime, time, timetz, date, iter support non-nano * cast in liboffsets * mypy fixup Co-authored-by: Patrick Hoefler <[email protected]> Co-authored-by: Matthew Roeschke <[email protected]> Co-authored-by: Jonas Haag <[email protected]>
* TYP: a few mismatches found by stubtest * a few more * flake8
…s-dev#47753) (pandas-dev#47754) When passing secondary_y=True to a plotting function, a second axes with a y-axis on the right side is created. Passing ylabel, ylim or yticks changed these properties of the original invisible left y-axis, not the secondary y-axis.
…andas-dev#46409) (pandas-dev#47736) * TST: add test for last method on dataframe grouped by on boolean column (pandas-dev#46409) * TST: add test for last method on dataframe grouped by on boolean column (pandas-dev#46409) * BUG: PeriodIndex fails to handle NA, rather than putting NaT in its place (pandas-dev#46673) * BUG: PeriodIndex fails to handle NA, rather than putting NaT in its place (pandas-dev#46673) * BUG: PeriodIndex fails to handle NA, rather than putting NaT in its place (pandas-dev#46673)
… requirements (pandas-dev#47727) * package versions in install.rst match v.1.5.0.rst * remove azure from optional deps as not yet supported officially * correct from whatsnew. 2021.05 should be 2021.5 left-passed zeros are not the format of version numbers for fsspec or gcsfs and would cause pip to fail if anyone used them to fetch from PyPi * align cols in install.rst
* opt out of bottleneck for nanmean * remove trailing whitespace * make error bound explicit * unittest only _bn_ok_dtype * link issue to test function * Update doc/source/whatsnew/v1.5.0.rst clarify that there might be a performance decrease experienced from disabling `mean` for bottleneck Co-authored-by: Matthew Roeschke <[email protected]> * extend unit tests with (u)int dtypes * Update pandas/core/nanops.py Co-authored-by: JMBurley <[email protected]> Co-authored-by: Matthew Roeschke <[email protected]> Co-authored-by: JMBurley <[email protected]>
* TYP: freq and na_value * _simple_new
… mask (pandas-dev#47763) * BUG: fix regression in Series[string] setitem setting a scalar with a mask * expand test for non-string value
* FIX: PeriodIndex json roundtrip * update changelog * Update doc/source/whatsnew/v1.5.0.rst Co-authored-by: Matthew Roeschke <[email protected]> * simplify change and add specialized tests * pep8 change Co-authored-by: Matthew Roeschke <[email protected]>
…Series.str) (pandas-dev#47755) * fix 28277 * fix typo * add test * Update pandas/tests/strings/test_cat.py Co-authored-by: Matthew Roeschke <[email protected]> * fix pep 8 issue, change comment symbol Co-authored-by: Matthew Roeschke <[email protected]>
…pe column (pandas-dev#47757) * Update test_melt.py * Update v1.5.0.rst * Update melt.py * Update test_melt.py * Update melt.py * fix type * Update doc/source/whatsnew/v1.5.0.rst Co-authored-by: Matthew Roeschke <[email protected]> * Update melt.py * Update test_melt.py Co-authored-by: Matthew Roeschke <[email protected]>
* BUG: PeriodIndex-with-Nat + TimedeltaArray * mypy fixup * de-kludge
* Update to `jupyterlite==0.1.0b10` * Update requirements-dev.txt
* REF: re-use convert_reso * typo fixup
* Update join docs for other param Update join docs regarding using multiple Series * Update type for _join_compat * Allow any iterable for join; test join for a list of series * Update type signature * Update pd.concat type, add cast() to make frame.join() work with mypy * Fix type union syntax * NDFrame * Remove cast * Fix mypy errors * Code review * Remove unnecessary assert * Add comment explaining the cast * Fix swapped order of cast comment * Remove full stop * Update pandas/core/frame.py Co-authored-by: Marc Garcia <[email protected]>
…dev#47585) * BUG: Series map ignoring na_action for dict or series mapper * Add tests
…pandas-dev#47810) * Update array.py * Update test_array.py * Update array.py * fix format * Update v1.5.0.rst * fix number
* TST/CI: xfail test_round_sanity for 32 bit * xfail generally * minimize diff * minimize again * strict=False
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Speed up SAS7BDAT parser by ~5–20x by moving all performance critical parts to Cython:
process_page_metadata
with sub-methods_get_subheader_index
,_get_subheader_processor
read_{float,int}
with big/little endian unpackingI also made a big refactor related to how the cythonized parts of the parser work together with the Python parts. Now it is made explicit where variables are synchronized between the two. I renamed the Cython parser class to
SAS7BDATCythonReader
to reflect that they are essentially extensions to the PythonSAS7BDATReader
, but put in a.pyx
file for technical reasons.Other improvements: