Skip to content

Speed up SAS7BDAT parser #7

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 252 commits into
base: main
Choose a base branch
from
Draft

Speed up SAS7BDAT parser #7

wants to merge 252 commits into from

Conversation

jonashaag
Copy link
Owner

@jonashaag jonashaag commented May 23, 2022

Speed up SAS7BDAT parser by ~5–20x by moving all performance critical parts to Cython:

  • process_page_metadata with sub-methods _get_subheader_index, _get_subheader_processor
  • read_{float,int} with big/little endian unpacking
  • String encoding and NaN placement

I also made a big refactor related to how the cythonized parts of the parser work together with the Python parts. Now it is made explicit where variables are synchronized between the two. I renamed the Cython parser class to SAS7BDATCythonReader to reflect that they are essentially extensions to the Python SAS7BDATReader, but put in a .pyx file for technical reasons.

Other improvements:

  • Remove use of NumPy arrays for RDC/RLE decompression
  • Speed up parsing of strings with a lot of trailing zero bytes
  • Remove some dead code
  • Some cleanup

@jonashaag
Copy link
Owner Author

@xhochy

@jonashaag
Copy link
Owner Author

@xhochy now with 20x speedup \o/

@jonashaag jonashaag force-pushed the fast-sas7bdat branch 2 times, most recently from 73b74d9 to 41430ff Compare June 12, 2022 21:16
dataxerik and others added 17 commits June 13, 2022 16:32
* ENH: get_resolution support non-nano

* tzaware case
…-dev#47317)

* DOC: added example of valid input dict in dfgroupby.aggregate

* Updated line spacing to pass flake8 test

* Removed trailing whitespace

* retrying tests

* Rerun tests

* added user-defined function to examples

* Update generic.py
* TYP: Series.quantile

* common.py
* [ENH] to_orc

pandas.io.orc.to_orc method definition

* pandas.DataFrame.to_orc

set to_orc to pandas.DataFrame

* Cleaning

* Fix style & edit comments & change min dependency version to 5.0.0

* Fix style & add to see also

* Add ORC to documentation

* Changes according to review

* Fix problems mentioned in comment

* Linter compliance

* Address comments

* Add orc test

* Fixes from pre-commit [automated commit]

* Fix issues according to comments

* Simplify the code base after raising Arrow version to 7.0.0

* Fix min arrow version in to_orc

* Add to_orc test in line with other formats

* Add BytesIO support & test

* Fix some docs issues

* Use keyword only arguments

* Fix bug

* Fix param issue

* Doctest skipping due to minimal versions

* Doctest skipping due to minimal versions

* Improve spacing in docstring & remove orc test in test_common that has unusual pyarrow version requirement and is with a lot of other tests

* Fix docstring syntax

* ORC is not text

* Fix BytesIO bug && do not require orc to be explicitly imported before usage && all pytest tests have passed

* ORC writer does not work for categorical columns yet

* Appease mypy

* Appease mypy

* Edit according to reviews

* Fix path bug in test_orc

* Fix testdata tuple bug in test_orc

* Fix docstrings for check compliance

* read_orc does not have engine as a param

* Fix sphinx warnings

* Improve docs & rerun tests

* Force retrigger

* Fix test_orc according to review

* Rename some variables and func

* Update pandas/core/frame.py

Co-authored-by: Matthew Roeschke <[email protected]>

* Fix issues according to review

* Forced reruns

* Fix issues according to review

* Reraise Pyarrow TypeError as NotImplementedError

* Fix bugs

* Fix expected error msg in orc tests

* Avoid deprecated functions

* Replace {} with None in arg

Co-authored-by: NickFillot <[email protected]>
Co-authored-by: Matthew Roeschke <[email protected]>
* TYP: plotting._matplotlib

* somehow super causes issues

* fix pickle issue: was accessing _kind on class

* and the last plotting file

* add timedelta
…das-dev#47331)

* REGR: concat not sorting columns for mixed column names

* Fix none in columns

* BUG: concat not sorting column names when None is included

* Update doc/source/whatsnew/v1.5.0.rst

Co-authored-by: Matthew Roeschke <[email protected]>

* Add gh reference

Co-authored-by: Matthew Roeschke <[email protected]>
* Add run-tests action

* Fix

* Fix

* Fix

* Update macos-windows.yml

* Update posix.yml

* Update python-dev.yml

* Update action.yml

* Update macos-windows.yml

* Update posix.yml

* Update python-dev.yml

* Update python-dev.yml

* Update python-dev.yml

* Update python-dev.yml

* Update python-dev.yml

* Update python-dev.yml

* Update python-dev.yml

* Update python-dev.yml

* Update python-dev.yml
…47230)

* ENH: preserve non-nano DTA/TDA in Index/Series/DataFrame

* tighten xfail

* _prep_ndarray->_prep_ndarraylike

* xfail non-strict
* ENH: Timestamp +- timedeltalike scalar support non-nano

* ENH: Timestamp.__sub__(datetime) with non-nano

* better exception message
…ndas-dev#47307)

* ENH: Timestamp +- timedeltalike scalar support non-nano

* ENH: Timestamp.__sub__(datetime) with non-nano

* better exception message

* BUG: concat not sorting mixed column names when None is included (pandas-dev#47331)

* REGR: concat not sorting columns for mixed column names

* Fix none in columns

* BUG: concat not sorting column names when None is included

* Update doc/source/whatsnew/v1.5.0.rst

Co-authored-by: Matthew Roeschke <[email protected]>

* Add gh reference

Co-authored-by: Matthew Roeschke <[email protected]>

* Add run-tests action (pandas-dev#47292)

* Add run-tests action

* Fix

* Fix

* Fix

* Update macos-windows.yml

* Update posix.yml

* Update python-dev.yml

* Update action.yml

* Update macos-windows.yml

* Update posix.yml

* Update python-dev.yml

* Update python-dev.yml

* Update python-dev.yml

* Update python-dev.yml

* Update python-dev.yml

* Update python-dev.yml

* Update python-dev.yml

* Update python-dev.yml

* Update python-dev.yml

* ENH: Timestamp pickle support non-nano tzaware (pandas-dev#47340)

* ENH: DTA to_pydatetime, time, timetz, date, iter support non-nano

* cast in liboffsets

* mypy fixup

Co-authored-by: Patrick Hoefler <[email protected]>
Co-authored-by: Matthew Roeschke <[email protected]>
Co-authored-by: Jonas Haag <[email protected]>
twoertwein and others added 29 commits July 18, 2022 10:17
* TYP: a few mismatches found by stubtest

* a few more

* flake8
…s-dev#47753) (pandas-dev#47754)

When passing secondary_y=True to a plotting function, a second axes with a
y-axis on the right side is created. Passing ylabel, ylim or yticks changed
these properties of the original invisible left y-axis, not the secondary
y-axis.
…andas-dev#46409) (pandas-dev#47736)

* TST: add test for last method on dataframe grouped by on boolean column (pandas-dev#46409)

* TST: add test for last method on dataframe grouped by on boolean column (pandas-dev#46409)

* BUG: PeriodIndex fails to handle NA, rather than putting NaT in its place (pandas-dev#46673)

* BUG: PeriodIndex fails to handle NA, rather than putting NaT in its place (pandas-dev#46673)

* BUG: PeriodIndex fails to handle NA, rather than putting NaT in its place (pandas-dev#46673)
… requirements (pandas-dev#47727)

* package versions in install.rst match v.1.5.0.rst

* remove azure from optional deps as not yet supported officially

* correct from whatsnew. 2021.05 should be 2021.5

left-passed zeros are not the format of version numbers for fsspec or gcsfs and would cause pip to fail if anyone used them to fetch from PyPi

* align cols in install.rst
* opt out of bottleneck for nanmean

* remove trailing whitespace

* make error bound explicit

* unittest only _bn_ok_dtype

* link issue to test function

* Update doc/source/whatsnew/v1.5.0.rst

clarify that there might be a performance decrease experienced from disabling `mean` for bottleneck

Co-authored-by: Matthew Roeschke <[email protected]>

* extend unit tests with (u)int dtypes

* Update pandas/core/nanops.py

Co-authored-by: JMBurley <[email protected]>

Co-authored-by: Matthew Roeschke <[email protected]>
Co-authored-by: JMBurley <[email protected]>
* TYP: freq and na_value

* _simple_new
… mask (pandas-dev#47763)

* BUG: fix regression in Series[string] setitem setting a scalar with a mask

* expand test for non-string value
* FIX: PeriodIndex json roundtrip

* update changelog

* Update doc/source/whatsnew/v1.5.0.rst

Co-authored-by: Matthew Roeschke <[email protected]>

* simplify change and add specialized tests

* pep8 change

Co-authored-by: Matthew Roeschke <[email protected]>
…Series.str) (pandas-dev#47755)

* fix 28277

* fix typo

* add test

* Update pandas/tests/strings/test_cat.py

Co-authored-by: Matthew Roeschke <[email protected]>

* fix pep 8 issue, change comment symbol

Co-authored-by: Matthew Roeschke <[email protected]>
…pe column (pandas-dev#47757)

* Update test_melt.py

* Update v1.5.0.rst

* Update melt.py

* Update test_melt.py

* Update melt.py

* fix type

* Update doc/source/whatsnew/v1.5.0.rst

Co-authored-by: Matthew Roeschke <[email protected]>

* Update melt.py

* Update test_melt.py

Co-authored-by: Matthew Roeschke <[email protected]>
* BUG: PeriodIndex-with-Nat + TimedeltaArray

* mypy fixup

* de-kludge
* Update to `jupyterlite==0.1.0b10`

* Update requirements-dev.txt
* REF: re-use convert_reso

* typo fixup
* Update join docs for other param

Update join docs regarding using multiple Series

* Update type for _join_compat

* Allow any iterable for join; test join for a list of series

* Update type signature

* Update pd.concat type, add cast() to make frame.join() work with mypy

* Fix type union syntax

* NDFrame

* Remove cast

* Fix mypy errors

* Code review

* Remove unnecessary assert

* Add comment explaining the cast

* Fix swapped order of cast comment

* Remove full stop

* Update pandas/core/frame.py

Co-authored-by: Marc Garcia <[email protected]>
…dev#47585)

* BUG: Series map ignoring na_action for dict or series mapper

* Add tests
…pandas-dev#47810)

* Update array.py

* Update test_array.py

* Update array.py

* fix format

* Update v1.5.0.rst

* fix number
* TST/CI: xfail test_round_sanity for 32 bit

* xfail generally

* minimize diff

* minimize again

* strict=False
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.