Speed up SAS7BDAT parser #7

jonashaag · 2022-05-23T07:30:06Z

Speed up SAS7BDAT parser by ~5–20x by moving all performance critical parts to Cython:

process_page_metadata with sub-methods _get_subheader_index, _get_subheader_processor
read_{float,int} with big/little endian unpacking
String encoding and NaN placement

I also made a big refactor related to how the cythonized parts of the parser work together with the Python parts. Now it is made explicit where variables are synchronized between the two. I renamed the Cython parser class to SAS7BDATCythonReader to reflect that they are essentially extensions to the Python SAS7BDATReader, but put in a .pyx file for technical reasons.

Other improvements:

Remove use of NumPy arrays for RDC/RLE decompression
Speed up parsing of strings with a lot of trailing zero bytes
Remove some dead code
Some cleanup

jonashaag · 2022-05-23T09:14:51Z

@xhochy

jonashaag · 2022-05-27T22:18:51Z

@xhochy now with 20x speedup \o/

* ENH: DTA.to_period support non-nano * update test

* add test * add blank line * fix

…7313) * ENH: Timestamp +- timedeltalike scalar support non-nano * catch and re-raise OverflowError

…ndas-dev#47338)

* ENH: get_resolution support non-nano * tzaware case

…-dev#47317) * DOC: added example of valid input dict in dfgroupby.aggregate * Updated line spacing to pass flake8 test * Removed trailing whitespace * retrying tests * Rerun tests * added user-defined function to examples * Update generic.py

* TYP: Series.quantile * common.py

* [ENH] to_orc pandas.io.orc.to_orc method definition * pandas.DataFrame.to_orc set to_orc to pandas.DataFrame * Cleaning * Fix style & edit comments & change min dependency version to 5.0.0 * Fix style & add to see also * Add ORC to documentation * Changes according to review * Fix problems mentioned in comment * Linter compliance * Address comments * Add orc test * Fixes from pre-commit [automated commit] * Fix issues according to comments * Simplify the code base after raising Arrow version to 7.0.0 * Fix min arrow version in to_orc * Add to_orc test in line with other formats * Add BytesIO support & test * Fix some docs issues * Use keyword only arguments * Fix bug * Fix param issue * Doctest skipping due to minimal versions * Doctest skipping due to minimal versions * Improve spacing in docstring & remove orc test in test_common that has unusual pyarrow version requirement and is with a lot of other tests * Fix docstring syntax * ORC is not text * Fix BytesIO bug && do not require orc to be explicitly imported before usage && all pytest tests have passed * ORC writer does not work for categorical columns yet * Appease mypy * Appease mypy * Edit according to reviews * Fix path bug in test_orc * Fix testdata tuple bug in test_orc * Fix docstrings for check compliance * read_orc does not have engine as a param * Fix sphinx warnings * Improve docs & rerun tests * Force retrigger * Fix test_orc according to review * Rename some variables and func * Update pandas/core/frame.py Co-authored-by: Matthew Roeschke <[email protected]> * Fix issues according to review * Forced reruns * Fix issues according to review * Reraise Pyarrow TypeError as NotImplementedError * Fix bugs * Fix expected error msg in orc tests * Avoid deprecated functions * Replace {} with None in arg Co-authored-by: NickFillot <[email protected]> Co-authored-by: Matthew Roeschke <[email protected]>

* TYP: plotting._matplotlib * somehow super causes issues * fix pickle issue: was accessing _kind on class * and the last plotting file * add timedelta

…das-dev#47331) * REGR: concat not sorting columns for mixed column names * Fix none in columns * BUG: concat not sorting column names when None is included * Update doc/source/whatsnew/v1.5.0.rst Co-authored-by: Matthew Roeschke <[email protected]> * Add gh reference Co-authored-by: Matthew Roeschke <[email protected]>

* Add run-tests action * Fix * Fix * Fix * Update macos-windows.yml * Update posix.yml * Update python-dev.yml * Update action.yml * Update macos-windows.yml * Update posix.yml * Update python-dev.yml * Update python-dev.yml * Update python-dev.yml * Update python-dev.yml * Update python-dev.yml * Update python-dev.yml * Update python-dev.yml * Update python-dev.yml * Update python-dev.yml

…47230) * ENH: preserve non-nano DTA/TDA in Index/Series/DataFrame * tighten xfail * _prep_ndarray->_prep_ndarraylike * xfail non-strict

* ENH: Timestamp +- timedeltalike scalar support non-nano * ENH: Timestamp.__sub__(datetime) with non-nano * better exception message

…rder False (pandas-dev#47325)

…pandas-dev#47349)

…ndas-dev#47307) * ENH: Timestamp +- timedeltalike scalar support non-nano * ENH: Timestamp.__sub__(datetime) with non-nano * better exception message * BUG: concat not sorting mixed column names when None is included (pandas-dev#47331) * REGR: concat not sorting columns for mixed column names * Fix none in columns * BUG: concat not sorting column names when None is included * Update doc/source/whatsnew/v1.5.0.rst Co-authored-by: Matthew Roeschke <[email protected]> * Add gh reference Co-authored-by: Matthew Roeschke <[email protected]> * Add run-tests action (pandas-dev#47292) * Add run-tests action * Fix * Fix * Fix * Update macos-windows.yml * Update posix.yml * Update python-dev.yml * Update action.yml * Update macos-windows.yml * Update posix.yml * Update python-dev.yml * Update python-dev.yml * Update python-dev.yml * Update python-dev.yml * Update python-dev.yml * Update python-dev.yml * Update python-dev.yml * Update python-dev.yml * Update python-dev.yml * ENH: Timestamp pickle support non-nano tzaware (pandas-dev#47340) * ENH: DTA to_pydatetime, time, timetz, date, iter support non-nano * cast in liboffsets * mypy fixup Co-authored-by: Patrick Hoefler <[email protected]> Co-authored-by: Matthew Roeschke <[email protected]> Co-authored-by: Jonas Haag <[email protected]>

…47357)

* TYP: a few mismatches found by stubtest * a few more * flake8

…s-dev#47753) (pandas-dev#47754) When passing secondary_y=True to a plotting function, a second axes with a y-axis on the right side is created. Passing ylabel, ylim or yticks changed these properties of the original invisible left y-axis, not the secondary y-axis.

…andas-dev#46409) (pandas-dev#47736) * TST: add test for last method on dataframe grouped by on boolean column (pandas-dev#46409) * TST: add test for last method on dataframe grouped by on boolean column (pandas-dev#46409) * BUG: PeriodIndex fails to handle NA, rather than putting NaT in its place (pandas-dev#46673) * BUG: PeriodIndex fails to handle NA, rather than putting NaT in its place (pandas-dev#46673) * BUG: PeriodIndex fails to handle NA, rather than putting NaT in its place (pandas-dev#46673)

… requirements (pandas-dev#47727) * package versions in install.rst match v.1.5.0.rst * remove azure from optional deps as not yet supported officially * correct from whatsnew. 2021.05 should be 2021.5 left-passed zeros are not the format of version numbers for fsspec or gcsfs and would cause pip to fail if anyone used them to fetch from PyPi * align cols in install.rst

* opt out of bottleneck for nanmean * remove trailing whitespace * make error bound explicit * unittest only _bn_ok_dtype * link issue to test function * Update doc/source/whatsnew/v1.5.0.rst clarify that there might be a performance decrease experienced from disabling `mean` for bottleneck Co-authored-by: Matthew Roeschke <[email protected]> * extend unit tests with (u)int dtypes * Update pandas/core/nanops.py Co-authored-by: JMBurley <[email protected]> Co-authored-by: Matthew Roeschke <[email protected]> Co-authored-by: JMBurley <[email protected]>

* TYP: freq and na_value * _simple_new

… mask (pandas-dev#47763) * BUG: fix regression in Series[string] setitem setting a scalar with a mask * expand test for non-string value

* FIX: PeriodIndex json roundtrip * update changelog * Update doc/source/whatsnew/v1.5.0.rst Co-authored-by: Matthew Roeschke <[email protected]> * simplify change and add specialized tests * pep8 change Co-authored-by: Matthew Roeschke <[email protected]>

…Series.str) (pandas-dev#47755) * fix 28277 * fix typo * add test * Update pandas/tests/strings/test_cat.py Co-authored-by: Matthew Roeschke <[email protected]> * fix pep 8 issue, change comment symbol Co-authored-by: Matthew Roeschke <[email protected]>

…pe column (pandas-dev#47757) * Update test_melt.py * Update v1.5.0.rst * Update melt.py * Update test_melt.py * Update melt.py * fix type * Update doc/source/whatsnew/v1.5.0.rst Co-authored-by: Matthew Roeschke <[email protected]> * Update melt.py * Update test_melt.py Co-authored-by: Matthew Roeschke <[email protected]>

* BUG: PeriodIndex-with-Nat + TimedeltaArray * mypy fixup * de-kludge

* Update to `jupyterlite==0.1.0b10` * Update requirements-dev.txt

* REF: re-use convert_reso * typo fixup

* Update join docs for other param Update join docs regarding using multiple Series * Update type for _join_compat * Allow any iterable for join; test join for a list of series * Update type signature * Update pd.concat type, add cast() to make frame.join() work with mypy * Fix type union syntax * NDFrame * Remove cast * Fix mypy errors * Code review * Remove unnecessary assert * Add comment explaining the cast * Fix swapped order of cast comment * Remove full stop * Update pandas/core/frame.py Co-authored-by: Marc Garcia <[email protected]>

…dev#47585) * BUG: Series map ignoring na_action for dict or series mapper * Add tests

…dev#47711)

…pandas-dev#47810) * Update array.py * Update test_array.py * Update array.py * fix format * Update v1.5.0.rst * fix number

…#32550 (pandas-dev#47732)

* TST/CI: xfail test_round_sanity for 32 bit * xfail generally * minimize diff * minimize again * strict=False

jonashaag force-pushed the fast-sas7bdat branch from ae8d7f2 to d9f8e0d Compare May 23, 2022 09:14

jonashaag force-pushed the fast-sas7bdat branch from 091dd62 to 396a76a Compare May 27, 2022 22:16

jonashaag force-pushed the fast-sas7bdat branch 2 times, most recently from 73b74d9 to 41430ff Compare June 12, 2022 21:16

jbrockmendel and others added 6 commits June 13, 2022 10:25

ENH: DTA.to_period support non-nano (pandas-dev#47324)

25749d2

* ENH: DTA.to_period support non-nano * update test

Add test for multi-column dtype assignment (pandas-dev#47323)

f042800

* add test * add blank line * fix

ENH: Timestamp.normalize support non-nano (pandas-dev#47316)

28d5b01

BUG: DateOffset addition preserve non-nano (pandas-dev#47334)

8de88ff

ENH: Timestamp +- timedeltalike scalar support non-nano (pandas-dev#4…

b74dc5c

…7313) * ENH: Timestamp +- timedeltalike scalar support non-nano * catch and re-raise OverflowError

ENH: Timestamp.replace support non-nano (pandas-dev#47312)

87da500

jonashaag mentioned this pull request Jun 13, 2022

Meta issue: SAS7BDAT parser improvements pandas-dev/pandas#47339

Open

dataxerik and others added 17 commits June 13, 2022 16:32

ENH: Move UndefinedVariableError to error/__init__.py per GH27656 (pa…

f40203c

…ndas-dev#47338)

ENH: get_resolution support non-nano (pandas-dev#47322)

a8d8ae7

* ENH: get_resolution support non-nano * tzaware case

ENH: Timestamp.tz_convert support non-nano (pandas-dev#47320)

38a7d29

TYP: Series.quantile (pandas-dev#47304)

830130a

* TYP: Series.quantile * common.py

TYP: plotting._matplotlib (pandas-dev#47311)

696e9bd

* TYP: plotting._matplotlib * somehow super causes issues * fix pickle issue: was accessing _kind on class * and the last plotting file * add timedelta

ENH: Timestamp pickle support non-nano tzaware (pandas-dev#47340)

007bf4a

ENH: preserve non-nano DTA/TDA in Index/Series/DataFrame (pandas-dev#…

f600fd4

…47230) * ENH: preserve non-nano DTA/TDA in Index/Series/DataFrame * tighten xfail * _prep_ndarray->_prep_ndarraylike * xfail non-strict

ENH: Timestamp.__sub__(datetimelike) support non-nano (pandas-dev#47346)

f7be58a

* ENH: Timestamp +- timedeltalike scalar support non-nano * ENH: Timestamp.__sub__(datetime) with non-nano * better exception message

REGR: Fix nan comparison for same Index object (pandas-dev#47326)

7310d90

REGR: Avoid regression warning with ea dtype and assert_index_equal o…

7c6a76a

…rder False (pandas-dev#47325)

REGR: MultiIndex.dtypes has regular Index instead of MultiIndex index (…

6f0be79

…pandas-dev#47349)

ENH: Move IndexingError to error/__init__.py per GH27656 (pandas-dev#…

4dfe48f

…47357)

twoertwein and others added 29 commits July 18, 2022 10:17

TYP: Appender also works with properties (pandas-dev#47768)

0a26cdd

PERF: operations with zoneinfo tzinfos (pandas-dev#47767)

b731518

TYP: a few mismatches found by stubtest (pandas-dev#47764)

87930ef

* TYP: a few mismatches found by stubtest * a few more * flake8

TYP: def validate_* (pandas-dev#47750)

089f7f8

TYP: freq and na_value (pandas-dev#47729)

efd15b7

* TYP: freq and na_value * _simple_new

BUG: fix regression in Series[string] setitem setting a scalar with a…

1b1dd36

… mask (pandas-dev#47763) * BUG: fix regression in Series[string] setitem setting a scalar with a mask * expand test for non-string value

BUG: PeriodIndex + TimedeltaArray-with-NaT (pandas-dev#47783)

9f5c8b9

* BUG: PeriodIndex-with-Nat + TimedeltaArray * mypy fixup * de-kludge

DOC: Fix versionadded for callable in on_bad_lines (pandas-dev#47792)

f7e0e68

WEB: Update to jupyterlite==0.1.0b10 (pandas-dev#47532)

187636f

* Update to `jupyterlite==0.1.0b10` * Update requirements-dev.txt

REF: re-use convert_reso (pandas-dev#47807)

caf261f

* REF: re-use convert_reso * typo fixup

TYP: Column.null_count is a Python int (pandas-dev#47804)

060ce49

BUG: Series map ignoring na_action for dict or series mapper (pandas-…

bd31d64

…dev#47585) * BUG: Series map ignoring na_action for dict or series mapper * Add tests

ENH: Timestamp.min/max/resolution support non-nano (pandas-dev#47720)

96b036c

ENH/TST: Add BaseUnaryOpsTests tests for ArrowExtensionArray (pandas-…

433dcd5

…dev#47711)

ENH/TST: Add isin, _hasna for ArrowExtensionArray (pandas-dev#47805)

8f04a8e

ENH/TST: Add Reduction tests for ArrowExtensionArray (pandas-dev#47730)

bedd8f0

BUG: fix SparseArray.unique IndexError and _first_fill_value_loc algo (…

3d94f7a

…pandas-dev#47810) * Update array.py * Update test_array.py * Update array.py * fix format * Update v1.5.0.rst * fix number

DOC: Updating some capitalization in doc/source/user_guide pandas-dev…

d8bb752

…#32550 (pandas-dev#47732)

TST/CI: xfail test_round_sanity for 32 bit (pandas-dev#47803)

8c7b0b2

* TST/CI: xfail test_round_sanity for 32 bit * xfail generally * minimize diff * minimize again * strict=False

WEB: Update sponsors in website (pandas-dev#47678)

a62897a

Speed up SAS7BDAT parser

7cd9b82

jonashaag force-pushed the fast-sas7bdat branch from 41430ff to 7cd9b82 Compare July 25, 2022 13:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up SAS7BDAT parser #7

Speed up SAS7BDAT parser #7

Uh oh!

jonashaag commented May 23, 2022 •

edited

Loading

Uh oh!

jonashaag commented May 23, 2022

Uh oh!

jonashaag commented May 27, 2022

Uh oh!

Uh oh!

Speed up SAS7BDAT parser #7

Are you sure you want to change the base?

Speed up SAS7BDAT parser #7

Uh oh!

Conversation

jonashaag commented May 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonashaag commented May 23, 2022

Uh oh!

jonashaag commented May 27, 2022

Uh oh!

Uh oh!

jonashaag commented May 23, 2022 •

edited

Loading