diff --git a/.github/CODE_OF_CONDUCT.md b/.github/CODE_OF_CONDUCT.md deleted file mode 100644 index 87a5b7905fc6d..0000000000000 --- a/.github/CODE_OF_CONDUCT.md +++ /dev/null @@ -1,62 +0,0 @@ -# Contributor Code of Conduct - -As contributors and maintainers of this project, and in the interest of -fostering an open and welcoming community, we pledge to respect all people who -contribute through reporting issues, posting feature requests, updating -documentation, submitting pull requests or patches, and other activities. - -We are committed to making participation in this project a harassment-free -experience for everyone, regardless of level of experience, gender, gender -identity and expression, sexual orientation, disability, personal appearance, -body size, race, ethnicity, age, religion, or nationality. - -Examples of unacceptable behavior by participants include: - -* The use of sexualized language or imagery -* Personal attacks -* Trolling or insulting/derogatory comments -* Public or private harassment -* Publishing other's private information, such as physical or electronic - addresses, without explicit permission -* Other unethical or unprofessional conduct - -Project maintainers have the right and responsibility to remove, edit, or -reject comments, commits, code, wiki edits, issues, and other contributions -that are not aligned to this Code of Conduct, or to ban temporarily or -permanently any contributor for other behaviors that they deem inappropriate, -threatening, offensive, or harmful. - -By adopting this Code of Conduct, project maintainers commit themselves to -fairly and consistently applying these principles to every aspect of managing -this project. Project maintainers who do not follow or enforce the Code of -Conduct may be permanently removed from the project team. - -This Code of Conduct applies both within project spaces and in public spaces -when an individual is representing the project or its community. - -A working group of community members is committed to promptly addressing any -reported issues. The working group is made up of pandas contributors and users. -Instances of abusive, harassing, or otherwise unacceptable behavior may be -reported by contacting the working group by e-mail (pandas-coc@googlegroups.com). -Messages sent to this e-mail address will not be publicly visible but only to -the working group members. The working group currently includes - -- Safia Abdalla -- Tom Augspurger -- Joris Van den Bossche -- Camille Scott -- Nathaniel Smith - -All complaints will be reviewed and investigated and will result in a response -that is deemed necessary and appropriate to the circumstances. Maintainers are -obligated to maintain confidentiality with regard to the reporter of an -incident. - -This Code of Conduct is adapted from the [Contributor Covenant][homepage], -version 1.3.0, available at -[https://www.contributor-covenant.org/version/1/3/0/][version], -and the [Swift Code of Conduct][swift]. - -[homepage]: https://www.contributor-covenant.org -[version]: https://www.contributor-covenant.org/version/1/3/0/ -[swift]: https://swift.org/community/#code-of-conduct diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md deleted file mode 100644 index d27eab5b9c95c..0000000000000 --- a/.github/CONTRIBUTING.md +++ /dev/null @@ -1,3 +0,0 @@ -# Contributing to pandas - -A detailed overview on how to contribute can be found in the **[contributing guide](https://pandas.pydata.org/docs/dev/development/contributing.html)**. diff --git a/.github/FUNDING.yml b/.github/FUNDING.yml deleted file mode 100644 index 27dfded808b95..0000000000000 --- a/.github/FUNDING.yml +++ /dev/null @@ -1,3 +0,0 @@ -custom: https://pandas.pydata.org/donate.html -github: [numfocus] -tidelift: pypi/pandas diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md deleted file mode 100644 index 0c30b941bc520..0000000000000 --- a/.github/ISSUE_TEMPLATE/feature_request.md +++ /dev/null @@ -1,33 +0,0 @@ ---- - -name: Feature Request -about: Suggest an idea for pandas -title: "ENH:" -labels: "Enhancement, Needs Triage" - ---- - -#### Is your feature request related to a problem? - -[this should provide a description of what the problem is, e.g. "I wish I could use pandas to do [...]"] - -#### Describe the solution you'd like - -[this should provide a description of the feature request, e.g. "`DataFrame.foo` should get a new parameter `bar` that [...]", try to write a docstring for the desired feature] - -#### API breaking implications - -[this should provide a description of how this feature will affect the API] - -#### Describe alternatives you've considered - -[this should provide a description of any alternative solutions or features you've considered] - -#### Additional context - -[add any other context, code examples, or references to existing implementations about the feature request here] - -```python -# Your code here, if applicable - -``` diff --git a/.github/ISSUE_TEMPLATE/feature_request.yaml b/.github/ISSUE_TEMPLATE/feature_request.yaml new file mode 100644 index 0000000000000..f837eb1ca5bb7 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/feature_request.yaml @@ -0,0 +1,72 @@ +name: Feature Request +description: Suggest an idea for pandas +title: "ENH: " +labels: [Enhancement, Needs Triage] + +body: + - type: checkboxes + id: checks + attributes: + label: Feature Type + description: Please check what type of feature request you would like to propose. + options: + - label: > + Adding new functionality to pandas + - label: > + Changing existing functionality in pandas + - label: > + Removing existing functionality in pandas + - type: textarea + id: description + attributes: + label: Problem Description + description: > + Please describe what problem the feature would solve, e.g. "I wish I could use pandas to ..." + placeholder: > + I wish I could use pandas to return a Series from a DataFrame when possible. + validations: + required: true + - type: textarea + id: feature + attributes: + label: Feature Description + description: > + Please describe how the new feature would be implemented, using psudocode if relevant. + placeholder: > + Add a new parameter to DataFrame, to_series, to return a Series if possible. + + def __init__(self, ..., to_series: bool=False): + """ + Parameters + ---------- + ... + + to_series : bool, default False + Return a Series if possible + """ + if to_series: + return Series(data) + validations: + required: true + - type: textarea + id: alternative + attributes: + label: Alternative Solutions + description: > + Please describe any alternative solution (existing functionality, 3rd party package, etc.) + that would satisfy the feature request. + placeholder: > + Write a custom function to return a Series when possible. + + def to_series(...) + result = pd.DataFrame(...) + ... + validations: + required: true + - type: textarea + id: context + attributes: + label: Additional Context + description: > + Please provide any relevant Github issues, code examples or references that help describe and support + the feature request. diff --git a/.github/SECURITY.md b/.github/SECURITY.md deleted file mode 100644 index f3b059a5d4f13..0000000000000 --- a/.github/SECURITY.md +++ /dev/null @@ -1 +0,0 @@ -To report a security vulnerability to pandas, please go to https://tidelift.com/security and see the instructions there. diff --git a/.github/actions/build_pandas/action.yml b/.github/actions/build_pandas/action.yml index 5e5a3bdf0f024..23bb988ef4d73 100644 --- a/.github/actions/build_pandas/action.yml +++ b/.github/actions/build_pandas/action.yml @@ -6,8 +6,8 @@ runs: - name: Environment Detail run: | - conda info - conda list + micromamba info + micromamba list shell: bash -el {0} - name: Build Pandas @@ -17,4 +17,6 @@ runs: shell: bash -el {0} env: # Cannot use parallel compilation on Windows, see https://github.com/pandas-dev/pandas/issues/30873 - N_JOBS: ${{ runner.os == 'Windows' && 1 || 2 }} + # GH 47305: Parallel build causes flaky ImportError: /home/runner/work/pandas/pandas/pandas/_libs/tslibs/timestamps.cpython-38-x86_64-linux-gnu.so: undefined symbol: pandas_datetime_to_datetimestruct + N_JOBS: 1 + #N_JOBS: ${{ runner.os == 'Windows' && 1 || 2 }} diff --git a/.github/actions/run-tests/action.yml b/.github/actions/run-tests/action.yml new file mode 100644 index 0000000000000..2a7601f196ec4 --- /dev/null +++ b/.github/actions/run-tests/action.yml @@ -0,0 +1,27 @@ +name: Run tests and report results +runs: + using: composite + steps: + - name: Test + run: ci/run_tests.sh + shell: bash -el {0} + + - name: Publish test results + uses: actions/upload-artifact@v2 + with: + name: Test results + path: test-data.xml + if: failure() + + - name: Report Coverage + run: coverage report -m + shell: bash -el {0} + if: failure() + + - name: Upload coverage to Codecov + uses: codecov/codecov-action@v2 + with: + flags: unittests + name: codecov-pandas + fail_ci_if_error: false + if: failure() diff --git a/.github/actions/setup-conda/action.yml b/.github/actions/setup-conda/action.yml index 87a0bd2ed1715..002d0020c2df1 100644 --- a/.github/actions/setup-conda/action.yml +++ b/.github/actions/setup-conda/action.yml @@ -6,8 +6,8 @@ inputs: environment-name: description: Name to use for the Conda environment default: test - python-version: - description: Python version to install + extra-specs: + description: Extra packages to install required: false pyarrow-version: description: If set, overrides the PyArrow version in the Conda environment to the given string. @@ -24,14 +24,13 @@ runs: if: ${{ inputs.pyarrow-version }} - name: Install ${{ inputs.environment-file }} - uses: conda-incubator/setup-miniconda@v2.1.1 + uses: mamba-org/provision-with-micromamba@v12 with: environment-file: ${{ inputs.environment-file }} - activate-environment: ${{ inputs.environment-name }} - python-version: ${{ inputs.python-version }} - channel-priority: ${{ runner.os == 'macOS' && 'flexible' || 'strict' }} + environment-name: ${{ inputs.environment-name }} + extra-specs: ${{ inputs.extra-specs }} channels: conda-forge - mamba-version: "0.24" - use-mamba: true - use-only-tar-bz2: true + channel-priority: ${{ runner.os == 'macOS' && 'flexible' || 'strict' }} condarc-file: ci/condarc.yml + cache-env: true + cache-downloads: true diff --git a/.github/workflows/32-bit-linux.yml b/.github/workflows/32-bit-linux.yml index be894e6a5a63e..e091160c952f8 100644 --- a/.github/workflows/32-bit-linux.yml +++ b/.github/workflows/32-bit-linux.yml @@ -12,6 +12,9 @@ on: paths-ignore: - "doc/**" +permissions: + contents: read + jobs: pytest: runs-on: ubuntu-latest diff --git a/.github/workflows/assign.yml b/.github/workflows/assign.yml index a1812843b1a8f..b7bb8db549f86 100644 --- a/.github/workflows/assign.yml +++ b/.github/workflows/assign.yml @@ -3,8 +3,14 @@ on: issue_comment: types: created +permissions: + contents: read + jobs: issue_assign: + permissions: + issues: write + pull-requests: write runs-on: ubuntu-latest steps: - if: github.event.comment.body == 'take' diff --git a/.github/workflows/asv-bot.yml b/.github/workflows/asv-bot.yml index 022c12cf6ff6c..abb19a95315b6 100644 --- a/.github/workflows/asv-bot.yml +++ b/.github/workflows/asv-bot.yml @@ -9,8 +9,15 @@ env: ENV_FILE: environment.yml COMMENT: ${{github.event.comment.body}} +permissions: + contents: read + jobs: autotune: + permissions: + contents: read + issues: write + pull-requests: write name: "Run benchmarks" # TODO: Support more benchmarking options later, against different branches, against self, etc if: startsWith(github.event.comment.body, '@github-actions benchmark') @@ -33,12 +40,6 @@ jobs: with: fetch-depth: 0 - - name: Cache conda - uses: actions/cache@v3 - with: - path: ~/conda_pkgs_dir - key: ${{ runner.os }}-conda-${{ hashFiles('${{ env.ENV_FILE }}') }} - # Although asv sets up its own env, deps are still needed # during discovery process - name: Set up Conda diff --git a/.github/workflows/autoupdate-pre-commit-config.yml b/.github/workflows/autoupdate-pre-commit-config.yml index d2eac234ca361..9a41871c26062 100644 --- a/.github/workflows/autoupdate-pre-commit-config.yml +++ b/.github/workflows/autoupdate-pre-commit-config.yml @@ -5,8 +5,14 @@ on: - cron: "0 7 1 * *" # At 07:00 on 1st of every month. workflow_dispatch: +permissions: + contents: read + jobs: update-pre-commit: + permissions: + contents: write # for technote-space/create-pr-action to push code + pull-requests: write # for technote-space/create-pr-action to create a PR if: github.repository_owner == 'pandas-dev' name: Autoupdate pre-commit config runs-on: ubuntu-latest diff --git a/.github/workflows/code-checks.yml b/.github/workflows/code-checks.yml index 96088547634c5..09c603f347d4c 100644 --- a/.github/workflows/code-checks.yml +++ b/.github/workflows/code-checks.yml @@ -14,6 +14,9 @@ env: ENV_FILE: environment.yml PANDAS_CI: 1 +permissions: + contents: read + jobs: pre_commit: name: pre-commit @@ -52,12 +55,6 @@ jobs: with: fetch-depth: 0 - - name: Cache conda - uses: actions/cache@v3 - with: - path: ~/conda_pkgs_dir - key: ${{ runner.os }}-conda-${{ hashFiles('${{ env.ENV_FILE }}') }} - - name: Set up Conda uses: ./.github/actions/setup-conda @@ -65,37 +62,39 @@ jobs: id: build uses: ./.github/actions/build_pandas + # The following checks are independent of each other and should still be run if one fails - name: Check for no warnings when building single-page docs run: ci/code_checks.sh single-docs - if: ${{ steps.build.outcome == 'success' }} + if: ${{ steps.build.outcome == 'success' && always() }} - name: Run checks on imported code run: ci/code_checks.sh code - if: ${{ steps.build.outcome == 'success' }} + if: ${{ steps.build.outcome == 'success' && always() }} - name: Run doctests run: ci/code_checks.sh doctests - if: ${{ steps.build.outcome == 'success' }} + if: ${{ steps.build.outcome == 'success' && always() }} - name: Run docstring validation run: ci/code_checks.sh docstrings - if: ${{ steps.build.outcome == 'success' }} + if: ${{ steps.build.outcome == 'success' && always() }} - name: Use existing environment for type checking run: | echo $PATH >> $GITHUB_PATH echo "PYTHONHOME=$PYTHONHOME" >> $GITHUB_ENV echo "PYTHONPATH=$PYTHONPATH" >> $GITHUB_ENV + if: ${{ steps.build.outcome == 'success' && always() }} - name: Typing uses: pre-commit/action@v2.0.3 with: extra_args: --hook-stage manual --all-files - if: ${{ steps.build.outcome == 'success' }} + if: ${{ steps.build.outcome == 'success' && always() }} - name: Run docstring validation script tests run: pytest scripts - if: ${{ steps.build.outcome == 'success' }} + if: ${{ steps.build.outcome == 'success' && always() }} asv-benchmarks: name: ASV Benchmarks @@ -115,12 +114,6 @@ jobs: with: fetch-depth: 0 - - name: Cache conda - uses: actions/cache@v3 - with: - path: ~/conda_pkgs_dir - key: ${{ runner.os }}-conda-${{ hashFiles('${{ env.ENV_FILE }}') }} - - name: Set up Conda uses: ./.github/actions/setup-conda @@ -157,3 +150,32 @@ jobs: - name: Build image run: docker build --pull --no-cache --tag pandas-dev-env . + + requirements-dev-text-installable: + name: Test install requirements-dev.txt + runs-on: ubuntu-latest + + concurrency: + # https://github.community/t/concurrecy-not-work-for-push/183068/7 + group: ${{ github.event_name == 'push' && github.run_number || github.ref }}-requirements-dev-text-installable + cancel-in-progress: true + + steps: + - name: Checkout + uses: actions/checkout@v3 + with: + fetch-depth: 0 + + - name: Setup Python + id: setup_python + uses: actions/setup-python@v3 + with: + python-version: '3.8' + cache: 'pip' + cache-dependency-path: 'requirements-dev.txt' + + - name: Install requirements-dev.txt + run: pip install -r requirements-dev.txt + + - name: Check Pip Cache Hit + run: echo ${{ steps.setup_python.outputs.cache-hit }} diff --git a/.github/workflows/comment_bot.yml b/.github/workflows/comment_bot.yml deleted file mode 100644 index 3824e015e8336..0000000000000 --- a/.github/workflows/comment_bot.yml +++ /dev/null @@ -1,40 +0,0 @@ -name: Comment-bot - -on: - issue_comment: - types: - - created - - edited - -jobs: - autotune: - name: "Fixup pre-commit formatting" - if: startsWith(github.event.comment.body, '@github-actions pre-commit') - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v3 - - uses: r-lib/actions/pr-fetch@v2 - with: - repo-token: ${{ secrets.GITHUB_TOKEN }} - - name: Cache multiple paths - uses: actions/cache@v3 - with: - path: | - ~/.cache/pre-commit - ~/.cache/pip - key: pre-commit-dispatched-${{ runner.os }}-build - - uses: actions/setup-python@v3 - with: - python-version: 3.8 - - name: Install-pre-commit - run: python -m pip install --upgrade pre-commit - - name: Run pre-commit - run: pre-commit run --from-ref=origin/main --to-ref=HEAD --all-files || (exit 0) - - name: Commit results - run: | - git config user.name "$(git log -1 --pretty=format:%an)" - git config user.email "$(git log -1 --pretty=format:%ae)" - git commit -a -m 'Fixes from pre-commit [automated commit]' || echo "No changes to commit" - - uses: r-lib/actions/pr-push@v2 - with: - repo-token: ${{ secrets.GITHUB_TOKEN }} diff --git a/.github/workflows/docbuild-and-upload.yml b/.github/workflows/docbuild-and-upload.yml index 5ffd4135802bd..626bf7828e032 100644 --- a/.github/workflows/docbuild-and-upload.yml +++ b/.github/workflows/docbuild-and-upload.yml @@ -14,6 +14,9 @@ env: ENV_FILE: environment.yml PANDAS_CI: 1 +permissions: + contents: read + jobs: web_and_docs: name: Doc Build and Upload @@ -46,6 +49,11 @@ jobs: - name: Build documentation run: doc/make.py --warnings-are-errors + - name: Build the interactive terminal + run: | + cd web/interactive_terminal + jupyter lite build + - name: Install ssh key run: | mkdir -m 700 -p ~/.ssh diff --git a/.github/workflows/macos-windows.yml b/.github/workflows/macos-windows.yml index 26e6c8699ca64..e9503a2486560 100644 --- a/.github/workflows/macos-windows.yml +++ b/.github/workflows/macos-windows.yml @@ -15,16 +15,18 @@ on: env: PANDAS_CI: 1 PYTEST_TARGET: pandas - PYTEST_WORKERS: auto PATTERN: "not slow and not db and not network and not single_cpu" +permissions: + contents: read + jobs: pytest: defaults: run: shell: bash -el {0} - timeout-minutes: 90 + timeout-minutes: 120 strategy: matrix: os: [macos-latest, windows-latest] @@ -36,6 +38,9 @@ jobs: # https://github.community/t/concurrecy-not-work-for-push/183068/7 group: ${{ github.event_name == 'push' && github.run_number || github.ref }}-${{ matrix.env_file }}-${{ matrix.os }} cancel-in-progress: true + env: + # GH 47443: PYTEST_WORKERS > 1 crashes Windows builds with memory related errors + PYTEST_WORKERS: ${{ matrix.os == 'macos-latest' && 'auto' || '1' }} steps: - name: Checkout @@ -53,18 +58,4 @@ jobs: uses: ./.github/actions/build_pandas - name: Test - run: ci/run_tests.sh - - - name: Publish test results - uses: actions/upload-artifact@v3 - with: - name: Test results - path: test-data.xml - if: failure() - - - name: Upload coverage to Codecov - uses: codecov/codecov-action@v2 - with: - flags: unittests - name: codecov-pandas - fail_ci_if_error: false + uses: ./.github/actions/run-tests diff --git a/.github/workflows/python-dev.yml b/.github/workflows/python-dev.yml index 753e288f5e391..d93b92a9662ec 100644 --- a/.github/workflows/python-dev.yml +++ b/.github/workflows/python-dev.yml @@ -27,6 +27,9 @@ env: COVERAGE: true PYTEST_TARGET: pandas +permissions: + contents: read + jobs: build: if: false # Comment this line out to "unfreeze" @@ -57,40 +60,20 @@ jobs: - name: Install dependencies shell: bash -el {0} run: | - python -m pip install --upgrade pip setuptools wheel - pip install -i https://pypi.anaconda.org/scipy-wheels-nightly/simple numpy - pip install git+https://github.com/nedbat/coveragepy.git - pip install cython python-dateutil pytz hypothesis pytest>=6.2.5 pytest-xdist pytest-cov - pip list + python3 -m pip install --upgrade pip setuptools wheel + python3 -m pip install -i https://pypi.anaconda.org/scipy-wheels-nightly/simple numpy + python3 -m pip install git+https://github.com/nedbat/coveragepy.git + python3 -m pip install cython python-dateutil pytz hypothesis pytest>=6.2.5 pytest-xdist pytest-cov pytest-asyncio>=0.17 + python3 -m pip list - name: Build Pandas run: | - python setup.py build_ext -q -j2 - python -m pip install -e . --no-build-isolation --no-use-pep517 + python3 setup.py build_ext -q -j2 + python3 -m pip install -e . --no-build-isolation --no-use-pep517 - name: Build Version run: | - python -c "import pandas; pandas.show_versions();" + python3 -c "import pandas; pandas.show_versions();" - - name: Test with pytest - shell: bash -el {0} - run: | - ci/run_tests.sh - - - name: Publish test results - uses: actions/upload-artifact@v3 - with: - name: Test results - path: test-data.xml - if: failure() - - - name: Report Coverage - run: | - coverage report -m - - - name: Upload coverage to Codecov - uses: codecov/codecov-action@v2 - with: - flags: unittests - name: codecov-pandas - fail_ci_if_error: true + - name: Test + uses: ./.github/actions/run-tests diff --git a/.github/workflows/sdist.yml b/.github/workflows/sdist.yml index 5ae2280c5069f..2e1ffe6d0d17e 100644 --- a/.github/workflows/sdist.yml +++ b/.github/workflows/sdist.yml @@ -13,6 +13,9 @@ on: paths-ignore: - "doc/**" +permissions: + contents: read + jobs: build: if: ${{ github.event.label.name == 'Build' || contains(github.event.pull_request.labels.*.name, 'Build') || github.event_name == 'push'}} @@ -62,9 +65,10 @@ jobs: - name: Set up Conda uses: ./.github/actions/setup-conda with: - environment-file: "" + environment-file: false environment-name: pandas-sdist - python-version: ${{ matrix.python-version }} + extra-specs: | + python =${{ matrix.python-version }} - name: Install pandas from sdist run: | diff --git a/.github/workflows/stale-pr.yml b/.github/workflows/stale-pr.yml index b97b60717a2b8..69656be18a8b1 100644 --- a/.github/workflows/stale-pr.yml +++ b/.github/workflows/stale-pr.yml @@ -4,8 +4,13 @@ on: # * is a special character in YAML so you have to quote this string - cron: "0 0 * * *" +permissions: + contents: read + jobs: stale: + permissions: + pull-requests: write runs-on: ubuntu-latest steps: - uses: actions/stale@v4 diff --git a/.github/workflows/posix.yml b/.github/workflows/ubuntu.yml similarity index 86% rename from .github/workflows/posix.yml rename to .github/workflows/ubuntu.yml index 061b2b361ca62..a759280c74521 100644 --- a/.github/workflows/posix.yml +++ b/.github/workflows/ubuntu.yml @@ -1,4 +1,4 @@ -name: Posix +name: Ubuntu on: push: @@ -15,6 +15,9 @@ on: env: PANDAS_CI: 1 +permissions: + contents: read + jobs: pytest: runs-on: ubuntu-latest @@ -134,18 +137,9 @@ jobs: with: fetch-depth: 0 - - name: Cache conda - uses: actions/cache@v3 - env: - CACHE_NUMBER: 0 - with: - path: ~/conda_pkgs_dir - key: ${{ runner.os }}-conda-${{ env.CACHE_NUMBER }}-${{ - hashFiles('${{ env.ENV_FILE }}') }} - - name: Extra installs # xsel for clipboard tests - run: sudo apt-get update && sudo apt-get install -y libc6-dev-i386 xsel ${{ env.EXTRA_APT }} + run: sudo apt-get update && sudo apt-get install -y xsel ${{ env.EXTRA_APT }} - name: Set up Conda uses: ./.github/actions/setup-conda @@ -157,23 +151,6 @@ jobs: uses: ./.github/actions/build_pandas - name: Test - run: ci/run_tests.sh + uses: ./.github/actions/run-tests # TODO: Don't continue on error for PyPy continue-on-error: ${{ env.IS_PYPY == 'true' }} - - - name: Build Version - run: conda list - - - name: Publish test results - uses: actions/upload-artifact@v3 - with: - name: Test results - path: test-data.xml - if: failure() - - - name: Upload coverage to Codecov - uses: codecov/codecov-action@v2 - with: - flags: unittests - name: codecov-pandas - fail_ci_if_error: false diff --git a/.gitignore b/.gitignore index 87224f1d6060f..07b1f056d511b 100644 --- a/.gitignore +++ b/.gitignore @@ -122,3 +122,7 @@ doc/build/html/index.html doc/tmp.sv env/ doc/source/savefig/ + +# Interactive terminal generated files # +######################################## +.jupyterlite.doit.db diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index 13e0ecd33359f..0c18a5bba50e8 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -18,7 +18,7 @@ repos: pass_filenames: true require_serial: false - repo: https://github.com/python/black - rev: 22.3.0 + rev: 22.6.0 hooks: - id: black - repo: https://github.com/codespell-project/codespell @@ -27,7 +27,7 @@ repos: - id: codespell types_or: [python, rst, markdown] - repo: https://github.com/pre-commit/pre-commit-hooks - rev: v4.2.0 + rev: v4.3.0 hooks: - id: debug-statements - id: end-of-file-fixer @@ -43,7 +43,7 @@ repos: # from Cython files nor do we want to lint C files that we didn't modify for # this particular codebase (e.g. src/headers, src/klib). However, # we can lint all header files since they aren't "generated" like C files are. - exclude: ^pandas/_libs/src/(klib|headers)/ + exclude: ^pandas/((_libs/src/(klib|headers)/)|(io/sas/portable_endian.h$)) args: [--quiet, '--extensions=c,h', '--headers=h', --recursive, '--filter=-readability/casting,-runtime/int,-build/include_subdir'] - repo: https://github.com/PyCQA/flake8 rev: 4.0.1 @@ -59,7 +59,7 @@ repos: hooks: - id: isort - repo: https://github.com/asottile/pyupgrade - rev: v2.32.1 + rev: v2.34.0 hooks: - id: pyupgrade args: [--py38-plus] @@ -74,7 +74,7 @@ repos: types: [text] # overwrite types: [rst] types_or: [python, rst] - repo: https://github.com/sphinx-contrib/sphinx-lint - rev: v0.6 + rev: v0.6.1 hooks: - id: sphinx-lint - repo: https://github.com/asottile/yesqa @@ -93,9 +93,7 @@ repos: types: [python] stages: [manual] additional_dependencies: &pyright_dependencies - - pyright@1.1.253 -- repo: local - hooks: + - pyright@1.1.258 - id: pyright_reportGeneralTypeIssues name: pyright reportGeneralTypeIssues entry: pyright --skipunannotated -p pyright_reportGeneralTypeIssues.json @@ -105,8 +103,6 @@ repos: types: [python] stages: [manual] additional_dependencies: *pyright_dependencies -- repo: local - hooks: - id: mypy name: mypy entry: mypy @@ -115,8 +111,6 @@ repos: pass_filenames: false types: [python] stages: [manual] -- repo: local - hooks: - id: flake8-rst name: flake8-rst description: Run flake8 on code snippets in docstrings or RST files @@ -229,3 +223,23 @@ repos: entry: python scripts/validate_min_versions_in_sync.py language: python files: ^(ci/deps/actions-.*-minimum_versions\.yaml|pandas/compat/_optional\.py)$ + - id: flake8-pyi + name: flake8-pyi + entry: flake8 --extend-ignore=E301,E302,E305,E701,E704 + types: [pyi] + language: python + additional_dependencies: + - flake8==4.0.1 + - flake8-pyi==22.5.1 + - id: future-annotations + name: import annotations from __future__ + entry: 'from __future__ import annotations' + language: pygrep + args: [--negate] + files: ^pandas/ + types: [python] + exclude: | + (?x) + /(__init__\.py)|(api\.py)|(_version\.py)|(testing\.py)|(conftest\.py)$ + |/tests/ + |/_testing/ diff --git a/CITATION.cff b/CITATION.cff new file mode 100644 index 0000000000000..0161dfa92fdef --- /dev/null +++ b/CITATION.cff @@ -0,0 +1,10 @@ +cff-version: 1.2.0 +title: 'pandas-dev/pandas: Pandas' +message: 'If you use this software, please cite it as below.' +authors: + - name: "The pandas development team" +license: BSD-3-Clause +license-url: "https://github.com/pandas-dev/pandas/blob/main/LICENSE" +repository-code: "https://github.com/pandas-dev/pandas" +type: software +url: "https://github.com/pandas-dev/pandas" diff --git a/README.md b/README.md index fc3f988dc6809..aaf63ead9c416 100644 --- a/README.md +++ b/README.md @@ -169,4 +169,4 @@ Or maybe through using pandas you have an idea of your own or are looking for so Feel free to ask questions on the [mailing list](https://groups.google.com/forum/?fromgroups#!forum/pydata) or on [Gitter](https://gitter.im/pydata/pandas). -As contributors and maintainers to this project, you are expected to abide by pandas' code of conduct. More information can be found at: [Contributor Code of Conduct](https://github.com/pandas-dev/pandas/blob/main/.github/CODE_OF_CONDUCT.md) +As contributors and maintainers to this project, you are expected to abide by pandas' code of conduct. More information can be found at: [Contributor Code of Conduct](https://github.com/pandas-dev/.github/blob/master/CODE_OF_CONDUCT.md) diff --git a/RELEASE.md b/RELEASE.md index 42cb82dfcf020..344a097a3e81e 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -1,6 +1,6 @@ Release Notes ============= -The list of changes to Pandas between each release can be found +The list of changes to pandas between each release can be found [here](https://pandas.pydata.org/pandas-docs/stable/whatsnew/index.html). For full details, see the commit logs at https://github.com/pandas-dev/pandas. diff --git a/asv_bench/benchmarks/indexing.py b/asv_bench/benchmarks/indexing.py index 2c0e2e6ca442a..69e3d166943a8 100644 --- a/asv_bench/benchmarks/indexing.py +++ b/asv_bench/benchmarks/indexing.py @@ -157,25 +157,39 @@ def time_boolean_rows_boolean(self): class DataFrameNumericIndexing: - def setup(self): + + params = [ + (Int64Index, UInt64Index, Float64Index), + ("unique_monotonic_inc", "nonunique_monotonic_inc"), + ] + param_names = ["index_dtype", "index_structure"] + + def setup(self, index, index_structure): + N = 10**5 + indices = { + "unique_monotonic_inc": index(range(N)), + "nonunique_monotonic_inc": index( + list(range(55)) + [54] + list(range(55, N - 1)) + ), + } self.idx_dupe = np.array(range(30)) * 99 - self.df = DataFrame(np.random.randn(100000, 5)) + self.df = DataFrame(np.random.randn(N, 5), index=indices[index_structure]) self.df_dup = concat([self.df, 2 * self.df, 3 * self.df]) - self.bool_indexer = [True] * 50000 + [False] * 50000 + self.bool_indexer = [True] * (N // 2) + [False] * (N - N // 2) - def time_iloc_dups(self): + def time_iloc_dups(self, index, index_structure): self.df_dup.iloc[self.idx_dupe] - def time_loc_dups(self): + def time_loc_dups(self, index, index_structure): self.df_dup.loc[self.idx_dupe] - def time_iloc(self): + def time_iloc(self, index, index_structure): self.df.iloc[:100, 0] - def time_loc(self): + def time_loc(self, index, index_structure): self.df.loc[:100, 0] - def time_bool_indexer(self): + def time_bool_indexer(self, index, index_structure): self.df[self.bool_indexer] diff --git a/asv_bench/benchmarks/io/excel.py b/asv_bench/benchmarks/io/excel.py index a2d989e787e0f..5bef28988fc40 100644 --- a/asv_bench/benchmarks/io/excel.py +++ b/asv_bench/benchmarks/io/excel.py @@ -1,13 +1,19 @@ from io import BytesIO import numpy as np -from odf.opendocument import OpenDocumentSpreadsheet -from odf.table import ( - Table, - TableCell, - TableRow, -) -from odf.text import P + +try: + from odf.opendocument import OpenDocumentSpreadsheet + from odf.table import ( + Table, + TableCell, + TableRow, + ) + from odf.text import P + + have_odf = True +except ModuleNotFoundError: + have_odf = False from pandas import ( DataFrame, @@ -47,6 +53,25 @@ def time_write_excel(self, engine): writer.save() +class WriteExcelStyled: + params = ["openpyxl", "xlsxwriter"] + param_names = ["engine"] + + def setup(self, engine): + self.df = _generate_dataframe() + + def time_write_excel_style(self, engine): + bio = BytesIO() + bio.seek(0) + writer = ExcelWriter(bio, engine=engine) + df_style = self.df.style + df_style.applymap(lambda x: "border: red 1px solid;") + df_style.applymap(lambda x: "color: blue") + df_style.applymap(lambda x: "border-color: green black", subset=["float1"]) + df_style.to_excel(writer, sheet_name="Sheet1") + writer.save() + + class ReadExcel: params = ["xlrd", "openpyxl", "odf"] diff --git a/asv_bench/benchmarks/io/sql.py b/asv_bench/benchmarks/io/sql.py index 3cfa28de78c90..fb8b7dafa0ade 100644 --- a/asv_bench/benchmarks/io/sql.py +++ b/asv_bench/benchmarks/io/sql.py @@ -39,6 +39,8 @@ def setup(self, connection): index=tm.makeStringIndex(N), ) self.df.loc[1000:3000, "float_with_nan"] = np.nan + self.df["date"] = self.df["datetime"].dt.date + self.df["time"] = self.df["datetime"].dt.time self.df["datetime_string"] = self.df["datetime"].astype(str) self.df.to_sql(self.table_name, self.con, if_exists="replace") @@ -53,7 +55,16 @@ class WriteSQLDtypes: params = ( ["sqlalchemy", "sqlite"], - ["float", "float_with_nan", "string", "bool", "int", "datetime"], + [ + "float", + "float_with_nan", + "string", + "bool", + "int", + "date", + "time", + "datetime", + ], ) param_names = ["connection", "dtype"] @@ -78,6 +89,8 @@ def setup(self, connection, dtype): index=tm.makeStringIndex(N), ) self.df.loc[1000:3000, "float_with_nan"] = np.nan + self.df["date"] = self.df["datetime"].dt.date + self.df["time"] = self.df["datetime"].dt.time self.df["datetime_string"] = self.df["datetime"].astype(str) self.df.to_sql(self.table_name, self.con, if_exists="replace") @@ -105,6 +118,8 @@ def setup(self): index=tm.makeStringIndex(N), ) self.df.loc[1000:3000, "float_with_nan"] = np.nan + self.df["date"] = self.df["datetime"].dt.date + self.df["time"] = self.df["datetime"].dt.time self.df["datetime_string"] = self.df["datetime"].astype(str) self.df.to_sql(self.table_name, self.con, if_exists="replace") @@ -122,7 +137,16 @@ def time_read_sql_table_parse_dates(self): class ReadSQLTableDtypes: - params = ["float", "float_with_nan", "string", "bool", "int", "datetime"] + params = [ + "float", + "float_with_nan", + "string", + "bool", + "int", + "date", + "time", + "datetime", + ] param_names = ["dtype"] def setup(self, dtype): @@ -141,6 +165,8 @@ def setup(self, dtype): index=tm.makeStringIndex(N), ) self.df.loc[1000:3000, "float_with_nan"] = np.nan + self.df["date"] = self.df["datetime"].dt.date + self.df["time"] = self.df["datetime"].dt.time self.df["datetime_string"] = self.df["datetime"].astype(str) self.df.to_sql(self.table_name, self.con, if_exists="replace") diff --git a/asv_bench/benchmarks/strftime.py b/asv_bench/benchmarks/strftime.py new file mode 100644 index 0000000000000..ac1b7f65d2d90 --- /dev/null +++ b/asv_bench/benchmarks/strftime.py @@ -0,0 +1,64 @@ +import numpy as np + +import pandas as pd +from pandas import offsets + + +class DatetimeStrftime: + timeout = 1500 + params = [1000, 10000] + param_names = ["obs"] + + def setup(self, obs): + d = "2018-11-29" + dt = "2018-11-26 11:18:27.0" + self.data = pd.DataFrame( + { + "dt": [np.datetime64(dt)] * obs, + "d": [np.datetime64(d)] * obs, + "r": [np.random.uniform()] * obs, + } + ) + + def time_frame_date_to_str(self, obs): + self.data["d"].astype(str) + + def time_frame_date_formatting_default(self, obs): + self.data["d"].dt.strftime(date_format="%Y-%m-%d") + + def time_frame_date_formatting_custom(self, obs): + self.data["d"].dt.strftime(date_format="%Y---%m---%d") + + def time_frame_datetime_to_str(self, obs): + self.data["dt"].astype(str) + + def time_frame_datetime_formatting_default_date_only(self, obs): + self.data["dt"].dt.strftime(date_format="%Y-%m-%d") + + def time_frame_datetime_formatting_default(self, obs): + self.data["dt"].dt.strftime(date_format="%Y-%m-%d %H:%M:%S") + + def time_frame_datetime_formatting_default_with_float(self, obs): + self.data["dt"].dt.strftime(date_format="%Y-%m-%d %H:%M:%S.%f") + + def time_frame_datetime_formatting_custom(self, obs): + self.data["dt"].dt.strftime(date_format="%Y-%m-%d --- %H:%M:%S") + + +class BusinessHourStrftime: + timeout = 1500 + params = [1000, 10000] + param_names = ["obs"] + + def setup(self, obs): + self.data = pd.DataFrame( + { + "off": [offsets.BusinessHour()] * obs, + } + ) + + def time_frame_offset_str(self, obs): + self.data["off"].apply(str) + + def time_frame_offset_repr(self, obs): + self.data["off"].apply(repr) diff --git a/ci/deps/actions-310.yaml b/ci/deps/actions-310.yaml index d17d29ef38e7f..73700c0da0d47 100644 --- a/ci/deps/actions-310.yaml +++ b/ci/deps/actions-310.yaml @@ -31,8 +31,7 @@ dependencies: - jinja2 - lxml - matplotlib - # TODO: uncomment after numba supports py310 - #- numba + - numba - numexpr - openpyxl - odfpy diff --git a/doc/redirects.csv b/doc/redirects.csv index 173e670e30f0e..90ddf6c4dc582 100644 --- a/doc/redirects.csv +++ b/doc/redirects.csv @@ -761,6 +761,7 @@ generated/pandas.IntervalIndex.mid,../reference/api/pandas.IntervalIndex.mid generated/pandas.IntervalIndex.overlaps,../reference/api/pandas.IntervalIndex.overlaps generated/pandas.IntervalIndex.right,../reference/api/pandas.IntervalIndex.right generated/pandas.IntervalIndex.set_closed,../reference/api/pandas.IntervalIndex.set_closed +generated/pandas.IntervalIndex.set_inclusive,../reference/api/pandas.IntervalIndex.set_inclusive generated/pandas.IntervalIndex.to_tuples,../reference/api/pandas.IntervalIndex.to_tuples generated/pandas.IntervalIndex.values,../reference/api/pandas.IntervalIndex.values generated/pandas.Interval.left,../reference/api/pandas.Interval.left diff --git a/doc/source/conf.py b/doc/source/conf.py index 49025288f0449..2a6ec8947c8d7 100644 --- a/doc/source/conf.py +++ b/doc/source/conf.py @@ -447,7 +447,6 @@ "py": ("https://pylib.readthedocs.io/en/latest/", None), "python": ("https://docs.python.org/3/", None), "scipy": ("https://docs.scipy.org/doc/scipy/", None), - "statsmodels": ("https://www.statsmodels.org/devel/", None), "pyarrow": ("https://arrow.apache.org/docs/", None), } diff --git a/doc/source/development/contributing.rst b/doc/source/development/contributing.rst index 1d745d21dacae..e76197e302ca4 100644 --- a/doc/source/development/contributing.rst +++ b/doc/source/development/contributing.rst @@ -326,13 +326,7 @@ Autofixing formatting errors ---------------------------- We use several styling checks (e.g. ``black``, ``flake8``, ``isort``) which are run after -you make a pull request. If there is a scenario where any of these checks fail then you -can comment:: - - @github-actions pre-commit - -on that pull request. This will trigger a workflow which will autofix formatting -errors. +you make a pull request. To automatically fix formatting errors on each commit you make, you can set up pre-commit yourself. First, create a Python :ref:`environment diff --git a/doc/source/development/contributing_codebase.rst b/doc/source/development/contributing_codebase.rst index 81cd69aa384a4..c74c44fb1d5f0 100644 --- a/doc/source/development/contributing_codebase.rst +++ b/doc/source/development/contributing_codebase.rst @@ -324,8 +324,169 @@ Writing tests All tests should go into the ``tests`` subdirectory of the specific package. This folder contains many current examples of tests, and we suggest looking to these for -inspiration. Please reference our :ref:`testing location guide ` if you are unsure -where to place a new unit test. +inspiration. Ideally, there should be one, and only one, obvious place for a test to reside. +Until we reach that ideal, these are some rules of thumb for where a test should +be located. + +1. Does your test depend only on code in ``pd._libs.tslibs``? + This test likely belongs in one of: + + - tests.tslibs + + .. note:: + + No file in ``tests.tslibs`` should import from any pandas modules + outside of ``pd._libs.tslibs`` + + - tests.scalar + - tests.tseries.offsets + +2. Does your test depend only on code in pd._libs? + This test likely belongs in one of: + + - tests.libs + - tests.groupby.test_libgroupby + +3. Is your test for an arithmetic or comparison method? + This test likely belongs in one of: + + - tests.arithmetic + + .. note:: + + These are intended for tests that can be shared to test the behavior + of DataFrame/Series/Index/ExtensionArray using the ``box_with_array`` + fixture. + + - tests.frame.test_arithmetic + - tests.series.test_arithmetic + +4. Is your test for a reduction method (min, max, sum, prod, ...)? + This test likely belongs in one of: + + - tests.reductions + + .. note:: + + These are intended for tests that can be shared to test the behavior + of DataFrame/Series/Index/ExtensionArray. + + - tests.frame.test_reductions + - tests.series.test_reductions + - tests.test_nanops + +5. Is your test for an indexing method? + This is the most difficult case for deciding where a test belongs, because + there are many of these tests, and many of them test more than one method + (e.g. both ``Series.__getitem__`` and ``Series.loc.__getitem__``) + + A) Is the test specifically testing an Index method (e.g. ``Index.get_loc``, + ``Index.get_indexer``)? + This test likely belongs in one of: + + - tests.indexes.test_indexing + - tests.indexes.fooindex.test_indexing + + Within that files there should be a method-specific test class e.g. + ``TestGetLoc``. + + In most cases, neither ``Series`` nor ``DataFrame`` objects should be + needed in these tests. + + B) Is the test for a Series or DataFrame indexing method *other* than + ``__getitem__`` or ``__setitem__``, e.g. ``xs``, ``where``, ``take``, + ``mask``, ``lookup``, or ``insert``? + This test likely belongs in one of: + + - tests.frame.indexing.test_methodname + - tests.series.indexing.test_methodname + + C) Is the test for any of ``loc``, ``iloc``, ``at``, or ``iat``? + This test likely belongs in one of: + + - tests.indexing.test_loc + - tests.indexing.test_iloc + - tests.indexing.test_at + - tests.indexing.test_iat + + Within the appropriate file, test classes correspond to either types of + indexers (e.g. ``TestLocBooleanMask``) or major use cases + (e.g. ``TestLocSetitemWithExpansion``). + + See the note in section D) about tests that test multiple indexing methods. + + D) Is the test for ``Series.__getitem__``, ``Series.__setitem__``, + ``DataFrame.__getitem__``, or ``DataFrame.__setitem__``? + This test likely belongs in one of: + + - tests.series.test_getitem + - tests.series.test_setitem + - tests.frame.test_getitem + - tests.frame.test_setitem + + If many cases such a test may test multiple similar methods, e.g. + + .. code-block:: python + + import pandas as pd + import pandas._testing as tm + + def test_getitem_listlike_of_ints(): + ser = pd.Series(range(5)) + + result = ser[[3, 4]] + expected = pd.Series([2, 3]) + tm.assert_series_equal(result, expected) + + result = ser.loc[[3, 4]] + tm.assert_series_equal(result, expected) + + In cases like this, the test location should be based on the *underlying* + method being tested. Or in the case of a test for a bugfix, the location + of the actual bug. So in this example, we know that ``Series.__getitem__`` + calls ``Series.loc.__getitem__``, so this is *really* a test for + ``loc.__getitem__``. So this test belongs in ``tests.indexing.test_loc``. + +6. Is your test for a DataFrame or Series method? + + A) Is the method a plotting method? + This test likely belongs in one of: + + - tests.plotting + + B) Is the method an IO method? + This test likely belongs in one of: + + - tests.io + + C) Otherwise + This test likely belongs in one of: + + - tests.series.methods.test_mymethod + - tests.frame.methods.test_mymethod + + .. note:: + + If a test can be shared between DataFrame/Series using the + ``frame_or_series`` fixture, by convention it goes in the + ``tests.frame`` file. + +7. Is your test for an Index method, not depending on Series/DataFrame? + This test likely belongs in one of: + + - tests.indexes + +8) Is your test for one of the pandas-provided ExtensionArrays (``Categorical``, + ``DatetimeArray``, ``TimedeltaArray``, ``PeriodArray``, ``IntervalArray``, + ``PandasArray``, ``FloatArray``, ``BoolArray``, ``StringArray``)? + This test likely belongs in one of: + + - tests.arrays + +9) Is your test for *all* ExtensionArray subclasses (the "EA Interface")? + This test likely belongs in one of: + + - tests.extension Using ``pytest`` ~~~~~~~~~~~~~~~~ @@ -388,6 +549,8 @@ xfail is not to be used for tests involving failure due to invalid user argument For these tests, we need to verify the correct exception type and error message is being raised, using ``pytest.raises`` instead. +.. _contributing.warnings: + Testing a warning ^^^^^^^^^^^^^^^^^ @@ -405,6 +568,27 @@ If a warning should specifically not happen in a block of code, pass ``False`` i with tm.assert_produces_warning(False): pd.no_warning_function() +If you have a test that would emit a warning, but you aren't actually testing the +warning itself (say because it's going to be removed in the future, or because we're +matching a 3rd-party library's behavior), then use ``pytest.mark.filterwarnings`` to +ignore the error. + +.. code-block:: python + + @pytest.mark.filterwarnings("ignore:msg:category") + def test_thing(self): + pass + +If you need finer-grained control, you can use Python's +`warnings module `__ +to control whether a warning is ignored or raised at different places within +a single test. + +.. code-block:: python + + with warnings.catch_warnings(): + warnings.simplefilter("ignore", FutureWarning) + Testing an exception ^^^^^^^^^^^^^^^^^^^^ @@ -570,59 +754,6 @@ preferred if the inputs or logic are simple, with Hypothesis tests reserved for cases with complex logic or where there are too many combinations of options or subtle interactions to test (or think of!) all of them. -.. _contributing.warnings: - -Testing warnings -~~~~~~~~~~~~~~~~ - -By default, the :ref:`Continuous Integration ` will fail if any unhandled warnings are emitted. - -If your change involves checking that a warning is actually emitted, use -``tm.assert_produces_warning(ExpectedWarning)``. - - -.. code-block:: python - - import pandas._testing as tm - - - df = pd.DataFrame() - with tm.assert_produces_warning(FutureWarning): - df.some_operation() - -We prefer this to the ``pytest.warns`` context manager because ours checks that the warning's -stacklevel is set correctly. The stacklevel is what ensure the *user's* file name and line number -is printed in the warning, rather than something internal to pandas. It represents the number of -function calls from user code (e.g. ``df.some_operation()``) to the function that actually emits -the warning. Our linter will fail the build if you use ``pytest.warns`` in a test. - -If you have a test that would emit a warning, but you aren't actually testing the -warning itself (say because it's going to be removed in the future, or because we're -matching a 3rd-party library's behavior), then use ``pytest.mark.filterwarnings`` to -ignore the error. - -.. code-block:: python - - @pytest.mark.filterwarnings("ignore:msg:category") - def test_thing(self): - ... - -If the test generates a warning of class ``category`` whose message starts -with ``msg``, the warning will be ignored and the test will pass. - -If you need finer-grained control, you can use Python's usual -`warnings module `__ -to control whether a warning is ignored / raised at different places within -a single test. - -.. code-block:: python - - with warnings.catch_warnings(): - warnings.simplefilter("ignore", FutureWarning) - # Or use warnings.filterwarnings(...) - -Alternatively, consider breaking up the unit test. - Running the test suite ---------------------- diff --git a/doc/source/development/index.rst b/doc/source/development/index.rst index 01509705bb92c..1dbe162cd1a6b 100644 --- a/doc/source/development/index.rst +++ b/doc/source/development/index.rst @@ -18,7 +18,6 @@ Development contributing_codebase maintaining internals - test_writing debugging_extensions extending developer diff --git a/doc/source/development/test_writing.rst b/doc/source/development/test_writing.rst deleted file mode 100644 index 76eae505471b7..0000000000000 --- a/doc/source/development/test_writing.rst +++ /dev/null @@ -1,167 +0,0 @@ -.. _test_organization: - -Test organization -================= -Ideally, there should be one, and only one, obvious place for a test to reside. -Until we reach that ideal, these are some rules of thumb for where a test should -be located. - -1. Does your test depend only on code in ``pd._libs.tslibs``? - This test likely belongs in one of: - - - tests.tslibs - - .. note:: - - No file in ``tests.tslibs`` should import from any pandas modules - outside of ``pd._libs.tslibs`` - - - tests.scalar - - tests.tseries.offsets - -2. Does your test depend only on code in pd._libs? - This test likely belongs in one of: - - - tests.libs - - tests.groupby.test_libgroupby - -3. Is your test for an arithmetic or comparison method? - This test likely belongs in one of: - - - tests.arithmetic - - .. note:: - - These are intended for tests that can be shared to test the behavior - of DataFrame/Series/Index/ExtensionArray using the ``box_with_array`` - fixture. - - - tests.frame.test_arithmetic - - tests.series.test_arithmetic - -4. Is your test for a reduction method (min, max, sum, prod, ...)? - This test likely belongs in one of: - - - tests.reductions - - .. note:: - - These are intended for tests that can be shared to test the behavior - of DataFrame/Series/Index/ExtensionArray. - - - tests.frame.test_reductions - - tests.series.test_reductions - - tests.test_nanops - -5. Is your test for an indexing method? - This is the most difficult case for deciding where a test belongs, because - there are many of these tests, and many of them test more than one method - (e.g. both ``Series.__getitem__`` and ``Series.loc.__getitem__``) - - A) Is the test specifically testing an Index method (e.g. ``Index.get_loc``, - ``Index.get_indexer``)? - This test likely belongs in one of: - - - tests.indexes.test_indexing - - tests.indexes.fooindex.test_indexing - - Within that files there should be a method-specific test class e.g. - ``TestGetLoc``. - - In most cases, neither ``Series`` nor ``DataFrame`` objects should be - needed in these tests. - - B) Is the test for a Series or DataFrame indexing method *other* than - ``__getitem__`` or ``__setitem__``, e.g. ``xs``, ``where``, ``take``, - ``mask``, ``lookup``, or ``insert``? - This test likely belongs in one of: - - - tests.frame.indexing.test_methodname - - tests.series.indexing.test_methodname - - C) Is the test for any of ``loc``, ``iloc``, ``at``, or ``iat``? - This test likely belongs in one of: - - - tests.indexing.test_loc - - tests.indexing.test_iloc - - tests.indexing.test_at - - tests.indexing.test_iat - - Within the appropriate file, test classes correspond to either types of - indexers (e.g. ``TestLocBooleanMask``) or major use cases - (e.g. ``TestLocSetitemWithExpansion``). - - See the note in section D) about tests that test multiple indexing methods. - - D) Is the test for ``Series.__getitem__``, ``Series.__setitem__``, - ``DataFrame.__getitem__``, or ``DataFrame.__setitem__``? - This test likely belongs in one of: - - - tests.series.test_getitem - - tests.series.test_setitem - - tests.frame.test_getitem - - tests.frame.test_setitem - - If many cases such a test may test multiple similar methods, e.g. - - .. code-block:: python - - import pandas as pd - import pandas._testing as tm - - def test_getitem_listlike_of_ints(): - ser = pd.Series(range(5)) - - result = ser[[3, 4]] - expected = pd.Series([2, 3]) - tm.assert_series_equal(result, expected) - - result = ser.loc[[3, 4]] - tm.assert_series_equal(result, expected) - - In cases like this, the test location should be based on the *underlying* - method being tested. Or in the case of a test for a bugfix, the location - of the actual bug. So in this example, we know that ``Series.__getitem__`` - calls ``Series.loc.__getitem__``, so this is *really* a test for - ``loc.__getitem__``. So this test belongs in ``tests.indexing.test_loc``. - -6. Is your test for a DataFrame or Series method? - - A) Is the method a plotting method? - This test likely belongs in one of: - - - tests.plotting - - B) Is the method an IO method? - This test likely belongs in one of: - - - tests.io - - C) Otherwise - This test likely belongs in one of: - - - tests.series.methods.test_mymethod - - tests.frame.methods.test_mymethod - - .. note:: - - If a test can be shared between DataFrame/Series using the - ``frame_or_series`` fixture, by convention it goes in the - ``tests.frame`` file. - -7. Is your test for an Index method, not depending on Series/DataFrame? - This test likely belongs in one of: - - - tests.indexes - -8) Is your test for one of the pandas-provided ExtensionArrays (``Categorical``, - ``DatetimeArray``, ``TimedeltaArray``, ``PeriodArray``, ``IntervalArray``, - ``PandasArray``, ``FloatArray``, ``BoolArray``, ``StringArray``)? - This test likely belongs in one of: - - - tests.arrays - -9) Is your test for *all* ExtensionArray subclasses (the "EA Interface")? - This test likely belongs in one of: - - - tests.extension diff --git a/doc/source/getting_started/install.rst b/doc/source/getting_started/install.rst index 39c9db2c883b8..5d9bfd97030b5 100644 --- a/doc/source/getting_started/install.rst +++ b/doc/source/getting_started/install.rst @@ -199,7 +199,7 @@ the code base as of this writing. To run it on your machine to verify that everything is working (and that you have all of the dependencies, soft and hard, installed), make sure you have `pytest `__ >= 6.0 and `Hypothesis -`__ >= 3.58, then run: +`__ >= 6.13.0, then run: :: @@ -247,11 +247,11 @@ Recommended dependencies * `numexpr `__: for accelerating certain numerical operations. ``numexpr`` uses multiple cores as well as smart chunking and caching to achieve large speedups. - If installed, must be Version 2.7.1 or higher. + If installed, must be Version 2.7.3 or higher. * `bottleneck `__: for accelerating certain types of ``nan`` evaluations. ``bottleneck`` uses specialized cython routines to achieve large speedups. If installed, - must be Version 1.3.1 or higher. + must be Version 1.3.2 or higher. .. note:: @@ -277,8 +277,8 @@ Visualization Dependency Minimum Version Notes ========================= ================== ============================================================= matplotlib 3.3.2 Plotting library -Jinja2 2.11 Conditional formatting with DataFrame.style -tabulate 0.8.7 Printing in Markdown-friendly format (see `tabulate`_) +Jinja2 3.0.0 Conditional formatting with DataFrame.style +tabulate 0.8.9 Printing in Markdown-friendly format (see `tabulate`_) ========================= ================== ============================================================= Computation @@ -287,10 +287,10 @@ Computation ========================= ================== ============================================================= Dependency Minimum Version Notes ========================= ================== ============================================================= -SciPy 1.4.1 Miscellaneous statistical functions -numba 0.50.1 Alternative execution engine for rolling operations +SciPy 1.7.1 Miscellaneous statistical functions +numba 0.53.1 Alternative execution engine for rolling operations (see :ref:`Enhancing Performance `) -xarray 0.15.1 pandas-like API for N-dimensional data +xarray 0.19.0 pandas-like API for N-dimensional data ========================= ================== ============================================================= Excel files @@ -301,9 +301,9 @@ Dependency Minimum Version Notes ========================= ================== ============================================================= xlrd 2.0.1 Reading Excel xlwt 1.3.0 Writing Excel -xlsxwriter 1.2.2 Writing Excel -openpyxl 3.0.3 Reading / writing for xlsx files -pyxlsb 1.0.6 Reading for xlsb files +xlsxwriter 1.4.3 Writing Excel +openpyxl 3.0.7 Reading / writing for xlsx files +pyxlsb 1.0.8 Reading for xlsb files ========================= ================== ============================================================= HTML @@ -312,9 +312,9 @@ HTML ========================= ================== ============================================================= Dependency Minimum Version Notes ========================= ================== ============================================================= -BeautifulSoup4 4.8.2 HTML parser for read_html +BeautifulSoup4 4.9.3 HTML parser for read_html html5lib 1.1 HTML parser for read_html -lxml 4.5.0 HTML parser for read_html +lxml 4.6.3 HTML parser for read_html ========================= ================== ============================================================= One of the following combinations of libraries is needed to use the @@ -356,9 +356,9 @@ SQL databases ========================= ================== ============================================================= Dependency Minimum Version Notes ========================= ================== ============================================================= -SQLAlchemy 1.4.0 SQL support for databases other than sqlite -psycopg2 2.8.4 PostgreSQL engine for sqlalchemy -pymysql 0.10.1 MySQL engine for sqlalchemy +SQLAlchemy 1.4.16 SQL support for databases other than sqlite +psycopg2 2.8.6 PostgreSQL engine for sqlalchemy +pymysql 1.0.2 MySQL engine for sqlalchemy ========================= ================== ============================================================= Other data sources @@ -368,11 +368,11 @@ Other data sources Dependency Minimum Version Notes ========================= ================== ============================================================= PyTables 3.6.1 HDF5-based reading / writing -blosc 1.20.1 Compression for HDF5 +blosc 1.21.0 Compression for HDF5 zlib Compression for HDF5 fastparquet 0.4.0 Parquet reading / writing pyarrow 1.0.1 Parquet, ORC, and feather reading / writing -pyreadstat 1.1.0 SPSS files (.sav) reading +pyreadstat 1.1.2 SPSS files (.sav) reading ========================= ================== ============================================================= .. _install.warn_orc: @@ -396,10 +396,10 @@ Access data in the cloud ========================= ================== ============================================================= Dependency Minimum Version Notes ========================= ================== ============================================================= -fsspec 0.7.4 Handling files aside from simple local and HTTP -gcsfs 0.6.0 Google Cloud Storage access -pandas-gbq 0.14.0 Google Big Query access -s3fs 0.4.0 Amazon S3 access +fsspec 2021.5.0 Handling files aside from simple local and HTTP +gcsfs 2021.5.0 Google Cloud Storage access +pandas-gbq 0.15.0 Google Big Query access +s3fs 2021.05.0 Amazon S3 access ========================= ================== ============================================================= Clipboard diff --git a/doc/source/getting_started/tutorials.rst b/doc/source/getting_started/tutorials.rst index 8febc3adb9666..bff50bb1e4c2d 100644 --- a/doc/source/getting_started/tutorials.rst +++ b/doc/source/getting_started/tutorials.rst @@ -118,3 +118,4 @@ Various tutorials * `Pandas and Python: Top 10, by Manish Amde `_ * `Pandas DataFrames Tutorial, by Karlijn Willems `_ * `A concise tutorial with real life examples `_ +* `430+ Searchable Pandas recipes by Isshin Inada `_ diff --git a/doc/source/reference/arrays.rst b/doc/source/reference/arrays.rst index fed0d2c5f7827..cd0ce581519a8 100644 --- a/doc/source/reference/arrays.rst +++ b/doc/source/reference/arrays.rst @@ -304,6 +304,7 @@ Properties :toctree: api/ Interval.inclusive + Interval.closed Interval.closed_left Interval.closed_right Interval.is_empty @@ -351,6 +352,7 @@ A collection of intervals may be stored in an :class:`arrays.IntervalArray`. arrays.IntervalArray.contains arrays.IntervalArray.overlaps arrays.IntervalArray.set_closed + arrays.IntervalArray.set_inclusive arrays.IntervalArray.to_tuples diff --git a/doc/source/reference/frame.rst b/doc/source/reference/frame.rst index ea27d1efbb235..e71ee80767d29 100644 --- a/doc/source/reference/frame.rst +++ b/doc/source/reference/frame.rst @@ -373,6 +373,7 @@ Serialization / IO / conversion DataFrame.from_dict DataFrame.from_records + DataFrame.to_orc DataFrame.to_parquet DataFrame.to_pickle DataFrame.to_csv diff --git a/doc/source/reference/general_functions.rst b/doc/source/reference/general_functions.rst index a42d54b7e50ef..f82d9c9a6482c 100644 --- a/doc/source/reference/general_functions.rst +++ b/doc/source/reference/general_functions.rst @@ -23,6 +23,7 @@ Data manipulations merge_asof concat get_dummies + from_dummies factorize unique wide_to_long diff --git a/doc/source/reference/indexing.rst b/doc/source/reference/indexing.rst index 89a9a0a92ef08..589a339a1ca60 100644 --- a/doc/source/reference/indexing.rst +++ b/doc/source/reference/indexing.rst @@ -251,6 +251,7 @@ IntervalIndex components IntervalIndex.get_loc IntervalIndex.get_indexer IntervalIndex.set_closed + IntervalIndex.set_inclusive IntervalIndex.contains IntervalIndex.overlaps IntervalIndex.to_tuples diff --git a/doc/source/reference/io.rst b/doc/source/reference/io.rst index 70fd381bffd2c..425b5f81be966 100644 --- a/doc/source/reference/io.rst +++ b/doc/source/reference/io.rst @@ -159,6 +159,7 @@ ORC :toctree: api/ read_orc + DataFrame.to_orc SAS ~~~ diff --git a/doc/source/reference/testing.rst b/doc/source/reference/testing.rst index 68e0555afc916..338dd87aa8c62 100644 --- a/doc/source/reference/testing.rst +++ b/doc/source/reference/testing.rst @@ -26,10 +26,16 @@ Exceptions and warnings errors.AbstractMethodError errors.AccessorRegistrationWarning + errors.AttributeConflictWarning + errors.ClosedFileError + errors.CSSWarning + errors.DatabaseError errors.DataError errors.DtypeWarning errors.DuplicateLabelError errors.EmptyDataError + errors.IncompatibilityWarning + errors.IndexingError errors.InvalidIndexError errors.IntCastingNaNError errors.MergeError @@ -42,9 +48,13 @@ Exceptions and warnings errors.ParserError errors.ParserWarning errors.PerformanceWarning + errors.PossibleDataLossError + errors.PyperclipException + errors.PyperclipWindowsException errors.SettingWithCopyError errors.SettingWithCopyWarning errors.SpecificationError + errors.UndefinedVariableError errors.UnsortedIndexError errors.UnsupportedFunctionCall diff --git a/doc/source/user_guide/groupby.rst b/doc/source/user_guide/groupby.rst index ba3fb17cc8764..34244a8edcbfa 100644 --- a/doc/source/user_guide/groupby.rst +++ b/doc/source/user_guide/groupby.rst @@ -345,6 +345,17 @@ Index level names may be supplied as keys. More on the ``sum`` function and aggregation later. +When using ``.groupby()`` on a DatFrame with a MultiIndex, do not specify both ``by`` and ``level``. +The argument validation should be done in ``.groupby()``, using the name of the specific index. + +.. ipython:: python + + df = pd.DataFrame({"col1": ["a", "b", "c"]}) + df.index = pd.MultiIndex.from_arrays([["a", "a", "b"], + [1, 2, 1]], + names=["x", "y"]) + df.groupby(["col1", "x"]) + Grouping DataFrame with Index levels and columns ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A DataFrame may be grouped by a combination of columns and index levels by @@ -839,10 +850,10 @@ Alternatively, the built-in methods could be used to produce the same outputs. .. ipython:: python - max = ts.groupby(lambda x: x.year).transform("max") - min = ts.groupby(lambda x: x.year).transform("min") + max_ts = ts.groupby(lambda x: x.year).transform("max") + min_ts = ts.groupby(lambda x: x.year).transform("min") - max - min + max_ts - min_ts Another common data transform is to replace missing data with the group mean. diff --git a/doc/source/user_guide/io.rst b/doc/source/user_guide/io.rst index 4e19deb84487f..7d1aa76613d33 100644 --- a/doc/source/user_guide/io.rst +++ b/doc/source/user_guide/io.rst @@ -30,7 +30,7 @@ The pandas I/O API is a set of top level ``reader`` functions accessed like binary;`HDF5 Format `__;:ref:`read_hdf`;:ref:`to_hdf` binary;`Feather Format `__;:ref:`read_feather`;:ref:`to_feather` binary;`Parquet Format `__;:ref:`read_parquet`;:ref:`to_parquet` - binary;`ORC Format `__;:ref:`read_orc`; + binary;`ORC Format `__;:ref:`read_orc`;:ref:`to_orc` binary;`Stata `__;:ref:`read_stata`;:ref:`to_stata` binary;`SAS `__;:ref:`read_sas`; binary;`SPSS `__;:ref:`read_spss`; @@ -2559,16 +2559,29 @@ Let's look at a few examples. Read a URL with no options: -.. ipython:: python +.. code-block:: ipython - url = "https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list" - dfs = pd.read_html(url) - dfs + In [320]: "https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list" + In [321]: pd.read_html(url) + Out[321]: + [ Bank NameBank CityCity StateSt ... Acquiring InstitutionAI Closing DateClosing FundFund + 0 Almena State Bank Almena KS ... Equity Bank October 23, 2020 10538 + 1 First City Bank of Florida Fort Walton Beach FL ... United Fidelity Bank, fsb October 16, 2020 10537 + 2 The First State Bank Barboursville WV ... MVB Bank, Inc. April 3, 2020 10536 + 3 Ericson State Bank Ericson NE ... Farmers and Merchants Bank February 14, 2020 10535 + 4 City National Bank of New Jersey Newark NJ ... Industrial Bank November 1, 2019 10534 + .. ... ... ... ... ... ... ... + 558 Superior Bank, FSB Hinsdale IL ... Superior Federal, FSB July 27, 2001 6004 + 559 Malta National Bank Malta OH ... North Valley Bank May 3, 2001 4648 + 560 First Alliance Bank & Trust Co. Manchester NH ... Southern New Hampshire Bank & Trust February 2, 2001 4647 + 561 National State Bank of Metropolis Metropolis IL ... Banterra Bank of Marion December 14, 2000 4646 + 562 Bank of Honolulu Honolulu HI ... Bank of the Orient October 13, 2000 4645 + + [563 rows x 7 columns]] .. note:: - The data from the above URL changes every Monday so the resulting data above - and the data below may be slightly different. + The data from the above URL changes every Monday so the resulting data above may be slightly different. Read in the content of the file from the above URL and pass it to ``read_html`` as a string: @@ -5562,13 +5575,64 @@ ORC .. versionadded:: 1.0.0 Similar to the :ref:`parquet ` format, the `ORC Format `__ is a binary columnar serialization -for data frames. It is designed to make reading data frames efficient. pandas provides *only* a reader for the -ORC format, :func:`~pandas.read_orc`. This requires the `pyarrow `__ library. +for data frames. It is designed to make reading data frames efficient. pandas provides both the reader and the writer for the +ORC format, :func:`~pandas.read_orc` and :func:`~pandas.DataFrame.to_orc`. This requires the `pyarrow `__ library. .. warning:: * It is *highly recommended* to install pyarrow using conda due to some issues occurred by pyarrow. - * :func:`~pandas.read_orc` is not supported on Windows yet, you can find valid environments on :ref:`install optional dependencies `. + * :func:`~pandas.DataFrame.to_orc` requires pyarrow>=7.0.0. + * :func:`~pandas.read_orc` and :func:`~pandas.DataFrame.to_orc` are not supported on Windows yet, you can find valid environments on :ref:`install optional dependencies `. + * For supported dtypes please refer to `supported ORC features in Arrow `__. + * Currently timezones in datetime columns are not preserved when a dataframe is converted into ORC files. + +.. ipython:: python + + df = pd.DataFrame( + { + "a": list("abc"), + "b": list(range(1, 4)), + "c": np.arange(4.0, 7.0, dtype="float64"), + "d": [True, False, True], + "e": pd.date_range("20130101", periods=3), + } + ) + + df + df.dtypes + +Write to an orc file. + +.. ipython:: python + :okwarning: + + df.to_orc("example_pa.orc", engine="pyarrow") + +Read from an orc file. + +.. ipython:: python + :okwarning: + + result = pd.read_orc("example_pa.orc") + + result.dtypes + +Read only certain columns of an orc file. + +.. ipython:: python + + result = pd.read_orc( + "example_pa.orc", + columns=["a", "b"], + ) + result.dtypes + + +.. ipython:: python + :suppress: + + os.remove("example_pa.orc") + .. _io.sql: diff --git a/doc/source/user_guide/reshaping.rst b/doc/source/user_guide/reshaping.rst index b24890564d1bf..adca9de6c130a 100644 --- a/doc/source/user_guide/reshaping.rst +++ b/doc/source/user_guide/reshaping.rst @@ -706,6 +706,30 @@ To choose another dtype, use the ``dtype`` argument: pd.get_dummies(df, dtype=bool).dtypes +.. versionadded:: 1.5.0 + +To convert a "dummy" or "indicator" ``DataFrame``, into a categorical ``DataFrame``, +for example ``k`` columns of a ``DataFrame`` containing 1s and 0s can derive a +``DataFrame`` which has ``k`` distinct values using +:func:`~pandas.from_dummies`: + +.. ipython:: python + + df = pd.DataFrame({"prefix_a": [0, 1, 0], "prefix_b": [1, 0, 1]}) + df + + pd.from_dummies(df, sep="_") + +Dummy coded data only requires ``k - 1`` categories to be included, in this case +the ``k`` th category is the default category, implied by not being assigned any of +the other ``k - 1`` categories, can be passed via ``default_category``. + +.. ipython:: python + + df = pd.DataFrame({"prefix_a": [0, 1, 0]}) + df + + pd.from_dummies(df, sep="_", default_category="b") .. _reshaping.factorize: diff --git a/doc/source/user_guide/sparse.rst b/doc/source/user_guide/sparse.rst index ef2cb8909b59d..bc4eec1c23a35 100644 --- a/doc/source/user_guide/sparse.rst +++ b/doc/source/user_guide/sparse.rst @@ -266,8 +266,8 @@ have no replacement. .. _sparse.scipysparse: -Interaction with scipy.sparse ------------------------------ +Interaction with *scipy.sparse* +------------------------------- Use :meth:`DataFrame.sparse.from_spmatrix` to create a :class:`DataFrame` with sparse values from a sparse matrix. diff --git a/doc/source/user_guide/style.ipynb b/doc/source/user_guide/style.ipynb index 58187b3052819..43021fcbc13fb 100644 --- a/doc/source/user_guide/style.ipynb +++ b/doc/source/user_guide/style.ipynb @@ -151,7 +151,7 @@ "\n", "### Formatting Values\n", "\n", - "Before adding styles it is useful to show that the [Styler][styler] can distinguish the *display* value from the *actual* value, in both datavlaues and index or columns headers. To control the display value, the text is printed in each cell as string, and we can use the [.format()][formatfunc] and [.format_index()][formatfuncindex] methods to manipulate this according to a [format spec string][format] or a callable that takes a single value and returns a string. It is possible to define this for the whole table, or index, or for individual columns, or MultiIndex levels. \n", + "Before adding styles it is useful to show that the [Styler][styler] can distinguish the *display* value from the *actual* value, in both datavalues and index or columns headers. To control the display value, the text is printed in each cell as string, and we can use the [.format()][formatfunc] and [.format_index()][formatfuncindex] methods to manipulate this according to a [format spec string][format] or a callable that takes a single value and returns a string. It is possible to define this for the whole table, or index, or for individual columns, or MultiIndex levels. \n", "\n", "Additionally, the format function has a **precision** argument to specifically help formatting floats, as well as **decimal** and **thousands** separators to support other locales, an **na_rep** argument to display missing data, and an **escape** argument to help displaying safe-HTML or safe-LaTeX. The default formatter is configured to adopt pandas' `styler.format.precision` option, controllable using `with pd.option_context('format.precision', 2):` \n", "\n", diff --git a/doc/source/user_guide/timeseries.rst b/doc/source/user_guide/timeseries.rst index c67d028b65b3e..ed7688f229ca8 100644 --- a/doc/source/user_guide/timeseries.rst +++ b/doc/source/user_guide/timeseries.rst @@ -388,7 +388,7 @@ We subtract the epoch (midnight at January 1, 1970 UTC) and then floor divide by .. _timeseries.origin: -Using the ``origin`` Parameter +Using the ``origin`` parameter ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Using the ``origin`` parameter, one can specify an alternative starting point for creation @@ -1523,7 +1523,7 @@ or calendars with additional rules. .. _timeseries.advanced_datetime: -Time series-related instance methods +Time Series-related instance methods ------------------------------------ Shifting / lagging @@ -2601,7 +2601,7 @@ Transform nonexistent times to ``NaT`` or shift the times. .. _timeseries.timezone_series: -Time zone series operations +Time zone Series operations ~~~~~~~~~~~~~~~~~~~~~~~~~~~ A :class:`Series` with time zone **naive** values is diff --git a/doc/source/user_guide/visualization.rst b/doc/source/user_guide/visualization.rst index 72600289dcf75..d6426fe8bed2d 100644 --- a/doc/source/user_guide/visualization.rst +++ b/doc/source/user_guide/visualization.rst @@ -3,7 +3,7 @@ {{ header }} ******************* -Chart Visualization +Chart visualization ******************* This section demonstrates visualization through charting. For information on @@ -1746,7 +1746,7 @@ Andrews curves charts: plt.close("all") -Plotting directly with matplotlib +Plotting directly with Matplotlib --------------------------------- In some situations it may still be preferable or necessary to prepare plots diff --git a/doc/source/user_guide/window.rst b/doc/source/user_guide/window.rst index 2407fd3113830..e08fa81c5fa09 100644 --- a/doc/source/user_guide/window.rst +++ b/doc/source/user_guide/window.rst @@ -3,7 +3,7 @@ {{ header }} ******************** -Windowing Operations +Windowing operations ******************** pandas contains a compact set of APIs for performing windowing operations - an operation that performs @@ -490,7 +490,7 @@ For all supported aggregation functions, see :ref:`api.functions_expanding`. .. _window.exponentially_weighted: -Exponentially Weighted window +Exponentially weighted window ----------------------------- An exponentially weighted window is similar to an expanding window but with each prior point diff --git a/doc/source/whatsnew/index.rst b/doc/source/whatsnew/index.rst index ccec4f90183bc..926b73d0f3fd9 100644 --- a/doc/source/whatsnew/index.rst +++ b/doc/source/whatsnew/index.rst @@ -24,6 +24,7 @@ Version 1.4 .. toctree:: :maxdepth: 2 + v1.4.4 v1.4.3 v1.4.2 v1.4.1 diff --git a/doc/source/whatsnew/v1.4.0.rst b/doc/source/whatsnew/v1.4.0.rst index 52aa9312d4c14..697070e50a40a 100644 --- a/doc/source/whatsnew/v1.4.0.rst +++ b/doc/source/whatsnew/v1.4.0.rst @@ -271,6 +271,9 @@ the given ``dayfirst`` value when the value is a delimited date string (e.g. Ignoring dtypes in concat with empty or all-NA columns ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +.. note:: + This behaviour change has been reverted in pandas 1.4.3. + When using :func:`concat` to concatenate two or more :class:`DataFrame` objects, if one of the DataFrames was empty or had all-NA values, its dtype was *sometimes* ignored when finding the concatenated dtype. These are now @@ -301,9 +304,15 @@ object, the ``np.nan`` is retained. *New behavior*: -.. ipython:: python +.. code-block:: ipython + + In [4]: res + Out[4]: + bar + 0 2013-01-01 00:00:00 + 1 NaN + - res .. _whatsnew_140.notable_bug_fixes.value_counts_and_mode_do_not_coerce_to_nan: diff --git a/doc/source/whatsnew/v1.4.3.rst b/doc/source/whatsnew/v1.4.3.rst index ca8b8ca15ec47..70b451a231453 100644 --- a/doc/source/whatsnew/v1.4.3.rst +++ b/doc/source/whatsnew/v1.4.3.rst @@ -1,7 +1,7 @@ .. _whatsnew_143: -What's new in 1.4.3 (April ??, 2022) ------------------------------------- +What's new in 1.4.3 (June 23, 2022) +----------------------------------- These are the changes in pandas 1.4.3. See :ref:`release` for a full changelog including other versions of pandas. @@ -10,22 +10,39 @@ including other versions of pandas. .. --------------------------------------------------------------------------- +.. _whatsnew_143.concat: + +Behavior of ``concat`` with empty or all-NA DataFrame columns +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The behavior change in version 1.4.0 to stop ignoring the data type +of empty or all-NA columns with float or object dtype in :func:`concat` +(:ref:`whatsnew_140.notable_bug_fixes.concat_with_empty_or_all_na`) has been +reverted (:issue:`45637`). + + .. _whatsnew_143.regressions: Fixed regressions ~~~~~~~~~~~~~~~~~ - Fixed regression in :meth:`DataFrame.replace` when the replacement value was explicitly ``None`` when passed in a dictionary to ``to_replace`` also casting other columns to object dtype even when there were no values to replace (:issue:`46634`) +- Fixed regression in :meth:`DataFrame.to_csv` raising error when :class:`DataFrame` contains extension dtype categorical column (:issue:`46297`, :issue:`46812`) +- Fixed regression in representation of ``dtypes`` attribute of :class:`MultiIndex` (:issue:`46900`) - Fixed regression when setting values with :meth:`DataFrame.loc` updating :class:`RangeIndex` when index was set as new column and column was updated afterwards (:issue:`47128`) -- Fixed regression in :meth:`DataFrame.nsmallest` led to wrong results when ``np.nan`` in the sorting column (:issue:`46589`) +- Fixed regression in :meth:`DataFrame.fillna` and :meth:`DataFrame.update` creating a copy when updating inplace (:issue:`47188`) +- Fixed regression in :meth:`DataFrame.nsmallest` led to wrong results when the sorting column has ``np.nan`` values (:issue:`46589`) - Fixed regression in :func:`read_fwf` raising ``ValueError`` when ``widths`` was specified with ``usecols`` (:issue:`46580`) - Fixed regression in :func:`concat` not sorting columns for mixed column names (:issue:`47127`) - Fixed regression in :meth:`.Groupby.transform` and :meth:`.Groupby.agg` failing with ``engine="numba"`` when the index was a :class:`MultiIndex` (:issue:`46867`) +- Fixed regression in ``NaN`` comparison for :class:`Index` operations where the same object was compared (:issue:`47105`) - Fixed regression is :meth:`.Styler.to_latex` and :meth:`.Styler.to_html` where ``buf`` failed in combination with ``encoding`` (:issue:`47053`) - Fixed regression in :func:`read_csv` with ``index_col=False`` identifying first row as index names when ``header=None`` (:issue:`46955`) - Fixed regression in :meth:`.DataFrameGroupBy.agg` when used with list-likes or dict-likes and ``axis=1`` that would give incorrect results; now raises ``NotImplementedError`` (:issue:`46995`) - Fixed regression in :meth:`DataFrame.resample` and :meth:`DataFrame.rolling` when used with list-likes or dict-likes and ``axis=1`` that would raise an unintuitive error message; now raises ``NotImplementedError`` (:issue:`46904`) +- Fixed regression in :func:`testing.assert_index_equal` when ``check_order=False`` and :class:`Index` has extension or object dtype (:issue:`47207`) - Fixed regression in :func:`read_excel` returning ints as floats on certain input sheets (:issue:`46988`) - Fixed regression in :meth:`DataFrame.shift` when ``axis`` is ``columns`` and ``fill_value`` is absent, ``freq`` is ignored (:issue:`47039`) +- Fixed regression in :meth:`DataFrame.to_json` causing a segmentation violation when :class:`DataFrame` is created with an ``index`` parameter of the type :class:`PeriodIndex` (:issue:`46683`) .. --------------------------------------------------------------------------- @@ -33,9 +50,9 @@ Fixed regressions Bug fixes ~~~~~~~~~ -- Bug in :meth:`pd.eval`, :meth:`DataFrame.eval` and :meth:`DataFrame.query` where passing empty ``local_dict`` or ``global_dict`` was treated as passing ``None`` (:issue:`47084`) -- Most I/O methods do no longer suppress ``OSError`` and ``ValueError`` when closing file handles (:issue:`47136`) -- +- Bug in :func:`pandas.eval`, :meth:`DataFrame.eval` and :meth:`DataFrame.query` where passing empty ``local_dict`` or ``global_dict`` was treated as passing ``None`` (:issue:`47084`) +- Most I/O methods no longer suppress ``OSError`` and ``ValueError`` when closing file handles (:issue:`47136`) +- Improving error message raised by :meth:`DataFrame.from_dict` when passing an invalid ``orient`` parameter (:issue:`47450`) .. --------------------------------------------------------------------------- @@ -44,7 +61,6 @@ Bug fixes Other ~~~~~ - The minimum version of Cython needed to compile pandas is now ``0.29.30`` (:issue:`41935`) -- .. --------------------------------------------------------------------------- @@ -53,4 +69,4 @@ Other Contributors ~~~~~~~~~~~~ -.. contributors:: v1.4.2..v1.4.3|HEAD +.. contributors:: v1.4.2..v1.4.3 diff --git a/doc/source/whatsnew/v1.4.4.rst b/doc/source/whatsnew/v1.4.4.rst new file mode 100644 index 0000000000000..6bd7378e05404 --- /dev/null +++ b/doc/source/whatsnew/v1.4.4.rst @@ -0,0 +1,46 @@ +.. _whatsnew_144: + +What's new in 1.4.4 (July ??, 2022) +----------------------------------- + +These are the changes in pandas 1.4.4. See :ref:`release` for a full changelog +including other versions of pandas. + +{{ header }} + +.. --------------------------------------------------------------------------- + +.. _whatsnew_144.regressions: + +Fixed regressions +~~~~~~~~~~~~~~~~~ +- Fixed regression in :func:`concat` materializing :class:`Index` during sorting even if :class:`Index` was already sorted (:issue:`47501`) +- Fixed regression in setting ``None`` or non-string value into a ``string``-dtype Series using a mask (:issue:`47628`) +- + +.. --------------------------------------------------------------------------- + +.. _whatsnew_144.bug_fixes: + +Bug fixes +~~~~~~~~~ +- The :class:`errors.FutureWarning` raised when passing arguments (other than ``filepath_or_buffer``) as positional in :func:`read_csv` is now raised at the correct stacklevel (:issue:`47385`) +- Bug in :meth:`DataFrame.to_sql` when ``method`` was a ``callable`` that did not return an ``int`` and would raise a ``TypeError`` (:issue:`46891`) + +.. --------------------------------------------------------------------------- + +.. _whatsnew_144.other: + +Other +~~~~~ +- +- + +.. --------------------------------------------------------------------------- + +.. _whatsnew_144.contributors: + +Contributors +~~~~~~~~~~~~ + +.. contributors:: v1.4.3..v1.4.4|HEAD diff --git a/doc/source/whatsnew/v1.5.0.rst b/doc/source/whatsnew/v1.5.0.rst index 55bfb044fb31d..7f07187e34c78 100644 --- a/doc/source/whatsnew/v1.5.0.rst +++ b/doc/source/whatsnew/v1.5.0.rst @@ -100,6 +100,47 @@ as seen in the following example. 1 2021-01-02 08:00:00 4 2 2021-01-02 16:00:00 5 +.. _whatsnew_150.enhancements.from_dummies: + +from_dummies +^^^^^^^^^^^^ + +Added new function :func:`~pandas.from_dummies` to convert a dummy coded :class:`DataFrame` into a categorical :class:`DataFrame`. + +Example:: + +.. ipython:: python + + import pandas as pd + + df = pd.DataFrame({"col1_a": [1, 0, 1], "col1_b": [0, 1, 0], + "col2_a": [0, 1, 0], "col2_b": [1, 0, 0], + "col2_c": [0, 0, 1]}) + + pd.from_dummies(df, sep="_") + +.. _whatsnew_150.enhancements.orc: + +Writing to ORC files +^^^^^^^^^^^^^^^^^^^^ + +The new method :meth:`DataFrame.to_orc` allows writing to ORC files (:issue:`43864`). + +This functionality depends the `pyarrow `__ library. For more details, see :ref:`the IO docs on ORC `. + +.. warning:: + + * It is *highly recommended* to install pyarrow using conda due to some issues occurred by pyarrow. + * :func:`~pandas.DataFrame.to_orc` requires pyarrow>=7.0.0. + * :func:`~pandas.DataFrame.to_orc` is not supported on Windows yet, you can find valid environments on :ref:`install optional dependencies `. + * For supported dtypes please refer to `supported ORC features in Arrow `__. + * Currently timezones in datetime columns are not preserved when a dataframe is converted into ORC files. + +.. code-block:: python + + df = pd.DataFrame(data={"col1": [1, 2], "col2": [3, 4]}) + df.to_orc("./out.orc") + .. _whatsnew_150.enhancements.tar: Reading directly from TAR archives @@ -125,13 +166,92 @@ If the compression method cannot be inferred, use the ``compression`` argument: (``mode`` being one of ``tarfile.open``'s modes: https://docs.python.org/3/library/tarfile.html#tarfile.open) +.. _whatsnew_150.enhancements.read_xml_dtypes: + +read_xml now supports ``dtype``, ``converters``, and ``parse_dates`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Similar to other IO methods, :func:`pandas.read_xml` now supports assigning specific dtypes to columns, +apply converter methods, and parse dates (:issue:`43567`). + +.. ipython:: python + + xml_dates = """ + + + square + 00360 + 4.0 + 2020-01-01 + + + circle + 00360 + + 2021-01-01 + + + triangle + 00180 + 3.0 + 2022-01-01 + + """ + + df = pd.read_xml( + xml_dates, + dtype={'sides': 'Int64'}, + converters={'degrees': str}, + parse_dates=['date'] + ) + df + df.dtypes + + +.. _whatsnew_150.enhancements.read_xml_iterparse: + +read_xml now supports large XML using ``iterparse`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +For very large XML files that can range in hundreds of megabytes to gigabytes, :func:`pandas.read_xml` +now supports parsing such sizeable files using `lxml's iterparse`_ and `etree's iterparse`_ +which are memory-efficient methods to iterate through XML trees and extract specific elements +and attributes without holding entire tree in memory (:issue:`45442`). + +.. code-block:: ipython + + In [1]: df = pd.read_xml( + ... "/path/to/downloaded/enwikisource-latest-pages-articles.xml", + ... iterparse = {"page": ["title", "ns", "id"]}) + ... ) + df + Out[2]: + title ns id + 0 Gettysburg Address 0 21450 + 1 Main Page 0 42950 + 2 Declaration by United Nations 0 8435 + 3 Constitution of the United States of America 0 8435 + 4 Declaration of Independence (Israel) 0 17858 + ... ... ... ... + 3578760 Page:Black cat 1897 07 v2 n10.pdf/17 104 219649 + 3578761 Page:Black cat 1897 07 v2 n10.pdf/43 104 219649 + 3578762 Page:Black cat 1897 07 v2 n10.pdf/44 104 219649 + 3578763 The History of Tom Jones, a Foundling/Book IX 0 12084291 + 3578764 Page:Shakespeare of Stratford (1926) Yale.djvu/91 104 21450 + + [3578765 rows x 3 columns] + + +.. _`lxml's iterparse`: https://lxml.de/3.2/parsing.html#iterparse-and-iterwalk +.. _`etree's iterparse`: https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse + .. _whatsnew_150.enhancements.other: Other enhancements ^^^^^^^^^^^^^^^^^^ - :meth:`Series.map` now raises when ``arg`` is dict but ``na_action`` is not either ``None`` or ``'ignore'`` (:issue:`46588`) - :meth:`MultiIndex.to_frame` now supports the argument ``allow_duplicates`` and raises on duplicate labels if it is missing or False (:issue:`45245`) -- :class:`StringArray` now accepts array-likes containing nan-likes (``None``, ``np.nan``) for the ``values`` parameter in its constructor in addition to strings and :attr:`pandas.NA`. (:issue:`40839`) +- :class:`.StringArray` now accepts array-likes containing nan-likes (``None``, ``np.nan``) for the ``values`` parameter in its constructor in addition to strings and :attr:`pandas.NA`. (:issue:`40839`) - Improved the rendering of ``categories`` in :class:`CategoricalIndex` (:issue:`45218`) - :meth:`DataFrame.plot` will now allow the ``subplots`` parameter to be a list of iterables specifying column groups, so that columns may be grouped together in the same subplot (:issue:`29688`). - :meth:`to_numeric` now preserves float64 arrays when downcasting would generate values not representable in float32 (:issue:`43693`) @@ -142,18 +262,23 @@ Other enhancements - Implemented a ``bool``-dtype :class:`Index`, passing a bool-dtype array-like to ``pd.Index`` will now retain ``bool`` dtype instead of casting to ``object`` (:issue:`45061`) - Implemented a complex-dtype :class:`Index`, passing a complex-dtype array-like to ``pd.Index`` will now retain complex dtype instead of casting to ``object`` (:issue:`45845`) - :class:`Series` and :class:`DataFrame` with ``IntegerDtype`` now supports bitwise operations (:issue:`34463`) -- Add ``milliseconds`` field support for :class:`~pandas.DateOffset` (:issue:`43371`) +- Add ``milliseconds`` field support for :class:`.DateOffset` (:issue:`43371`) - :meth:`DataFrame.reset_index` now accepts a ``names`` argument which renames the index names (:issue:`6878`) -- :meth:`pd.concat` now raises when ``levels`` is given but ``keys`` is None (:issue:`46653`) -- :meth:`pd.concat` now raises when ``levels`` contains duplicate values (:issue:`46653`) +- :func:`concat` now raises when ``levels`` is given but ``keys`` is None (:issue:`46653`) +- :func:`concat` now raises when ``levels`` contains duplicate values (:issue:`46653`) - Added ``numeric_only`` argument to :meth:`DataFrame.corr`, :meth:`DataFrame.corrwith`, :meth:`DataFrame.cov`, :meth:`DataFrame.idxmin`, :meth:`DataFrame.idxmax`, :meth:`.DataFrameGroupBy.idxmin`, :meth:`.DataFrameGroupBy.idxmax`, :meth:`.GroupBy.var`, :meth:`.GroupBy.std`, :meth:`.GroupBy.sem`, and :meth:`.DataFrameGroupBy.quantile` (:issue:`46560`) - A :class:`errors.PerformanceWarning` is now thrown when using ``string[pyarrow]`` dtype with methods that don't dispatch to ``pyarrow.compute`` methods (:issue:`42613`, :issue:`46725`) - Added ``validate`` argument to :meth:`DataFrame.join` (:issue:`46622`) - A :class:`errors.PerformanceWarning` is now thrown when using ``string[pyarrow]`` dtype with methods that don't dispatch to ``pyarrow.compute`` methods (:issue:`42613`) - Added ``numeric_only`` argument to :meth:`Resampler.sum`, :meth:`Resampler.prod`, :meth:`Resampler.min`, :meth:`Resampler.max`, :meth:`Resampler.first`, and :meth:`Resampler.last` (:issue:`46442`) - ``times`` argument in :class:`.ExponentialMovingWindow` now accepts ``np.timedelta64`` (:issue:`47003`) -- :class:`DataError`, :class:`SpecificationError`, :class:`SettingWithCopyError`, :class:`SettingWithCopyWarning`, and :class:`NumExprClobberingError` are now exposed in ``pandas.errors`` (:issue:`27656`) +- :class:`.DataError`, :class:`.SpecificationError`, :class:`.SettingWithCopyError`, :class:`.SettingWithCopyWarning`, :class:`.NumExprClobberingError`, :class:`.UndefinedVariableError`, and :class:`.IndexingError` are now exposed in ``pandas.errors`` (:issue:`27656`) - Added ``check_like`` argument to :func:`testing.assert_series_equal` (:issue:`47247`) +- Allow reading compressed SAS files with :func:`read_sas` (e.g., ``.sas7bdat.gz`` files) +- :meth:`DatetimeIndex.astype` now supports casting timezone-naive indexes to ``datetime64[s]``, ``datetime64[ms]``, and ``datetime64[us]``, and timezone-aware indexes to the corresponding ``datetime64[unit, tzname]`` dtypes (:issue:`47579`) +- :class:`Series` reducers (e.g. ``min``, ``max``, ``sum``, ``mean``) will now successfully operate when the dtype is numeric and ``numeric_only=True`` is provided; previously this would raise a ``NotImplementedError`` (:issue:`47500`) +- :meth:`RangeIndex.union` now can return a :class:`RangeIndex` instead of a :class:`Int64Index` if the resulting values are equally spaced (:issue:`47557`, :issue:`43885`) +- :meth:`DataFrame.compare` now accepts an argument ``result_names`` to allow the user to specify the result's names of both left and right DataFrame which are being compared. This is by default ``'self'`` and ``'other'`` (:issue:`44354`) .. --------------------------------------------------------------------------- .. _whatsnew_150.notable_bug_fixes: @@ -271,83 +396,10 @@ upon serialization. (Related issue :issue:`12997`) Backwards incompatible API changes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -.. _whatsnew_150.api_breaking.read_xml_dtypes: - -read_xml now supports ``dtype``, ``converters``, and ``parse_dates`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Similar to other IO methods, :func:`pandas.read_xml` now supports assigning specific dtypes to columns, -apply converter methods, and parse dates (:issue:`43567`). - -.. ipython:: python - - xml_dates = """ - - - square - 00360 - 4.0 - 2020-01-01 - - - circle - 00360 - - 2021-01-01 - - - triangle - 00180 - 3.0 - 2022-01-01 - - """ - - df = pd.read_xml( - xml_dates, - dtype={'sides': 'Int64'}, - converters={'degrees': str}, - parse_dates=['date'] - ) - df - df.dtypes - -.. _whatsnew_150.read_xml_iterparse: - -read_xml now supports large XML using ``iterparse`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -For very large XML files that can range in hundreds of megabytes to gigabytes, :func:`pandas.read_xml` -now supports parsing such sizeable files using `lxml's iterparse`_ and `etree's iterparse`_ -which are memory-efficient methods to iterate through XML trees and extract specific elements -and attributes without holding entire tree in memory (:issue:`#45442`). - -.. code-block:: ipython - - In [1]: df = pd.read_xml( - ... "/path/to/downloaded/enwikisource-latest-pages-articles.xml", - ... iterparse = {"page": ["title", "ns", "id"]}) - ... ) - df - Out[2]: - title ns id - 0 Gettysburg Address 0 21450 - 1 Main Page 0 42950 - 2 Declaration by United Nations 0 8435 - 3 Constitution of the United States of America 0 8435 - 4 Declaration of Independence (Israel) 0 17858 - ... ... ... ... - 3578760 Page:Black cat 1897 07 v2 n10.pdf/17 104 219649 - 3578761 Page:Black cat 1897 07 v2 n10.pdf/43 104 219649 - 3578762 Page:Black cat 1897 07 v2 n10.pdf/44 104 219649 - 3578763 The History of Tom Jones, a Foundling/Book IX 0 12084291 - 3578764 Page:Shakespeare of Stratford (1926) Yale.djvu/91 104 21450 - - [3578765 rows x 3 columns] +.. _whatsnew_150.api_breaking.api_breaking1: - -.. _`lxml's iterparse`: https://lxml.de/3.2/parsing.html#iterparse-and-iterwalk -.. _`etree's iterparse`: https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse +api_breaking_change1 +^^^^^^^^^^^^^^^^^^^^ .. _whatsnew_150.api_breaking.api_breaking2: @@ -438,6 +490,7 @@ Other API changes October 2022. (:issue:`46312`) - :func:`read_json` now raises ``FileNotFoundError`` (previously ``ValueError``) when input is a string ending in ``.json``, ``.json.gz``, ``.json.bz2``, etc. but no such file exists. (:issue:`29102`) - Operations with :class:`Timestamp` or :class:`Timedelta` that would previously raise ``OverflowError`` instead raise ``OutOfBoundsDatetime`` or ``OutOfBoundsTimedelta`` where appropriate (:issue:`47268`) +- When :func:`read_sas` previously returned ``None``, it now returns an empty :class:`DataFrame` (:issue:`47410`) - .. --------------------------------------------------------------------------- @@ -448,6 +501,9 @@ Deprecations .. _whatsnew_150.deprecations.int_slicing_series: +Label-based integer slicing on a Series with an Int64Index or RangeIndex +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + In a future version, integer slicing on a :class:`Series` with a :class:`Int64Index` or :class:`RangeIndex` will be treated as *label-based*, not positional. This will make the behavior consistent with other :meth:`Series.__getitem__` and :meth:`Series.__setitem__` behaviors (:issue:`45162`). For example: @@ -539,31 +595,37 @@ As ``group_keys=True`` is the default value of :meth:`DataFrame.groupby` and raise a ``FutureWarning``. This can be silenced and the previous behavior retained by specifying ``group_keys=False``. -.. _whatsnew_150.notable_bug_fixes.setitem_column_try_inplace: +.. _whatsnew_150.deprecations.setitem_column_try_inplace: _ see also _whatsnew_130.notable_bug_fixes.setitem_column_try_inplace -Try operating inplace when setting values with ``loc`` and ``iloc`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Inplace operation when setting values with ``loc`` and ``iloc`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Most of the time setting values with ``frame.iloc`` attempts to set values -in-place, only falling back to inserting a new array if necessary. In the past, -setting entire columns has been an exception to this rule: +inplace, only falling back to inserting a new array if necessary. There are +some cases where this rule is not followed, for example when setting an entire +column from an array with different dtype: .. ipython:: python - values = np.arange(4).reshape(2, 2) - df = pd.DataFrame(values) - ser = df[0] + df = pd.DataFrame({'price': [11.1, 12.2]}, index=['book1', 'book2']) + original_prices = df['price'] + new_prices = np.array([98, 99]) *Old behavior*: .. code-block:: ipython - In [3]: df.iloc[:, 0] = np.array([10, 11]) - In [4]: ser + In [3]: df.iloc[:, 0] = new_prices + In [4]: df.iloc[:, 0] Out[4]: - 0 0 - 1 2 - Name: 0, dtype: int64 + book1 98 + book2 99 + Name: price, dtype: int64 + In [5]: original_prices + Out[5]: + book1 11.1 + book2 12.2 + Name: price, float: 64 This behavior is deprecated. In a future version, setting an entire column with iloc will attempt to operate inplace. @@ -572,39 +634,52 @@ iloc will attempt to operate inplace. .. code-block:: ipython - In [3]: df.iloc[:, 0] = np.array([10, 11]) - In [4]: ser + In [3]: df.iloc[:, 0] = new_prices + In [4]: df.iloc[:, 0] Out[4]: - 0 10 - 1 11 - Name: 0, dtype: int64 + book1 98.0 + book2 99.0 + Name: price, dtype: float64 + In [5]: original_prices + Out[5]: + book1 98.0 + book2 99.0 + Name: price, dtype: float64 To get the old behavior, use :meth:`DataFrame.__setitem__` directly: -*Future behavior*: - .. code-block:: ipython - In [5]: df[0] = np.array([21, 31]) - In [4]: ser - Out[4]: - 0 10 - 1 11 - Name: 0, dtype: int64 - -In the case where ``df.columns`` is not unique, use :meth:`DataFrame.isetitem`: - -*Future behavior*: + In [3]: df[df.columns[0]] = new_prices + In [4]: df.iloc[:, 0] + Out[4] + book1 98 + book2 99 + Name: price, dtype: int64 + In [5]: original_prices + Out[5]: + book1 11.1 + book2 12.2 + Name: price, dtype: float64 + +To get the old behaviour when ``df.columns`` is not unique and you want to +change a single column by index, you can use :meth:`DataFrame.isetitem`, which +has been added in pandas 1.5: .. code-block:: ipython - In [5]: df.columns = ["A", "A"] - In [5]: df.isetitem(0, np.array([21, 31])) - In [4]: ser + In [3]: df_with_duplicated_cols = pd.concat([df, df], axis='columns') + In [3]: df_with_duplicated_cols.isetitem(0, new_prices) + In [4]: df_with_duplicated_cols.iloc[:, 0] Out[4]: - 0 10 - 1 11 - Name: 0, dtype: int64 + book1 98 + book2 99 + Name: price, dtype: int64 + In [5]: original_prices + Out[5]: + book1 11.1 + book2 12.2 + Name: 0, dtype: float64 .. _whatsnew_150.deprecations.numeric_only_default: @@ -673,7 +748,7 @@ Other Deprecations - Deprecated treating float-dtype data as wall-times when passed with a timezone to :class:`Series` or :class:`DatetimeIndex` (:issue:`45573`) - Deprecated the behavior of :meth:`Series.fillna` and :meth:`DataFrame.fillna` with ``timedelta64[ns]`` dtype and incompatible fill value; in a future version this will cast to a common dtype (usually object) instead of raising, matching the behavior of other dtypes (:issue:`45746`) - Deprecated the ``warn`` parameter in :func:`infer_freq` (:issue:`45947`) -- Deprecated allowing non-keyword arguments in :meth:`ExtensionArray.argsort` (:issue:`46134`) +- Deprecated allowing non-keyword arguments in :meth:`.ExtensionArray.argsort` (:issue:`46134`) - Deprecated treating all-bool ``object``-dtype columns as bool-like in :meth:`DataFrame.any` and :meth:`DataFrame.all` with ``bool_only=True``, explicitly cast to bool instead (:issue:`46188`) - Deprecated behavior of method :meth:`DataFrame.quantile`, attribute ``numeric_only`` will default False. Including datetime/timedelta columns in the result (:issue:`7308`). - Deprecated :attr:`Timedelta.freq` and :attr:`Timedelta.is_populated` (:issue:`46430`) @@ -682,17 +757,22 @@ Other Deprecations - Deprecated the ``closed`` argument in :meth:`interval_range` in favor of ``inclusive`` argument; In a future version passing ``closed`` will raise (:issue:`40245`) - Deprecated the methods :meth:`DataFrame.mad`, :meth:`Series.mad`, and the corresponding groupby methods (:issue:`11787`) - Deprecated positional arguments to :meth:`Index.join` except for ``other``, use keyword-only arguments instead of positional arguments (:issue:`46518`) +- Deprecated positional arguments to :meth:`StringMethods.rsplit` and :meth:`StringMethods.split` except for ``pat``, use keyword-only arguments instead of positional arguments (:issue:`47423`) - Deprecated indexing on a timezone-naive :class:`DatetimeIndex` using a string representing a timezone-aware datetime (:issue:`46903`, :issue:`36148`) - Deprecated the ``closed`` argument in :class:`Interval` in favor of ``inclusive`` argument; In a future version passing ``closed`` will raise (:issue:`40245`) - Deprecated the ``closed`` argument in :class:`IntervalIndex` in favor of ``inclusive`` argument; In a future version passing ``closed`` will raise (:issue:`40245`) - Deprecated the ``closed`` argument in :class:`IntervalDtype` in favor of ``inclusive`` argument; In a future version passing ``closed`` will raise (:issue:`40245`) -- Deprecated the ``closed`` argument in :class:`IntervalArray` in favor of ``inclusive`` argument; In a future version passing ``closed`` will raise (:issue:`40245`) -- Deprecated the ``closed`` argument in :class:`intervaltree` in favor of ``inclusive`` argument; In a future version passing ``closed`` will raise (:issue:`40245`) +- Deprecated the ``closed`` argument in :class:`.IntervalArray` in favor of ``inclusive`` argument; In a future version passing ``closed`` will raise (:issue:`40245`) +- Deprecated :meth:`.IntervalArray.set_closed` and :meth:`.IntervalIndex.set_closed` in favor of ``set_inclusive``; In a future version ``set_closed`` will get removed (:issue:`40245`) - Deprecated the ``closed`` argument in :class:`ArrowInterval` in favor of ``inclusive`` argument; In a future version passing ``closed`` will raise (:issue:`40245`) - Deprecated allowing ``unit="M"`` or ``unit="Y"`` in :class:`Timestamp` constructor with a non-round float value (:issue:`47267`) - Deprecated the ``display.column_space`` global configuration option (:issue:`7576`) +- Deprecated the argument ``na_sentinel`` in :func:`factorize`, :meth:`Index.factorize`, and :meth:`.ExtensionArray.factorize`; pass ``use_na_sentinel=True`` instead to use the sentinel ``-1`` for NaN values and ``use_na_sentinel=False`` instead of ``na_sentinel=None`` to encode NaN values (:issue:`46910`) - Deprecated :meth:`DataFrameGroupBy.transform` not aligning the result when the UDF returned DataFrame (:issue:`45648`) -- +- Clarified warning from :func:`to_datetime` when delimited dates can't be parsed in accordance to specified ``dayfirst`` argument (:issue:`46210`) +- Deprecated :class:`Series` and :class:`Resampler` reducers (e.g. ``min``, ``max``, ``sum``, ``mean``) raising a ``NotImplementedError`` when the dtype is non-numric and ``numeric_only=True`` is provided; this will raise a ``TypeError`` in a future version (:issue:`47500`) +- Deprecated :meth:`Series.rank` returning an empty result when the dtype is non-numeric and ``numeric_only=True`` is provided; this will raise a ``TypeError`` in a future version (:issue:`47500`) +- Deprecated argument ``errors`` for :meth:`Series.mask`, :meth:`Series.where`, :meth:`DataFrame.mask`, and :meth:`DataFrame.where` as ``errors`` had no effect on this methods (:issue:`47728`) .. --------------------------------------------------------------------------- .. _whatsnew_150.performance: @@ -716,6 +796,13 @@ Performance improvements - Performance improvement in :func:`factorize` (:issue:`46109`) - Performance improvement in :class:`DataFrame` and :class:`Series` constructors for extension dtype scalars (:issue:`45854`) - Performance improvement in :func:`read_excel` when ``nrows`` argument provided (:issue:`32727`) +- Performance improvement in :meth:`.Styler.to_excel` when applying repeated CSS formats (:issue:`47371`) +- Performance improvement in :meth:`MultiIndex.is_monotonic_increasing` (:issue:`47458`) +- Performance improvement in :class:`BusinessHour` ``str`` and ``repr`` (:issue:`44764`) +- Performance improvement in datetime arrays string formatting when one of the default strftime formats ``"%Y-%m-%d %H:%M:%S"`` or ``"%Y-%m-%d %H:%M:%S.%f"`` is used. (:issue:`44764`) +- Performance improvement in :meth:`Series.to_sql` and :meth:`DataFrame.to_sql` (:class:`SQLiteTable`) when processing time arrays. (:issue:`44764`) +- Performance improvements to :func:`read_sas` (:issue:`47403`, :issue:`47404`, :issue:`47405`) +- .. --------------------------------------------------------------------------- .. _whatsnew_150.bug_fixes: @@ -725,8 +812,8 @@ Bug fixes Categorical ^^^^^^^^^^^ -- Bug in :meth:`Categorical.view` not accepting integer dtypes (:issue:`25464`) -- Bug in :meth:`CategoricalIndex.union` when the index's categories are integer-dtype and the index contains ``NaN`` values incorrectly raising instead of casting to ``float64`` (:issue:`45362`) +- Bug in :meth:`.Categorical.view` not accepting integer dtypes (:issue:`25464`) +- Bug in :meth:`.CategoricalIndex.union` when the index's categories are integer-dtype and the index contains ``NaN`` values incorrectly raising instead of casting to ``float64`` (:issue:`45362`) - Datetimelike @@ -739,7 +826,7 @@ Datetimelike - Bug in :meth:`DatetimeIndex.tz_localize` localizing to UTC failing to make a copy of the underlying data (:issue:`46460`) - Bug in :meth:`DatetimeIndex.resolution` incorrectly returning "day" instead of "nanosecond" for nanosecond-resolution indexes (:issue:`46903`) - Bug in :class:`Timestamp` with an integer or float value and ``unit="Y"`` or ``unit="M"`` giving slightly-wrong results (:issue:`47266`) -- Bug in :class:`DatetimeArray` construction when passed another :class:`DatetimeArray` and ``freq=None`` incorrectly inferring the freq from the given array (:issue:`47296`) +- Bug in :class:`.DatetimeArray` construction when passed another :class:`.DatetimeArray` and ``freq=None`` incorrectly inferring the freq from the given array (:issue:`47296`) - Timedelta @@ -759,7 +846,7 @@ Numeric - Bug in operations with array-likes with ``dtype="boolean"`` and :attr:`NA` incorrectly altering the array in-place (:issue:`45421`) - Bug in division, ``pow`` and ``mod`` operations on array-likes with ``dtype="boolean"`` not being like their ``np.bool_`` counterparts (:issue:`46063`) - Bug in multiplying a :class:`Series` with ``IntegerDtype`` or ``FloatingDtype`` by an array-like with ``timedelta64[ns]`` dtype incorrectly raising (:issue:`45622`) -- +- Bug in :meth:`mean` where the optional dependency ``bottleneck`` causes precision loss linear in the length of the array. ``bottleneck`` has been disabled for :meth:`mean` improving the loss to log-linear but may result in a performance decrease. (:issue:`42878`) Conversion ^^^^^^^^^^ @@ -772,10 +859,13 @@ Conversion - Bug in metaclass of generic abstract dtypes causing :meth:`DataFrame.apply` and :meth:`Series.apply` to raise for the built-in function ``type`` (:issue:`46684`) - Bug in :meth:`DataFrame.to_records` returning inconsistent numpy types if the index was a :class:`MultiIndex` (:issue:`47263`) - Bug in :meth:`DataFrame.to_dict` for ``orient="list"`` or ``orient="index"`` was not returning native types (:issue:`46751`) +- Bug in :meth:`DataFrame.apply` that returns a :class:`DataFrame` instead of a :class:`Series` when applied to an empty :class:`DataFrame` and ``axis=1`` (:issue:`39111`) +- Bug when inferring the dtype from an iterable that is *not* a NumPy ``ndarray`` consisting of all NumPy unsigned integer scalars did not result in an unsigned integer dtype (:issue:`47294`) Strings ^^^^^^^ - Bug in :meth:`str.startswith` and :meth:`str.endswith` when using other series as parameter _pat_. Now raises ``TypeError`` (:issue:`3485`) +- Bug in :meth:`Series.str.zfill` when strings contain leading signs, padding '0' before the sign character rather than after as ``str.zfill`` from standard library (:issue:`20868`) - Interval @@ -787,20 +877,26 @@ Indexing ^^^^^^^^ - Bug in :meth:`loc.__getitem__` with a list of keys causing an internal inconsistency that could lead to a disconnect between ``frame.at[x, y]`` vs ``frame[y].loc[x]`` (:issue:`22372`) - Bug in :meth:`DataFrame.iloc` where indexing a single row on a :class:`DataFrame` with a single ExtensionDtype column gave a copy instead of a view on the underlying data (:issue:`45241`) +- Bug in :meth:`DataFrame.__getitem__` returning copy when :class:`DataFrame` has duplicated columns even if a unique column is selected (:issue:`45316`, :issue:`41062`) - Bug in :meth:`Series.align` does not create :class:`MultiIndex` with union of levels when both MultiIndexes intersections are identical (:issue:`45224`) - Bug in setting a NA value (``None`` or ``np.nan``) into a :class:`Series` with int-based :class:`IntervalDtype` incorrectly casting to object dtype instead of a float-based :class:`IntervalDtype` (:issue:`45568`) - Bug in indexing setting values into an ``ExtensionDtype`` column with ``df.iloc[:, i] = values`` with ``values`` having the same dtype as ``df.iloc[:, i]`` incorrectly inserting a new array instead of setting in-place (:issue:`33457`) - Bug in :meth:`Series.__setitem__` with a non-integer :class:`Index` when using an integer key to set a value that cannot be set inplace where a ``ValueError`` was raised instead of casting to a common dtype (:issue:`45070`) - Bug in :meth:`Series.__setitem__` when setting incompatible values into a ``PeriodDtype`` or ``IntervalDtype`` :class:`Series` raising when indexing with a boolean mask but coercing when indexing with otherwise-equivalent indexers; these now consistently coerce, along with :meth:`Series.mask` and :meth:`Series.where` (:issue:`45768`) - Bug in :meth:`DataFrame.where` with multiple columns with datetime-like dtypes failing to downcast results consistent with other dtypes (:issue:`45837`) +- Bug in :func:`isin` upcasting to ``float64`` with unsigned integer dtype and list-like argument without a dtype (:issue:`46485`) - Bug in :meth:`Series.loc.__setitem__` and :meth:`Series.loc.__getitem__` not raising when using multiple keys without using a :class:`MultiIndex` (:issue:`13831`) - Bug in :meth:`Index.reindex` raising ``AssertionError`` when ``level`` was specified but no :class:`MultiIndex` was given; level is ignored now (:issue:`35132`) - Bug when setting a value too large for a :class:`Series` dtype failing to coerce to a common type (:issue:`26049`, :issue:`32878`) - Bug in :meth:`loc.__setitem__` treating ``range`` keys as positional instead of label-based (:issue:`45479`) - Bug in :meth:`Series.__setitem__` when setting ``boolean`` dtype values containing ``NA`` incorrectly raising instead of casting to ``boolean`` dtype (:issue:`45462`) -- Bug in :meth:`Series.__setitem__` where setting :attr:`NA` into a numeric-dtpye :class:`Series` would incorrectly upcast to object-dtype rather than treating the value as ``np.nan`` (:issue:`44199`) +- Bug in :meth:`Series.loc` raising with boolean indexer containing ``NA`` when :class:`Index` did not match (:issue:`46551`) +- Bug in :meth:`Series.__setitem__` where setting :attr:`NA` into a numeric-dtype :class:`Series` would incorrectly upcast to object-dtype rather than treating the value as ``np.nan`` (:issue:`44199`) +- Bug in :meth:`DataFrame.loc` when setting values to a column and right hand side is a dictionary (:issue:`47216`) +- Bug in :meth:`DataFrame.loc` when setting a :class:`DataFrame` not aligning index in some cases (:issue:`47578`) - Bug in :meth:`Series.__setitem__` with ``datetime64[ns]`` dtype, an all-``False`` boolean mask, and an incompatible value incorrectly casting to ``object`` instead of retaining ``datetime64[ns]`` dtype (:issue:`45967`) - Bug in :meth:`Index.__getitem__` raising ``ValueError`` when indexer is from boolean dtype with ``NA`` (:issue:`45806`) +- Bug in :meth:`Series.__setitem__` losing precision when enlarging :class:`Series` with scalar (:issue:`32346`) - Bug in :meth:`Series.mask` with ``inplace=True`` or setting values with a boolean mask with small integer dtypes incorrectly raising (:issue:`45750`) - Bug in :meth:`DataFrame.mask` with ``inplace=True`` and ``ExtensionDtype`` columns incorrectly raising (:issue:`45577`) - Bug in getting a column from a DataFrame with an object-dtype row index with datetime-like values: the resulting Series now preserves the exact object-dtype Index from the parent DataFrame (:issue:`42950`) @@ -818,8 +914,10 @@ Missing ^^^^^^^ - Bug in :meth:`Series.fillna` and :meth:`DataFrame.fillna` with ``downcast`` keyword not being respected in some cases where there are no NA values present (:issue:`45423`) - Bug in :meth:`Series.fillna` and :meth:`DataFrame.fillna` with :class:`IntervalDtype` and incompatible value raising instead of casting to a common (usually object) dtype (:issue:`45796`) +- Bug in :meth:`Series.map` not respecting ``na_action`` argument if mapper is a ``dict`` or :class:`Series` (:issue:`47527`) - Bug in :meth:`DataFrame.interpolate` with object-dtype column not returning a copy with ``inplace=False`` (:issue:`45791`) - Bug in :meth:`DataFrame.dropna` allows to set both ``how`` and ``thresh`` incompatible arguments (:issue:`46575`) +- Bug in :meth:`DataFrame.fillna` ignored ``axis`` when :class:`DataFrame` is single block (:issue:`47713`) MultiIndex ^^^^^^^^^^ @@ -838,8 +936,11 @@ I/O - Bug in :func:`read_csv` not recognizing line break for ``on_bad_lines="warn"`` for ``engine="c"`` (:issue:`41710`) - Bug in :meth:`DataFrame.to_csv` not respecting ``float_format`` for ``Float64`` dtype (:issue:`45991`) - Bug in :func:`read_csv` not respecting a specified converter to index columns in all cases (:issue:`40589`) +- Bug in :func:`read_csv` interpreting second row as :class:`Index` names even when ``index_col=False`` (:issue:`46569`) - Bug in :func:`read_parquet` when ``engine="pyarrow"`` which caused partial write to disk when column of unsupported datatype was passed (:issue:`44914`) - Bug in :func:`DataFrame.to_excel` and :class:`ExcelWriter` would raise when writing an empty DataFrame to a ``.ods`` file (:issue:`45793`) +- Bug in :func:`read_csv` ignoring non-existing header row for ``engine="python"`` (:issue:`47400`) +- Bug in :func:`read_excel` raising uncontrolled ``IndexError`` when ``header`` references non-existing rows (:issue:`43143`) - Bug in :func:`read_html` where elements surrounding ``
`` were joined without a space between them (:issue:`29528`) - Bug in :func:`read_csv` when data is longer than header leading to issues with callables in ``usecols`` expecting strings (:issue:`46997`) - Bug in Parquet roundtrip for Interval dtype with ``datetime64[ns]`` subtype (:issue:`45881`) @@ -847,18 +948,26 @@ I/O - Bug in :func:`read_parquet` when ``engine="fastparquet"`` where the file was not closed on error (:issue:`46555`) - :meth:`to_html` now excludes the ``border`` attribute from ```` elements when ``border`` keyword is set to ``False``. - Bug in :func:`read_sas` with certain types of compressed SAS7BDAT files (:issue:`35545`) +- Bug in :func:`read_excel` not forward filling :class:`MultiIndex` when no names were given (:issue:`47487`) - Bug in :func:`read_sas` returned ``None`` rather than an empty DataFrame for SAS7BDAT files with zero rows (:issue:`18198`) - Bug in :class:`StataWriter` where value labels were always written with default encoding (:issue:`46750`) - Bug in :class:`StataWriterUTF8` where some valid characters were removed from variable names (:issue:`47276`) +- Bug in :meth:`DataFrame.to_excel` when writing an empty dataframe with :class:`MultiIndex` (:issue:`19543`) +- Bug in :func:`read_sas` with RLE-compressed SAS7BDAT files that contain 0x40 control bytes (:issue:`31243`) +- Bug in :func:`read_sas` that scrambled column names (:issue:`31243`) +- Bug in :func:`read_sas` with RLE-compressed SAS7BDAT files that contain 0x00 control bytes (:issue:`47099`) +- Bug in :func:`read_parquet` with ``use_nullable_dtypes=True`` where ``float64`` dtype was returned instead of nullable ``Float64`` dtype (:issue:`45694`) +- Bug in :meth:`DataFrame.to_json` where ``PeriodDtype`` would not make the serialization roundtrip when read back with :meth:`read_json` (:issue:`44720`) Period ^^^^^^ -- Bug in subtraction of :class:`Period` from :class:`PeriodArray` returning wrong results (:issue:`45999`) +- Bug in subtraction of :class:`Period` from :class:`.PeriodArray` returning wrong results (:issue:`45999`) - Bug in :meth:`Period.strftime` and :meth:`PeriodIndex.strftime`, directives ``%l`` and ``%u`` were giving wrong results (:issue:`46252`) - Bug in inferring an incorrect ``freq`` when passing a string to :class:`Period` microseconds that are a multiple of 1000 (:issue:`46811`) - Bug in constructing a :class:`Period` from a :class:`Timestamp` or ``np.datetime64`` object with non-zero nanoseconds and ``freq="ns"`` incorrectly truncating the nanoseconds (:issue:`46811`) - Bug in adding ``np.timedelta64("NaT", "ns")`` to a :class:`Period` with a timedelta-like freq incorrectly raising ``IncompatibleFrequency`` instead of returning ``NaT`` (:issue:`47196`) - Bug in adding an array of integers to an array with :class:`PeriodDtype` giving incorrect results when ``dtype.freq.n > 1`` (:issue:`47209`) +- Bug in subtracting a :class:`Period` from an array with :class:`PeriodDtype` returning incorrect results instead of raising ``OverflowError`` when the operation overflows (:issue:`47538`) - Plotting @@ -870,6 +979,8 @@ Plotting - Bug in :meth:`DataFrame.plot.scatter` that prevented specifying ``norm`` (:issue:`45809`) - The function :meth:`DataFrame.plot.scatter` now accepts ``color`` as an alias for ``c`` and ``size`` as an alias for ``s`` for consistency to other plotting functions (:issue:`44670`) - Fix showing "None" as ylabel in :meth:`Series.plot` when not setting ylabel (:issue:`46129`) +- Bug in :meth:`DataFrame.plot` that led to xticks and vertical grids being improperly placed when plotting a quarterly series (:issue:`47602`) +- Bug in :meth:`DataFrame.plot` that prevented setting y-axis label, limits and ticks for a secondary y-axis (:issue:`47753`) Groupby/resample/rolling ^^^^^^^^^^^^^^^^^^^^^^^^ @@ -878,38 +989,45 @@ Groupby/resample/rolling - Bug in :meth:`.DataFrameGroupBy.size` and :meth:`.DataFrameGroupBy.transform` with ``func="size"`` produced incorrect results when ``axis=1`` (:issue:`45715`) - Bug in :meth:`.ExponentialMovingWindow.mean` with ``axis=1`` and ``engine='numba'`` when the :class:`DataFrame` has more columns than rows (:issue:`46086`) - Bug when using ``engine="numba"`` would return the same jitted function when modifying ``engine_kwargs`` (:issue:`46086`) -- Bug in :meth:`.DataFrameGroupby.transform` fails when ``axis=1`` and ``func`` is ``"first"`` or ``"last"`` (:issue:`45986`) -- Bug in :meth:`DataFrameGroupby.cumsum` with ``skipna=False`` giving incorrect results (:issue:`46216`) +- Bug in :meth:`.DataFrameGroupBy.transform` fails when ``axis=1`` and ``func`` is ``"first"`` or ``"last"`` (:issue:`45986`) +- Bug in :meth:`DataFrameGroupBy.cumsum` with ``skipna=False`` giving incorrect results (:issue:`46216`) - Bug in :meth:`.GroupBy.cumsum` with ``timedelta64[ns]`` dtype failing to recognize ``NaT`` as a null value (:issue:`46216`) -- Bug in :meth:`GroupBy.cummin` and :meth:`GroupBy.cummax` with nullable dtypes incorrectly altering the original data in place (:issue:`46220`) -- Bug in :meth:`GroupBy.cummax` with ``int64`` dtype with leading value being the smallest possible int64 (:issue:`46382`) -- Bug in :meth:`GroupBy.max` with empty groups and ``uint64`` dtype incorrectly raising ``RuntimeError`` (:issue:`46408`) +- Bug in :meth:`.GroupBy.cummin` and :meth:`.GroupBy.cummax` with nullable dtypes incorrectly altering the original data in place (:issue:`46220`) +- Bug in :meth:`DataFrame.groupby` raising error when ``None`` is in first level of :class:`MultiIndex` (:issue:`47348`) +- Bug in :meth:`.GroupBy.cummax` with ``int64`` dtype with leading value being the smallest possible int64 (:issue:`46382`) +- Bug in :meth:`.GroupBy.max` with empty groups and ``uint64`` dtype incorrectly raising ``RuntimeError`` (:issue:`46408`) - Bug in :meth:`.GroupBy.apply` would fail when ``func`` was a string and args or kwargs were supplied (:issue:`46479`) - Bug in :meth:`SeriesGroupBy.apply` would incorrectly name its result when there was a unique group (:issue:`46369`) -- Bug in :meth:`Rolling.sum` and :meth:`Rolling.mean` would give incorrect result with window of same values (:issue:`42064`, :issue:`46431`) -- Bug in :meth:`Rolling.var` and :meth:`Rolling.std` would give non-zero result with window of same values (:issue:`42064`) -- Bug in :meth:`Rolling.skew` and :meth:`Rolling.kurt` would give NaN with window of same values (:issue:`30993`) +- Bug in :meth:`.Rolling.sum` and :meth:`.Rolling.mean` would give incorrect result with window of same values (:issue:`42064`, :issue:`46431`) +- Bug in :meth:`.Rolling.var` and :meth:`.Rolling.std` would give non-zero result with window of same values (:issue:`42064`) +- Bug in :meth:`.Rolling.skew` and :meth:`.Rolling.kurt` would give NaN with window of same values (:issue:`30993`) - Bug in :meth:`.Rolling.var` would segfault calculating weighted variance when window size was larger than data size (:issue:`46760`) - Bug in :meth:`Grouper.__repr__` where ``dropna`` was not included. Now it is (:issue:`46754`) - Bug in :meth:`DataFrame.rolling` gives ValueError when center=True, axis=1 and win_type is specified (:issue:`46135`) - Bug in :meth:`.DataFrameGroupBy.describe` and :meth:`.SeriesGroupBy.describe` produces inconsistent results for empty datasets (:issue:`41575`) - Bug in :meth:`DataFrame.resample` reduction methods when used with ``on`` would attempt to aggregate the provided column (:issue:`47079`) - Bug in :meth:`DataFrame.groupby` and :meth:`Series.groupby` would not respect ``dropna=False`` when the input DataFrame/Series had a NaN values in a :class:`MultiIndex` (:issue:`46783`) +- Bug in :meth:`DataFrameGroupBy.resample` raises ``KeyError`` when getting the result from a key list which misses the resample key (:issue:`47362`) +- Reshaping ^^^^^^^^^ - Bug in :func:`concat` between a :class:`Series` with integer dtype and another with :class:`CategoricalDtype` with integer categories and containing ``NaN`` values casting to object dtype instead of ``float64`` (:issue:`45359`) - Bug in :func:`get_dummies` that selected object and categorical dtypes but not string (:issue:`44965`) - Bug in :meth:`DataFrame.align` when aligning a :class:`MultiIndex` to a :class:`Series` with another :class:`MultiIndex` (:issue:`46001`) -- Bug in concanenation with ``IntegerDtype``, or ``FloatingDtype`` arrays where the resulting dtype did not mirror the behavior of the non-nullable dtypes (:issue:`46379`) +- Bug in concatenation with ``IntegerDtype``, or ``FloatingDtype`` arrays where the resulting dtype did not mirror the behavior of the non-nullable dtypes (:issue:`46379`) +- Bug in :func:`concat` losing dtype of columns when ``join="outer"`` and ``sort=True`` (:issue:`47329`) +- Bug in :func:`concat` not sorting the column names when ``None`` is included (:issue:`47331`) - Bug in :func:`concat` with identical key leads to error when indexing :class:`MultiIndex` (:issue:`46519`) - Bug in :meth:`DataFrame.join` with a list when using suffixes to join DataFrames with duplicate column names (:issue:`46396`) - Bug in :meth:`DataFrame.pivot_table` with ``sort=False`` results in sorted index (:issue:`17041`) -- +- Bug in :meth:`concat` when ``axis=1`` and ``sort=False`` where the resulting Index was a :class:`Int64Index` instead of a :class:`RangeIndex` (:issue:`46675`) +- Bug in :meth:`wide_to_long` raises when ``stubnames`` is missing in columns and ``i`` contains string dtype column (:issue:`46044`) Sparse ^^^^^^ - Bug in :meth:`Series.where` and :meth:`DataFrame.where` with ``SparseDtype`` failing to retain the array's ``fill_value`` (:issue:`45691`) +- Bug in :meth:`SparseArray.unique` fails to keep original elements order (:issue:`47809`) - ExtensionArray @@ -923,6 +1041,7 @@ Styler - Bug in :class:`CSSToExcelConverter` leading to ``TypeError`` when border color provided without border style for ``xlsxwriter`` engine (:issue:`42276`) - Bug in :meth:`Styler.set_sticky` leading to white text on white background in dark mode (:issue:`46984`) - Bug in :meth:`Styler.to_latex` causing ``UnboundLocalError`` when ``clines="all;data"`` and the ``DataFrame`` has no rows. (:issue:`47203`) +- Bug in :meth:`Styler.to_excel` when using ``vertical-align: middle;`` with ``xlsxwriter`` engine (:issue:`30107`) Metadata ^^^^^^^^ @@ -935,7 +1054,7 @@ Other .. ***DO NOT USE THIS SECTION*** -- +- Bug in :func:`.assert_index_equal` with ``names=True`` and ``check_order=False`` not checking names (:issue:`47328`) - .. --------------------------------------------------------------------------- diff --git a/environment.yml b/environment.yml index 1f1583354339c..eb4d53e116927 100644 --- a/environment.yml +++ b/environment.yml @@ -1,21 +1,85 @@ +# Local development dependencies including docs building, website upload, ASV benchmark name: pandas-dev channels: - conda-forge dependencies: - # required - - numpy>=1.19.5 - python=3.8 - - python-dateutil>=2.8.1 + + # test dependencies + - cython=0.29.30 + - pytest>=6.0 + - pytest-cov + - pytest-xdist>=1.31 + - psutil + - pytest-asyncio>=0.17 + - boto3 + + # required dependencies + - python-dateutil + - numpy - pytz + # optional dependencies + - beautifulsoup4 + - blosc + - brotlipy + - bottleneck + - fastparquet + - fsspec + - html5lib + - hypothesis + - gcsfs + - jinja2 + - lxml + - matplotlib + - numba>=0.53.1 + - numexpr>=2.8.0 # pin for "Run checks on imported code" job + - openpyxl + - odfpy + - pandas-gbq + - psycopg2 + - pyarrow + - pymysql + - pyreadstat + - pytables + - python-snappy + - pyxlsb + - s3fs + - scipy + - sqlalchemy + - tabulate + - xarray + - xlrd + - xlsxwriter + - xlwt + - zstandard + + # downstream packages + - aiobotocore<2.0.0 # GH#44311 pinned to fix docbuild + - botocore + - cftime + - dask + - ipython + - geopandas-base + - seaborn + - scikit-learn + - statsmodels + - coverage + - pandas-datareader + - pyyaml + - py + - pytorch + + # local testing dependencies + - moto + - flask + # benchmarks - asv - # building # The compiler packages are meta-packages and install the correct compiler (activation) packages on the respective platforms. - c-compiler - cxx-compiler - - cython>=0.29.30 # code checks - black=22.3.0 @@ -32,10 +96,11 @@ dependencies: # documentation - gitpython # obtain contributors from git for whatsnew - gitdb + - natsort # DataFrame.sort_values doctest - numpydoc - pandas-dev-flaker=0.5.0 - pydata-sphinx-theme=0.8.0 - - pytest-cython + - pytest-cython # doctest - sphinx - sphinx-panels - types-python-dateutil @@ -47,77 +112,19 @@ dependencies: - nbconvert>=6.4.5 - nbsphinx - pandoc - - # Dask and its dependencies (that dont install with dask) - - dask-core - - toolz>=0.7.3 - - partd>=0.3.10 - - cloudpickle>=0.2.1 - - # web (jinja2 is also needed, but it's also an optional pandas dependency) - - markdown - - feedparser - - pyyaml - - requests - - # testing - - boto3 - - botocore>=1.11 - - hypothesis>=5.5.3 - - moto # mock S3 - - flask - - pytest>=6.0 - - pytest-cov - - pytest-xdist>=1.31 - - pytest-asyncio>=0.17 - - pytest-instafail - - # downstream tests - - seaborn - - statsmodels - - # unused (required indirectly may be?) - ipywidgets - nbformat - notebook>=6.0.3 - - # optional - - blosc - - bottleneck>=1.3.1 - ipykernel - - ipython>=7.11.1 - - jinja2 # pandas.Styler - - matplotlib>=3.3.2 # pandas.plotting, Series.plot, DataFrame.plot - - numexpr>=2.7.1 - - scipy>=1.4.1 - - numba>=0.50.1 - # optional for io - # --------------- - # pd.read_html - - beautifulsoup4>=4.8.2 - - html5lib - - lxml - - # pd.read_excel, DataFrame.to_excel, pd.ExcelWriter, pd.ExcelFile - - openpyxl - - xlrd - - xlsxwriter - - xlwt - - odfpy - - - fastparquet>=0.4.0 # pandas.read_parquet, DataFrame.to_parquet - - pyarrow>2.0.1 # pandas.read_parquet, DataFrame.to_parquet, pandas.read_feather, DataFrame.to_feather - - python-snappy # required by pyarrow + # web + - jinja2 # in optional dependencies, but documented here as needed + - markdown + - feedparser + - pyyaml + - requests - - pytables>=3.6.1 # pandas.read_hdf, DataFrame.to_hdf - - s3fs>=0.4.0 # file IO when using 's3://...' path - - aiobotocore<2.0.0 # GH#44311 pinned to fix docbuild - - fsspec>=0.7.4 # for generic remote file operations - - gcsfs>=0.6.0 # file IO when using 'gcs://...' path - - sqlalchemy # pandas.read_sql, DataFrame.to_sql - - xarray # DataFrame.to_xarray - - cftime # Needed for downstream xarray.CFTimeIndex test - - pyreadstat # pandas.read_spss - - tabulate>=0.8.3 # DataFrame.to_markdown - - natsort # DataFrame.sort_values + # build the interactive terminal + - jupyterlab >=3.4,<4 + - pip: + - jupyterlite==0.1.0b10 diff --git a/pandas/__init__.py b/pandas/__init__.py index 3645e8744d8af..eb5ce71141f46 100644 --- a/pandas/__init__.py +++ b/pandas/__init__.py @@ -1,4 +1,4 @@ -# flake8: noqa +from __future__ import annotations __docformat__ = "restructuredtext" @@ -19,7 +19,7 @@ del _hard_dependencies, _dependency, _missing_dependencies # numpy compat -from pandas.compat import is_numpy_dev as _is_numpy_dev +from pandas.compat import is_numpy_dev as _is_numpy_dev # pyright: ignore # noqa:F401 try: from pandas._libs import hashtable as _hashtable, lib as _lib, tslib as _tslib @@ -43,7 +43,7 @@ ) # let init-time option registration happen -import pandas.core.config_init +import pandas.core.config_init # pyright: ignore # noqa:F401 from pandas.core.api import ( # dtype @@ -128,11 +128,13 @@ pivot, pivot_table, get_dummies, + from_dummies, cut, qcut, ) -from pandas import api, arrays, errors, io, plotting, testing, tseries +from pandas import api, arrays, errors, io, plotting, tseries +from pandas import testing # noqa:PDF015 from pandas.util._print_versions import show_versions from pandas.io.api import ( @@ -184,7 +186,7 @@ __deprecated_num_index_names = ["Float64Index", "Int64Index", "UInt64Index"] -def __dir__(): +def __dir__() -> list[str]: # GH43028 # Int64Index etc. are deprecated, but we still want them to be available in the dir. # Remove in Pandas 2.0, when we remove Int64Index etc. from the code base. @@ -361,6 +363,7 @@ def __getattr__(name): "eval", "factorize", "get_dummies", + "from_dummies", "get_option", "infer_freq", "interval_range", diff --git a/pandas/_config/__init__.py b/pandas/_config/__init__.py index 65936a9fcdbf3..929f8a5af6b3f 100644 --- a/pandas/_config/__init__.py +++ b/pandas/_config/__init__.py @@ -16,7 +16,7 @@ "options", ] from pandas._config import config -from pandas._config import dates # noqa:F401 +from pandas._config import dates # pyright: ignore # noqa:F401 from pandas._config.config import ( describe_option, get_option, diff --git a/pandas/_config/config.py b/pandas/_config/config.py index 756c3b2d4b2b6..eacbf1b016432 100644 --- a/pandas/_config/config.py +++ b/pandas/_config/config.py @@ -60,6 +60,7 @@ Callable, Generic, Iterable, + Iterator, NamedTuple, cast, ) @@ -435,13 +436,13 @@ def __init__(self, *args) -> None: self.ops = list(zip(args[::2], args[1::2])) - def __enter__(self): + def __enter__(self) -> None: self.undo = [(pat, _get_option(pat, silent=True)) for pat, val in self.ops] for pat, val in self.ops: _set_option(pat, val, silent=True) - def __exit__(self, *args): + def __exit__(self, *args) -> None: if self.undo: for pat, val in self.undo: _set_option(pat, val, silent=True) @@ -733,7 +734,7 @@ def pp(name: str, ks: Iterable[str]) -> list[str]: @contextmanager -def config_prefix(prefix): +def config_prefix(prefix) -> Iterator[None]: """ contextmanager for multiple invocations of API with a common prefix diff --git a/pandas/_config/dates.py b/pandas/_config/dates.py index 5bf2b49ce5904..b37831f96eb73 100644 --- a/pandas/_config/dates.py +++ b/pandas/_config/dates.py @@ -1,6 +1,8 @@ """ config for datetime formatting """ +from __future__ import annotations + from pandas._config import config as cf pc_date_dayfirst_doc = """ diff --git a/pandas/_config/localization.py b/pandas/_config/localization.py index fa5503029fd4b..c4355e954c67c 100644 --- a/pandas/_config/localization.py +++ b/pandas/_config/localization.py @@ -39,7 +39,8 @@ def set_locale( particular locale, without globally setting the locale. This probably isn't thread-safe. """ - current_locale = locale.getlocale() + # getlocale is not always compliant with setlocale, use setlocale. GH#46595 + current_locale = locale.setlocale(lc_var) try: locale.setlocale(lc_var, new_locale) diff --git a/pandas/_libs/algos.pyi b/pandas/_libs/algos.pyi index 0cc9209fbdfc5..f55ff0ae8b574 100644 --- a/pandas/_libs/algos.pyi +++ b/pandas/_libs/algos.pyi @@ -42,7 +42,7 @@ def groupsort_indexer( np.ndarray, # ndarray[int64_t, ndim=1] ]: ... def kth_smallest( - a: np.ndarray, # numeric[:] + arr: np.ndarray, # numeric[:] k: int, ) -> Any: ... # numeric @@ -129,18 +129,11 @@ def diff_2d( ) -> None: ... def ensure_platform_int(arr: object) -> npt.NDArray[np.intp]: ... def ensure_object(arr: object) -> npt.NDArray[np.object_]: ... -def ensure_complex64(arr: object, copy=...) -> npt.NDArray[np.complex64]: ... -def ensure_complex128(arr: object, copy=...) -> npt.NDArray[np.complex128]: ... def ensure_float64(arr: object, copy=...) -> npt.NDArray[np.float64]: ... -def ensure_float32(arr: object, copy=...) -> npt.NDArray[np.float32]: ... def ensure_int8(arr: object, copy=...) -> npt.NDArray[np.int8]: ... def ensure_int16(arr: object, copy=...) -> npt.NDArray[np.int16]: ... def ensure_int32(arr: object, copy=...) -> npt.NDArray[np.int32]: ... def ensure_int64(arr: object, copy=...) -> npt.NDArray[np.int64]: ... -def ensure_uint8(arr: object, copy=...) -> npt.NDArray[np.uint8]: ... -def ensure_uint16(arr: object, copy=...) -> npt.NDArray[np.uint16]: ... -def ensure_uint32(arr: object, copy=...) -> npt.NDArray[np.uint32]: ... -def ensure_uint64(arr: object, copy=...) -> npt.NDArray[np.uint64]: ... def take_1d_int8_int8( values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=... ) -> None: ... diff --git a/pandas/_libs/algos.pyx b/pandas/_libs/algos.pyx index d33eba06988e9..c05d6a300ccf0 100644 --- a/pandas/_libs/algos.pyx +++ b/pandas/_libs/algos.pyx @@ -180,6 +180,8 @@ def is_lexsorted(list_of_arrays: list) -> bint: else: result = False break + if not result: + break free(vecs) return result @@ -322,6 +324,7 @@ def kth_smallest(numeric_t[::1] arr, Py_ssize_t k) -> numeric_t: @cython.boundscheck(False) @cython.wraparound(False) +@cython.cdivision(True) def nancorr(const float64_t[:, :] mat, bint cov=False, minp=None): cdef: Py_ssize_t i, j, xi, yi, N, K @@ -354,8 +357,8 @@ def nancorr(const float64_t[:, :] mat, bint cov=False, minp=None): nobs += 1 dx = vx - meanx dy = vy - meany - meanx += 1 / nobs * dx - meany += 1 / nobs * dy + meanx += 1. / nobs * dx + meany += 1. / nobs * dy ssqdmx += (vx - meanx) * dx ssqdmy += (vy - meany) * dy covxy += (vx - meanx) * dy diff --git a/pandas/_libs/arrays.pyx b/pandas/_libs/arrays.pyx index 8895a2bcfca89..f63d16e819c92 100644 --- a/pandas/_libs/arrays.pyx +++ b/pandas/_libs/arrays.pyx @@ -157,7 +157,7 @@ cdef class NDArrayBacked: return self._from_backing_data(res_values) # TODO: pass NPY_MAXDIMS equiv to axis=None? - def repeat(self, repeats, axis: int = 0): + def repeat(self, repeats, axis: int | np.integer = 0): if axis is None: axis = 0 res_values = cnp.PyArray_Repeat(self._ndarray, repeats, axis) diff --git a/pandas/_libs/groupby.pyi b/pandas/_libs/groupby.pyi index 2f0c3980c0c02..c7cb9705d7cb9 100644 --- a/pandas/_libs/groupby.pyi +++ b/pandas/_libs/groupby.pyi @@ -105,8 +105,9 @@ def group_last( values: np.ndarray, # ndarray[rank_t, ndim=2] labels: np.ndarray, # const int64_t[:] mask: npt.NDArray[np.bool_] | None, - result_mask: npt.NDArray[np.bool_] | None, + result_mask: npt.NDArray[np.bool_] | None = ..., min_count: int = ..., # Py_ssize_t + is_datetimelike: bool = ..., ) -> None: ... def group_nth( out: np.ndarray, # rank_t[:, ::1] @@ -114,9 +115,10 @@ def group_nth( values: np.ndarray, # ndarray[rank_t, ndim=2] labels: np.ndarray, # const int64_t[:] mask: npt.NDArray[np.bool_] | None, - result_mask: npt.NDArray[np.bool_] | None, + result_mask: npt.NDArray[np.bool_] | None = ..., min_count: int = ..., # int64_t rank: int = ..., # int64_t + is_datetimelike: bool = ..., ) -> None: ... def group_rank( out: np.ndarray, # float64_t[:, ::1] @@ -124,7 +126,7 @@ def group_rank( labels: np.ndarray, # const int64_t[:] ngroups: int, is_datetimelike: bool, - ties_method: Literal["aveage", "min", "max", "first", "dense"] = ..., + ties_method: Literal["average", "min", "max", "first", "dense"] = ..., ascending: bool = ..., pct: bool = ..., na_option: Literal["keep", "top", "bottom"] = ..., @@ -136,6 +138,7 @@ def group_max( values: np.ndarray, # ndarray[groupby_t, ndim=2] labels: np.ndarray, # const int64_t[:] min_count: int = ..., + is_datetimelike: bool = ..., mask: np.ndarray | None = ..., result_mask: np.ndarray | None = ..., ) -> None: ... @@ -145,6 +148,7 @@ def group_min( values: np.ndarray, # ndarray[groupby_t, ndim=2] labels: np.ndarray, # const int64_t[:] min_count: int = ..., + is_datetimelike: bool = ..., mask: np.ndarray | None = ..., result_mask: np.ndarray | None = ..., ) -> None: ... @@ -154,6 +158,9 @@ def group_cummin( labels: np.ndarray, # const int64_t[:] ngroups: int, is_datetimelike: bool, + mask: np.ndarray | None = ..., + result_mask: np.ndarray | None = ..., + skipna: bool = ..., ) -> None: ... def group_cummax( out: np.ndarray, # groupby_t[:, ::1] @@ -161,4 +168,7 @@ def group_cummax( labels: np.ndarray, # const int64_t[:] ngroups: int, is_datetimelike: bool, + mask: np.ndarray | None = ..., + result_mask: np.ndarray | None = ..., + skipna: bool = ..., ) -> None: ... diff --git a/pandas/_libs/index.pyi b/pandas/_libs/index.pyi index 68ecf201285c7..575f83847b1b6 100644 --- a/pandas/_libs/index.pyi +++ b/pandas/_libs/index.pyi @@ -69,7 +69,7 @@ class BaseMultiIndexCodesEngine: ) -> npt.NDArray[np.intp]: ... class ExtensionEngine: - def __init__(self, values: "ExtensionArray"): ... + def __init__(self, values: ExtensionArray): ... def __contains__(self, val: object) -> bool: ... def get_loc(self, val: object) -> int | slice | np.ndarray: ... def get_indexer(self, values: np.ndarray) -> npt.NDArray[np.intp]: ... diff --git a/pandas/_libs/internals.pyi b/pandas/_libs/internals.pyi index 6a90fbc729580..201c7b7b565cc 100644 --- a/pandas/_libs/internals.pyi +++ b/pandas/_libs/internals.pyi @@ -32,7 +32,7 @@ def update_blklocs_and_blknos( loc: int, nblocks: int, ) -> tuple[npt.NDArray[np.intp], npt.NDArray[np.intp]]: ... - +@final class BlockPlacement: def __init__(self, val: int | slice | np.ndarray): ... @property diff --git a/pandas/_libs/interval.pyi b/pandas/_libs/interval.pyi index d177e597478d9..bad0f2bab93d8 100644 --- a/pandas/_libs/interval.pyi +++ b/pandas/_libs/interval.pyi @@ -4,7 +4,6 @@ from typing import ( Any, Generic, TypeVar, - Union, overload, ) @@ -13,12 +12,12 @@ import numpy.typing as npt from pandas._libs import lib from pandas._typing import ( - IntervalClosedType, + IntervalInclusiveType, Timedelta, Timestamp, ) -VALID_CLOSED: frozenset[str] +VALID_INCLUSIVE: frozenset[str] _OrderableScalarT = TypeVar("_OrderableScalarT", int, float) _OrderableTimesT = TypeVar("_OrderableTimesT", Timestamp, Timedelta) @@ -53,11 +52,13 @@ class IntervalMixin: def open_right(self) -> bool: ... @property def is_empty(self) -> bool: ... - def _check_closed_matches(self, other: IntervalMixin, name: str = ...) -> None: ... + def _check_inclusive_matches( + self, other: IntervalMixin, name: str = ... + ) -> None: ... def _warning_interval( inclusive, closed -) -> tuple[IntervalClosedType, lib.NoDefault]: ... +) -> tuple[IntervalInclusiveType, lib.NoDefault]: ... class Interval(IntervalMixin, Generic[_OrderableT]): @property @@ -65,15 +66,17 @@ class Interval(IntervalMixin, Generic[_OrderableT]): @property def right(self: Interval[_OrderableT]) -> _OrderableT: ... @property - def inclusive(self) -> IntervalClosedType: ... + def inclusive(self) -> IntervalInclusiveType: ... + @property + def closed(self) -> IntervalInclusiveType: ... mid: _MidDescriptor length: _LengthDescriptor def __init__( self, left: _OrderableT, right: _OrderableT, - inclusive: IntervalClosedType = ..., - closed: IntervalClosedType = ..., + inclusive: IntervalInclusiveType = ..., + closed: IntervalInclusiveType = ..., ) -> None: ... def __hash__(self) -> int: ... @overload @@ -81,11 +84,7 @@ class Interval(IntervalMixin, Generic[_OrderableT]): self: Interval[_OrderableTimesT], key: _OrderableTimesT ) -> bool: ... @overload - def __contains__( - self: Interval[_OrderableScalarT], key: Union[int, float] - ) -> bool: ... - def __repr__(self) -> str: ... - def __str__(self) -> str: ... + def __contains__(self: Interval[_OrderableScalarT], key: int | float) -> bool: ... @overload def __add__( self: Interval[_OrderableTimesT], y: Timedelta @@ -95,7 +94,7 @@ class Interval(IntervalMixin, Generic[_OrderableT]): self: Interval[int], y: _OrderableScalarT ) -> Interval[_OrderableScalarT]: ... @overload - def __add__(self: Interval[float], y: Union[int, float]) -> Interval[float]: ... + def __add__(self: Interval[float], y: int | float) -> Interval[float]: ... @overload def __radd__( self: Interval[_OrderableTimesT], y: Timedelta @@ -105,7 +104,7 @@ class Interval(IntervalMixin, Generic[_OrderableT]): self: Interval[int], y: _OrderableScalarT ) -> Interval[_OrderableScalarT]: ... @overload - def __radd__(self: Interval[float], y: Union[int, float]) -> Interval[float]: ... + def __radd__(self: Interval[float], y: int | float) -> Interval[float]: ... @overload def __sub__( self: Interval[_OrderableTimesT], y: Timedelta @@ -115,7 +114,7 @@ class Interval(IntervalMixin, Generic[_OrderableT]): self: Interval[int], y: _OrderableScalarT ) -> Interval[_OrderableScalarT]: ... @overload - def __sub__(self: Interval[float], y: Union[int, float]) -> Interval[float]: ... + def __sub__(self: Interval[float], y: int | float) -> Interval[float]: ... @overload def __rsub__( self: Interval[_OrderableTimesT], y: Timedelta @@ -125,45 +124,43 @@ class Interval(IntervalMixin, Generic[_OrderableT]): self: Interval[int], y: _OrderableScalarT ) -> Interval[_OrderableScalarT]: ... @overload - def __rsub__(self: Interval[float], y: Union[int, float]) -> Interval[float]: ... + def __rsub__(self: Interval[float], y: int | float) -> Interval[float]: ... @overload def __mul__( self: Interval[int], y: _OrderableScalarT ) -> Interval[_OrderableScalarT]: ... @overload - def __mul__(self: Interval[float], y: Union[int, float]) -> Interval[float]: ... + def __mul__(self: Interval[float], y: int | float) -> Interval[float]: ... @overload def __rmul__( self: Interval[int], y: _OrderableScalarT ) -> Interval[_OrderableScalarT]: ... @overload - def __rmul__(self: Interval[float], y: Union[int, float]) -> Interval[float]: ... + def __rmul__(self: Interval[float], y: int | float) -> Interval[float]: ... @overload def __truediv__( self: Interval[int], y: _OrderableScalarT ) -> Interval[_OrderableScalarT]: ... @overload - def __truediv__(self: Interval[float], y: Union[int, float]) -> Interval[float]: ... + def __truediv__(self: Interval[float], y: int | float) -> Interval[float]: ... @overload def __floordiv__( self: Interval[int], y: _OrderableScalarT ) -> Interval[_OrderableScalarT]: ... @overload - def __floordiv__( - self: Interval[float], y: Union[int, float] - ) -> Interval[float]: ... + def __floordiv__(self: Interval[float], y: int | float) -> Interval[float]: ... def overlaps(self: Interval[_OrderableT], other: Interval[_OrderableT]) -> bool: ... def intervals_to_interval_bounds( - intervals: np.ndarray, validate_closed: bool = ... -) -> tuple[np.ndarray, np.ndarray, str]: ... + intervals: np.ndarray, validate_inclusive: bool = ... +) -> tuple[np.ndarray, np.ndarray, IntervalInclusiveType]: ... class IntervalTree(IntervalMixin): def __init__( self, left: np.ndarray, right: np.ndarray, - inclusive: IntervalClosedType = ..., + inclusive: IntervalInclusiveType = ..., leaf_size: int = ..., ) -> None: ... @property diff --git a/pandas/_libs/interval.pyx b/pandas/_libs/interval.pyx index 178836ff1548b..bc0a63c5c5a33 100644 --- a/pandas/_libs/interval.pyx +++ b/pandas/_libs/interval.pyx @@ -9,6 +9,8 @@ from cpython.datetime cimport ( import_datetime, ) +from pandas.util._exceptions import find_stack_level + import_datetime() cimport cython @@ -54,7 +56,7 @@ from pandas._libs.tslibs.util cimport ( is_timedelta64_object, ) -VALID_CLOSED = frozenset(['both', 'neither', 'left', 'right']) +VALID_INCLUSIVE = frozenset(['both', 'neither', 'left', 'right']) cdef class IntervalMixin: @@ -83,7 +85,7 @@ cdef class IntervalMixin: Returns ------- bool - True if the Interval is closed on the left-side. + True if the Interval is closed on the right-side. """ return self.inclusive in ('right', 'both') @@ -97,7 +99,7 @@ cdef class IntervalMixin: Returns ------- bool - True if the Interval is closed on the left-side. + True if the Interval is not closed on the left-side. """ return not self.closed_left @@ -111,7 +113,7 @@ cdef class IntervalMixin: Returns ------- bool - True if the Interval is closed on the left-side. + True if the Interval is not closed on the right-side. """ return not self.closed_right @@ -186,7 +188,7 @@ cdef class IntervalMixin: """ return (self.right == self.left) & (self.inclusive != 'both') - def _check_closed_matches(self, other, name='other'): + def _check_inclusive_matches(self, other, name='other'): """ Check if the inclusive attribute of `other` matches. @@ -201,7 +203,7 @@ cdef class IntervalMixin: Raises ------ ValueError - When `other` is not closed exactly the same as self. + When `other` is not inclusive exactly the same as self. """ if self.inclusive != other.inclusive: raise ValueError(f"'{name}.inclusive' is {repr(other.inclusive)}, " @@ -229,7 +231,7 @@ def _warning_interval(inclusive: str | None = None, closed: None | lib.NoDefault stacklevel=2, ) if closed is None: - inclusive = "both" + inclusive = "right" elif closed in ("both", "neither", "left", "right"): inclusive = closed else: @@ -257,14 +259,14 @@ cdef class Interval(IntervalMixin): .. deprecated:: 1.5.0 inclusive : {'both', 'neither', 'left', 'right'}, default 'both' - Whether the interval is closed on the left-side, right-side, both or + Whether the interval is inclusive on the left-side, right-side, both or neither. See the Notes for more detailed explanation. .. versionadded:: 1.5.0 See Also -------- - IntervalIndex : An Index of Interval objects that are all closed on the + IntervalIndex : An Index of Interval objects that are all inclusive on the same side. cut : Convert continuous data into discrete bins (Categorical of Interval objects). @@ -277,13 +279,13 @@ cdef class Interval(IntervalMixin): The parameters `left` and `right` must be from the same type, you must be able to compare them and they must satisfy ``left <= right``. - A closed interval (in mathematics denoted by square brackets) contains - its endpoints, i.e. the closed interval ``[0, 5]`` is characterized by the + A inclusive interval (in mathematics denoted by square brackets) contains + its endpoints, i.e. the inclusive interval ``[0, 5]`` is characterized by the conditions ``0 <= x <= 5``. This is what ``inclusive='both'`` stands for. An open interval (in mathematics denoted by parentheses) does not contain its endpoints, i.e. the open interval ``(0, 5)`` is characterized by the conditions ``0 < x < 5``. This is what ``inclusive='neither'`` stands for. - Intervals can also be half-open or half-closed, i.e. ``[0, 5)`` is + Intervals can also be half-open or half-inclusive, i.e. ``[0, 5)`` is described by ``0 <= x < 5`` (``inclusive='left'``) and ``(0, 5]`` is described by ``0 < x <= 5`` (``inclusive='right'``). @@ -350,7 +352,7 @@ cdef class Interval(IntervalMixin): cdef readonly str inclusive """ - Whether the interval is closed on the left-side, right-side, both or + Whether the interval is inclusive on the left-side, right-side, both or neither. """ @@ -364,9 +366,9 @@ cdef class Interval(IntervalMixin): inclusive, closed = _warning_interval(inclusive, closed) if inclusive is None: - inclusive = "both" + inclusive = "right" - if inclusive not in VALID_CLOSED: + if inclusive not in VALID_INCLUSIVE: raise ValueError(f"invalid option for 'inclusive': {inclusive}") if not left <= right: raise ValueError("left side of interval must be <= right side") @@ -379,6 +381,21 @@ cdef class Interval(IntervalMixin): self.right = right self.inclusive = inclusive + @property + def closed(self): + """ + Whether the interval is closed on the left-side, right-side, both or + neither. + + .. deprecated:: 1.5.0 + """ + warnings.warn( + "Attribute `closed` is deprecated in favor of `inclusive`.", + FutureWarning, + stacklevel=find_stack_level(), + ) + return self.inclusive + def _validate_endpoint(self, endpoint): # GH 23013 if not (is_integer_object(endpoint) or is_float_object(endpoint) or @@ -505,7 +522,7 @@ cdef class Interval(IntervalMixin): """ Check whether two Interval objects overlap. - Two intervals overlap if they share a common point, including closed + Two intervals overlap if they share a common point, including inclusive endpoints. Intervals that only have an open endpoint in common do not overlap. @@ -534,7 +551,7 @@ cdef class Interval(IntervalMixin): >>> i1.overlaps(i3) False - Intervals that share closed endpoints overlap: + Intervals that share inclusive endpoints overlap: >>> i4 = pd.Interval(0, 1, inclusive='both') >>> i5 = pd.Interval(1, 2, inclusive='both') @@ -551,7 +568,7 @@ cdef class Interval(IntervalMixin): raise TypeError("`other` must be an Interval, " f"got {type(other).__name__}") - # equality is okay if both endpoints are closed (overlap at a point) + # equality is okay if both endpoints are inclusive (overlap at a point) op1 = le if (self.closed_left and other.closed_right) else lt op2 = le if (other.closed_left and self.closed_right) else lt @@ -563,16 +580,16 @@ cdef class Interval(IntervalMixin): @cython.wraparound(False) @cython.boundscheck(False) -def intervals_to_interval_bounds(ndarray intervals, bint validate_closed=True): +def intervals_to_interval_bounds(ndarray intervals, bint validate_inclusive=True): """ Parameters ---------- intervals : ndarray Object array of Intervals / nulls. - validate_closed: bool, default True - Boolean indicating if all intervals must be closed on the same side. - Mismatching closed will raise if True, else return None for closed. + validate_inclusive: bool, default True + Boolean indicating if all intervals must be inclusive on the same side. + Mismatching inclusive will raise if True, else return None for inclusive. Returns ------- @@ -585,7 +602,7 @@ def intervals_to_interval_bounds(ndarray intervals, bint validate_closed=True): object inclusive = None, interval Py_ssize_t i, n = len(intervals) ndarray left, right - bint seen_closed = False + bint seen_inclusive = False left = np.empty(n, dtype=intervals.dtype) right = np.empty(n, dtype=intervals.dtype) @@ -603,13 +620,13 @@ def intervals_to_interval_bounds(ndarray intervals, bint validate_closed=True): left[i] = interval.left right[i] = interval.right - if not seen_closed: - seen_closed = True + if not seen_inclusive: + seen_inclusive = True inclusive = interval.inclusive elif inclusive != interval.inclusive: inclusive = None - if validate_closed: - raise ValueError("intervals must all be closed on the same side") + if validate_inclusive: + raise ValueError("intervals must all be inclusive on the same side") return left, right, inclusive diff --git a/pandas/_libs/intervaltree.pxi.in b/pandas/_libs/intervaltree.pxi.in index 9842332bae7ef..8bf1a53d56dfb 100644 --- a/pandas/_libs/intervaltree.pxi.in +++ b/pandas/_libs/intervaltree.pxi.in @@ -8,8 +8,6 @@ import warnings from pandas._libs import lib from pandas._libs.algos import is_monotonic -from pandas._libs.interval import _warning_interval - ctypedef fused int_scalar_t: int64_t float64_t @@ -42,18 +40,13 @@ cdef class IntervalTree(IntervalMixin): object _is_overlapping, _left_sorter, _right_sorter Py_ssize_t _na_count - def __init__(self, left, right, inclusive: str | None = None, closed: None | lib.NoDefault = lib.no_default, leaf_size=100): + def __init__(self, left, right, inclusive: str | None = None, leaf_size=100): """ Parameters ---------- left, right : np.ndarray[ndim=1] Left and right bounds for each interval. Assumed to contain no NaNs. - closed : {'left', 'right', 'both', 'neither'}, optional - Whether the intervals are closed on the left-side, right-side, both - or neither. Defaults to 'right'. - - .. deprecated:: 1.5.0 inclusive : {"both", "neither", "left", "right"}, optional Whether the intervals are closed on the left-side, right-side, both @@ -66,10 +59,8 @@ cdef class IntervalTree(IntervalMixin): to brute-force search. Tune this parameter to optimize query performance. """ - inclusive, closed = _warning_interval(inclusive, closed) - if inclusive is None: - inclusive = "both" + inclusive = "right" if inclusive not in ['left', 'right', 'both', 'neither']: raise ValueError("invalid option for 'inclusive': %s" % inclusive) @@ -119,7 +110,7 @@ cdef class IntervalTree(IntervalMixin): if self._is_overlapping is not None: return self._is_overlapping - # <= when both sides closed since endpoints can overlap + # <= when inclusive on both sides since endpoints can overlap op = le if self.inclusive == 'both' else lt # overlap if start of current interval < end of previous interval @@ -263,7 +254,7 @@ cdef class IntervalNode: # we need specialized nodes and leaves to optimize for different dtype and -# closed values +# inclusive values {{py: diff --git a/pandas/_libs/join.pyi b/pandas/_libs/join.pyi index a5e91e2ce83eb..8d02f8f57dee1 100644 --- a/pandas/_libs/join.pyi +++ b/pandas/_libs/join.pyi @@ -56,6 +56,7 @@ def asof_join_backward_on_X_by_Y( right_by_values: np.ndarray, # by_t[:] allow_exact_matches: bool = ..., tolerance: np.number | int | float | None = ..., + use_hashtable: bool = ..., ) -> tuple[npt.NDArray[np.intp], npt.NDArray[np.intp]]: ... def asof_join_forward_on_X_by_Y( left_values: np.ndarray, # asof_t[:] @@ -64,6 +65,7 @@ def asof_join_forward_on_X_by_Y( right_by_values: np.ndarray, # by_t[:] allow_exact_matches: bool = ..., tolerance: np.number | int | float | None = ..., + use_hashtable: bool = ..., ) -> tuple[npt.NDArray[np.intp], npt.NDArray[np.intp]]: ... def asof_join_nearest_on_X_by_Y( left_values: np.ndarray, # asof_t[:] @@ -72,22 +74,5 @@ def asof_join_nearest_on_X_by_Y( right_by_values: np.ndarray, # by_t[:] allow_exact_matches: bool = ..., tolerance: np.number | int | float | None = ..., -) -> tuple[npt.NDArray[np.intp], npt.NDArray[np.intp]]: ... -def asof_join_backward( - left_values: np.ndarray, # asof_t[:] - right_values: np.ndarray, # asof_t[:] - allow_exact_matches: bool = ..., - tolerance: np.number | int | float | None = ..., -) -> tuple[npt.NDArray[np.intp], npt.NDArray[np.intp]]: ... -def asof_join_forward( - left_values: np.ndarray, # asof_t[:] - right_values: np.ndarray, # asof_t[:] - allow_exact_matches: bool = ..., - tolerance: np.number | int | float | None = ..., -) -> tuple[npt.NDArray[np.intp], npt.NDArray[np.intp]]: ... -def asof_join_nearest( - left_values: np.ndarray, # asof_t[:] - right_values: np.ndarray, # asof_t[:] - allow_exact_matches: bool = ..., - tolerance: np.number | int | float | None = ..., + use_hashtable: bool = ..., ) -> tuple[npt.NDArray[np.intp], npt.NDArray[np.intp]]: ... diff --git a/pandas/_libs/lib.pyi b/pandas/_libs/lib.pyi index 02f021128cbed..77d3cbe92bef9 100644 --- a/pandas/_libs/lib.pyi +++ b/pandas/_libs/lib.pyi @@ -213,7 +213,7 @@ def count_level_2d( def get_level_sorter( label: np.ndarray, # const int64_t[:] starts: np.ndarray, # const intp_t[:] -) -> np.ndarray: ... # np.ndarray[np.intp, ndim=1] +) -> np.ndarray: ... # np.ndarray[np.intp, ndim=1] def generate_bins_dt64( values: npt.NDArray[np.int64], binner: np.ndarray, # const int64_t[:] diff --git a/pandas/_libs/lib.pyx b/pandas/_libs/lib.pyx index 2136c410ef4a0..e353d224708b7 100644 --- a/pandas/_libs/lib.pyx +++ b/pandas/_libs/lib.pyx @@ -17,6 +17,7 @@ from cpython.number cimport PyNumber_Check from cpython.object cimport ( Py_EQ, PyObject_RichCompareBool, + PyTypeObject, ) from cpython.ref cimport Py_INCREF from cpython.sequence cimport PySequence_Check @@ -54,6 +55,11 @@ from numpy cimport ( cnp.import_array() +cdef extern from "Python.h": + # Note: importing extern-style allows us to declare these as nogil + # functions, whereas `from cpython cimport` does not. + bint PyObject_TypeCheck(object obj, PyTypeObject* type) nogil + cdef extern from "numpy/arrayobject.h": # cython's numpy.dtype specification is incorrect, which leads to # errors in issubclass(self.dtype.type, np.bool_), so we directly @@ -71,6 +77,9 @@ cdef extern from "numpy/arrayobject.h": object fields tuple names + PyTypeObject PySignedIntegerArrType_Type + PyTypeObject PyUnsignedIntegerArrType_Type + cdef extern from "numpy/ndarrayobject.h": bint PyArray_CheckScalar(obj) nogil @@ -1283,9 +1292,9 @@ cdef class Seen: In addition to setting a flag that an integer was seen, we also set two flags depending on the type of integer seen: - 1) sint_ : a negative (signed) number in the + 1) sint_ : a signed numpy integer type or a negative (signed) number in the range of [-2**63, 0) was encountered - 2) uint_ : a positive number in the range of + 2) uint_ : an unsigned numpy integer type or a positive number in the range of [2**63, 2**64) was encountered Parameters @@ -1294,8 +1303,18 @@ cdef class Seen: Value with which to set the flags. """ self.int_ = True - self.sint_ = self.sint_ or (oINT64_MIN <= val < 0) - self.uint_ = self.uint_ or (oINT64_MAX < val <= oUINT64_MAX) + self.sint_ = ( + self.sint_ + or (oINT64_MIN <= val < 0) + # Cython equivalent of `isinstance(val, np.signedinteger)` + or PyObject_TypeCheck(val, &PySignedIntegerArrType_Type) + ) + self.uint_ = ( + self.uint_ + or (oINT64_MAX < val <= oUINT64_MAX) + # Cython equivalent of `isinstance(val, np.unsignedinteger)` + or PyObject_TypeCheck(val, &PyUnsignedIntegerArrType_Type) + ) @property def numeric_(self): @@ -2542,7 +2561,6 @@ def maybe_convert_objects(ndarray[object] objects, floats[i] = val complexes[i] = val if not seen.null_: - val = int(val) seen.saw_int(val) if ((seen.uint_ and seen.sint_) or diff --git a/pandas/_libs/missing.pyi b/pandas/_libs/missing.pyi index 3a4cc9def07bd..27f227558dee5 100644 --- a/pandas/_libs/missing.pyi +++ b/pandas/_libs/missing.pyi @@ -1,7 +1,8 @@ import numpy as np from numpy import typing as npt -class NAType: ... +class NAType: + def __new__(cls, *args, **kwargs): ... NA: NAType diff --git a/pandas/_libs/parsers.pyi b/pandas/_libs/parsers.pyi index 01f5d5802ccd5..6b0bbf183f07e 100644 --- a/pandas/_libs/parsers.pyi +++ b/pandas/_libs/parsers.pyi @@ -63,7 +63,6 @@ class TextReader: skip_blank_lines: bool = ..., encoding_errors: bytes | str = ..., ): ... - def set_error_bad_lines(self, status: int) -> None: ... def set_noconvert(self, i: int) -> None: ... def remove_noconvert(self, i: int) -> None: ... def close(self) -> None: ... diff --git a/pandas/_libs/sparse.pyi b/pandas/_libs/sparse.pyi index aa5388025f6f2..be5d251b2aea6 100644 --- a/pandas/_libs/sparse.pyi +++ b/pandas/_libs/sparse.pyi @@ -7,7 +7,7 @@ import numpy as np from pandas._typing import npt -SparseIndexT = TypeVar("SparseIndexT", bound="SparseIndex") +_SparseIndexT = TypeVar("_SparseIndexT", bound=SparseIndex) class SparseIndex: length: int @@ -24,8 +24,8 @@ class SparseIndex: def lookup_array(self, indexer: npt.NDArray[np.int32]) -> npt.NDArray[np.int32]: ... def to_int_index(self) -> IntIndex: ... def to_block_index(self) -> BlockIndex: ... - def intersect(self: SparseIndexT, y_: SparseIndex) -> SparseIndexT: ... - def make_union(self: SparseIndexT, y_: SparseIndex) -> SparseIndexT: ... + def intersect(self: _SparseIndexT, y_: SparseIndex) -> _SparseIndexT: ... + def make_union(self: _SparseIndexT, y_: SparseIndex) -> _SparseIndexT: ... class IntIndex(SparseIndex): indices: npt.NDArray[np.int32] diff --git a/pandas/_libs/src/ujson/python/objToJSON.c b/pandas/_libs/src/ujson/python/objToJSON.c index 73d2a1f786f8b..260f1ffb6165f 100644 --- a/pandas/_libs/src/ujson/python/objToJSON.c +++ b/pandas/_libs/src/ujson/python/objToJSON.c @@ -238,8 +238,10 @@ static PyObject *get_values(PyObject *obj) { PyErr_Clear(); } else if (PyObject_HasAttrString(values, "__array__")) { // We may have gotten a Categorical or Sparse array so call np.array + PyObject *array_values = PyObject_CallMethod(values, "__array__", + NULL); Py_DECREF(values); - values = PyObject_CallMethod(values, "__array__", NULL); + values = array_values; } else if (!PyArray_CheckExact(values)) { // Didn't get a numpy array, so keep trying Py_DECREF(values); diff --git a/pandas/_libs/src/ujson/python/ujson.c b/pandas/_libs/src/ujson/python/ujson.c index def06cdf2db84..5d4a5693c0ff6 100644 --- a/pandas/_libs/src/ujson/python/ujson.c +++ b/pandas/_libs/src/ujson/python/ujson.c @@ -43,7 +43,7 @@ Numeric decoder derived from TCL library /* objToJSON */ PyObject *objToJSON(PyObject *self, PyObject *args, PyObject *kwargs); -void initObjToJSON(void); +void *initObjToJSON(void); /* JSONToObj */ PyObject *JSONToObj(PyObject *self, PyObject *args, PyObject *kwargs); diff --git a/pandas/_libs/tslib.pyi b/pandas/_libs/tslib.pyi index 4b02235ac9925..2212f8db8ea1e 100644 --- a/pandas/_libs/tslib.pyi +++ b/pandas/_libs/tslib.pyi @@ -9,6 +9,7 @@ def format_array_from_datetime( tz: tzinfo | None = ..., format: str | None = ..., na_rep: object = ..., + reso: int = ..., # NPY_DATETIMEUNIT ) -> npt.NDArray[np.object_]: ... def array_with_unit_to_datetime( values: np.ndarray, diff --git a/pandas/_libs/tslib.pyx b/pandas/_libs/tslib.pyx index e6bbf52ab1272..9c7f35d240f96 100644 --- a/pandas/_libs/tslib.pyx +++ b/pandas/_libs/tslib.pyx @@ -28,11 +28,12 @@ import pytz from pandas._libs.tslibs.np_datetime cimport ( NPY_DATETIMEUNIT, + NPY_FR_ns, check_dts_bounds, - dt64_to_dtstruct, dtstruct_to_dt64, get_datetime64_value, npy_datetimestruct, + pandas_datetime_to_datetimestruct, pydate_to_dt64, pydatetime_to_dt64, string_to_dts, @@ -104,10 +105,11 @@ def _test_parse_iso8601(ts: str): @cython.wraparound(False) @cython.boundscheck(False) def format_array_from_datetime( - ndarray[int64_t] values, + ndarray values, tzinfo tz=None, str format=None, - object na_rep=None + object na_rep=None, + NPY_DATETIMEUNIT reso=NPY_FR_ns, ) -> np.ndarray: """ return a np object array of the string formatted values @@ -120,40 +122,71 @@ def format_array_from_datetime( a strftime capable string na_rep : optional, default is None a nat format + reso : NPY_DATETIMEUNIT, default NPY_FR_ns Returns ------- np.ndarray[object] """ cdef: - int64_t val, ns, N = len(values) + int64_t val, ns, N = values.size bint show_ms = False, show_us = False, show_ns = False - bint basic_format = False - ndarray[object] result = cnp.PyArray_EMPTY(values.ndim, values.shape, cnp.NPY_OBJECT, 0) + bint basic_format = False, basic_format_day = False _Timestamp ts - str res + object res npy_datetimestruct dts + # Note that `result` (and thus `result_flat`) is C-order and + # `it` iterates C-order as well, so the iteration matches + # See discussion at + # github.com/pandas-dev/pandas/pull/46886#discussion_r860261305 + ndarray result = cnp.PyArray_EMPTY(values.ndim, values.shape, cnp.NPY_OBJECT, 0) + object[::1] res_flat = result.ravel() # should NOT be a copy + cnp.flatiter it = cnp.PyArray_IterNew(values) + if na_rep is None: na_rep = 'NaT' - # if we don't have a format nor tz, then choose - # a format based on precision - basic_format = format is None and tz is None - if basic_format: - reso_obj = get_resolution(values) - show_ns = reso_obj == Resolution.RESO_NS - show_us = reso_obj == Resolution.RESO_US - show_ms = reso_obj == Resolution.RESO_MS + if tz is None: + # if we don't have a format nor tz, then choose + # a format based on precision + basic_format = format is None + if basic_format: + reso_obj = get_resolution(values, tz=tz, reso=reso) + show_ns = reso_obj == Resolution.RESO_NS + show_us = reso_obj == Resolution.RESO_US + show_ms = reso_obj == Resolution.RESO_MS + + elif format == "%Y-%m-%d %H:%M:%S": + # Same format as default, but with hardcoded precision (s) + basic_format = True + show_ns = show_us = show_ms = False + + elif format == "%Y-%m-%d %H:%M:%S.%f": + # Same format as default, but with hardcoded precision (us) + basic_format = show_us = True + show_ns = show_ms = False + + elif format == "%Y-%m-%d": + # Default format for dates + basic_format_day = True + + assert not (basic_format_day and basic_format) for i in range(N): - val = values[i] + # Analogous to: utc_val = values[i] + val = (cnp.PyArray_ITER_DATA(it))[0] if val == NPY_NAT: - result[i] = na_rep + res = na_rep + elif basic_format_day: + + pandas_datetime_to_datetimestruct(val, reso, &dts) + res = f'{dts.year}-{dts.month:02d}-{dts.day:02d}' + elif basic_format: - dt64_to_dtstruct(val, &dts) + pandas_datetime_to_datetimestruct(val, reso, &dts) res = (f'{dts.year}-{dts.month:02d}-{dts.day:02d} ' f'{dts.hour:02d}:{dts.min:02d}:{dts.sec:02d}') @@ -165,22 +198,32 @@ def format_array_from_datetime( elif show_ms: res += f'.{dts.us // 1000:03d}' - result[i] = res - else: - ts = Timestamp(val, tz=tz) + ts = Timestamp._from_value_and_reso(val, reso=reso, tz=tz) if format is None: - result[i] = str(ts) + # Use datetime.str, that returns ts.isoformat(sep=' ') + res = str(ts) else: # invalid format string # requires dates > 1900 try: # Note: dispatches to pydatetime - result[i] = ts.strftime(format) + res = ts.strftime(format) except ValueError: - result[i] = str(ts) + # Use datetime.str, that returns ts.isoformat(sep=' ') + res = str(ts) + + # Note: we can index result directly instead of using PyArray_MultiIter_DATA + # like we do for the other functions because result is known C-contiguous + # and is the first argument to PyArray_MultiIterNew2. The usual pattern + # does not seem to work with object dtype. + # See discussion at + # github.com/pandas-dev/pandas/pull/46886#discussion_r860261305 + res_flat[i] = res + + cnp.PyArray_ITER_NEXT(it) return result diff --git a/pandas/_libs/tslibs/__init__.py b/pandas/_libs/tslibs/__init__.py index 68452ce011f9d..47143b32d6dbe 100644 --- a/pandas/_libs/tslibs/__init__.py +++ b/pandas/_libs/tslibs/__init__.py @@ -29,13 +29,17 @@ "astype_overflowsafe", "get_unit_from_dtype", "periods_per_day", + "periods_per_second", + "is_supported_unit", ] from pandas._libs.tslibs import dtypes from pandas._libs.tslibs.conversion import localize_pydatetime from pandas._libs.tslibs.dtypes import ( Resolution, + is_supported_unit, periods_per_day, + periods_per_second, ) from pandas._libs.tslibs.nattype import ( NaT, diff --git a/pandas/_libs/tslibs/ccalendar.pxd b/pandas/_libs/tslibs/ccalendar.pxd index 511c9f94a47d8..341f2176f5eb4 100644 --- a/pandas/_libs/tslibs/ccalendar.pxd +++ b/pandas/_libs/tslibs/ccalendar.pxd @@ -15,8 +15,6 @@ cpdef int32_t get_day_of_year(int year, int month, int day) nogil cpdef int get_lastbday(int year, int month) nogil cpdef int get_firstbday(int year, int month) nogil -cdef int64_t DAY_NANOS -cdef int64_t HOUR_NANOS cdef dict c_MONTH_NUMBERS cdef int32_t* month_offset diff --git a/pandas/_libs/tslibs/ccalendar.pyi b/pandas/_libs/tslibs/ccalendar.pyi index 5d5b935ffa54b..993f18a61d74a 100644 --- a/pandas/_libs/tslibs/ccalendar.pyi +++ b/pandas/_libs/tslibs/ccalendar.pyi @@ -8,7 +8,5 @@ def get_firstbday(year: int, month: int) -> int: ... def get_lastbday(year: int, month: int) -> int: ... def get_day_of_year(year: int, month: int, day: int) -> int: ... def get_iso_calendar(year: int, month: int, day: int) -> tuple[int, int, int]: ... -def is_leapyear(year: int) -> bool: ... def get_week_of_year(year: int, month: int, day: int) -> int: ... def get_days_in_month(year: int, month: int) -> int: ... -def dayofweek(y: int, m: int, d: int) -> int: ... diff --git a/pandas/_libs/tslibs/ccalendar.pyx b/pandas/_libs/tslibs/ccalendar.pyx index ff6f1721ca6c9..00ee15b73f551 100644 --- a/pandas/_libs/tslibs/ccalendar.pyx +++ b/pandas/_libs/tslibs/ccalendar.pyx @@ -47,11 +47,6 @@ DAYS_FULL = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', int_to_weekday = {num: name for num, name in enumerate(DAYS)} weekday_to_int = {int_to_weekday[key]: key for key in int_to_weekday} -DAY_SECONDS = 86400 -HOUR_SECONDS = 3600 - -cdef const int64_t DAY_NANOS = DAY_SECONDS * 1_000_000_000 -cdef const int64_t HOUR_NANOS = HOUR_SECONDS * 1_000_000_000 # ---------------------------------------------------------------------- diff --git a/pandas/_libs/tslibs/conversion.pxd b/pandas/_libs/tslibs/conversion.pxd index fb0c7d71ad58f..637a84998751f 100644 --- a/pandas/_libs/tslibs/conversion.pxd +++ b/pandas/_libs/tslibs/conversion.pxd @@ -27,7 +27,8 @@ cdef _TSObject convert_to_tsobject(object ts, tzinfo tz, str unit, int32_t nanos=*) cdef _TSObject convert_datetime_to_tsobject(datetime ts, tzinfo tz, - int32_t nanos=*) + int32_t nanos=*, + NPY_DATETIMEUNIT reso=*) cdef int64_t get_datetime64_nanos(object val) except? -1 diff --git a/pandas/_libs/tslibs/conversion.pyi b/pandas/_libs/tslibs/conversion.pyi index 7e2ebb1b3bad3..d564d767f7f05 100644 --- a/pandas/_libs/tslibs/conversion.pyi +++ b/pandas/_libs/tslibs/conversion.pyi @@ -5,8 +5,6 @@ from datetime import ( import numpy as np -from pandas._typing import npt - DT64NS_DTYPE: np.dtype TD64NS_DTYPE: np.dtype diff --git a/pandas/_libs/tslibs/conversion.pyx b/pandas/_libs/tslibs/conversion.pyx index 4e1fcbbcdcc61..0dfb859a3444f 100644 --- a/pandas/_libs/tslibs/conversion.pyx +++ b/pandas/_libs/tslibs/conversion.pyx @@ -31,20 +31,26 @@ from cpython.datetime cimport ( import_datetime() from pandas._libs.tslibs.base cimport ABCTimestamp +from pandas._libs.tslibs.dtypes cimport ( + abbrev_to_npy_unit, + periods_per_second, +) from pandas._libs.tslibs.np_datetime cimport ( NPY_DATETIMEUNIT, NPY_FR_ns, astype_overflowsafe, check_dts_bounds, - dt64_to_dtstruct, dtstruct_to_dt64, get_datetime64_unit, get_datetime64_value, + get_implementation_bounds, get_unit_from_dtype, npy_datetime, npy_datetimestruct, + npy_datetimestruct_to_datetime, pandas_datetime_to_datetimestruct, pydatetime_to_dt64, + pydatetime_to_dtstruct, string_to_dts, ) @@ -135,35 +141,40 @@ cpdef inline (int64_t, int) precision_from_unit(str unit): cdef: int64_t m int p + NPY_DATETIMEUNIT reso = abbrev_to_npy_unit(unit) - if unit == "Y": + if reso == NPY_DATETIMEUNIT.NPY_FR_Y: + # each 400 years we have 97 leap years, for an average of 97/400=.2425 + # extra days each year. We get 31556952 by writing + # 3600*24*365.2425=31556952 m = 1_000_000_000 * 31556952 p = 9 - elif unit == "M": + elif reso == NPY_DATETIMEUNIT.NPY_FR_M: + # 2629746 comes from dividing the "Y" case by 12. m = 1_000_000_000 * 2629746 p = 9 - elif unit == "W": + elif reso == NPY_DATETIMEUNIT.NPY_FR_W: m = 1_000_000_000 * 3600 * 24 * 7 p = 9 - elif unit == "D" or unit == "d": + elif reso == NPY_DATETIMEUNIT.NPY_FR_D: m = 1_000_000_000 * 3600 * 24 p = 9 - elif unit == "h": + elif reso == NPY_DATETIMEUNIT.NPY_FR_h: m = 1_000_000_000 * 3600 p = 9 - elif unit == "m": + elif reso == NPY_DATETIMEUNIT.NPY_FR_m: m = 1_000_000_000 * 60 p = 9 - elif unit == "s": + elif reso == NPY_DATETIMEUNIT.NPY_FR_s: m = 1_000_000_000 p = 9 - elif unit == "ms": + elif reso == NPY_DATETIMEUNIT.NPY_FR_ms: m = 1_000_000 p = 6 - elif unit == "us": + elif reso == NPY_DATETIMEUNIT.NPY_FR_us: m = 1000 p = 3 - elif unit == "ns" or unit is None: + elif reso == NPY_DATETIMEUNIT.NPY_FR_ns or reso == NPY_DATETIMEUNIT.NPY_FR_GENERIC: m = 1 p = 0 else: @@ -240,7 +251,7 @@ cdef _TSObject convert_to_tsobject(object ts, tzinfo tz, str unit, elif is_datetime64_object(ts): obj.value = get_datetime64_nanos(ts) if obj.value != NPY_NAT: - dt64_to_dtstruct(obj.value, &obj.dts) + pandas_datetime_to_datetimestruct(obj.value, NPY_FR_ns, &obj.dts) elif is_integer_object(ts): try: ts = ts @@ -258,7 +269,7 @@ cdef _TSObject convert_to_tsobject(object ts, tzinfo tz, str unit, ts = ts * cast_from_unit(None, unit) obj.value = ts - dt64_to_dtstruct(ts, &obj.dts) + pandas_datetime_to_datetimestruct(ts, NPY_FR_ns, &obj.dts) elif is_float_object(ts): if ts != ts or ts == NPY_NAT: obj.value = NPY_NAT @@ -281,7 +292,7 @@ cdef _TSObject convert_to_tsobject(object ts, tzinfo tz, str unit, ts = cast_from_unit(ts, unit) obj.value = ts - dt64_to_dtstruct(ts, &obj.dts) + pandas_datetime_to_datetimestruct(ts, NPY_FR_ns, &obj.dts) elif PyDateTime_Check(ts): return convert_datetime_to_tsobject(ts, tz, nanos) elif PyDate_Check(ts): @@ -307,11 +318,15 @@ cdef maybe_localize_tso(_TSObject obj, tzinfo tz, NPY_DATETIMEUNIT reso): if obj.value != NPY_NAT: # check_overflows needs to run after _localize_tso check_dts_bounds(&obj.dts, reso) - check_overflows(obj) + check_overflows(obj, reso) -cdef _TSObject convert_datetime_to_tsobject(datetime ts, tzinfo tz, - int32_t nanos=0): +cdef _TSObject convert_datetime_to_tsobject( + datetime ts, + tzinfo tz, + int32_t nanos=0, + NPY_DATETIMEUNIT reso=NPY_FR_ns, +): """ Convert a datetime (or Timestamp) input `ts`, along with optional timezone object `tz` to a _TSObject. @@ -327,6 +342,7 @@ cdef _TSObject convert_datetime_to_tsobject(datetime ts, tzinfo tz, timezone for the timezone-aware output nanos : int32_t, default is 0 nanoseconds supplement the precision of the datetime input ts + reso : NPY_DATETIMEUNIT, default NPY_FR_ns Returns ------- @@ -334,6 +350,7 @@ cdef _TSObject convert_datetime_to_tsobject(datetime ts, tzinfo tz, """ cdef: _TSObject obj = _TSObject() + int64_t pps obj.fold = ts.fold if tz is not None: @@ -342,34 +359,35 @@ cdef _TSObject convert_datetime_to_tsobject(datetime ts, tzinfo tz, if ts.tzinfo is not None: # Convert the current timezone to the passed timezone ts = ts.astimezone(tz) - obj.value = pydatetime_to_dt64(ts, &obj.dts) + pydatetime_to_dtstruct(ts, &obj.dts) obj.tzinfo = ts.tzinfo elif not is_utc(tz): ts = _localize_pydatetime(ts, tz) - obj.value = pydatetime_to_dt64(ts, &obj.dts) + pydatetime_to_dtstruct(ts, &obj.dts) obj.tzinfo = ts.tzinfo else: # UTC - obj.value = pydatetime_to_dt64(ts, &obj.dts) + pydatetime_to_dtstruct(ts, &obj.dts) obj.tzinfo = tz else: - obj.value = pydatetime_to_dt64(ts, &obj.dts) + pydatetime_to_dtstruct(ts, &obj.dts) obj.tzinfo = ts.tzinfo - if obj.tzinfo is not None and not is_utc(obj.tzinfo): - offset = get_utcoffset(obj.tzinfo, ts) - obj.value -= int(offset.total_seconds() * 1e9) - if isinstance(ts, ABCTimestamp): - obj.value += ts.nanosecond obj.dts.ps = ts.nanosecond * 1000 if nanos: - obj.value += nanos obj.dts.ps = nanos * 1000 - check_dts_bounds(&obj.dts) - check_overflows(obj) + obj.value = npy_datetimestruct_to_datetime(reso, &obj.dts) + + if obj.tzinfo is not None and not is_utc(obj.tzinfo): + offset = get_utcoffset(obj.tzinfo, ts) + pps = periods_per_second(reso) + obj.value -= int(offset.total_seconds() * pps) + + check_dts_bounds(&obj.dts, reso) + check_overflows(obj, reso) return obj @@ -401,7 +419,7 @@ cdef _TSObject _create_tsobject_tz_using_offset(npy_datetimestruct dts, obj.tzinfo = pytz.FixedOffset(tzoffset) obj.value = tz_localize_to_utc_single(value, obj.tzinfo) if tz is None: - check_overflows(obj) + check_overflows(obj, NPY_FR_ns) return obj cdef: @@ -515,13 +533,14 @@ cdef _TSObject _convert_str_to_tsobject(object ts, tzinfo tz, str unit, return convert_datetime_to_tsobject(dt, tz) -cdef inline check_overflows(_TSObject obj): +cdef inline check_overflows(_TSObject obj, NPY_DATETIMEUNIT reso=NPY_FR_ns): """ Check that we haven't silently overflowed in timezone conversion Parameters ---------- obj : _TSObject + reso : NPY_DATETIMEUNIT, default NPY_FR_ns Returns ------- @@ -532,7 +551,12 @@ cdef inline check_overflows(_TSObject obj): OutOfBoundsDatetime """ # GH#12677 - if obj.dts.year == 1677: + cdef: + npy_datetimestruct lb, ub + + get_implementation_bounds(reso, &lb, &ub) + + if obj.dts.year == lb.year: if not (obj.value < 0): from pandas._libs.tslibs.timestamps import Timestamp fmt = (f"{obj.dts.year}-{obj.dts.month:02d}-{obj.dts.day:02d} " @@ -540,7 +564,7 @@ cdef inline check_overflows(_TSObject obj): raise OutOfBoundsDatetime( f"Converting {fmt} underflows past {Timestamp.min}" ) - elif obj.dts.year == 2262: + elif obj.dts.year == ub.year: if not (obj.value > 0): from pandas._libs.tslibs.timestamps import Timestamp fmt = (f"{obj.dts.year}-{obj.dts.month:02d}-{obj.dts.day:02d} " diff --git a/pandas/_libs/tslibs/dtypes.pxd b/pandas/_libs/tslibs/dtypes.pxd index e16a389bc5459..352680143113d 100644 --- a/pandas/_libs/tslibs/dtypes.pxd +++ b/pandas/_libs/tslibs/dtypes.pxd @@ -3,11 +3,11 @@ from numpy cimport int64_t from pandas._libs.tslibs.np_datetime cimport NPY_DATETIMEUNIT -cdef str npy_unit_to_abbrev(NPY_DATETIMEUNIT unit) +cpdef str npy_unit_to_abbrev(NPY_DATETIMEUNIT unit) +cdef NPY_DATETIMEUNIT abbrev_to_npy_unit(str abbrev) cdef NPY_DATETIMEUNIT freq_group_code_to_npy_unit(int freq) nogil cpdef int64_t periods_per_day(NPY_DATETIMEUNIT reso=*) except? -1 -cdef int64_t periods_per_second(NPY_DATETIMEUNIT reso) except? -1 -cdef int64_t get_conversion_factor(NPY_DATETIMEUNIT from_unit, NPY_DATETIMEUNIT to_unit) except? -1 +cpdef int64_t periods_per_second(NPY_DATETIMEUNIT reso) except? -1 cdef dict attrname_to_abbrevs diff --git a/pandas/_libs/tslibs/dtypes.pyi b/pandas/_libs/tslibs/dtypes.pyi index 5c343f89f38ea..82f62e16c4205 100644 --- a/pandas/_libs/tslibs/dtypes.pyi +++ b/pandas/_libs/tslibs/dtypes.pyi @@ -6,16 +6,21 @@ _attrname_to_abbrevs: dict[str, str] _period_code_map: dict[str, int] def periods_per_day(reso: int) -> int: ... +def periods_per_second(reso: int) -> int: ... +def is_supported_unit(reso: int) -> bool: ... +def npy_unit_to_abbrev(reso: int) -> str: ... class PeriodDtypeBase: _dtype_code: int # PeriodDtypeCode # actually __cinit__ def __new__(cls, code: int): ... + @property def _freq_group_code(self) -> int: ... @property def _resolution_obj(self) -> Resolution: ... def _get_to_timestamp_base(self) -> int: ... + @property def _freqstr(self) -> str: ... class FreqGroup(Enum): diff --git a/pandas/_libs/tslibs/dtypes.pyx b/pandas/_libs/tslibs/dtypes.pyx index f843f6ccdfc58..c09ac2a686d5c 100644 --- a/pandas/_libs/tslibs/dtypes.pyx +++ b/pandas/_libs/tslibs/dtypes.pyx @@ -4,7 +4,10 @@ cimport cython from enum import Enum -from pandas._libs.tslibs.np_datetime cimport NPY_DATETIMEUNIT +from pandas._libs.tslibs.np_datetime cimport ( + NPY_DATETIMEUNIT, + get_conversion_factor, +) cdef class PeriodDtypeBase: @@ -277,7 +280,16 @@ class NpyDatetimeUnit(Enum): NPY_FR_GENERIC = NPY_DATETIMEUNIT.NPY_FR_GENERIC -cdef str npy_unit_to_abbrev(NPY_DATETIMEUNIT unit): +def is_supported_unit(NPY_DATETIMEUNIT reso): + return ( + reso == NPY_DATETIMEUNIT.NPY_FR_ns + or reso == NPY_DATETIMEUNIT.NPY_FR_us + or reso == NPY_DATETIMEUNIT.NPY_FR_ms + or reso == NPY_DATETIMEUNIT.NPY_FR_s + ) + + +cpdef str npy_unit_to_abbrev(NPY_DATETIMEUNIT unit): if unit == NPY_DATETIMEUNIT.NPY_FR_ns or unit == NPY_DATETIMEUNIT.NPY_FR_GENERIC: # generic -> default to nanoseconds return "ns" @@ -313,6 +325,39 @@ cdef str npy_unit_to_abbrev(NPY_DATETIMEUNIT unit): raise NotImplementedError(unit) +cdef NPY_DATETIMEUNIT abbrev_to_npy_unit(str abbrev): + if abbrev == "Y": + return NPY_DATETIMEUNIT.NPY_FR_Y + elif abbrev == "M": + return NPY_DATETIMEUNIT.NPY_FR_M + elif abbrev == "W": + return NPY_DATETIMEUNIT.NPY_FR_W + elif abbrev == "D" or abbrev == "d": + return NPY_DATETIMEUNIT.NPY_FR_D + elif abbrev == "h": + return NPY_DATETIMEUNIT.NPY_FR_h + elif abbrev == "m": + return NPY_DATETIMEUNIT.NPY_FR_m + elif abbrev == "s": + return NPY_DATETIMEUNIT.NPY_FR_s + elif abbrev == "ms": + return NPY_DATETIMEUNIT.NPY_FR_ms + elif abbrev == "us": + return NPY_DATETIMEUNIT.NPY_FR_us + elif abbrev == "ns": + return NPY_DATETIMEUNIT.NPY_FR_ns + elif abbrev == "ps": + return NPY_DATETIMEUNIT.NPY_FR_ps + elif abbrev == "fs": + return NPY_DATETIMEUNIT.NPY_FR_fs + elif abbrev == "as": + return NPY_DATETIMEUNIT.NPY_FR_as + elif abbrev is None: + return NPY_DATETIMEUNIT.NPY_FR_GENERIC + else: + raise ValueError(f"Unrecognized unit {abbrev}") + + cdef NPY_DATETIMEUNIT freq_group_code_to_npy_unit(int freq) nogil: """ Convert the freq to the corresponding NPY_DATETIMEUNIT to pass @@ -344,83 +389,11 @@ cpdef int64_t periods_per_day(NPY_DATETIMEUNIT reso=NPY_DATETIMEUNIT.NPY_FR_ns) """ How many of the given time units fit into a single day? """ - cdef: - int64_t day_units - - if reso == NPY_DATETIMEUNIT.NPY_FR_ps: - # pico is the smallest unit for which we don't overflow, so - # we exclude femto and atto - day_units = 24 * 3600 * 1_000_000_000_000 - elif reso == NPY_DATETIMEUNIT.NPY_FR_ns: - day_units = 24 * 3600 * 1_000_000_000 - elif reso == NPY_DATETIMEUNIT.NPY_FR_us: - day_units = 24 * 3600 * 1_000_000 - elif reso == NPY_DATETIMEUNIT.NPY_FR_ms: - day_units = 24 * 3600 * 1_000 - elif reso == NPY_DATETIMEUNIT.NPY_FR_s: - day_units = 24 * 3600 - elif reso == NPY_DATETIMEUNIT.NPY_FR_m: - day_units = 24 * 60 - elif reso == NPY_DATETIMEUNIT.NPY_FR_h: - day_units = 24 - elif reso == NPY_DATETIMEUNIT.NPY_FR_D: - day_units = 1 - else: - raise NotImplementedError(reso) - return day_units - - -cdef int64_t periods_per_second(NPY_DATETIMEUNIT reso) except? -1: - if reso == NPY_DATETIMEUNIT.NPY_FR_ns: - return 1_000_000_000 - elif reso == NPY_DATETIMEUNIT.NPY_FR_us: - return 1_000_000 - elif reso == NPY_DATETIMEUNIT.NPY_FR_ms: - return 1_000 - elif reso == NPY_DATETIMEUNIT.NPY_FR_s: - return 1 - else: - raise NotImplementedError(reso) + return get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_D, reso) -@cython.overflowcheck(True) -cdef int64_t get_conversion_factor(NPY_DATETIMEUNIT from_unit, NPY_DATETIMEUNIT to_unit) except? -1: - """ - Find the factor by which we need to multiply to convert from from_unit to to_unit. - """ - if ( - from_unit == NPY_DATETIMEUNIT.NPY_FR_GENERIC - or to_unit == NPY_DATETIMEUNIT.NPY_FR_GENERIC - ): - raise ValueError("unit-less resolutions are not supported") - if from_unit > to_unit: - raise ValueError - - if from_unit == to_unit: - return 1 - - if from_unit == NPY_DATETIMEUNIT.NPY_FR_W: - return 7 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_D, to_unit) - elif from_unit == NPY_DATETIMEUNIT.NPY_FR_D: - return 24 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_h, to_unit) - elif from_unit == NPY_DATETIMEUNIT.NPY_FR_h: - return 60 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_m, to_unit) - elif from_unit == NPY_DATETIMEUNIT.NPY_FR_m: - return 60 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_s, to_unit) - elif from_unit == NPY_DATETIMEUNIT.NPY_FR_s: - return 1000 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_ms, to_unit) - elif from_unit == NPY_DATETIMEUNIT.NPY_FR_ms: - return 1000 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_us, to_unit) - elif from_unit == NPY_DATETIMEUNIT.NPY_FR_us: - return 1000 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_ns, to_unit) - elif from_unit == NPY_DATETIMEUNIT.NPY_FR_ns: - return 1000 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_ps, to_unit) - elif from_unit == NPY_DATETIMEUNIT.NPY_FR_ps: - return 1000 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_fs, to_unit) - elif from_unit == NPY_DATETIMEUNIT.NPY_FR_fs: - return 1000 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_as, to_unit) - else: - raise ValueError(from_unit, to_unit) +cpdef int64_t periods_per_second(NPY_DATETIMEUNIT reso) except? -1: + return get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_s, reso) cdef dict _reso_str_map = { diff --git a/pandas/_libs/tslibs/nattype.pyi b/pandas/_libs/tslibs/nattype.pyi index efadd8f0220b3..0aa80330b15bc 100644 --- a/pandas/_libs/tslibs/nattype.pyi +++ b/pandas/_libs/tslibs/nattype.pyi @@ -3,10 +3,6 @@ from datetime import ( timedelta, tzinfo as _tzinfo, ) -from typing import ( - Any, - Union, -) import numpy as np @@ -16,15 +12,14 @@ NaT: NaTType iNaT: int nat_strings: set[str] -def is_null_datetimelike(val: object, inat_is_null: bool = ...) -> bool: ... - -_NaTComparisonTypes = Union[datetime, timedelta, Period, np.datetime64, np.timedelta64] +_NaTComparisonTypes = datetime | timedelta | Period | np.datetime64 | np.timedelta64 class _NatComparison: def __call__(self, other: _NaTComparisonTypes) -> bool: ... class NaTType: value: np.int64 + @property def asm8(self) -> np.datetime64: ... def to_datetime64(self) -> np.datetime64: ... def to_numpy( @@ -117,8 +112,8 @@ class NaTType: # inject Period properties @property def qyear(self) -> float: ... - def __eq__(self, other: Any) -> bool: ... - def __ne__(self, other: Any) -> bool: ... + def __eq__(self, other: object) -> bool: ... + def __ne__(self, other: object) -> bool: ... __lt__: _NatComparison __le__: _NatComparison __gt__: _NatComparison diff --git a/pandas/_libs/tslibs/np_datetime.pxd b/pandas/_libs/tslibs/np_datetime.pxd index d4dbcbe2acd6e..c1936e34cf8d0 100644 --- a/pandas/_libs/tslibs/np_datetime.pxd +++ b/pandas/_libs/tslibs/np_datetime.pxd @@ -76,9 +76,9 @@ cdef bint cmp_scalar(int64_t lhs, int64_t rhs, int op) except -1 cdef check_dts_bounds(npy_datetimestruct *dts, NPY_DATETIMEUNIT unit=?) cdef int64_t dtstruct_to_dt64(npy_datetimestruct* dts) nogil -cdef void dt64_to_dtstruct(int64_t dt64, npy_datetimestruct* out) nogil cdef int64_t pydatetime_to_dt64(datetime val, npy_datetimestruct *dts) +cdef void pydatetime_to_dtstruct(datetime dt, npy_datetimestruct *dts) cdef int64_t pydate_to_dt64(date val, npy_datetimestruct *dts) cdef void pydate_to_dtstruct(date val, npy_datetimestruct *dts) @@ -101,6 +101,18 @@ cpdef cnp.ndarray astype_overflowsafe( cnp.ndarray values, # ndarray[datetime64[anyunit]] cnp.dtype dtype, # ndarray[datetime64[anyunit]] bint copy=*, + bint round_ok=*, ) +cdef int64_t get_conversion_factor(NPY_DATETIMEUNIT from_unit, NPY_DATETIMEUNIT to_unit) except? -1 cdef bint cmp_dtstructs(npy_datetimestruct* left, npy_datetimestruct* right, int op) +cdef get_implementation_bounds( + NPY_DATETIMEUNIT reso, npy_datetimestruct *lower, npy_datetimestruct *upper +) + +cdef int64_t convert_reso( + int64_t value, + NPY_DATETIMEUNIT from_reso, + NPY_DATETIMEUNIT to_reso, + bint round_ok, +) except? -1 diff --git a/pandas/_libs/tslibs/np_datetime.pyi b/pandas/_libs/tslibs/np_datetime.pyi index 27871a78f8aaf..d80d26375412b 100644 --- a/pandas/_libs/tslibs/np_datetime.pyi +++ b/pandas/_libs/tslibs/np_datetime.pyi @@ -1,5 +1,7 @@ import numpy as np +from pandas._typing import npt + class OutOfBoundsDatetime(ValueError): ... class OutOfBoundsTimedelta(ValueError): ... @@ -7,6 +9,12 @@ class OutOfBoundsTimedelta(ValueError): ... def py_get_unit_from_dtype(dtype: np.dtype): ... def py_td64_to_tdstruct(td64: int, unit: int) -> dict: ... def astype_overflowsafe( - arr: np.ndarray, dtype: np.dtype, copy: bool = ... + arr: np.ndarray, + dtype: np.dtype, + copy: bool = ..., + round_ok: bool = ..., ) -> np.ndarray: ... def is_unitless(dtype: np.dtype) -> bool: ... +def compare_mismatched_resolutions( + left: np.ndarray, right: np.ndarray, op +) -> npt.NDArray[np.bool_]: ... diff --git a/pandas/_libs/tslibs/np_datetime.pyx b/pandas/_libs/tslibs/np_datetime.pyx index cf967509a84c0..8ab0ba24f9151 100644 --- a/pandas/_libs/tslibs/np_datetime.pyx +++ b/pandas/_libs/tslibs/np_datetime.pyx @@ -1,4 +1,6 @@ +cimport cython from cpython.datetime cimport ( + PyDateTime_CheckExact, PyDateTime_DATE_GET_HOUR, PyDateTime_DATE_GET_MICROSECOND, PyDateTime_DATE_GET_MINUTE, @@ -20,12 +22,14 @@ from cpython.object cimport ( import_datetime() import numpy as np + cimport numpy as cnp cnp.import_array() from numpy cimport ( int64_t, ndarray, + uint8_t, ) from pandas._libs.tslibs.util cimport get_c_string_buf_and_size @@ -167,6 +171,26 @@ class OutOfBoundsTimedelta(ValueError): pass +cdef get_implementation_bounds(NPY_DATETIMEUNIT reso, npy_datetimestruct *lower, npy_datetimestruct *upper): + if reso == NPY_FR_ns: + upper[0] = _NS_MAX_DTS + lower[0] = _NS_MIN_DTS + elif reso == NPY_FR_us: + upper[0] = _US_MAX_DTS + lower[0] = _US_MIN_DTS + elif reso == NPY_FR_ms: + upper[0] = _MS_MAX_DTS + lower[0] = _MS_MIN_DTS + elif reso == NPY_FR_s: + upper[0] = _S_MAX_DTS + lower[0] = _S_MIN_DTS + elif reso == NPY_FR_m: + upper[0] = _M_MAX_DTS + lower[0] = _M_MIN_DTS + else: + raise NotImplementedError(reso) + + cdef check_dts_bounds(npy_datetimestruct *dts, NPY_DATETIMEUNIT unit=NPY_FR_ns): """Raises OutOfBoundsDatetime if the given date is outside the range that can be represented by nanosecond-resolution 64-bit integers.""" @@ -174,23 +198,7 @@ cdef check_dts_bounds(npy_datetimestruct *dts, NPY_DATETIMEUNIT unit=NPY_FR_ns): bint error = False npy_datetimestruct cmp_upper, cmp_lower - if unit == NPY_FR_ns: - cmp_upper = _NS_MAX_DTS - cmp_lower = _NS_MIN_DTS - elif unit == NPY_FR_us: - cmp_upper = _US_MAX_DTS - cmp_lower = _US_MIN_DTS - elif unit == NPY_FR_ms: - cmp_upper = _MS_MAX_DTS - cmp_lower = _MS_MIN_DTS - elif unit == NPY_FR_s: - cmp_upper = _S_MAX_DTS - cmp_lower = _S_MIN_DTS - elif unit == NPY_FR_m: - cmp_upper = _M_MAX_DTS - cmp_lower = _M_MIN_DTS - else: - raise NotImplementedError(unit) + get_implementation_bounds(unit, &cmp_lower, &cmp_upper) if cmp_npy_datetimestruct(dts, &cmp_lower) == -1: error = True @@ -213,14 +221,6 @@ cdef inline int64_t dtstruct_to_dt64(npy_datetimestruct* dts) nogil: return npy_datetimestruct_to_datetime(NPY_FR_ns, dts) -cdef inline void dt64_to_dtstruct(int64_t dt64, - npy_datetimestruct* out) nogil: - """Convenience function to call pandas_datetime_to_datetimestruct - with the by-far-most-common frequency NPY_FR_ns""" - pandas_datetime_to_datetimestruct(dt64, NPY_FR_ns, out) - return - - # just exposed for testing at the moment def py_td64_to_tdstruct(int64_t td64, NPY_DATETIMEUNIT unit): cdef: @@ -229,19 +229,29 @@ def py_td64_to_tdstruct(int64_t td64, NPY_DATETIMEUNIT unit): return tds # <- returned as a dict to python +cdef inline void pydatetime_to_dtstruct(datetime dt, npy_datetimestruct *dts): + if PyDateTime_CheckExact(dt): + dts.year = PyDateTime_GET_YEAR(dt) + else: + # We use dt.year instead of PyDateTime_GET_YEAR because with Timestamp + # we override year such that PyDateTime_GET_YEAR is incorrect. + dts.year = dt.year + + dts.month = PyDateTime_GET_MONTH(dt) + dts.day = PyDateTime_GET_DAY(dt) + dts.hour = PyDateTime_DATE_GET_HOUR(dt) + dts.min = PyDateTime_DATE_GET_MINUTE(dt) + dts.sec = PyDateTime_DATE_GET_SECOND(dt) + dts.us = PyDateTime_DATE_GET_MICROSECOND(dt) + dts.ps = dts.as = 0 + + cdef inline int64_t pydatetime_to_dt64(datetime val, npy_datetimestruct *dts): """ Note we are assuming that the datetime object is timezone-naive. """ - dts.year = PyDateTime_GET_YEAR(val) - dts.month = PyDateTime_GET_MONTH(val) - dts.day = PyDateTime_GET_DAY(val) - dts.hour = PyDateTime_DATE_GET_HOUR(val) - dts.min = PyDateTime_DATE_GET_MINUTE(val) - dts.sec = PyDateTime_DATE_GET_SECOND(val) - dts.us = PyDateTime_DATE_GET_MICROSECOND(val) - dts.ps = dts.as = 0 + pydatetime_to_dtstruct(val, dts) return dtstruct_to_dt64(dts) @@ -279,6 +289,7 @@ cpdef ndarray astype_overflowsafe( ndarray values, cnp.dtype dtype, bint copy=True, + bint round_ok=True, ): """ Convert an ndarray with datetime64[X] to datetime64[Y] @@ -311,10 +322,6 @@ cpdef ndarray astype_overflowsafe( "datetime64/timedelta64 values and dtype must have a unit specified" ) - if (values).dtype.byteorder == ">": - # GH#29684 we incorrectly get OutOfBoundsDatetime if we dont swap - values = values.astype(values.dtype.newbyteorder("<")) - if from_unit == to_unit: # Check this before allocating result for perf, might save some memory if copy: @@ -322,9 +329,17 @@ cpdef ndarray astype_overflowsafe( return values elif from_unit > to_unit: - # e.g. ns -> us, so there is no risk of overflow, so we can use - # numpy's astype safely. Note there _is_ risk of truncation. - return values.astype(dtype) + if round_ok: + # e.g. ns -> us, so there is no risk of overflow, so we can use + # numpy's astype safely. Note there _is_ risk of truncation. + return values.astype(dtype) + else: + iresult2 = astype_round_check(values.view("i8"), from_unit, to_unit) + return iresult2.view(dtype) + + if (values).dtype.byteorder == ">": + # GH#29684 we incorrectly get OutOfBoundsDatetime if we dont swap + values = values.astype(values.dtype.newbyteorder("<")) cdef: ndarray i8values = values.view("i8") @@ -353,10 +368,11 @@ cpdef ndarray astype_overflowsafe( check_dts_bounds(&dts, to_unit) except OutOfBoundsDatetime as err: if is_td: - tdval = np.timedelta64(value).view(values.dtype) + from_abbrev = np.datetime_data(values.dtype)[0] + np_val = np.timedelta64(value, from_abbrev) msg = ( - "Cannot convert {tdval} to {dtype} without overflow" - .format(tdval=str(tdval), dtype=str(dtype)) + "Cannot convert {np_val} to {dtype} without overflow" + .format(np_val=str(np_val), dtype=str(dtype)) ) raise OutOfBoundsTimedelta(msg) from err else: @@ -370,3 +386,221 @@ cpdef ndarray astype_overflowsafe( cnp.PyArray_MultiIter_NEXT(mi) return iresult.view(dtype) + + +# TODO: try to upstream this fix to numpy +def compare_mismatched_resolutions(ndarray left, ndarray right, op): + """ + Overflow-safe comparison of timedelta64/datetime64 with mismatched resolutions. + + >>> left = np.array([500], dtype="M8[Y]") + >>> right = np.array([0], dtype="M8[ns]") + >>> left < right # <- wrong! + array([ True]) + """ + + if left.dtype.kind != right.dtype.kind or left.dtype.kind not in ["m", "M"]: + raise ValueError("left and right must both be timedelta64 or both datetime64") + + cdef: + int op_code = op_to_op_code(op) + NPY_DATETIMEUNIT left_unit = get_unit_from_dtype(left.dtype) + NPY_DATETIMEUNIT right_unit = get_unit_from_dtype(right.dtype) + + # equiv: result = np.empty((left).shape, dtype="bool") + ndarray result = cnp.PyArray_EMPTY( + left.ndim, left.shape, cnp.NPY_BOOL, 0 + ) + + ndarray lvalues = left.view("i8") + ndarray rvalues = right.view("i8") + + cnp.broadcast mi = cnp.PyArray_MultiIterNew3(result, lvalues, rvalues) + int64_t lval, rval + bint res_value + + Py_ssize_t i, N = left.size + npy_datetimestruct ldts, rdts + + + for i in range(N): + # Analogous to: lval = lvalues[i] + lval = (cnp.PyArray_MultiIter_DATA(mi, 1))[0] + + # Analogous to: rval = rvalues[i] + rval = (cnp.PyArray_MultiIter_DATA(mi, 2))[0] + + if lval == NPY_DATETIME_NAT or rval == NPY_DATETIME_NAT: + res_value = op_code == Py_NE + + else: + pandas_datetime_to_datetimestruct(lval, left_unit, &ldts) + pandas_datetime_to_datetimestruct(rval, right_unit, &rdts) + + res_value = cmp_dtstructs(&ldts, &rdts, op_code) + + # Analogous to: result[i] = res_value + (cnp.PyArray_MultiIter_DATA(mi, 0))[0] = res_value + + cnp.PyArray_MultiIter_NEXT(mi) + + return result + + +import operator + + +cdef int op_to_op_code(op): + # TODO: should exist somewhere? + if op is operator.eq: + return Py_EQ + if op is operator.ne: + return Py_NE + if op is operator.le: + return Py_LE + if op is operator.lt: + return Py_LT + if op is operator.ge: + return Py_GE + if op is operator.gt: + return Py_GT + + +cdef ndarray astype_round_check( + ndarray i8values, + NPY_DATETIMEUNIT from_unit, + NPY_DATETIMEUNIT to_unit +): + # cases with from_unit > to_unit, e.g. ns->us, raise if the conversion + # involves truncation, e.g. 1500ns->1us + cdef: + Py_ssize_t i, N = i8values.size + + # equiv: iresult = np.empty((i8values).shape, dtype="i8") + ndarray iresult = cnp.PyArray_EMPTY( + i8values.ndim, i8values.shape, cnp.NPY_INT64, 0 + ) + cnp.broadcast mi = cnp.PyArray_MultiIterNew2(iresult, i8values) + + # Note the arguments to_unit, from unit are swapped vs how they + # are passed when going to a higher-frequency reso. + int64_t mult = get_conversion_factor(to_unit, from_unit) + int64_t value, mod + + for i in range(N): + # Analogous to: item = i8values[i] + value = (cnp.PyArray_MultiIter_DATA(mi, 1))[0] + + if value == NPY_DATETIME_NAT: + new_value = NPY_DATETIME_NAT + else: + new_value, mod = divmod(value, mult) + if mod != 0: + # TODO: avoid runtime import + from pandas._libs.tslibs.dtypes import npy_unit_to_abbrev + from_abbrev = npy_unit_to_abbrev(from_unit) + to_abbrev = npy_unit_to_abbrev(to_unit) + raise ValueError( + f"Cannot losslessly cast '{value} {from_abbrev}' to {to_abbrev}" + ) + + # Analogous to: iresult[i] = new_value + (cnp.PyArray_MultiIter_DATA(mi, 0))[0] = new_value + + cnp.PyArray_MultiIter_NEXT(mi) + + return iresult + + +@cython.overflowcheck(True) +cdef int64_t get_conversion_factor(NPY_DATETIMEUNIT from_unit, NPY_DATETIMEUNIT to_unit) except? -1: + """ + Find the factor by which we need to multiply to convert from from_unit to to_unit. + """ + if ( + from_unit == NPY_DATETIMEUNIT.NPY_FR_GENERIC + or to_unit == NPY_DATETIMEUNIT.NPY_FR_GENERIC + ): + raise ValueError("unit-less resolutions are not supported") + if from_unit > to_unit: + raise ValueError + + if from_unit == to_unit: + return 1 + + if from_unit == NPY_DATETIMEUNIT.NPY_FR_W: + return 7 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_D, to_unit) + elif from_unit == NPY_DATETIMEUNIT.NPY_FR_D: + return 24 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_h, to_unit) + elif from_unit == NPY_DATETIMEUNIT.NPY_FR_h: + return 60 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_m, to_unit) + elif from_unit == NPY_DATETIMEUNIT.NPY_FR_m: + return 60 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_s, to_unit) + elif from_unit == NPY_DATETIMEUNIT.NPY_FR_s: + return 1000 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_ms, to_unit) + elif from_unit == NPY_DATETIMEUNIT.NPY_FR_ms: + return 1000 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_us, to_unit) + elif from_unit == NPY_DATETIMEUNIT.NPY_FR_us: + return 1000 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_ns, to_unit) + elif from_unit == NPY_DATETIMEUNIT.NPY_FR_ns: + return 1000 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_ps, to_unit) + elif from_unit == NPY_DATETIMEUNIT.NPY_FR_ps: + return 1000 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_fs, to_unit) + elif from_unit == NPY_DATETIMEUNIT.NPY_FR_fs: + return 1000 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_as, to_unit) + + +cdef int64_t convert_reso( + int64_t value, + NPY_DATETIMEUNIT from_reso, + NPY_DATETIMEUNIT to_reso, + bint round_ok, +) except? -1: + cdef: + int64_t res_value, mult, div, mod + + if from_reso == to_reso: + return value + + elif to_reso < from_reso: + # e.g. ns -> us, no risk of overflow, but can be lossy rounding + mult = get_conversion_factor(to_reso, from_reso) + div, mod = divmod(value, mult) + if mod > 0 and not round_ok: + raise ValueError("Cannot losslessly convert units") + + # Note that when mod > 0, we follow np.timedelta64 in always + # rounding down. + res_value = div + + elif ( + from_reso == NPY_FR_Y + or from_reso == NPY_FR_M + or to_reso == NPY_FR_Y + or to_reso == NPY_FR_M + ): + # Converting by multiplying isn't _quite_ right bc the number of + # seconds in a month/year isn't fixed. + res_value = _convert_reso_with_dtstruct(value, from_reso, to_reso) + + else: + # e.g. ns -> us, risk of overflow, but no risk of lossy rounding + mult = get_conversion_factor(from_reso, to_reso) + with cython.overflowcheck(True): + # Note: caller is responsible for re-raising as OutOfBoundsTimedelta + res_value = value * mult + + return res_value + + +cdef int64_t _convert_reso_with_dtstruct( + int64_t value, + NPY_DATETIMEUNIT from_unit, + NPY_DATETIMEUNIT to_unit, +) except? -1: + cdef: + npy_datetimestruct dts + + pandas_datetime_to_datetimestruct(value, from_unit, &dts) + check_dts_bounds(&dts, to_unit) + return npy_datetimestruct_to_datetime(to_unit, &dts) diff --git a/pandas/_libs/tslibs/offsets.pyi b/pandas/_libs/tslibs/offsets.pyi index 058bbcdd346e0..c885b869f983a 100644 --- a/pandas/_libs/tslibs/offsets.pyi +++ b/pandas/_libs/tslibs/offsets.pyi @@ -21,7 +21,7 @@ from .timedeltas import Timedelta if TYPE_CHECKING: from pandas.core.indexes.datetimes import DatetimeIndex -_BaseOffsetT = TypeVar("_BaseOffsetT", bound="BaseOffset") +_BaseOffsetT = TypeVar("_BaseOffsetT", bound=BaseOffset) _DatetimeT = TypeVar("_DatetimeT", bound=datetime) _TimedeltaT = TypeVar("_TimedeltaT", bound=timedelta) @@ -76,13 +76,13 @@ class BaseOffset: def __rmul__(self: _BaseOffsetT, other: int) -> _BaseOffsetT: ... def __neg__(self: _BaseOffsetT) -> _BaseOffsetT: ... def copy(self: _BaseOffsetT) -> _BaseOffsetT: ... - def __repr__(self) -> str: ... @property def name(self) -> str: ... @property def rule_code(self) -> str: ... + @property def freqstr(self) -> str: ... - def apply_index(self, dtindex: "DatetimeIndex") -> "DatetimeIndex": ... + def apply_index(self, dtindex: DatetimeIndex) -> DatetimeIndex: ... def _apply_array(self, dtarr) -> None: ... def rollback(self, dt: datetime) -> datetime: ... def rollforward(self, dt: datetime) -> datetime: ... @@ -105,10 +105,14 @@ class SingleConstructorOffset(BaseOffset): @overload def to_offset(freq: None) -> None: ... @overload -def to_offset(freq: timedelta | BaseOffset | str) -> BaseOffset: ... +def to_offset(freq: _BaseOffsetT) -> _BaseOffsetT: ... +@overload +def to_offset(freq: timedelta | str) -> BaseOffset: ... class Tick(SingleConstructorOffset): _reso: int + _prefix: str + _td64_unit: str def __init__(self, n: int = ..., normalize: bool = ...) -> None: ... @property def delta(self) -> Timedelta: ... diff --git a/pandas/_libs/tslibs/offsets.pyx b/pandas/_libs/tslibs/offsets.pyx index d37c287be4cfd..5f4f6b998a60a 100644 --- a/pandas/_libs/tslibs/offsets.pyx +++ b/pandas/_libs/tslibs/offsets.pyx @@ -70,6 +70,7 @@ from pandas._libs.tslibs.np_datetime cimport ( from .dtypes cimport PeriodDtypeCode from .timedeltas cimport ( + _Timedelta, delta_to_nanoseconds, is_any_td_scalar, ) @@ -795,6 +796,7 @@ cdef class SingleConstructorOffset(BaseOffset): cdef class Tick(SingleConstructorOffset): _adjust_dst = False _prefix = "undefined" + _td64_unit = "undefined" _attributes = tuple(["n", "normalize"]) def __init__(self, n=1, normalize=False): @@ -967,6 +969,7 @@ cdef class Tick(SingleConstructorOffset): cdef class Day(Tick): _nanos_inc = 24 * 3600 * 1_000_000_000 _prefix = "D" + _td64_unit = "D" _period_dtype_code = PeriodDtypeCode.D _reso = NPY_DATETIMEUNIT.NPY_FR_D @@ -974,6 +977,7 @@ cdef class Day(Tick): cdef class Hour(Tick): _nanos_inc = 3600 * 1_000_000_000 _prefix = "H" + _td64_unit = "h" _period_dtype_code = PeriodDtypeCode.H _reso = NPY_DATETIMEUNIT.NPY_FR_h @@ -981,6 +985,7 @@ cdef class Hour(Tick): cdef class Minute(Tick): _nanos_inc = 60 * 1_000_000_000 _prefix = "T" + _td64_unit = "m" _period_dtype_code = PeriodDtypeCode.T _reso = NPY_DATETIMEUNIT.NPY_FR_m @@ -988,6 +993,7 @@ cdef class Minute(Tick): cdef class Second(Tick): _nanos_inc = 1_000_000_000 _prefix = "S" + _td64_unit = "s" _period_dtype_code = PeriodDtypeCode.S _reso = NPY_DATETIMEUNIT.NPY_FR_s @@ -995,6 +1001,7 @@ cdef class Second(Tick): cdef class Milli(Tick): _nanos_inc = 1_000_000 _prefix = "L" + _td64_unit = "ms" _period_dtype_code = PeriodDtypeCode.L _reso = NPY_DATETIMEUNIT.NPY_FR_ms @@ -1002,6 +1009,7 @@ cdef class Milli(Tick): cdef class Micro(Tick): _nanos_inc = 1000 _prefix = "U" + _td64_unit = "us" _period_dtype_code = PeriodDtypeCode.U _reso = NPY_DATETIMEUNIT.NPY_FR_us @@ -1009,6 +1017,7 @@ cdef class Micro(Tick): cdef class Nano(Tick): _nanos_inc = 1 _prefix = "N" + _td64_unit = "ns" _period_dtype_code = PeriodDtypeCode.N _reso = NPY_DATETIMEUNIT.NPY_FR_ns @@ -1140,7 +1149,9 @@ cdef class RelativeDeltaOffset(BaseOffset): weeks = kwds.get("weeks", 0) * self.n if weeks: - dt64other = dt64other + Timedelta(days=7 * weeks) + delta = Timedelta(days=7 * weeks) + td = (<_Timedelta>delta)._as_reso(reso) + dt64other = dt64other + td timedelta_kwds = { k: v @@ -1149,13 +1160,14 @@ cdef class RelativeDeltaOffset(BaseOffset): } if timedelta_kwds: delta = Timedelta(**timedelta_kwds) - dt64other = dt64other + (self.n * delta) - # FIXME: fails to preserve non-nano + td = (<_Timedelta>delta)._as_reso(reso) + dt64other = dt64other + (self.n * td) return dt64other elif not self._use_relativedelta and hasattr(self, "_offset"): # timedelta - # FIXME: fails to preserve non-nano - return dt64other + Timedelta(self._offset * self.n) + delta = Timedelta(self._offset * self.n) + td = (<_Timedelta>delta)._as_reso(reso) + return dt64other + td else: # relativedelta with other keywords kwd = set(kwds) - relativedelta_fast @@ -1562,8 +1574,9 @@ cdef class BusinessHour(BusinessMixin): def _repr_attrs(self) -> str: out = super()._repr_attrs() + # Use python string formatting to be faster than strftime hours = ",".join( - f'{st.strftime("%H:%M")}-{en.strftime("%H:%M")}' + f'{st.hour:02d}:{st.minute:02d}-{en.hour:02d}:{en.minute:02d}' for st, en in zip(self.start, self.end) ) attrs = [f"{self._prefix}={hours}"] @@ -3111,7 +3124,7 @@ cdef class FY5253Quarter(FY5253Mixin): for qlen in qtr_lens: if qlen * 7 <= tdelta.days: num_qtrs += 1 - tdelta -= Timedelta(days=qlen * 7) + tdelta -= (<_Timedelta>Timedelta(days=qlen * 7))._as_reso(norm._reso) else: break else: diff --git a/pandas/_libs/tslibs/parsing.pyx b/pandas/_libs/tslibs/parsing.pyx index 8b42ed195957b..5cb11436f6f45 100644 --- a/pandas/_libs/tslibs/parsing.pyx +++ b/pandas/_libs/tslibs/parsing.pyx @@ -85,8 +85,9 @@ _DEFAULT_DATETIME = datetime(1, 1, 1).replace(hour=0, minute=0, second=0, microsecond=0) PARSING_WARNING_MSG = ( - "Parsing '{date_string}' in {format} format. Provide format " - "or specify infer_datetime_format=True for consistent parsing." + "Parsing dates in {format} format when dayfirst={dayfirst} was specified. " + "This may lead to inconsistently parsed dates! Specify a format " + "to ensure consistent parsing." ) cdef: @@ -185,16 +186,16 @@ cdef inline object _parse_delimited_date(str date_string, bint dayfirst): if dayfirst and not swapped_day_and_month: warnings.warn( PARSING_WARNING_MSG.format( - date_string=date_string, - format='MM/DD/YYYY' + format='MM/DD/YYYY', + dayfirst='True', ), stacklevel=4, ) elif not dayfirst and swapped_day_and_month: warnings.warn( PARSING_WARNING_MSG.format( - date_string=date_string, - format='DD/MM/YYYY' + format='DD/MM/YYYY', + dayfirst='False (the default)', ), stacklevel=4, ) diff --git a/pandas/_libs/tslibs/period.pyx b/pandas/_libs/tslibs/period.pyx index 0c05037097839..3332628627739 100644 --- a/pandas/_libs/tslibs/period.pyx +++ b/pandas/_libs/tslibs/period.pyx @@ -49,8 +49,6 @@ from pandas._libs.tslibs.np_datetime cimport ( NPY_FR_us, astype_overflowsafe, check_dts_bounds, - dt64_to_dtstruct, - dtstruct_to_dt64, get_timedelta64_value, npy_datetimestruct, npy_datetimestruct_to_datetime, @@ -60,7 +58,6 @@ from pandas._libs.tslibs.np_datetime cimport ( from pandas._libs.tslibs.timestamps import Timestamp from pandas._libs.tslibs.ccalendar cimport ( - c_MONTH_NUMBERS, dayofweek, get_day_of_year, get_days_in_month, @@ -814,7 +811,7 @@ cdef void get_date_info(int64_t ordinal, int freq, npy_datetimestruct *dts) nogi pandas_datetime_to_datetimestruct(unix_date, NPY_FR_D, dts) - dt64_to_dtstruct(nanos, &dts2) + pandas_datetime_to_datetimestruct(nanos, NPY_DATETIMEUNIT.NPY_FR_ns, &dts2) dts.hour = dts2.hour dts.min = dts2.min dts.sec = dts2.sec @@ -1150,7 +1147,7 @@ cdef int64_t period_ordinal_to_dt64(int64_t ordinal, int freq) except? -1: get_date_info(ordinal, freq, &dts) check_dts_bounds(&dts) - return dtstruct_to_dt64(&dts) + return npy_datetimestruct_to_datetime(NPY_DATETIMEUNIT.NPY_FR_ns, &dts) cdef str period_format(int64_t value, int freq, object fmt=None): @@ -1646,7 +1643,7 @@ cdef class _Period(PeriodMixin): return freq @classmethod - def _from_ordinal(cls, ordinal: int, freq) -> "Period": + def _from_ordinal(cls, ordinal: int64_t, freq) -> "Period": """ Fast creation from an ordinal and freq that are already validated! """ @@ -2483,9 +2480,11 @@ class Period(_Period): Parameters ---------- value : Period or str, default None - The time period represented (e.g., '4Q2005'). + The time period represented (e.g., '4Q2005'). This represents neither + the start or the end of the period, but rather the entire period itself. freq : str, default None - One of pandas period strings or corresponding objects. + One of pandas period strings or corresponding objects. Accepted + strings are listed in the :ref:`offset alias section ` in the user docs. ordinal : int, default None The period offset from the proleptic Gregorian epoch. year : int, default None @@ -2502,6 +2501,12 @@ class Period(_Period): Minute value of the period. second : int, default 0 Second value of the period. + + Examples + -------- + >>> period = pd.Period('2012-1-1', freq='D') + >>> period + Period('2012-01-01', 'D') """ def __new__(cls, value=None, freq=None, ordinal=None, diff --git a/pandas/_libs/tslibs/timedeltas.pyi b/pandas/_libs/tslibs/timedeltas.pyi index cc649e5a62660..1921329da9e24 100644 --- a/pandas/_libs/tslibs/timedeltas.pyi +++ b/pandas/_libs/tslibs/timedeltas.pyi @@ -2,7 +2,6 @@ from datetime import timedelta from typing import ( ClassVar, Literal, - Type, TypeVar, overload, ) @@ -84,7 +83,7 @@ class Timedelta(timedelta): resolution: ClassVar[Timedelta] value: int # np.int64 def __new__( - cls: Type[_S], + cls: type[_S], value=..., unit: str = ..., **kwargs: int | float | np.integer | np.floating, diff --git a/pandas/_libs/tslibs/timedeltas.pyx b/pandas/_libs/tslibs/timedeltas.pyx index 028371633a2c1..39458c10ad35b 100644 --- a/pandas/_libs/tslibs/timedeltas.pyx +++ b/pandas/_libs/tslibs/timedeltas.pyx @@ -35,10 +35,7 @@ from pandas._libs.tslibs.conversion cimport ( cast_from_unit, precision_from_unit, ) -from pandas._libs.tslibs.dtypes cimport ( - get_conversion_factor, - npy_unit_to_abbrev, -) +from pandas._libs.tslibs.dtypes cimport npy_unit_to_abbrev from pandas._libs.tslibs.nattype cimport ( NPY_NAT, c_NaT as NaT, @@ -50,6 +47,8 @@ from pandas._libs.tslibs.np_datetime cimport ( NPY_FR_ns, cmp_dtstructs, cmp_scalar, + convert_reso, + get_conversion_factor, get_datetime64_unit, get_timedelta64_value, get_unit_from_dtype, @@ -59,7 +58,10 @@ from pandas._libs.tslibs.np_datetime cimport ( pandas_timedeltastruct, ) -from pandas._libs.tslibs.np_datetime import OutOfBoundsTimedelta +from pandas._libs.tslibs.np_datetime import ( + OutOfBoundsDatetime, + OutOfBoundsTimedelta, +) from pandas._libs.tslibs.offsets cimport is_tick_object from pandas._libs.tslibs.util cimport ( @@ -242,6 +244,11 @@ cpdef int64_t delta_to_nanoseconds( elif is_timedelta64_object(delta): in_reso = get_datetime64_unit(delta) + if in_reso == NPY_DATETIMEUNIT.NPY_FR_Y or in_reso == NPY_DATETIMEUNIT.NPY_FR_M: + raise ValueError( + "delta_to_nanoseconds does not support Y or M units, " + "as their duration in nanoseconds is ambiguous." + ) n = get_timedelta64_value(delta) elif PyDelta_Check(delta): @@ -258,26 +265,15 @@ cpdef int64_t delta_to_nanoseconds( else: raise TypeError(type(delta)) - if reso < in_reso: - # e.g. ns -> us - factor = get_conversion_factor(reso, in_reso) - div, mod = divmod(n, factor) - if mod > 0 and not round_ok: - raise ValueError("Cannot losslessly convert units") - - # Note that when mod > 0, we follow np.timedelta64 in always - # rounding down. - value = div - else: - factor = get_conversion_factor(in_reso, reso) - try: - with cython.overflowcheck(True): - value = n * factor - except OverflowError as err: - unit_str = npy_unit_to_abbrev(reso) - raise OutOfBoundsTimedelta( - f"Cannot cast {str(delta)} to unit={unit_str} without overflow." - ) from err + try: + return convert_reso(n, in_reso, reso, round_ok=round_ok) + except (OutOfBoundsDatetime, OverflowError) as err: + # Catch OutOfBoundsDatetime bc convert_reso can call check_dts_bounds + # for Y/M-resolution cases + unit_str = npy_unit_to_abbrev(reso) + raise OutOfBoundsTimedelta( + f"Cannot cast {str(delta)} to unit={unit_str} without overflow." + ) from err return value @@ -765,8 +761,12 @@ def _binary_op_method_timedeltalike(op, name): # defined by Timestamp methods. elif is_array(other): - # nd-array like - if other.dtype.kind in ['m', 'M']: + if other.ndim == 0: + # see also: item_from_zerodim + item = cnp.PyArray_ToScalar(cnp.PyArray_DATA(other), other) + return f(self, item) + + elif other.dtype.kind in ['m', 'M']: return op(self.to_timedelta64(), other) elif other.dtype.kind == 'O': return np.array([op(self, x) for x in other]) @@ -787,8 +787,19 @@ def _binary_op_method_timedeltalike(op, name): # e.g. if original other was timedelta64('NaT') return NaT - if self._reso != other._reso: - raise NotImplementedError + # We allow silent casting to the lower resolution if and only + # if it is lossless. + try: + if self._reso < other._reso: + other = (<_Timedelta>other)._as_reso(self._reso, round_ok=False) + elif self._reso > other._reso: + self = (<_Timedelta>self)._as_reso(other._reso, round_ok=False) + except ValueError as err: + raise ValueError( + "Timedelta addition/subtraction with mismatched resolutions is not " + "allowed when casting to the lower resolution would require " + "lossy rounding." + ) from err res = op(self.value, other.value) if res == NPY_NAT: @@ -935,22 +946,30 @@ cdef _timedelta_from_value_and_reso(int64_t value, NPY_DATETIMEUNIT reso): cdef: _Timedelta td_base + # For millisecond and second resos, we cannot actually pass int(value) because + # many cases would fall outside of the pytimedelta implementation bounds. + # We pass 0 instead, and override seconds, microseconds, days. + # In principle we could pass 0 for ns and us too. if reso == NPY_FR_ns: td_base = _Timedelta.__new__(Timedelta, microseconds=int(value) // 1000) elif reso == NPY_DATETIMEUNIT.NPY_FR_us: td_base = _Timedelta.__new__(Timedelta, microseconds=int(value)) elif reso == NPY_DATETIMEUNIT.NPY_FR_ms: - td_base = _Timedelta.__new__(Timedelta, milliseconds=int(value)) + td_base = _Timedelta.__new__(Timedelta, milliseconds=0) elif reso == NPY_DATETIMEUNIT.NPY_FR_s: - td_base = _Timedelta.__new__(Timedelta, seconds=int(value)) - elif reso == NPY_DATETIMEUNIT.NPY_FR_m: - td_base = _Timedelta.__new__(Timedelta, minutes=int(value)) - elif reso == NPY_DATETIMEUNIT.NPY_FR_h: - td_base = _Timedelta.__new__(Timedelta, hours=int(value)) - elif reso == NPY_DATETIMEUNIT.NPY_FR_D: - td_base = _Timedelta.__new__(Timedelta, days=int(value)) + td_base = _Timedelta.__new__(Timedelta, seconds=0) + # Other resolutions are disabled but could potentially be implemented here: + # elif reso == NPY_DATETIMEUNIT.NPY_FR_m: + # td_base = _Timedelta.__new__(Timedelta, minutes=int(value)) + # elif reso == NPY_DATETIMEUNIT.NPY_FR_h: + # td_base = _Timedelta.__new__(Timedelta, hours=int(value)) + # elif reso == NPY_DATETIMEUNIT.NPY_FR_D: + # td_base = _Timedelta.__new__(Timedelta, days=int(value)) else: - raise NotImplementedError(reso) + raise NotImplementedError( + "Only resolutions 's', 'ms', 'us', 'ns' are supported." + ) + td_base.value = value td_base._is_populated = 0 @@ -958,6 +977,34 @@ cdef _timedelta_from_value_and_reso(int64_t value, NPY_DATETIMEUNIT reso): return td_base +class MinMaxReso: + """ + We need to define min/max/resolution on both the Timedelta _instance_ + and Timedelta class. On an instance, these depend on the object's _reso. + On the class, we default to the values we would get with nanosecond _reso. + """ + def __init__(self, name): + self._name = name + + def __get__(self, obj, type=None): + if self._name == "min": + val = np.iinfo(np.int64).min + 1 + elif self._name == "max": + val = np.iinfo(np.int64).max + else: + assert self._name == "resolution" + val = 1 + + if obj is None: + # i.e. this is on the class, default to nanos + return Timedelta(val) + else: + return Timedelta._from_value_and_reso(val, obj._reso) + + def __set__(self, obj, value): + raise AttributeError(f"{self._name} is not settable.") + + # Similar to Timestamp/datetime, this is a construction requirement for # timedeltas that we need to do object instantiation in python. This will # serve as a C extension type that shadows the Python class, where we do any @@ -971,6 +1018,36 @@ cdef class _Timedelta(timedelta): # higher than np.ndarray and np.matrix __array_priority__ = 100 + min = MinMaxReso("min") + max = MinMaxReso("max") + resolution = MinMaxReso("resolution") + + @property + def days(self) -> int: # TODO(cython3): make cdef property + # NB: using the python C-API PyDateTime_DELTA_GET_DAYS will fail + # (or be incorrect) + self._ensure_components() + return self._d + + @property + def seconds(self) -> int: # TODO(cython3): make cdef property + # NB: using the python C-API PyDateTime_DELTA_GET_SECONDS will fail + # (or be incorrect) + self._ensure_components() + return self._h * 3600 + self._m * 60 + self._s + + @property + def microseconds(self) -> int: # TODO(cython3): make cdef property + # NB: using the python C-API PyDateTime_DELTA_GET_MICROSECONDS will fail + # (or be incorrect) + self._ensure_components() + return self._ms * 1000 + self._us + + def total_seconds(self) -> float: + """Total seconds in the duration.""" + # We need to override bc we overrided days/seconds/microseconds + # TODO: add nanos/1e9? + return self.days * 24 * 3600 + self.seconds + self.microseconds / 1_000_000 @property def freq(self) -> None: @@ -1006,7 +1083,6 @@ cdef class _Timedelta(timedelta): def __richcmp__(_Timedelta self, object other, int op): cdef: _Timedelta ots - int ndim if isinstance(other, _Timedelta): ots = other @@ -1018,7 +1094,6 @@ cdef class _Timedelta(timedelta): return op == Py_NE elif util.is_array(other): - # TODO: watch out for zero-dim if other.dtype.kind == "m": return PyObject_RichCompare(self.asm8, other, op) elif other.dtype.kind == "O": @@ -1461,21 +1536,7 @@ cdef class _Timedelta(timedelta): if reso == self._reso: return self - if reso < self._reso: - # e.g. ns -> us - mult = get_conversion_factor(reso, self._reso) - div, mod = divmod(self.value, mult) - if mod > 0 and not round_ok: - raise ValueError("Cannot losslessly convert units") - - # Note that when mod > 0, we follow np.timedelta64 in always - # rounding down. - value = div - else: - mult = get_conversion_factor(self._reso, reso) - with cython.overflowcheck(True): - # Note: caller is responsible for re-raising as OutOfBoundsTimedelta - value = self.value * mult + value = convert_reso(self.value, self._reso, reso, round_ok=round_ok) return type(self)._from_value_and_reso(value, reso=reso) @@ -1653,15 +1714,14 @@ class Timedelta(_Timedelta): int64_t result, unit, remainder ndarray[int64_t] arr - if self._reso != NPY_FR_ns: - raise NotImplementedError - from pandas._libs.tslibs.offsets import to_offset - unit = to_offset(freq).nanos + + to_offset(freq).nanos # raises on non-fixed freq + unit = delta_to_nanoseconds(to_offset(freq), self._reso) arr = np.array([self.value], dtype="i8") result = round_nsint64(arr, mode, unit)[0] - return Timedelta(result, unit="ns") + return Timedelta._from_value_and_reso(result, self._reso) def round(self, freq): """ @@ -1729,7 +1789,10 @@ class Timedelta(_Timedelta): ) elif is_array(other): - # ndarray-like + if other.ndim == 0: + # see also: item_from_zerodim + item = cnp.PyArray_ToScalar(cnp.PyArray_DATA(other), other) + return self.__mul__(item) return other * self.to_timedelta64() return NotImplemented @@ -1737,20 +1800,35 @@ class Timedelta(_Timedelta): __rmul__ = __mul__ def __truediv__(self, other): + cdef: + int64_t new_value + if _should_cast_to_timedelta(other): # We interpret NaT as timedelta64("NaT") other = Timedelta(other) if other is NaT: return np.nan + if other._reso != self._reso: + raise ValueError( + "division between Timedeltas with mismatched resolutions " + "are not supported. Explicitly cast to matching resolutions " + "before dividing." + ) return self.value / float(other.value) elif is_integer_object(other) or is_float_object(other): # integers or floats - if self._reso != NPY_FR_ns: - raise NotImplementedError - return Timedelta(self.value / other, unit='ns') + if util.is_nan(other): + return NaT + return Timedelta._from_value_and_reso( + (self.value / other), self._reso + ) elif is_array(other): + if other.ndim == 0: + # see also: item_from_zerodim + item = cnp.PyArray_ToScalar(cnp.PyArray_DATA(other), other) + return self.__truediv__(item) return self.to_timedelta64() / other return NotImplemented @@ -1761,14 +1839,26 @@ class Timedelta(_Timedelta): other = Timedelta(other) if other is NaT: return np.nan - if self._reso != NPY_FR_ns: - raise NotImplementedError + if self._reso != other._reso: + raise ValueError( + "division between Timedeltas with mismatched resolutions " + "are not supported. Explicitly cast to matching resolutions " + "before dividing." + ) return float(other.value) / self.value elif is_array(other): - if other.dtype.kind == "O": + if other.ndim == 0: + # see also: item_from_zerodim + item = cnp.PyArray_ToScalar(cnp.PyArray_DATA(other), other) + return self.__rtruediv__(item) + elif other.dtype.kind == "O": # GH#31869 return np.array([x / self for x in other]) + + # TODO: if other.dtype.kind == "m" and other.dtype != self.asm8.dtype + # then should disallow for consistency with scalar behavior; requires + # deprecation cycle. (or changing scalar behavior) return other / self.to_timedelta64() return NotImplemented @@ -1781,16 +1871,25 @@ class Timedelta(_Timedelta): other = Timedelta(other) if other is NaT: return np.nan - if self._reso != NPY_FR_ns: - raise NotImplementedError + if self._reso != other._reso: + raise ValueError( + "floordivision between Timedeltas with mismatched resolutions " + "are not supported. Explicitly cast to matching resolutions " + "before dividing." + ) return self.value // other.value elif is_integer_object(other) or is_float_object(other): - if self._reso != NPY_FR_ns: - raise NotImplementedError - return Timedelta(self.value // other, unit='ns') + if util.is_nan(other): + return NaT + return type(self)._from_value_and_reso(self.value // other, self._reso) elif is_array(other): + if other.ndim == 0: + # see also: item_from_zerodim + item = cnp.PyArray_ToScalar(cnp.PyArray_DATA(other), other) + return self.__floordiv__(item) + if other.dtype.kind == 'm': # also timedelta-like if self._reso != NPY_FR_ns: @@ -1798,9 +1897,7 @@ class Timedelta(_Timedelta): return _broadcast_floordiv_td64(self.value, other, _floordiv) elif other.dtype.kind in ['i', 'u', 'f']: if other.ndim == 0: - if self._reso != NPY_FR_ns: - raise NotImplementedError - return Timedelta(self.value // other) + return self // other.item() else: return self.to_timedelta64() // other @@ -1816,11 +1913,20 @@ class Timedelta(_Timedelta): other = Timedelta(other) if other is NaT: return np.nan - if self._reso != NPY_FR_ns: - raise NotImplementedError + if self._reso != other._reso: + raise ValueError( + "floordivision between Timedeltas with mismatched resolutions " + "are not supported. Explicitly cast to matching resolutions " + "before dividing." + ) return other.value // self.value elif is_array(other): + if other.ndim == 0: + # see also: item_from_zerodim + item = cnp.PyArray_ToScalar(cnp.PyArray_DATA(other), other) + return self.__rfloordiv__(item) + if other.dtype.kind == 'm': # also timedelta-like if self._reso != NPY_FR_ns: @@ -1906,26 +2012,14 @@ cdef _broadcast_floordiv_td64( result : varies based on `other` """ # assumes other.dtype.kind == 'm', i.e. other is timedelta-like + # assumes other.ndim != 0 # We need to watch out for np.timedelta64('NaT'). mask = other.view('i8') == NPY_NAT - if other.ndim == 0: - if mask: - return np.nan - - return operation(value, other.astype('m8[ns]').astype('i8')) - - else: - res = operation(value, other.astype('m8[ns]').astype('i8')) - - if mask.any(): - res = res.astype('f8') - res[mask] = np.nan - return res - + res = operation(value, other.astype('m8[ns]', copy=False).astype('i8')) -# resolution in ns -Timedelta.min = Timedelta(np.iinfo(np.int64).min + 1) -Timedelta.max = Timedelta(np.iinfo(np.int64).max) -Timedelta.resolution = Timedelta(nanoseconds=1) + if mask.any(): + res = res.astype('f8') + res[mask] = np.nan + return res diff --git a/pandas/_libs/tslibs/timestamps.pxd b/pandas/_libs/tslibs/timestamps.pxd index bde7cf9328712..0ecb26822cf50 100644 --- a/pandas/_libs/tslibs/timestamps.pxd +++ b/pandas/_libs/tslibs/timestamps.pxd @@ -22,7 +22,7 @@ cdef _Timestamp create_timestamp_from_ts(int64_t value, cdef class _Timestamp(ABCTimestamp): cdef readonly: - int64_t value, nanosecond + int64_t value, nanosecond, year BaseOffset _freq NPY_DATETIMEUNIT _reso @@ -37,3 +37,4 @@ cdef class _Timestamp(ABCTimestamp): cpdef void _set_freq(self, freq) cdef _warn_on_field_deprecation(_Timestamp self, freq, str field) cdef bint _compare_mismatched_resos(_Timestamp self, _Timestamp other, int op) + cdef _Timestamp _as_reso(_Timestamp self, NPY_DATETIMEUNIT reso, bint round_ok=*) diff --git a/pandas/_libs/tslibs/timestamps.pyi b/pandas/_libs/tslibs/timestamps.pyi index fd593ae453ef7..082f26cf6f213 100644 --- a/pandas/_libs/tslibs/timestamps.pyi +++ b/pandas/_libs/tslibs/timestamps.pyi @@ -85,10 +85,10 @@ class Timestamp(datetime): def fold(self) -> int: ... @classmethod def fromtimestamp( - cls: type[_DatetimeT], t: float, tz: _tzinfo | None = ... + cls: type[_DatetimeT], ts: float, tz: _tzinfo | None = ... ) -> _DatetimeT: ... @classmethod - def utcfromtimestamp(cls: type[_DatetimeT], t: float) -> _DatetimeT: ... + def utcfromtimestamp(cls: type[_DatetimeT], ts: float) -> _DatetimeT: ... @classmethod def today(cls: type[_DatetimeT], tz: _tzinfo | str | None = ...) -> _DatetimeT: ... @classmethod @@ -104,7 +104,9 @@ class Timestamp(datetime): def utcnow(cls: type[_DatetimeT]) -> _DatetimeT: ... # error: Signature of "combine" incompatible with supertype "datetime" @classmethod - def combine(cls, date: _date, time: _time) -> datetime: ... # type: ignore[override] + def combine( # type: ignore[override] + cls, date: _date, time: _time + ) -> datetime: ... @classmethod def fromisoformat(cls: type[_DatetimeT], date_string: str) -> _DatetimeT: ... def strftime(self, format: str) -> str: ... @@ -116,19 +118,25 @@ class Timestamp(datetime): def date(self) -> _date: ... def time(self) -> _time: ... def timetz(self) -> _time: ... - def replace( + # LSP violation: nanosecond is not present in datetime.datetime.replace + # and has positional args following it + def replace( # type: ignore[override] self: _DatetimeT, - year: int = ..., - month: int = ..., - day: int = ..., - hour: int = ..., - minute: int = ..., - second: int = ..., - microsecond: int = ..., - tzinfo: _tzinfo | None = ..., - fold: int = ..., + year: int | None = ..., + month: int | None = ..., + day: int | None = ..., + hour: int | None = ..., + minute: int | None = ..., + second: int | None = ..., + microsecond: int | None = ..., + nanosecond: int | None = ..., + tzinfo: _tzinfo | type[object] | None = ..., + fold: int | None = ..., + ) -> _DatetimeT: ... + # LSP violation: datetime.datetime.astimezone has a default value for tz + def astimezone( # type: ignore[override] + self: _DatetimeT, tz: _tzinfo | None ) -> _DatetimeT: ... - def astimezone(self: _DatetimeT, tz: _tzinfo | None = ...) -> _DatetimeT: ... def ctime(self) -> str: ... def isoformat(self, sep: str = ..., timespec: str = ...) -> str: ... @classmethod @@ -204,8 +212,6 @@ class Timestamp(datetime): @property def dayofweek(self) -> int: ... @property - def day_of_month(self) -> int: ... - @property def day_of_year(self) -> int: ... @property def dayofyear(self) -> int: ... @@ -222,3 +228,4 @@ class Timestamp(datetime): def days_in_month(self) -> int: ... @property def daysinmonth(self) -> int: ... + def _as_unit(self, unit: str, round_ok: bool = ...) -> Timestamp: ... diff --git a/pandas/_libs/tslibs/timestamps.pyx b/pandas/_libs/tslibs/timestamps.pyx index c6bae70d04a98..dc4da6c9bf4d2 100644 --- a/pandas/_libs/tslibs/timestamps.pyx +++ b/pandas/_libs/tslibs/timestamps.pyx @@ -80,15 +80,17 @@ from pandas._libs.tslibs.nattype cimport ( from pandas._libs.tslibs.np_datetime cimport ( NPY_DATETIMEUNIT, NPY_FR_ns, - check_dts_bounds, cmp_dtstructs, cmp_scalar, - dt64_to_dtstruct, + convert_reso, + get_conversion_factor, get_datetime64_unit, get_datetime64_value, + get_unit_from_dtype, npy_datetimestruct, + npy_datetimestruct_to_datetime, pandas_datetime_to_datetimestruct, - pydatetime_to_dt64, + pydatetime_to_dtstruct, ) from pandas._libs.tslibs.np_datetime import ( @@ -102,6 +104,7 @@ from pandas._libs.tslibs.offsets cimport ( to_offset, ) from pandas._libs.tslibs.timedeltas cimport ( + _Timedelta, delta_to_nanoseconds, ensure_td64ns, is_any_td_scalar, @@ -141,12 +144,27 @@ cdef inline _Timestamp create_timestamp_from_ts( """ convenience routine to construct a Timestamp from its parts """ cdef: _Timestamp ts_base + int64_t pass_year = dts.year + + # We pass year=1970/1972 here and set year below because with non-nanosecond + # resolution we may have datetimes outside of the stdlib pydatetime + # implementation bounds, which would raise. + # NB: this means the C-API macro PyDateTime_GET_YEAR is unreliable. + if 1 <= pass_year <= 9999: + # we are in-bounds for pydatetime + pass + elif ccalendar.is_leapyear(dts.year): + pass_year = 1972 + else: + pass_year = 1970 - ts_base = _Timestamp.__new__(Timestamp, dts.year, dts.month, + ts_base = _Timestamp.__new__(Timestamp, pass_year, dts.month, dts.day, dts.hour, dts.min, dts.sec, dts.us, tz, fold=fold) + ts_base.value = value ts_base._freq = freq + ts_base.year = dts.year ts_base.nanosecond = dts.ps // 1000 ts_base._reso = reso @@ -155,14 +173,7 @@ cdef inline _Timestamp create_timestamp_from_ts( def _unpickle_timestamp(value, freq, tz, reso=NPY_FR_ns): # GH#41949 dont warn on unpickle if we have a freq - if reso == NPY_FR_ns: - ts = Timestamp(value, tz=tz) - else: - if tz is not None: - raise NotImplementedError - abbrev = npy_unit_to_abbrev(reso) - dt64 = np.datetime64(value, abbrev) - ts = Timestamp._from_dt64(dt64) + ts = Timestamp._from_value_and_reso(value, reso, tz) ts._set_freq(freq) return ts @@ -184,6 +195,40 @@ def integer_op_not_supported(obj): return TypeError(int_addsub_msg) +class MinMaxReso: + """ + We need to define min/max/resolution on both the Timestamp _instance_ + and Timestamp class. On an instance, these depend on the object's _reso. + On the class, we default to the values we would get with nanosecond _reso. + + See also: timedeltas.MinMaxReso + """ + def __init__(self, name): + self._name = name + + def __get__(self, obj, type=None): + cls = Timestamp + if self._name == "min": + val = np.iinfo(np.int64).min + 1 + elif self._name == "max": + val = np.iinfo(np.int64).max + else: + assert self._name == "resolution" + val = 1 + cls = Timedelta + + if obj is None: + # i.e. this is on the class, default to nanos + return cls(val) + elif self._name == "resolution": + return Timedelta._from_value_and_reso(val, obj._reso) + else: + return Timestamp._from_value_and_reso(val, obj._reso, tz=None) + + def __set__(self, obj, value): + raise AttributeError(f"{self._name} is not settable.") + + # ---------------------------------------------------------------------- cdef class _Timestamp(ABCTimestamp): @@ -193,6 +238,10 @@ cdef class _Timestamp(ABCTimestamp): dayofweek = _Timestamp.day_of_week dayofyear = _Timestamp.day_of_year + min = MinMaxReso("min") + max = MinMaxReso("max") + resolution = MinMaxReso("resolution") # GH#21336, GH#21365 + cpdef void _set_freq(self, freq): # set the ._freq attribute without going through the constructor, # which would issue a warning @@ -220,6 +269,11 @@ cdef class _Timestamp(ABCTimestamp): if value == NPY_NAT: return NaT + if reso < NPY_DATETIMEUNIT.NPY_FR_s or reso > NPY_DATETIMEUNIT.NPY_FR_ns: + raise NotImplementedError( + "Only resolutions 's', 'ms', 'us', 'ns' are supported." + ) + obj.value = value pandas_datetime_to_datetimestruct(value, reso, &obj.dts) maybe_localize_tso(obj, tz, reso) @@ -248,10 +302,12 @@ cdef class _Timestamp(ABCTimestamp): def __hash__(_Timestamp self): if self.nanosecond: return hash(self.value) + if not (1 <= self.year <= 9999): + # out of bounds for pydatetime + return hash(self.value) if self.fold: return datetime.__hash__(self.replace(fold=0)) return datetime.__hash__(self) - # TODO(non-nano): what if we are out of bounds for pydatetime? def __richcmp__(_Timestamp self, object other, int op): cdef: @@ -368,9 +424,6 @@ cdef class _Timestamp(ABCTimestamp): cdef: int64_t nanos = 0 - if isinstance(self, _Timestamp) and self._reso != NPY_FR_ns: - raise NotImplementedError(self._reso) - if is_any_td_scalar(other): if is_timedelta64_object(other): other_reso = get_datetime64_unit(other) @@ -388,16 +441,53 @@ cdef class _Timestamp(ABCTimestamp): # TODO: no tests get here other = ensure_td64ns(other) - # TODO: disallow round_ok - nanos = delta_to_nanoseconds( - other, reso=self._reso, round_ok=True - ) + if isinstance(other, _Timedelta): + # TODO: share this with __sub__, Timedelta.__add__ + # We allow silent casting to the lower resolution if and only + # if it is lossless. See also Timestamp.__sub__ + # and Timedelta.__add__ + try: + if self._reso < other._reso: + other = (<_Timedelta>other)._as_reso(self._reso, round_ok=False) + elif self._reso > other._reso: + self = (<_Timestamp>self)._as_reso(other._reso, round_ok=False) + except ValueError as err: + raise ValueError( + "Timestamp addition with mismatched resolutions is not " + "allowed when casting to the lower resolution would require " + "lossy rounding." + ) from err + try: - result = type(self)(self.value + nanos, tz=self.tzinfo) + nanos = delta_to_nanoseconds( + other, reso=self._reso, round_ok=False + ) + except OutOfBoundsTimedelta: + raise + except ValueError as err: + raise ValueError( + "Addition between Timestamp and Timedelta with mismatched " + "resolutions is not allowed when casting to the lower " + "resolution would require lossy rounding." + ) from err + + try: + new_value = self.value + nanos except OverflowError: # Use Python ints # Hit in test_tdi_add_overflow - result = type(self)(int(self.value) + int(nanos), tz=self.tzinfo) + new_value = int(self.value) + int(nanos) + + try: + result = type(self)._from_value_and_reso( + new_value, reso=self._reso, tz=self.tzinfo + ) + except OverflowError as err: + # TODO: don't hard-code nanosecond here + raise OutOfBoundsDatetime( + f"Out of bounds nanosecond timestamp: {new_value}" + ) from err + if result is not NaT: result._set_freq(self._freq) # avoid warning in constructor return result @@ -429,10 +519,10 @@ cdef class _Timestamp(ABCTimestamp): return NotImplemented def __sub__(self, other): - if isinstance(self, _Timestamp) and self._reso != NPY_FR_ns: - raise NotImplementedError(self._reso) + if other is NaT: + return NaT - if is_any_td_scalar(other) or is_integer_object(other): + elif is_any_td_scalar(other) or is_integer_object(other): neg_other = -other return self + neg_other @@ -448,9 +538,6 @@ cdef class _Timestamp(ABCTimestamp): ) return NotImplemented - if other is NaT: - return NaT - # coerce if necessary if we are a Timestamp-like if (PyDateTime_Check(self) and (PyDateTime_Check(other) or is_datetime64_object(other))): @@ -469,10 +556,25 @@ cdef class _Timestamp(ABCTimestamp): "Cannot subtract tz-naive and tz-aware datetime-like objects." ) + # We allow silent casting to the lower resolution if and only + # if it is lossless. + try: + if self._reso < other._reso: + other = (<_Timestamp>other)._as_reso(self._reso, round_ok=False) + elif self._reso > other._reso: + self = (<_Timestamp>self)._as_reso(other._reso, round_ok=False) + except ValueError as err: + raise ValueError( + "Timestamp subtraction with mismatched resolutions is not " + "allowed when casting to the lower resolution would require " + "lossy rounding." + ) from err + # scalar Timestamp/datetime - Timestamp/datetime -> yields a # Timedelta try: - return Timedelta(self.value - other.value) + res_value = self.value - other.value + return Timedelta._from_value_and_reso(res_value, self._reso) except (OverflowError, OutOfBoundsDatetime, OutOfBoundsTimedelta) as err: if isinstance(other, _Timestamp): if both_timestamps: @@ -493,9 +595,6 @@ cdef class _Timestamp(ABCTimestamp): return NotImplemented def __rsub__(self, other): - if self._reso != NPY_FR_ns: - raise NotImplementedError(self._reso) - if PyDateTime_Check(other): try: return type(self)(other) - self @@ -517,7 +616,8 @@ cdef class _Timestamp(ABCTimestamp): npy_datetimestruct dts if own_tz is not None and not is_utc(own_tz): - val = pydatetime_to_dt64(self, &dts) + self.nanosecond + pydatetime_to_dtstruct(self, &dts) + val = npy_datetimestruct_to_datetime(self._reso, &dts) + self.nanosecond else: val = self.value return val @@ -854,12 +954,11 @@ cdef class _Timestamp(ABCTimestamp): local_val = self._maybe_convert_value_to_local() int64_t normalized int64_t ppd = periods_per_day(self._reso) - - if self._reso != NPY_FR_ns: - raise NotImplementedError(self._reso) + _Timestamp ts normalized = normalize_i8_stamp(local_val, ppd) - return Timestamp(normalized).tz_localize(self.tzinfo) + ts = type(self)._from_value_and_reso(normalized, reso=self._reso, tz=None) + return ts.tz_localize(self.tzinfo) # ----------------------------------------------------------------- # Pickle Methods @@ -925,6 +1024,9 @@ cdef class _Timestamp(ABCTimestamp): """ base_ts = "microseconds" if timespec == "nanoseconds" else timespec base = super(_Timestamp, self).isoformat(sep=sep, timespec=base_ts) + # We need to replace the fake year 1970 with our real year + base = f"{self.year}-" + base.split("-", 1)[1] + if self.nanosecond == 0 and timespec != "nanoseconds": return base @@ -971,7 +1073,7 @@ cdef class _Timestamp(ABCTimestamp): @property def _date_repr(self) -> str: # Ideal here would be self.strftime("%Y-%m-%d"), but - # the datetime strftime() methods require year >= 1900 + # the datetime strftime() methods require year >= 1900 and is slower return f'{self.year}-{self.month:02d}-{self.day:02d}' @property @@ -1000,6 +1102,27 @@ cdef class _Timestamp(ABCTimestamp): # ----------------------------------------------------------------- # Conversion Methods + @cython.cdivision(False) + cdef _Timestamp _as_reso(self, NPY_DATETIMEUNIT reso, bint round_ok=True): + cdef: + int64_t value, mult, div, mod + + if reso == self._reso: + return self + + value = convert_reso(self.value, self._reso, reso, round_ok=round_ok) + return type(self)._from_value_and_reso(value, reso=reso, tz=self.tzinfo) + + def _as_unit(self, str unit, bint round_ok=True): + dtype = np.dtype(f"M8[{unit}]") + reso = get_unit_from_dtype(dtype) + try: + return self._as_reso(reso, round_ok=round_ok) + except OverflowError as err: + raise OutOfBoundsDatetime( + f"Cannot cast {self} to unit='{unit}' without overflow." + ) from err + @property def asm8(self) -> np.datetime64: """ @@ -1596,10 +1719,10 @@ class Timestamp(_Timestamp): def _round(self, freq, mode, ambiguous='raise', nonexistent='raise'): cdef: - int64_t nanos = to_offset(freq).nanos + int64_t nanos - if self._reso != NPY_FR_ns: - raise NotImplementedError(self._reso) + to_offset(freq).nanos # raises on non-fixed freq + nanos = delta_to_nanoseconds(to_offset(freq), self._reso) if self.tz is not None: value = self.tz_localize(None).value @@ -1610,7 +1733,7 @@ class Timestamp(_Timestamp): # Will only ever contain 1 element for timestamp r = round_nsint64(value, mode, nanos)[0] - result = Timestamp(r, unit='ns') + result = Timestamp._from_value_and_reso(r, self._reso, None) if self.tz is not None: result = result.tz_localize( self.tz, ambiguous=ambiguous, nonexistent=nonexistent @@ -1996,9 +2119,6 @@ default 'raise' >>> pd.NaT.tz_localize() NaT """ - if self._reso != NPY_FR_ns: - raise NotImplementedError(self._reso) - if ambiguous == 'infer': raise ValueError('Cannot infer offset with only one time.') @@ -2027,7 +2147,7 @@ default 'raise' "Cannot localize tz-aware Timestamp, use tz_convert for conversions" ) - out = Timestamp(value, tz=tz) + out = type(self)._from_value_and_reso(value, self._reso, tz=tz) if out is not NaT: out._set_freq(self._freq) # avoid warning in constructor return out @@ -2074,9 +2194,6 @@ default 'raise' >>> pd.NaT.tz_convert(tz='Asia/Tokyo') NaT """ - if self._reso != NPY_FR_ns: - raise NotImplementedError(self._reso) - if self.tzinfo is None: # tz naive, use tz_localize raise TypeError( @@ -2084,7 +2201,8 @@ default 'raise' ) else: # Same UTC timestamp, different time zone - out = Timestamp(self.value, tz=tz) + tz = maybe_get_tz(tz) + out = type(self)._from_value_and_reso(self.value, reso=self._reso, tz=tz) if out is not NaT: out._set_freq(self._freq) # avoid warning in constructor return out @@ -2156,9 +2274,6 @@ default 'raise' datetime ts_input tzinfo_type tzobj - if self._reso != NPY_FR_ns: - raise NotImplementedError(self._reso) - # set to naive if needed tzobj = self.tzinfo value = self.value @@ -2171,7 +2286,7 @@ default 'raise' value = tz_convert_from_utc_single(value, tzobj, reso=self._reso) # setup components - dt64_to_dtstruct(value, &dts) + pandas_datetime_to_datetimestruct(value, self._reso, &dts) dts.ps = self.nanosecond * 1000 # replace @@ -2218,12 +2333,12 @@ default 'raise' 'fold': fold} ts_input = datetime(**kwargs) - ts = convert_datetime_to_tsobject(ts_input, tzobj) - value = ts.value + (dts.ps // 1000) - if value != NPY_NAT: - check_dts_bounds(&dts) - - return create_timestamp_from_ts(value, dts, tzobj, self._freq, fold) + ts = convert_datetime_to_tsobject( + ts_input, tzobj, nanos=dts.ps // 1000, reso=self._reso + ) + return create_timestamp_from_ts( + ts.value, dts, tzobj, self._freq, fold, reso=self._reso + ) def to_julian_date(self) -> np.float64: """ @@ -2261,29 +2376,24 @@ default 'raise' Return the day of the week represented by the date. Monday == 1 ... Sunday == 7. """ - return super().isoweekday() + # same as super().isoweekday(), but that breaks because of how + # we have overriden year, see note in create_timestamp_from_ts + return self.weekday() + 1 def weekday(self): """ Return the day of the week represented by the date. Monday == 0 ... Sunday == 6. """ - return super().weekday() + # same as super().weekday(), but that breaks because of how + # we have overriden year, see note in create_timestamp_from_ts + return ccalendar.dayofweek(self.year, self.month, self.day) # Aliases Timestamp.weekofyear = Timestamp.week Timestamp.daysinmonth = Timestamp.days_in_month -# Add the min and max fields at the class level -cdef int64_t _NS_UPPER_BOUND = np.iinfo(np.int64).max -cdef int64_t _NS_LOWER_BOUND = NPY_NAT + 1 - -# Resolution is in nanoseconds -Timestamp.min = Timestamp(_NS_LOWER_BOUND) -Timestamp.max = Timestamp(_NS_UPPER_BOUND) -Timestamp.resolution = Timedelta(nanoseconds=1) # GH#21336, GH#21365 - # ---------------------------------------------------------------------- # Scalar analogues to functions in vectorized.pyx diff --git a/pandas/_libs/tslibs/timezones.pyi b/pandas/_libs/tslibs/timezones.pyi index 20c403e93b149..d241a35f21cca 100644 --- a/pandas/_libs/tslibs/timezones.pyi +++ b/pandas/_libs/tslibs/timezones.pyi @@ -6,8 +6,6 @@ from typing import Callable import numpy as np -from pandas._typing import npt - # imported from dateutil.tz dateutil_gettz: Callable[[str], tzinfo] @@ -17,9 +15,6 @@ def infer_tzinfo( start: datetime | None, end: datetime | None, ) -> tzinfo | None: ... -def get_dst_info( - tz: tzinfo, -) -> tuple[npt.NDArray[np.int64], npt.NDArray[np.int64], str]: ... def maybe_get_tz(tz: str | int | np.int64 | tzinfo | None) -> tzinfo | None: ... def get_timezone(tz: tzinfo) -> tzinfo | str: ... def is_utc(tz: tzinfo | None) -> bool: ... diff --git a/pandas/_libs/tslibs/tzconversion.pyx b/pandas/_libs/tslibs/tzconversion.pyx index 7657633c7215a..4487136aa7fb8 100644 --- a/pandas/_libs/tslibs/tzconversion.pyx +++ b/pandas/_libs/tslibs/tzconversion.pyx @@ -27,11 +27,10 @@ from numpy cimport ( cnp.import_array() -from pandas._libs.tslibs.ccalendar cimport ( - DAY_NANOS, - HOUR_NANOS, +from pandas._libs.tslibs.dtypes cimport ( + periods_per_day, + periods_per_second, ) -from pandas._libs.tslibs.dtypes cimport periods_per_second from pandas._libs.tslibs.nattype cimport NPY_NAT from pandas._libs.tslibs.np_datetime cimport ( NPY_DATETIMEUNIT, @@ -153,6 +152,7 @@ cdef int64_t tz_localize_to_utc_single( return val elif is_utc(tz) or tz is None: + # TODO: test with non-nano return val elif is_tzlocal(tz) or is_zoneinfo(tz): @@ -161,6 +161,15 @@ cdef int64_t tz_localize_to_utc_single( elif is_fixed_offset(tz): _, deltas, _ = get_dst_info(tz) delta = deltas[0] + # TODO: de-duplicate with Localizer.__init__ + if reso != NPY_DATETIMEUNIT.NPY_FR_ns: + if reso == NPY_DATETIMEUNIT.NPY_FR_us: + delta = delta // 1000 + elif reso == NPY_DATETIMEUNIT.NPY_FR_ms: + delta = delta // 1_000_000 + elif reso == NPY_DATETIMEUNIT.NPY_FR_s: + delta = delta // 1_000_000_000 + return val - delta else: @@ -229,6 +238,7 @@ timedelta-like} bint fill_nonexist = False str stamp Localizer info = Localizer(tz, reso=reso) + int64_t pph = periods_per_day(reso) // 24 # Vectorized version of DstTzInfo.localize if info.use_utc: @@ -242,7 +252,9 @@ timedelta-like} if v == NPY_NAT: result[i] = NPY_NAT else: - result[i] = v - _tz_localize_using_tzinfo_api(v, tz, to_utc=True, reso=reso) + result[i] = v - _tz_localize_using_tzinfo_api( + v, tz, to_utc=True, reso=reso + ) return result.base # to return underlying ndarray elif info.use_fixed: @@ -283,7 +295,7 @@ timedelta-like} shift_backward = True elif PyDelta_Check(nonexistent): from .timedeltas import delta_to_nanoseconds - shift_delta = delta_to_nanoseconds(nonexistent) + shift_delta = delta_to_nanoseconds(nonexistent, reso=reso) elif nonexistent not in ('raise', None): msg = ("nonexistent must be one of {'NaT', 'raise', 'shift_forward', " "shift_backwards} or a timedelta object") @@ -291,12 +303,14 @@ timedelta-like} # Determine whether each date lies left of the DST transition (store in # result_a) or right of the DST transition (store in result_b) - result_a, result_b =_get_utc_bounds(vals, info.tdata, info.ntrans, info.deltas) + result_a, result_b =_get_utc_bounds( + vals, info.tdata, info.ntrans, info.deltas, reso=reso + ) # silence false-positive compiler warning dst_hours = np.empty(0, dtype=np.int64) if infer_dst: - dst_hours = _get_dst_hours(vals, result_a, result_b) + dst_hours = _get_dst_hours(vals, result_a, result_b, reso=reso) # Pre-compute delta_idx_offset that will be used if we go down non-existent # paths. @@ -316,12 +330,15 @@ timedelta-like} left = result_a[i] right = result_b[i] if val == NPY_NAT: + # TODO: test with non-nano result[i] = val elif left != NPY_NAT and right != NPY_NAT: if left == right: + # TODO: test with non-nano result[i] = left else: if infer_dst and dst_hours[i] != NPY_NAT: + # TODO: test with non-nano result[i] = dst_hours[i] elif is_dst: if ambiguous_array[i]: @@ -329,9 +346,10 @@ timedelta-like} else: result[i] = right elif fill: + # TODO: test with non-nano; parametrize test_dt_round_tz_ambiguous result[i] = NPY_NAT else: - stamp = _render_tstamp(val) + stamp = _render_tstamp(val, reso=reso) raise pytz.AmbiguousTimeError( f"Cannot infer dst time from {stamp}, try using the " "'ambiguous' argument" @@ -339,23 +357,24 @@ timedelta-like} elif left != NPY_NAT: result[i] = left elif right != NPY_NAT: + # TODO: test with non-nano result[i] = right else: # Handle nonexistent times if shift_forward or shift_backward or shift_delta != 0: # Shift the nonexistent time to the closest existing time - remaining_mins = val % HOUR_NANOS + remaining_mins = val % pph if shift_delta != 0: # Validate that we don't relocalize on another nonexistent # time - if -1 < shift_delta + remaining_mins < HOUR_NANOS: + if -1 < shift_delta + remaining_mins < pph: raise ValueError( "The provided timedelta will relocalize on a " f"nonexistent time: {nonexistent}" ) new_local = val + shift_delta elif shift_forward: - new_local = val + (HOUR_NANOS - remaining_mins) + new_local = val + (pph - remaining_mins) else: # Subtract 1 since the beginning hour is _inclusive_ of # nonexistent times @@ -368,7 +387,7 @@ timedelta-like} elif fill_nonexist: result[i] = NPY_NAT else: - stamp = _render_tstamp(val) + stamp = _render_tstamp(val, reso=reso) raise pytz.NonExistentTimeError(stamp) return result.base # .base to get underlying ndarray @@ -404,10 +423,11 @@ cdef inline Py_ssize_t bisect_right_i8(int64_t *data, return left -cdef inline str _render_tstamp(int64_t val): +cdef inline str _render_tstamp(int64_t val, NPY_DATETIMEUNIT reso): """ Helper function to render exception messages""" from pandas._libs.tslibs.timestamps import Timestamp - return str(Timestamp(val)) + ts = Timestamp._from_value_and_reso(val, reso, None) + return str(ts) cdef _get_utc_bounds( @@ -415,6 +435,7 @@ cdef _get_utc_bounds( int64_t* tdata, Py_ssize_t ntrans, const int64_t[::1] deltas, + NPY_DATETIMEUNIT reso, ): # Determine whether each date lies left of the DST transition (store in # result_a) or right of the DST transition (store in result_b) @@ -424,6 +445,7 @@ cdef _get_utc_bounds( Py_ssize_t i, n = vals.size int64_t val, v_left, v_right Py_ssize_t isl, isr, pos_left, pos_right + int64_t ppd = periods_per_day(reso) result_a = cnp.PyArray_EMPTY(vals.ndim, vals.shape, cnp.NPY_INT64, 0) result_b = cnp.PyArray_EMPTY(vals.ndim, vals.shape, cnp.NPY_INT64, 0) @@ -438,8 +460,8 @@ cdef _get_utc_bounds( if val == NPY_NAT: continue - # TODO: be careful of overflow in val-DAY_NANOS - isl = bisect_right_i8(tdata, val - DAY_NANOS, ntrans) - 1 + # TODO: be careful of overflow in val-ppd + isl = bisect_right_i8(tdata, val - ppd, ntrans) - 1 if isl < 0: isl = 0 @@ -449,8 +471,8 @@ cdef _get_utc_bounds( if v_left + deltas[pos_left] == val: result_a[i] = v_left - # TODO: be careful of overflow in val+DAY_NANOS - isr = bisect_right_i8(tdata, val + DAY_NANOS, ntrans) - 1 + # TODO: be careful of overflow in val+ppd + isr = bisect_right_i8(tdata, val + ppd, ntrans) - 1 if isr < 0: isr = 0 @@ -465,10 +487,11 @@ cdef _get_utc_bounds( @cython.boundscheck(False) cdef ndarray[int64_t] _get_dst_hours( - # vals only needed here to potential render an exception message + # vals, reso only needed here to potential render an exception message const int64_t[:] vals, ndarray[int64_t] result_a, ndarray[int64_t] result_b, + NPY_DATETIMEUNIT reso, ): cdef: Py_ssize_t i, n = vals.shape[0] @@ -496,8 +519,8 @@ cdef ndarray[int64_t] _get_dst_hours( trans_idx = mismatch.nonzero()[0] if trans_idx.size == 1: - # TODO: not reached in tests 2022-05-02; possible? - stamp = _render_tstamp(vals[trans_idx[0]]) + # see test_tz_localize_to_utc_ambiguous_infer + stamp = _render_tstamp(vals[trans_idx[0]], reso=reso) raise pytz.AmbiguousTimeError( f"Cannot infer dst time from {stamp} as there " "are no repeated times" @@ -518,15 +541,15 @@ cdef ndarray[int64_t] _get_dst_hours( delta = np.diff(result_a[grp]) if grp.size == 1 or np.all(delta > 0): - # TODO: not reached in tests 2022-05-02; possible? - stamp = _render_tstamp(vals[grp[0]]) + # see test_tz_localize_to_utc_ambiguous_infer + stamp = _render_tstamp(vals[grp[0]], reso=reso) raise pytz.AmbiguousTimeError(stamp) # Find the index for the switch and pull from a for dst and b # for standard switch_idxs = (delta <= 0).nonzero()[0] if switch_idxs.size > 1: - # TODO: not reached in tests 2022-05-02; possible? + # see test_tz_localize_to_utc_ambiguous_infer raise pytz.AmbiguousTimeError( f"There are {switch_idxs.size} dst switches when " "there should only be 1." @@ -619,9 +642,7 @@ cdef int64_t _tz_localize_using_tzinfo_api( if not to_utc: # tz.utcoffset only makes sense if datetime # is _wall time_, so if val is a UTC timestamp convert to wall time - dt = datetime_new(dts.year, dts.month, dts.day, dts.hour, - dts.min, dts.sec, dts.us, utc_pytz) - dt = dt.astimezone(tz) + dt = _astimezone(dts, tz) if fold is not NULL: # NB: fold is only passed with to_utc=False @@ -635,6 +656,27 @@ cdef int64_t _tz_localize_using_tzinfo_api( return delta +cdef datetime _astimezone(npy_datetimestruct dts, tzinfo tz): + """ + Optimized equivalent to: + + dt = datetime(dts.year, dts.month, dts.day, dts.hour, + dts.min, dts.sec, dts.us, utc_pytz) + dt = dt.astimezone(tz) + + Derived from the datetime.astimezone implementation at + https://github.com/python/cpython/blob/main/Modules/_datetimemodule.c#L6187 + + NB: we are assuming tz is not None. + """ + cdef: + datetime result + + result = datetime_new(dts.year, dts.month, dts.day, dts.hour, + dts.min, dts.sec, dts.us, tz) + return tz.fromutc(result) + + # NB: relies on dateutil internals, subject to change. @cython.boundscheck(False) @cython.wraparound(False) diff --git a/pandas/_libs/tslibs/vectorized.pyi b/pandas/_libs/tslibs/vectorized.pyi index 8820a17ce5996..d24541aede8d8 100644 --- a/pandas/_libs/tslibs/vectorized.pyi +++ b/pandas/_libs/tslibs/vectorized.pyi @@ -14,6 +14,7 @@ def dt64arr_to_periodarr( stamps: npt.NDArray[np.int64], freq: int, tz: tzinfo | None, + reso: int = ..., # NPY_DATETIMEUNIT ) -> npt.NDArray[np.int64]: ... def is_date_array_normalized( stamps: npt.NDArray[np.int64], @@ -28,6 +29,7 @@ def normalize_i8_timestamps( def get_resolution( stamps: npt.NDArray[np.int64], tz: tzinfo | None = ..., + reso: int = ..., # NPY_DATETIMEUNIT ) -> Resolution: ... def ints_to_pydatetime( arr: npt.NDArray[np.int64], @@ -35,9 +37,10 @@ def ints_to_pydatetime( freq: BaseOffset | None = ..., fold: bool = ..., box: str = ..., + reso: int = ..., # NPY_DATETIMEUNIT ) -> npt.NDArray[np.object_]: ... def tz_convert_from_utc( stamps: npt.NDArray[np.int64], tz: tzinfo | None, - reso: int = ..., + reso: int = ..., # NPY_DATETIMEUNIT ) -> npt.NDArray[np.int64]: ... diff --git a/pandas/_libs/tslibs/vectorized.pyx b/pandas/_libs/tslibs/vectorized.pyx index 2cab55e607f15..b63b4cf1df66b 100644 --- a/pandas/_libs/tslibs/vectorized.pyx +++ b/pandas/_libs/tslibs/vectorized.pyx @@ -19,7 +19,6 @@ cnp.import_array() from .dtypes import Resolution -from .ccalendar cimport DAY_NANOS from .dtypes cimport ( c_Resolution, periods_per_day, @@ -31,8 +30,8 @@ from .nattype cimport ( from .np_datetime cimport ( NPY_DATETIMEUNIT, NPY_FR_ns, - dt64_to_dtstruct, npy_datetimestruct, + pandas_datetime_to_datetimestruct, ) from .offsets cimport BaseOffset from .period cimport get_period_ordinal @@ -99,7 +98,8 @@ def ints_to_pydatetime( tzinfo tz=None, BaseOffset freq=None, bint fold=False, - str box="datetime" + str box="datetime", + NPY_DATETIMEUNIT reso=NPY_FR_ns, ) -> np.ndarray: # stamps is int64, arbitrary ndim """ @@ -125,12 +125,14 @@ def ints_to_pydatetime( * If time, convert to datetime.time * If Timestamp, convert to pandas.Timestamp + reso : NPY_DATETIMEUNIT, default NPY_FR_ns + Returns ------- ndarray[object] of type specified by box """ cdef: - Localizer info = Localizer(tz, reso=NPY_FR_ns) + Localizer info = Localizer(tz, reso=reso) int64_t utc_val, local_val Py_ssize_t i, n = stamps.size Py_ssize_t pos = -1 # unused, avoid not-initialized warning @@ -178,10 +180,12 @@ def ints_to_pydatetime( # find right representation of dst etc in pytz timezone new_tz = tz._tzinfos[tz._transition_info[pos]] - dt64_to_dtstruct(local_val, &dts) + pandas_datetime_to_datetimestruct(local_val, reso, &dts) if use_ts: - res_val = create_timestamp_from_ts(utc_val, dts, new_tz, freq, fold) + res_val = create_timestamp_from_ts( + utc_val, dts, new_tz, freq, fold, reso=reso + ) elif use_pydt: res_val = datetime( dts.year, dts.month, dts.day, dts.hour, dts.min, dts.sec, dts.us, @@ -226,17 +230,19 @@ cdef inline c_Resolution _reso_stamp(npy_datetimestruct *dts): @cython.wraparound(False) @cython.boundscheck(False) -def get_resolution(ndarray stamps, tzinfo tz=None) -> Resolution: +def get_resolution( + ndarray stamps, tzinfo tz=None, NPY_DATETIMEUNIT reso=NPY_FR_ns +) -> Resolution: # stamps is int64_t, any ndim cdef: - Localizer info = Localizer(tz, reso=NPY_FR_ns) + Localizer info = Localizer(tz, reso=reso) int64_t utc_val, local_val Py_ssize_t i, n = stamps.size Py_ssize_t pos = -1 # unused, avoid not-initialized warning cnp.flatiter it = cnp.PyArray_IterNew(stamps) npy_datetimestruct dts - c_Resolution reso = c_Resolution.RESO_DAY, curr_reso + c_Resolution pd_reso = c_Resolution.RESO_DAY, curr_reso for i in range(n): # Analogous to: utc_val = stamps[i] @@ -247,14 +253,14 @@ def get_resolution(ndarray stamps, tzinfo tz=None) -> Resolution: else: local_val = info.utc_val_to_local_val(utc_val, &pos) - dt64_to_dtstruct(local_val, &dts) + pandas_datetime_to_datetimestruct(local_val, reso, &dts) curr_reso = _reso_stamp(&dts) - if curr_reso < reso: - reso = curr_reso + if curr_reso < pd_reso: + pd_reso = curr_reso cnp.PyArray_ITER_NEXT(it) - return Resolution(reso) + return Resolution(pd_reso) # ------------------------------------------------------------------------- @@ -354,10 +360,12 @@ def is_date_array_normalized(ndarray stamps, tzinfo tz, NPY_DATETIMEUNIT reso) - @cython.wraparound(False) @cython.boundscheck(False) -def dt64arr_to_periodarr(ndarray stamps, int freq, tzinfo tz): +def dt64arr_to_periodarr( + ndarray stamps, int freq, tzinfo tz, NPY_DATETIMEUNIT reso=NPY_FR_ns +): # stamps is int64_t, arbitrary ndim cdef: - Localizer info = Localizer(tz, reso=NPY_FR_ns) + Localizer info = Localizer(tz, reso=reso) Py_ssize_t i, n = stamps.size Py_ssize_t pos = -1 # unused, avoid not-initialized warning int64_t utc_val, local_val, res_val @@ -374,7 +382,7 @@ def dt64arr_to_periodarr(ndarray stamps, int freq, tzinfo tz): res_val = NPY_NAT else: local_val = info.utc_val_to_local_val(utc_val, &pos) - dt64_to_dtstruct(local_val, &dts) + pandas_datetime_to_datetimestruct(local_val, reso, &dts) res_val = get_period_ordinal(&dts, freq) # Analogous to: result[i] = res_val diff --git a/pandas/_libs/writers.pyi b/pandas/_libs/writers.pyi index 930322fcbeb77..0d2096eee3573 100644 --- a/pandas/_libs/writers.pyi +++ b/pandas/_libs/writers.pyi @@ -17,7 +17,7 @@ def max_len_string_array( ) -> int: ... def word_len(val: object) -> int: ... def string_array_replace_from_nan_rep( - arr: np.ndarray, # np.ndarray[object, ndim=1] + arr: np.ndarray, # np.ndarray[object, ndim=1] nan_rep: object, replace: object = ..., ) -> None: ... diff --git a/pandas/_testing/__init__.py b/pandas/_testing/__init__.py index 9d3e84dc964d6..1035fd08a1a36 100644 --- a/pandas/_testing/__init__.py +++ b/pandas/_testing/__init__.py @@ -19,7 +19,7 @@ import numpy as np -from pandas._config.localization import ( # noqa:F401 +from pandas._config.localization import ( can_set_locale, get_locales, set_locale, @@ -49,7 +49,7 @@ Series, bdate_range, ) -from pandas._testing._io import ( # noqa:F401 +from pandas._testing._io import ( close, network, round_trip_localpath, @@ -57,16 +57,16 @@ round_trip_pickle, write_to_compressed, ) -from pandas._testing._random import ( # noqa:F401 +from pandas._testing._random import ( randbool, rands, rands_array, ) -from pandas._testing._warnings import ( # noqa:F401 +from pandas._testing._warnings import ( assert_produces_warning, maybe_produces_warning, ) -from pandas._testing.asserters import ( # noqa:F401 +from pandas._testing.asserters import ( assert_almost_equal, assert_attr_equal, assert_categorical_equal, @@ -91,11 +91,11 @@ assert_timedelta_array_equal, raise_assert_detail, ) -from pandas._testing.compat import ( # noqa:F401 +from pandas._testing.compat import ( get_dtype, get_obj, ) -from pandas._testing.contexts import ( # noqa:F401 +from pandas._testing.contexts import ( RNGContext, decompress_file, ensure_clean, @@ -238,7 +238,7 @@ _testing_mode_warnings = (DeprecationWarning, ResourceWarning) -def set_testing_mode(): +def set_testing_mode() -> None: # set the testing mode filters testing_mode = os.environ.get("PANDAS_TESTING_MODE", "None") if "deprecate" in testing_mode: @@ -246,7 +246,7 @@ def set_testing_mode(): warnings.simplefilter("always", category) -def reset_testing_mode(): +def reset_testing_mode() -> None: # reset the testing mode filters testing_mode = os.environ.get("PANDAS_TESTING_MODE", "None") if "deprecate" in testing_mode: @@ -257,7 +257,7 @@ def reset_testing_mode(): set_testing_mode() -def reset_display_options(): +def reset_display_options() -> None: """ Reset the display options for printing and representing objects. """ @@ -333,16 +333,16 @@ def to_array(obj): # Others -def getCols(k): +def getCols(k) -> str: return string.ascii_uppercase[:k] # make index -def makeStringIndex(k=10, name=None): +def makeStringIndex(k=10, name=None) -> Index: return Index(rands_array(nchars=10, size=k), name=name) -def makeCategoricalIndex(k=10, n=3, name=None, **kwargs): +def makeCategoricalIndex(k=10, n=3, name=None, **kwargs) -> CategoricalIndex: """make a length k index or n categories""" x = rands_array(nchars=4, size=n, replace=False) return CategoricalIndex( @@ -350,13 +350,13 @@ def makeCategoricalIndex(k=10, n=3, name=None, **kwargs): ) -def makeIntervalIndex(k=10, name=None, **kwargs): +def makeIntervalIndex(k=10, name=None, **kwargs) -> IntervalIndex: """make a length k IntervalIndex""" x = np.linspace(0, 100, num=(k + 1)) return IntervalIndex.from_breaks(x, name=name, **kwargs) -def makeBoolIndex(k=10, name=None): +def makeBoolIndex(k=10, name=None) -> Index: if k == 1: return Index([True], name=name) elif k == 2: @@ -364,7 +364,7 @@ def makeBoolIndex(k=10, name=None): return Index([False, True] + [False] * (k - 2), name=name) -def makeNumericIndex(k=10, name=None, *, dtype): +def makeNumericIndex(k=10, name=None, *, dtype) -> NumericIndex: dtype = pandas_dtype(dtype) assert isinstance(dtype, np.dtype) @@ -382,21 +382,21 @@ def makeNumericIndex(k=10, name=None, *, dtype): return NumericIndex(values, dtype=dtype, name=name) -def makeIntIndex(k=10, name=None): +def makeIntIndex(k=10, name=None) -> Int64Index: base_idx = makeNumericIndex(k, name=name, dtype="int64") return Int64Index(base_idx) -def makeUIntIndex(k=10, name=None): +def makeUIntIndex(k=10, name=None) -> UInt64Index: base_idx = makeNumericIndex(k, name=name, dtype="uint64") return UInt64Index(base_idx) -def makeRangeIndex(k=10, name=None, **kwargs): +def makeRangeIndex(k=10, name=None, **kwargs) -> RangeIndex: return RangeIndex(0, k, 1, name=name, **kwargs) -def makeFloatIndex(k=10, name=None): +def makeFloatIndex(k=10, name=None) -> Float64Index: base_idx = makeNumericIndex(k, name=name, dtype="float64") return Float64Index(base_idx) @@ -456,34 +456,34 @@ def all_timeseries_index_generator(k: int = 10) -> Iterable[Index]: # make series -def make_rand_series(name=None, dtype=np.float64): +def make_rand_series(name=None, dtype=np.float64) -> Series: index = makeStringIndex(_N) data = np.random.randn(_N) data = data.astype(dtype, copy=False) return Series(data, index=index, name=name) -def makeFloatSeries(name=None): +def makeFloatSeries(name=None) -> Series: return make_rand_series(name=name) -def makeStringSeries(name=None): +def makeStringSeries(name=None) -> Series: return make_rand_series(name=name) -def makeObjectSeries(name=None): +def makeObjectSeries(name=None) -> Series: data = makeStringIndex(_N) data = Index(data, dtype=object) index = makeStringIndex(_N) return Series(data, index=index, name=name) -def getSeriesData(): +def getSeriesData() -> dict[str, Series]: index = makeStringIndex(_N) return {c: Series(np.random.randn(_N), index=index) for c in getCols(_K)} -def makeTimeSeries(nper=None, freq="B", name=None): +def makeTimeSeries(nper=None, freq="B", name=None) -> Series: if nper is None: nper = _N return Series( @@ -491,22 +491,22 @@ def makeTimeSeries(nper=None, freq="B", name=None): ) -def makePeriodSeries(nper=None, name=None): +def makePeriodSeries(nper=None, name=None) -> Series: if nper is None: nper = _N return Series(np.random.randn(nper), index=makePeriodIndex(nper), name=name) -def getTimeSeriesData(nper=None, freq="B"): +def getTimeSeriesData(nper=None, freq="B") -> dict[str, Series]: return {c: makeTimeSeries(nper, freq) for c in getCols(_K)} -def getPeriodData(nper=None): +def getPeriodData(nper=None) -> dict[str, Series]: return {c: makePeriodSeries(nper) for c in getCols(_K)} # make frame -def makeTimeDataFrame(nper=None, freq="B"): +def makeTimeDataFrame(nper=None, freq="B") -> DataFrame: data = getTimeSeriesData(nper, freq) return DataFrame(data) @@ -529,18 +529,23 @@ def getMixedTypeDict(): return index, data -def makeMixedDataFrame(): +def makeMixedDataFrame() -> DataFrame: return DataFrame(getMixedTypeDict()[1]) -def makePeriodFrame(nper=None): +def makePeriodFrame(nper=None) -> DataFrame: data = getPeriodData(nper) return DataFrame(data) def makeCustomIndex( - nentries, nlevels, prefix="#", names=False, ndupe_l=None, idx_type=None -): + nentries, + nlevels, + prefix="#", + names: bool | str | list[str] | None = False, + ndupe_l=None, + idx_type=None, +) -> Index: """ Create an index/multindex with given dimensions, levels, names, etc' @@ -637,7 +642,8 @@ def keyfunc(x): # convert tuples to index if nentries == 1: # we have a single level of tuples, i.e. a regular Index - index = Index(tuples[0], name=names[0]) + name = None if names is None else names[0] + index = Index(tuples[0], name=name) elif nlevels == 1: name = None if names is None else names[0] index = Index((x[0] for x in tuples), name=name) @@ -659,7 +665,7 @@ def makeCustomDataframe( dtype=None, c_idx_type=None, r_idx_type=None, -): +) -> DataFrame: """ Create a DataFrame using supplied parameters. @@ -780,7 +786,7 @@ def _gen_unique_rand(rng, _extra_size): return i.tolist(), j.tolist() -def makeMissingDataframe(density=0.9, random_state=None): +def makeMissingDataframe(density=0.9, random_state=None) -> DataFrame: df = makeDataFrame() i, j = _create_missing_idx(*df.shape, density=density, random_state=random_state) df.values[i, j] = np.nan @@ -854,7 +860,7 @@ def skipna_wrapper(x): return skipna_wrapper -def convert_rows_list_to_csv_str(rows_list: list[str]): +def convert_rows_list_to_csv_str(rows_list: list[str]) -> str: """ Convert list of CSV rows to single CSV-formatted string for current OS. @@ -1027,3 +1033,128 @@ def shares_memory(left, right) -> bool: return shares_memory(arr, right) raise NotImplementedError(type(left), type(right)) + + +__all__ = [ + "ALL_INT_EA_DTYPES", + "ALL_INT_NUMPY_DTYPES", + "ALL_NUMPY_DTYPES", + "ALL_REAL_NUMPY_DTYPES", + "all_timeseries_index_generator", + "assert_almost_equal", + "assert_attr_equal", + "assert_categorical_equal", + "assert_class_equal", + "assert_contains_all", + "assert_copy", + "assert_datetime_array_equal", + "assert_dict_equal", + "assert_equal", + "assert_extension_array_equal", + "assert_frame_equal", + "assert_index_equal", + "assert_indexing_slices_equivalent", + "assert_interval_array_equal", + "assert_is_sorted", + "assert_is_valid_plot_return_object", + "assert_metadata_equivalent", + "assert_numpy_array_equal", + "assert_period_array_equal", + "assert_produces_warning", + "assert_series_equal", + "assert_sp_array_equal", + "assert_timedelta_array_equal", + "at", + "BOOL_DTYPES", + "box_expected", + "BYTES_DTYPES", + "can_set_locale", + "close", + "COMPLEX_DTYPES", + "convert_rows_list_to_csv_str", + "DATETIME64_DTYPES", + "decompress_file", + "EMPTY_STRING_PATTERN", + "ENDIAN", + "ensure_clean", + "ensure_clean_dir", + "ensure_safe_environment_variables", + "equalContents", + "external_error_raised", + "FLOAT_EA_DTYPES", + "FLOAT_NUMPY_DTYPES", + "getCols", + "get_cython_table_params", + "get_dtype", + "getitem", + "get_locales", + "getMixedTypeDict", + "get_obj", + "get_op_from_name", + "getPeriodData", + "getSeriesData", + "getTimeSeriesData", + "iat", + "iloc", + "index_subclass_makers_generator", + "loc", + "makeBoolIndex", + "makeCategoricalIndex", + "makeCustomDataframe", + "makeCustomIndex", + "makeDataFrame", + "makeDateIndex", + "makeFloatIndex", + "makeFloatSeries", + "makeIntervalIndex", + "makeIntIndex", + "makeMissingDataframe", + "makeMixedDataFrame", + "makeMultiIndex", + "makeNumericIndex", + "makeObjectSeries", + "makePeriodFrame", + "makePeriodIndex", + "makePeriodSeries", + "make_rand_series", + "makeRangeIndex", + "makeStringIndex", + "makeStringSeries", + "makeTimeDataFrame", + "makeTimedeltaIndex", + "makeTimeSeries", + "makeUIntIndex", + "maybe_produces_warning", + "NARROW_NP_DTYPES", + "network", + "NP_NAT_OBJECTS", + "NULL_OBJECTS", + "OBJECT_DTYPES", + "raise_assert_detail", + "randbool", + "rands", + "reset_display_options", + "reset_testing_mode", + "RNGContext", + "round_trip_localpath", + "round_trip_pathlib", + "round_trip_pickle", + "setitem", + "set_locale", + "set_testing_mode", + "set_timezone", + "shares_memory", + "SIGNED_INT_EA_DTYPES", + "SIGNED_INT_NUMPY_DTYPES", + "STRING_DTYPES", + "SubclassedCategorical", + "SubclassedDataFrame", + "SubclassedSeries", + "TIMEDELTA64_DTYPES", + "to_array", + "UNSIGNED_INT_EA_DTYPES", + "UNSIGNED_INT_NUMPY_DTYPES", + "use_numexpr", + "with_csv_dialect", + "write_to_compressed", +] diff --git a/pandas/_testing/_io.py b/pandas/_testing/_io.py index 46f1545a67fab..d1acdff8d2fd7 100644 --- a/pandas/_testing/_io.py +++ b/pandas/_testing/_io.py @@ -250,7 +250,7 @@ def wrapper(*args, **kwargs): return wrapper -def can_connect(url, error_classes=None): +def can_connect(url, error_classes=None) -> bool: """ Try to connect to the given url. True if succeeds, False if OSError raised @@ -424,7 +424,7 @@ def write_to_compressed(compression, path, data, dest="test"): # Plotting -def close(fignum=None): +def close(fignum=None) -> None: from matplotlib.pyplot import ( close as _close, get_fignums, diff --git a/pandas/_testing/_random.py b/pandas/_testing/_random.py index cce6bf8da7d3e..880fffea21bd1 100644 --- a/pandas/_testing/_random.py +++ b/pandas/_testing/_random.py @@ -14,7 +14,7 @@ def randbool(size=(), p: float = 0.5): ) -def rands_array(nchars, size, dtype="O", replace=True): +def rands_array(nchars, size, dtype="O", replace=True) -> np.ndarray: """ Generate an array of byte strings. """ @@ -26,7 +26,7 @@ def rands_array(nchars, size, dtype="O", replace=True): return retval.astype(dtype) -def rands(nchars): +def rands(nchars) -> str: """ Generate one random byte string. diff --git a/pandas/_testing/asserters.py b/pandas/_testing/asserters.py index 7170089581f69..c7924dc451752 100644 --- a/pandas/_testing/asserters.py +++ b/pandas/_testing/asserters.py @@ -45,10 +45,7 @@ Series, TimedeltaIndex, ) -from pandas.core.algorithms import ( - safe_sort, - take_nd, -) +from pandas.core.algorithms import take_nd from pandas.core.arrays import ( DatetimeArray, ExtensionArray, @@ -58,6 +55,7 @@ ) from pandas.core.arrays.datetimelike import DatetimeLikeArrayMixin from pandas.core.arrays.string_ import StringDtype +from pandas.core.indexes.api import safe_sort_index from pandas.io.formats.printing import pprint_thing @@ -70,7 +68,7 @@ def assert_almost_equal( rtol: float = 1.0e-5, atol: float = 1.0e-8, **kwargs, -): +) -> None: """ Check that the left and right objects are approximately equal. @@ -241,7 +239,7 @@ def _check_isinstance(left, right, cls): ) -def assert_dict_equal(left, right, compare_keys: bool = True): +def assert_dict_equal(left, right, compare_keys: bool = True) -> None: _check_isinstance(left, right, dict) _testing.assert_dict_equal(left, right, compare_keys=compare_keys) @@ -367,8 +365,8 @@ def _get_ilevel_values(index, level): # If order doesn't matter then sort the index entries if not check_order: - left = Index(safe_sort(left)) - right = Index(safe_sort(right)) + left = safe_sort_index(left) + right = safe_sort_index(right) # MultiIndex special comparison for little-friendly error messages if left.nlevels > 1: @@ -430,7 +428,7 @@ def _get_ilevel_values(index, level): assert_categorical_equal(left._values, right._values, obj=f"{obj} category") -def assert_class_equal(left, right, exact: bool | str = True, obj="Input"): +def assert_class_equal(left, right, exact: bool | str = True, obj="Input") -> None: """ Checks classes are equal. """ @@ -457,7 +455,7 @@ def repr_class(x): raise_assert_detail(obj, msg, repr_class(left), repr_class(right)) -def assert_attr_equal(attr: str, left, right, obj: str = "Attributes"): +def assert_attr_equal(attr: str, left, right, obj: str = "Attributes") -> None: """ Check attributes are equal. Both objects must have attribute. @@ -476,11 +474,9 @@ def assert_attr_equal(attr: str, left, right, obj: str = "Attributes"): left_attr = getattr(left, attr) right_attr = getattr(right, attr) - if left_attr is right_attr: - return True - elif is_matching_na(left_attr, right_attr): + if left_attr is right_attr or is_matching_na(left_attr, right_attr): # e.g. both np.nan, both NaT, both pd.NA, ... - return True + return None try: result = left_attr == right_attr @@ -492,14 +488,13 @@ def assert_attr_equal(attr: str, left, right, obj: str = "Attributes"): elif not isinstance(result, bool): result = result.all() - if result: - return True - else: + if not result: msg = f'Attribute "{attr}" are different' raise_assert_detail(obj, msg, left_attr, right_attr) + return None -def assert_is_valid_plot_return_object(objs): +def assert_is_valid_plot_return_object(objs) -> None: import matplotlib.pyplot as plt if isinstance(objs, (Series, np.ndarray)): @@ -518,7 +513,7 @@ def assert_is_valid_plot_return_object(objs): assert isinstance(objs, (plt.Artist, tuple, dict)), msg -def assert_is_sorted(seq): +def assert_is_sorted(seq) -> None: """Assert that the sequence is sorted.""" if isinstance(seq, (Index, Series)): seq = seq.values @@ -528,7 +523,7 @@ def assert_is_sorted(seq): def assert_categorical_equal( left, right, check_dtype=True, check_category_order=True, obj="Categorical" -): +) -> None: """ Test that Categoricals are equivalent. @@ -583,7 +578,9 @@ def assert_categorical_equal( assert_attr_equal("ordered", left, right, obj=obj) -def assert_interval_array_equal(left, right, exact="equiv", obj="IntervalArray"): +def assert_interval_array_equal( + left, right, exact="equiv", obj="IntervalArray" +) -> None: """ Test that two IntervalArrays are equivalent. @@ -612,14 +609,16 @@ def assert_interval_array_equal(left, right, exact="equiv", obj="IntervalArray") assert_attr_equal("inclusive", left, right, obj=obj) -def assert_period_array_equal(left, right, obj="PeriodArray"): +def assert_period_array_equal(left, right, obj="PeriodArray") -> None: _check_isinstance(left, right, PeriodArray) assert_numpy_array_equal(left._data, right._data, obj=f"{obj}._data") assert_attr_equal("freq", left, right, obj=obj) -def assert_datetime_array_equal(left, right, obj="DatetimeArray", check_freq=True): +def assert_datetime_array_equal( + left, right, obj="DatetimeArray", check_freq=True +) -> None: __tracebackhide__ = True _check_isinstance(left, right, DatetimeArray) @@ -629,7 +628,9 @@ def assert_datetime_array_equal(left, right, obj="DatetimeArray", check_freq=Tru assert_attr_equal("tz", left, right, obj=obj) -def assert_timedelta_array_equal(left, right, obj="TimedeltaArray", check_freq=True): +def assert_timedelta_array_equal( + left, right, obj="TimedeltaArray", check_freq=True +) -> None: __tracebackhide__ = True _check_isinstance(left, right, TimedeltaArray) assert_numpy_array_equal(left._data, right._data, obj=f"{obj}._data") @@ -684,7 +685,7 @@ def assert_numpy_array_equal( check_same=None, obj="numpy array", index_values=None, -): +) -> None: """ Check that 'np.ndarray' is equivalent. @@ -764,7 +765,7 @@ def assert_extension_array_equal( check_exact=False, rtol: float = 1.0e-5, atol: float = 1.0e-8, -): +) -> None: """ Check that left and right ExtensionArrays are equal. @@ -866,7 +867,7 @@ def assert_series_equal( check_dtype: bool | Literal["equiv"] = True, check_index_type="equiv", check_series_type=True, - check_less_precise=no_default, + check_less_precise: bool | int | NoDefault = no_default, check_names=True, check_exact=False, check_datetimelike_compat=False, @@ -880,7 +881,7 @@ def assert_series_equal( *, check_index=True, check_like=False, -): +) -> None: """ Check that left and right Series are equal. @@ -1147,7 +1148,7 @@ def assert_frame_equal( rtol=1.0e-5, atol=1.0e-8, obj="DataFrame", -): +) -> None: """ Check that left and right DataFrame are equal. @@ -1354,7 +1355,7 @@ def assert_frame_equal( ) -def assert_equal(left, right, **kwargs): +def assert_equal(left, right, **kwargs) -> None: """ Wrapper for tm.assert_*_equal to dispatch to the appropriate test function. @@ -1395,7 +1396,7 @@ def assert_equal(left, right, **kwargs): assert_almost_equal(left, right) -def assert_sp_array_equal(left, right): +def assert_sp_array_equal(left, right) -> None: """ Check that the left and right SparseArray are equal. @@ -1428,12 +1429,12 @@ def assert_sp_array_equal(left, right): assert_numpy_array_equal(left.to_dense(), right.to_dense()) -def assert_contains_all(iterable, dic): +def assert_contains_all(iterable, dic) -> None: for k in iterable: assert k in dic, f"Did not contain item: {repr(k)}" -def assert_copy(iter1, iter2, **eql_kwargs): +def assert_copy(iter1, iter2, **eql_kwargs) -> None: """ iter1, iter2: iterables that produce elements comparable with assert_almost_equal @@ -1465,7 +1466,7 @@ def is_extension_array_dtype_and_needs_i8_conversion(left_dtype, right_dtype) -> return is_extension_array_dtype(left_dtype) and needs_i8_conversion(right_dtype) -def assert_indexing_slices_equivalent(ser: Series, l_slc: slice, i_slc: slice): +def assert_indexing_slices_equivalent(ser: Series, l_slc: slice, i_slc: slice) -> None: """ Check that ser.iloc[i_slc] matches ser.loc[l_slc] and, if applicable, ser[l_slc]. @@ -1479,7 +1480,7 @@ def assert_indexing_slices_equivalent(ser: Series, l_slc: slice, i_slc: slice): assert_series_equal(ser[l_slc], expected) -def assert_metadata_equivalent(left, right): +def assert_metadata_equivalent(left, right) -> None: """ Check that ._metadata attributes are equivalent. """ diff --git a/pandas/_testing/contexts.py b/pandas/_testing/contexts.py index 7df9afd68b432..e64adb06bea7a 100644 --- a/pandas/_testing/contexts.py +++ b/pandas/_testing/contexts.py @@ -8,6 +8,7 @@ from typing import ( IO, Any, + Iterator, ) import uuid @@ -19,7 +20,7 @@ @contextmanager -def decompress_file(path, compression): +def decompress_file(path, compression) -> Iterator[IO[bytes]]: """ Open a compressed file and return a file object. @@ -40,7 +41,7 @@ def decompress_file(path, compression): @contextmanager -def set_timezone(tz: str): +def set_timezone(tz: str) -> Iterator[None]: """ Context manager for temporarily setting a timezone. @@ -126,7 +127,7 @@ def ensure_clean(filename=None, return_filelike: bool = False, **kwargs: Any): @contextmanager -def ensure_clean_dir(): +def ensure_clean_dir() -> Iterator[str]: """ Get a temporary directory path and agrees to remove on close. @@ -145,7 +146,7 @@ def ensure_clean_dir(): @contextmanager -def ensure_safe_environment_variables(): +def ensure_safe_environment_variables() -> Iterator[None]: """ Get a context manager to safely set environment variables @@ -161,7 +162,7 @@ def ensure_safe_environment_variables(): @contextmanager -def with_csv_dialect(name, **kwargs): +def with_csv_dialect(name, **kwargs) -> Iterator[None]: """ Context manager to temporarily register a CSV dialect for parsing CSV. @@ -195,7 +196,7 @@ def with_csv_dialect(name, **kwargs): @contextmanager -def use_numexpr(use, min_elements=None): +def use_numexpr(use, min_elements=None) -> Iterator[None]: from pandas.core.computation import expressions as expr if min_elements is None: @@ -231,11 +232,11 @@ class RNGContext: def __init__(self, seed) -> None: self.seed = seed - def __enter__(self): + def __enter__(self) -> None: self.start_state = np.random.get_state() np.random.seed(self.seed) - def __exit__(self, exc_type, exc_value, traceback): + def __exit__(self, exc_type, exc_value, traceback) -> None: np.random.set_state(self.start_state) diff --git a/pandas/_typing.py b/pandas/_typing.py index a85820a403fde..4bc5f75400455 100644 --- a/pandas/_typing.py +++ b/pandas/_typing.py @@ -109,7 +109,7 @@ Axis = Union[str, int] IndexLabel = Union[Hashable, Sequence[Hashable]] -Level = Union[Hashable, int] +Level = Hashable Shape = Tuple[int, ...] Suffixes = Tuple[Optional[str], Optional[str]] Ordered = Optional[bool] @@ -314,7 +314,7 @@ def closed(self) -> bool: # Interval closed type IntervalLeftRight = Literal["left", "right"] -IntervalClosedType = Union[IntervalLeftRight, Literal["both", "neither"]] +IntervalInclusiveType = Union[IntervalLeftRight, Literal["both", "neither"]] # datetime and NaTType DatetimeNaTType = Union[datetime, "NaTType"] @@ -323,3 +323,9 @@ def closed(self) -> bool: # sort_index SortKind = Literal["quicksort", "mergesort", "heapsort", "stable"] NaPosition = Literal["first", "last"] + +# quantile interpolation +QuantileInterpolation = Literal["linear", "lower", "higher", "midpoint", "nearest"] + +# plotting +PlottingOrientation = Literal["horizontal", "vertical"] diff --git a/pandas/api/__init__.py b/pandas/api/__init__.py index 67fd722c9198b..22a09ed61d694 100644 --- a/pandas/api/__init__.py +++ b/pandas/api/__init__.py @@ -1,7 +1,14 @@ """ public toolkit API """ -from pandas.api import ( # noqa:F401 +from pandas.api import ( exchange, extensions, indexers, types, ) + +__all__ = [ + "exchange", + "extensions", + "indexers", + "types", +] diff --git a/pandas/compat/__init__.py b/pandas/compat/__init__.py index bb4787f07b2f0..5db859897b663 100644 --- a/pandas/compat/__init__.py +++ b/pandas/compat/__init__.py @@ -7,9 +7,12 @@ Other items: * platform checker """ +from __future__ import annotations + import os import platform import sys +from typing import TYPE_CHECKING from pandas._typing import F from pandas.compat.numpy import ( @@ -24,8 +27,12 @@ pa_version_under5p0, pa_version_under6p0, pa_version_under7p0, + pa_version_under8p0, ) +if TYPE_CHECKING: + import lzma + PY39 = sys.version_info >= (3, 9) PY310 = sys.version_info >= (3, 10) PYPY = platform.python_implementation() == "PyPy" @@ -117,7 +124,7 @@ def is_ci_environment() -> bool: return os.environ.get("PANDAS_CI", "0") == "1" -def get_lzma_file(): +def get_lzma_file() -> type[lzma.LZMAFile]: """ Importing the `LZMAFile` class from the `lzma` module. @@ -152,4 +159,5 @@ def get_lzma_file(): "pa_version_under5p0", "pa_version_under6p0", "pa_version_under7p0", + "pa_version_under8p0", ] diff --git a/pandas/compat/chainmap.py b/pandas/compat/chainmap.py index 9af7962fe4ad0..5bec8e5fa1913 100644 --- a/pandas/compat/chainmap.py +++ b/pandas/compat/chainmap.py @@ -1,3 +1,5 @@ +from __future__ import annotations + from typing import ( ChainMap, TypeVar, diff --git a/pandas/compat/numpy/function.py b/pandas/compat/numpy/function.py index e3aa5bb52f2ba..140d41782e6d3 100644 --- a/pandas/compat/numpy/function.py +++ b/pandas/compat/numpy/function.py @@ -17,7 +17,11 @@ """ from __future__ import annotations -from typing import Any +from typing import ( + Any, + TypeVar, + overload, +) from numpy import ndarray @@ -25,6 +29,7 @@ is_bool, is_integer, ) +from pandas._typing import Axis from pandas.errors import UnsupportedFunctionCall from pandas.util._validators import ( validate_args, @@ -32,6 +37,8 @@ validate_kwargs, ) +AxisNoneT = TypeVar("AxisNoneT", Axis, None) + class CompatValidator: def __init__( @@ -84,7 +91,7 @@ def __call__( ) -def process_skipna(skipna, args): +def process_skipna(skipna: bool | ndarray | None, args) -> tuple[bool, Any]: if isinstance(skipna, ndarray) or skipna is None: args = (skipna,) + args skipna = True @@ -92,7 +99,7 @@ def process_skipna(skipna, args): return skipna, args -def validate_argmin_with_skipna(skipna, args, kwargs): +def validate_argmin_with_skipna(skipna: bool | ndarray | None, args, kwargs) -> bool: """ If 'Series.argmin' is called via the 'numpy' library, the third parameter in its signature is 'out', which takes either an ndarray or 'None', so @@ -104,7 +111,7 @@ def validate_argmin_with_skipna(skipna, args, kwargs): return skipna -def validate_argmax_with_skipna(skipna, args, kwargs): +def validate_argmax_with_skipna(skipna: bool | ndarray | None, args, kwargs) -> bool: """ If 'Series.argmax' is called via the 'numpy' library, the third parameter in its signature is 'out', which takes either an ndarray or 'None', so @@ -137,7 +144,7 @@ def validate_argmax_with_skipna(skipna, args, kwargs): ) -def validate_argsort_with_ascending(ascending, args, kwargs): +def validate_argsort_with_ascending(ascending: bool | int | None, args, kwargs) -> bool: """ If 'Categorical.argsort' is called via the 'numpy' library, the first parameter in its signature is 'axis', which takes either an integer or @@ -149,7 +156,8 @@ def validate_argsort_with_ascending(ascending, args, kwargs): ascending = True validate_argsort_kind(args, kwargs, max_fname_arg_count=3) - return ascending + # error: Incompatible return value type (got "int", expected "bool") + return ascending # type: ignore[return-value] CLIP_DEFAULTS: dict[str, Any] = {"out": None} @@ -158,7 +166,19 @@ def validate_argsort_with_ascending(ascending, args, kwargs): ) -def validate_clip_with_axis(axis, args, kwargs): +@overload +def validate_clip_with_axis(axis: ndarray, args, kwargs) -> None: + ... + + +@overload +def validate_clip_with_axis(axis: AxisNoneT, args, kwargs) -> AxisNoneT: + ... + + +def validate_clip_with_axis( + axis: ndarray | AxisNoneT, args, kwargs +) -> AxisNoneT | None: """ If 'NDFrame.clip' is called via the numpy library, the third parameter in its signature is 'out', which can takes an ndarray, so check if the 'axis' @@ -167,10 +187,14 @@ def validate_clip_with_axis(axis, args, kwargs): """ if isinstance(axis, ndarray): args = (axis,) + args - axis = None + # error: Incompatible types in assignment (expression has type "None", + # variable has type "Union[ndarray[Any, Any], str, int]") + axis = None # type: ignore[assignment] validate_clip(args, kwargs) - return axis + # error: Incompatible return value type (got "Union[ndarray[Any, Any], + # str, int]", expected "Union[str, int, None]") + return axis # type: ignore[return-value] CUM_FUNC_DEFAULTS: dict[str, Any] = {} @@ -184,7 +208,7 @@ def validate_clip_with_axis(axis, args, kwargs): ) -def validate_cum_func_with_skipna(skipna, args, kwargs, name): +def validate_cum_func_with_skipna(skipna, args, kwargs, name) -> bool: """ If this function is called via the 'numpy' library, the third parameter in its signature is 'dtype', which takes either a 'numpy' dtype or 'None', so @@ -288,7 +312,7 @@ def validate_cum_func_with_skipna(skipna, args, kwargs, name): validate_take = CompatValidator(TAKE_DEFAULTS, fname="take", method="kwargs") -def validate_take_with_convert(convert, args, kwargs): +def validate_take_with_convert(convert: ndarray | bool | None, args, kwargs) -> bool: """ If this function is called via the 'numpy' library, the third parameter in its signature is 'axis', which takes either an ndarray or 'None', so check diff --git a/pandas/compat/pickle_compat.py b/pandas/compat/pickle_compat.py index 2333324a7e22d..c8db82500d0d6 100644 --- a/pandas/compat/pickle_compat.py +++ b/pandas/compat/pickle_compat.py @@ -7,7 +7,10 @@ import copy import io import pickle as pkl -from typing import TYPE_CHECKING +from typing import ( + TYPE_CHECKING, + Iterator, +) import warnings import numpy as np @@ -291,7 +294,7 @@ def loads( @contextlib.contextmanager -def patch_pickle(): +def patch_pickle() -> Iterator[None]: """ Temporarily patch pickle to use our unpickler. """ diff --git a/pandas/compat/pyarrow.py b/pandas/compat/pyarrow.py index eef2bb6639c36..833cda20368a2 100644 --- a/pandas/compat/pyarrow.py +++ b/pandas/compat/pyarrow.py @@ -1,5 +1,7 @@ """ support pyarrow compatibility across versions """ +from __future__ import annotations + from pandas.util.version import Version try: diff --git a/pandas/conftest.py b/pandas/conftest.py index dfe8c5f1778d3..e176707d8a8f1 100644 --- a/pandas/conftest.py +++ b/pandas/conftest.py @@ -29,6 +29,10 @@ from decimal import Decimal import operator import os +from typing import ( + Callable, + Literal, +) from dateutil.tz import ( tzlocal, @@ -90,7 +94,7 @@ # pytest -def pytest_addoption(parser): +def pytest_addoption(parser) -> None: parser.addoption("--skip-slow", action="store_true", help="skip slow tests") parser.addoption("--skip-network", action="store_true", help="skip network tests") parser.addoption("--skip-db", action="store_true", help="skip db tests") @@ -231,7 +235,7 @@ def pytest_collection_modifyitems(items, config): @pytest.fixture -def add_doctest_imports(doctest_namespace): +def add_doctest_imports(doctest_namespace) -> None: """ Make `np` and `pd` names available for doctests. """ @@ -243,7 +247,7 @@ def add_doctest_imports(doctest_namespace): # Autouse fixtures # ---------------------------------------------------------------- @pytest.fixture(autouse=True) -def configure_tests(): +def configure_tests() -> None: """ Configure settings for all tests and test modules. """ @@ -530,7 +534,7 @@ def multiindex_year_month_day_dataframe_random_data(): @pytest.fixture -def lexsorted_two_level_string_multiindex(): +def lexsorted_two_level_string_multiindex() -> MultiIndex: """ 2-level MultiIndex, lexsorted, with string names. """ @@ -542,7 +546,9 @@ def lexsorted_two_level_string_multiindex(): @pytest.fixture -def multiindex_dataframe_random_data(lexsorted_two_level_string_multiindex): +def multiindex_dataframe_random_data( + lexsorted_two_level_string_multiindex, +) -> DataFrame: """DataFrame with 2 level MultiIndex with random data""" index = lexsorted_two_level_string_multiindex return DataFrame( @@ -715,7 +721,7 @@ def index_with_missing(request): # Series' # ---------------------------------------------------------------- @pytest.fixture -def string_series(): +def string_series() -> Series: """ Fixture for Series of floats with Index of unique strings """ @@ -725,7 +731,7 @@ def string_series(): @pytest.fixture -def object_series(): +def object_series() -> Series: """ Fixture for Series of dtype object with Index of unique strings """ @@ -735,7 +741,7 @@ def object_series(): @pytest.fixture -def datetime_series(): +def datetime_series() -> Series: """ Fixture for Series of floats with DatetimeIndex """ @@ -758,7 +764,7 @@ def _create_series(index): @pytest.fixture -def series_with_simple_index(index): +def series_with_simple_index(index) -> Series: """ Fixture for tests on series with changing types of indices. """ @@ -766,7 +772,7 @@ def series_with_simple_index(index): @pytest.fixture -def series_with_multilevel_index(): +def series_with_multilevel_index() -> Series: """ Fixture with a Series with a 2-level MultiIndex. """ @@ -804,7 +810,7 @@ def index_or_series_obj(request): # DataFrames # ---------------------------------------------------------------- @pytest.fixture -def int_frame(): +def int_frame() -> DataFrame: """ Fixture for DataFrame of ints with index of unique strings @@ -833,7 +839,7 @@ def int_frame(): @pytest.fixture -def datetime_frame(): +def datetime_frame() -> DataFrame: """ Fixture for DataFrame of floats with DatetimeIndex @@ -862,7 +868,7 @@ def datetime_frame(): @pytest.fixture -def float_frame(): +def float_frame() -> DataFrame: """ Fixture for DataFrame of floats with index of unique strings @@ -891,7 +897,7 @@ def float_frame(): @pytest.fixture -def mixed_type_frame(): +def mixed_type_frame() -> DataFrame: """ Fixture for DataFrame of float/int/string columns with RangeIndex Columns are ['a', 'b', 'c', 'float32', 'int32']. @@ -909,7 +915,7 @@ def mixed_type_frame(): @pytest.fixture -def rand_series_with_duplicate_datetimeindex(): +def rand_series_with_duplicate_datetimeindex() -> Series: """ Fixture for Series with a DatetimeIndex that has duplicates. """ @@ -1151,7 +1157,7 @@ def strict_data_files(pytestconfig): @pytest.fixture -def datapath(strict_data_files): +def datapath(strict_data_files: str) -> Callable[..., str]: """ Get the path to a data file. @@ -1186,7 +1192,7 @@ def deco(*args): @pytest.fixture -def iris(datapath): +def iris(datapath) -> DataFrame: """ The iris dataset as a DataFrame. """ @@ -1376,7 +1382,7 @@ def timedelta64_dtype(request): @pytest.fixture -def fixed_now_ts(): +def fixed_now_ts() -> Timestamp: """ Fixture emits fixed Timestamp.now() """ @@ -1728,7 +1734,7 @@ def spmatrix(request): params=[ getattr(pd.offsets, o) for o in pd.offsets.__all__ - if issubclass(getattr(pd.offsets, o), pd.offsets.Tick) + if issubclass(getattr(pd.offsets, o), pd.offsets.Tick) and o != "Tick" ] ) def tick_classes(request): @@ -1845,7 +1851,7 @@ def using_array_manager(): @pytest.fixture -def using_copy_on_write(): +def using_copy_on_write() -> Literal[False]: """ Fixture to check if Copy-on-Write is enabled. """ diff --git a/pandas/core/_numba/kernels/shared.py b/pandas/core/_numba/kernels/shared.py index ec25e78a8d897..6e6bcef590d06 100644 --- a/pandas/core/_numba/kernels/shared.py +++ b/pandas/core/_numba/kernels/shared.py @@ -1,3 +1,5 @@ +from __future__ import annotations + import numba import numpy as np diff --git a/pandas/core/algorithms.py b/pandas/core/algorithms.py index 888e943488953..159c0bb2e72c0 100644 --- a/pandas/core/algorithms.py +++ b/pandas/core/algorithms.py @@ -4,6 +4,7 @@ """ from __future__ import annotations +import inspect import operator from textwrap import dedent from typing import ( @@ -14,7 +15,7 @@ cast, final, ) -from warnings import warn +import warnings import numpy as np @@ -57,6 +58,7 @@ is_numeric_dtype, is_object_dtype, is_scalar, + is_signed_integer_dtype, is_timedelta64_dtype, needs_i8_conversion, ) @@ -446,7 +448,12 @@ def isin(comps: AnyArrayLike, values: AnyArrayLike) -> npt.NDArray[np.bool_]: ) if not isinstance(values, (ABCIndex, ABCSeries, ABCExtensionArray, np.ndarray)): - values = _ensure_arraylike(list(values)) + if not is_signed_integer_dtype(comps): + # GH#46485 Use object to avoid upcast to float64 later + # TODO: Share with _find_common_type_compat + values = construct_1d_object_array_from_listlike(list(values)) + else: + values = _ensure_arraylike(list(values)) elif isinstance(values, ABCMultiIndex): # Avoid raising in extract_array values = np.array(values) @@ -580,7 +587,8 @@ def factorize_array( def factorize( values, sort: bool = False, - na_sentinel: int | None = -1, + na_sentinel: int | None | lib.NoDefault = lib.no_default, + use_na_sentinel: bool | lib.NoDefault = lib.no_default, size_hint: int | None = None, ) -> tuple[np.ndarray, np.ndarray | Index]: """ @@ -598,7 +606,19 @@ def factorize( Value to mark "not found". If None, will not drop the NaN from the uniques of the values. + .. deprecated:: 1.5.0 + The na_sentinel argument is deprecated and + will be removed in a future version of pandas. Specify use_na_sentinel as + either True or False. + .. versionchanged:: 1.1.2 + + use_na_sentinel : bool, default True + If True, the sentinel -1 will be used for NaN values. If False, + NaN values will be encoded as non-negative integers and will not drop the + NaN from the uniques of the values. + + .. versionadded:: 1.5.0 {size_hint}\ Returns @@ -646,8 +666,8 @@ def factorize( >>> uniques array(['a', 'b', 'c'], dtype=object) - Missing values are indicated in `codes` with `na_sentinel` - (``-1`` by default). Note that missing values are never + When ``use_na_sentinel=True`` (the default), missing values are indicated in + the `codes` with the sentinel value ``-1`` and missing values are not included in `uniques`. >>> codes, uniques = pd.factorize(['b', None, 'a', 'c', 'b']) @@ -682,16 +702,16 @@ def factorize( Index(['a', 'c'], dtype='object') If NaN is in the values, and we want to include NaN in the uniques of the - values, it can be achieved by setting ``na_sentinel=None``. + values, it can be achieved by setting ``use_na_sentinel=False``. >>> values = np.array([1, 2, 1, np.nan]) - >>> codes, uniques = pd.factorize(values) # default: na_sentinel=-1 + >>> codes, uniques = pd.factorize(values) # default: use_na_sentinel=True >>> codes array([ 0, 1, 0, -1]) >>> uniques array([1., 2.]) - >>> codes, uniques = pd.factorize(values, na_sentinel=None) + >>> codes, uniques = pd.factorize(values, use_na_sentinel=False) >>> codes array([0, 1, 0, 2]) >>> uniques @@ -706,6 +726,7 @@ def factorize( # responsible only for factorization. All data coercion, sorting and boxing # should happen here. + na_sentinel = resolve_na_sentinel(na_sentinel, use_na_sentinel) if isinstance(values, ABCRangeIndex): return values.factorize(sort=sort) @@ -730,9 +751,22 @@ def factorize( codes, uniques = values.factorize(sort=sort) return _re_wrap_factorize(original, uniques, codes) - if not isinstance(values.dtype, np.dtype): - # i.e. ExtensionDtype - codes, uniques = values.factorize(na_sentinel=na_sentinel) + elif not isinstance(values.dtype, np.dtype): + if ( + na_sentinel == -1 + and "use_na_sentinel" in inspect.signature(values.factorize).parameters + ): + # Avoid using catch_warnings when possible + # GH#46910 - TimelikeOps has deprecated signature + codes, uniques = values.factorize( # type: ignore[call-arg] + use_na_sentinel=True + ) + else: + with warnings.catch_warnings(): + # We've already warned above + warnings.filterwarnings("ignore", ".*use_na_sentinel.*", FutureWarning) + codes, uniques = values.factorize(na_sentinel=na_sentinel) + else: values = np.asarray(values) # convert DTA/TDA/MultiIndex codes, uniques = factorize_array( @@ -757,6 +791,56 @@ def factorize( return _re_wrap_factorize(original, uniques, codes) +def resolve_na_sentinel( + na_sentinel: int | None | lib.NoDefault, + use_na_sentinel: bool | lib.NoDefault, +) -> int | None: + """ + Determine value of na_sentinel for factorize methods. + + See GH#46910 for details on the deprecation. + + Parameters + ---------- + na_sentinel : int, None, or lib.no_default + Value passed to the method. + use_na_sentinel : bool or lib.no_default + Value passed to the method. + + Returns + ------- + Resolved value of na_sentinel. + """ + if na_sentinel is not lib.no_default and use_na_sentinel is not lib.no_default: + raise ValueError( + "Cannot specify both `na_sentinel` and `use_na_sentile`; " + f"got `na_sentinel={na_sentinel}` and `use_na_sentinel={use_na_sentinel}`" + ) + if na_sentinel is lib.no_default: + result = -1 if use_na_sentinel is lib.no_default or use_na_sentinel else None + else: + if na_sentinel is None: + msg = ( + "Specifying `na_sentinel=None` is deprecated, specify " + "`use_na_sentinel=False` instead." + ) + elif na_sentinel == -1: + msg = ( + "Specifying `na_sentinel=-1` is deprecated, specify " + "`use_na_sentinel=True` instead." + ) + else: + msg = ( + "Specifying the specific value to use for `na_sentinel` is " + "deprecated and will be removed in a future version of pandas. " + "Specify `use_na_sentinel=True` to use the sentinel value -1, and " + "`use_na_sentinel=False` to encode NaN values." + ) + warnings.warn(msg, FutureWarning, stacklevel=find_stack_level()) + result = na_sentinel + return result + + def _re_wrap_factorize(original, uniques, codes: np.ndarray): """ Wrap factorize results in Series or Index depending on original type. @@ -864,7 +948,7 @@ def value_counts( # Called once from SparseArray, otherwise could be private def value_counts_arraylike( values: np.ndarray, dropna: bool, mask: npt.NDArray[np.bool_] | None = None -): +) -> tuple[ArrayLike, npt.NDArray[np.int64]]: """ Parameters ---------- @@ -950,7 +1034,7 @@ def mode( try: npresult = np.sort(npresult) except TypeError as err: - warn(f"Unable to sort modes: {err}") + warnings.warn(f"Unable to sort modes: {err}") result = _reconstruct_data(npresult, original.dtype, original) return result @@ -1017,10 +1101,10 @@ def rank( def checked_add_with_arr( arr: npt.NDArray[np.int64], - b, + b: int | npt.NDArray[np.int64], arr_mask: npt.NDArray[np.bool_] | None = None, b_mask: npt.NDArray[np.bool_] | None = None, -) -> np.ndarray: +) -> npt.NDArray[np.int64]: """ Perform array addition that checks for underflow and overflow. @@ -1064,11 +1148,12 @@ def checked_add_with_arr( elif arr_mask is not None: not_nan = np.logical_not(arr_mask) elif b_mask is not None: - # Argument 1 to "__call__" of "_UFunc_Nin1_Nout1" has incompatible type - # "Optional[ndarray[Any, dtype[bool_]]]"; expected - # "Union[_SupportsArray[dtype[Any]], _NestedSequence[_SupportsArray[dtype[An - # y]]], bool, int, float, complex, str, bytes, _NestedSequence[Union[bool, - # int, float, complex, str, bytes]]]" [arg-type] + # error: Argument 1 to "__call__" of "_UFunc_Nin1_Nout1" has + # incompatible type "Optional[ndarray[Any, dtype[bool_]]]"; + # expected "Union[_SupportsArray[dtype[Any]], _NestedSequence + # [_SupportsArray[dtype[Any]]], bool, int, float, complex, str + # , bytes, _NestedSequence[Union[bool, int, float, complex, str + # , bytes]]]" not_nan = np.logical_not(b2_mask) # type: ignore[arg-type] else: not_nan = np.empty(arr.shape, dtype=bool) @@ -1098,7 +1183,12 @@ def checked_add_with_arr( if to_raise: raise OverflowError("Overflow in int64 addition") - return arr + b + + result = arr + b + if arr_mask is not None or b2_mask is not None: + np.putmask(result, ~not_nan, iNaT) + + return result # --------------- # @@ -1564,7 +1654,7 @@ def diff(arr, n: int, axis: int = 0): raise ValueError(f"cannot diff {type(arr).__name__} on axis={axis}") return op(arr, arr.shift(n)) else: - warn( + warnings.warn( "dtype lost in 'diff()'. In the future this will raise a " "TypeError. Convert to a suitable dtype prior to calling 'diff'.", FutureWarning, @@ -1771,9 +1861,12 @@ def safe_sort( def _sort_mixed(values) -> np.ndarray: """order ints before strings in 1d arrays, safe in py3""" str_pos = np.array([isinstance(x, str) for x in values], dtype=bool) - nums = np.sort(values[~str_pos]) + none_pos = np.array([x is None for x in values], dtype=bool) + nums = np.sort(values[~str_pos & ~none_pos]) strs = np.sort(values[str_pos]) - return np.concatenate([nums, np.asarray(strs, dtype=object)]) + return np.concatenate( + [nums, np.asarray(strs, dtype=object), np.array(values[none_pos])] + ) def _sort_tuples(values: np.ndarray) -> np.ndarray: diff --git a/pandas/core/api.py b/pandas/core/api.py index cf082d2013d3b..c2bedb032d479 100644 --- a/pandas/core/api.py +++ b/pandas/core/api.py @@ -1,5 +1,3 @@ -# flake8: noqa:F401 - from pandas._libs import ( NaT, Period, @@ -84,3 +82,65 @@ # DataFrame needs to be imported after NamedAgg to avoid a circular import from pandas.core.frame import DataFrame # isort:skip + +__all__ = [ + "array", + "bdate_range", + "BooleanDtype", + "Categorical", + "CategoricalDtype", + "CategoricalIndex", + "DataFrame", + "DateOffset", + "date_range", + "DatetimeIndex", + "DatetimeTZDtype", + "factorize", + "Flags", + "Float32Dtype", + "Float64Dtype", + "Float64Index", + "Grouper", + "Index", + "IndexSlice", + "Int16Dtype", + "Int32Dtype", + "Int64Dtype", + "Int64Index", + "Int8Dtype", + "Interval", + "IntervalDtype", + "IntervalIndex", + "interval_range", + "isna", + "isnull", + "MultiIndex", + "NA", + "NamedAgg", + "NaT", + "notna", + "notnull", + "NumericIndex", + "Period", + "PeriodDtype", + "PeriodIndex", + "period_range", + "RangeIndex", + "Series", + "set_eng_float_format", + "StringDtype", + "Timedelta", + "TimedeltaIndex", + "timedelta_range", + "Timestamp", + "to_datetime", + "to_numeric", + "to_timedelta", + "UInt16Dtype", + "UInt32Dtype", + "UInt64Dtype", + "UInt64Index", + "UInt8Dtype", + "unique", + "value_counts", +] diff --git a/pandas/core/apply.py b/pandas/core/apply.py index c0200c7d7c5b7..18a0f9b7aa2ce 100644 --- a/pandas/core/apply.py +++ b/pandas/core/apply.py @@ -790,7 +790,10 @@ def apply_empty_result(self): if not should_reduce: try: - r = self.f(Series([], dtype=np.float64)) + if self.axis == 0: + r = self.f(Series([], dtype=np.float64)) + else: + r = self.f(Series(index=self.columns, dtype=np.float64)) except Exception: pass else: diff --git a/pandas/core/array_algos/quantile.py b/pandas/core/array_algos/quantile.py index 78e12fb3995fd..217fbafce719c 100644 --- a/pandas/core/array_algos/quantile.py +++ b/pandas/core/array_algos/quantile.py @@ -143,9 +143,9 @@ def _nanpercentile_1d( return np.percentile( values, qs, - # error: No overload variant of "percentile" matches argument types - # "ndarray[Any, Any]", "ndarray[Any, dtype[floating[_64Bit]]]", - # "int", "Dict[str, str]" + # error: No overload variant of "percentile" matches argument + # types "ndarray[Any, Any]", "ndarray[Any, dtype[floating[_64Bit]]]" + # , "Dict[str, str]" [call-overload] **{np_percentile_argname: interpolation}, # type: ignore[call-overload] ) @@ -217,6 +217,6 @@ def _nanpercentile( axis=1, # error: No overload variant of "percentile" matches argument types # "ndarray[Any, Any]", "ndarray[Any, dtype[floating[_64Bit]]]", - # "int", "Dict[str, str]" + # "int", "Dict[str, str]" [call-overload] **{np_percentile_argname: interpolation}, # type: ignore[call-overload] ) diff --git a/pandas/core/array_algos/replace.py b/pandas/core/array_algos/replace.py index 19a44dbfe6f6d..466eeb768f5f9 100644 --- a/pandas/core/array_algos/replace.py +++ b/pandas/core/array_algos/replace.py @@ -119,7 +119,7 @@ def _check_comparison_types( def replace_regex( values: ArrayLike, rx: re.Pattern, value, mask: npt.NDArray[np.bool_] | None -): +) -> None: """ Parameters ---------- diff --git a/pandas/core/array_algos/transforms.py b/pandas/core/array_algos/transforms.py index 27aebb9911e83..93b029c21760e 100644 --- a/pandas/core/array_algos/transforms.py +++ b/pandas/core/array_algos/transforms.py @@ -2,6 +2,8 @@ transforms.py is for shape-preserving functions. """ +from __future__ import annotations + import numpy as np diff --git a/pandas/core/arraylike.py b/pandas/core/arraylike.py index b6e9bf1420b21..280a599de84ed 100644 --- a/pandas/core/arraylike.py +++ b/pandas/core/arraylike.py @@ -4,6 +4,8 @@ Index ExtensionArray """ +from __future__ import annotations + import operator from typing import Any import warnings @@ -265,7 +267,10 @@ def array_ufunc(self, ufunc: np.ufunc, method: str, *inputs: Any, **kwargs: Any) return result # Determine if we should defer. - no_defer = (np.ndarray.__array_ufunc__, cls.__array_ufunc__) + no_defer = ( + np.ndarray.__array_ufunc__, + cls.__array_ufunc__, + ) for item in inputs: higher_priority = ( diff --git a/pandas/core/arrays/_mixins.py b/pandas/core/arrays/_mixins.py index b15e0624963ea..f17d343024915 100644 --- a/pandas/core/arrays/_mixins.py +++ b/pandas/core/arrays/_mixins.py @@ -70,6 +70,8 @@ NumpyValueArrayLike, ) + from pandas import Series + def ravel_compat(meth: F) -> F: """ @@ -259,7 +261,7 @@ def _validate_shift_value(self, fill_value): # we can remove this and use validate_fill_value directly return self._validate_scalar(fill_value) - def __setitem__(self, key, value): + def __setitem__(self, key, value) -> None: key = check_array_indexer(self, key) value = self._validate_setitem_value(value) self._ndarray[key] = value @@ -433,7 +435,7 @@ def insert( # These are not part of the EA API, but we implement them because # pandas assumes they're there. - def value_counts(self, dropna: bool = True): + def value_counts(self, dropna: bool = True) -> Series: """ Return a Series containing counts of unique values. diff --git a/pandas/core/arrays/arrow/__init__.py b/pandas/core/arrays/arrow/__init__.py index 6bdf29e38ac62..58b268cbdd221 100644 --- a/pandas/core/arrays/arrow/__init__.py +++ b/pandas/core/arrays/arrow/__init__.py @@ -1,3 +1,3 @@ -# flake8: noqa: F401 - from pandas.core.arrays.arrow.array import ArrowExtensionArray + +__all__ = ["ArrowExtensionArray"] diff --git a/pandas/core/arrays/arrow/_arrow_utils.py b/pandas/core/arrays/arrow/_arrow_utils.py index e0f242e2ced5d..c9666de9f892d 100644 --- a/pandas/core/arrays/arrow/_arrow_utils.py +++ b/pandas/core/arrays/arrow/_arrow_utils.py @@ -6,15 +6,15 @@ import numpy as np import pyarrow -from pandas._libs import lib -from pandas._libs.interval import _warning_interval +from pandas._typing import IntervalInclusiveType from pandas.errors import PerformanceWarning +from pandas.util._decorators import deprecate_kwarg from pandas.util._exceptions import find_stack_level -from pandas.core.arrays.interval import VALID_CLOSED +from pandas.core.arrays.interval import VALID_INCLUSIVE -def fallback_performancewarning(version: str | None = None): +def fallback_performancewarning(version: str | None = None) -> None: """ Raise a PerformanceWarning for falling back to ExtensionArray's non-pyarrow method @@ -25,7 +25,9 @@ def fallback_performancewarning(version: str | None = None): warnings.warn(msg, PerformanceWarning, stacklevel=find_stack_level()) -def pyarrow_array_to_numpy_and_mask(arr, dtype: np.dtype): +def pyarrow_array_to_numpy_and_mask( + arr, dtype: np.dtype +) -> tuple[np.ndarray, np.ndarray]: """ Convert a primitive pyarrow.Array to a numpy array and boolean mask based on the buffers of the Array. @@ -75,12 +77,12 @@ def __init__(self, freq) -> None: def freq(self): return self._freq - def __arrow_ext_serialize__(self): + def __arrow_ext_serialize__(self) -> bytes: metadata = {"freq": self.freq} return json.dumps(metadata).encode() @classmethod - def __arrow_ext_deserialize__(cls, storage_type, serialized): + def __arrow_ext_deserialize__(cls, storage_type, serialized) -> ArrowPeriodType: metadata = json.loads(serialized.decode()) return ArrowPeriodType(metadata["freq"]) @@ -90,7 +92,7 @@ def __eq__(self, other): else: return NotImplemented - def __hash__(self): + def __hash__(self) -> int: return hash((str(self), self.freq)) def to_pandas_dtype(self): @@ -105,17 +107,12 @@ def to_pandas_dtype(self): class ArrowIntervalType(pyarrow.ExtensionType): - def __init__( - self, - subtype, - inclusive: str | None = None, - closed: None | lib.NoDefault = lib.no_default, - ) -> None: + @deprecate_kwarg(old_arg_name="closed", new_arg_name="inclusive") + def __init__(self, subtype, inclusive: IntervalInclusiveType) -> None: # attributes need to be set first before calling # super init (as that calls serialize) - inclusive, closed = _warning_interval(inclusive, closed) - assert inclusive in VALID_CLOSED - self._closed = inclusive + assert inclusive in VALID_INCLUSIVE + self._inclusive: IntervalInclusiveType = inclusive if not isinstance(subtype, pyarrow.DataType): subtype = pyarrow.type_for_alias(str(subtype)) self._subtype = subtype @@ -128,15 +125,24 @@ def subtype(self): return self._subtype @property - def inclusive(self): - return self._closed + def inclusive(self) -> IntervalInclusiveType: + return self._inclusive - def __arrow_ext_serialize__(self): + @property + def closed(self) -> IntervalInclusiveType: + warnings.warn( + "Attribute `closed` is deprecated in favor of `inclusive`.", + FutureWarning, + stacklevel=find_stack_level(), + ) + return self._inclusive + + def __arrow_ext_serialize__(self) -> bytes: metadata = {"subtype": str(self.subtype), "inclusive": self.inclusive} return json.dumps(metadata).encode() @classmethod - def __arrow_ext_deserialize__(cls, storage_type, serialized): + def __arrow_ext_deserialize__(cls, storage_type, serialized) -> ArrowIntervalType: metadata = json.loads(serialized.decode()) subtype = pyarrow.type_for_alias(metadata["subtype"]) inclusive = metadata["inclusive"] @@ -152,7 +158,7 @@ def __eq__(self, other): else: return NotImplemented - def __hash__(self): + def __hash__(self) -> int: return hash((str(self), str(self.subtype), self.inclusive)) def to_pandas_dtype(self): diff --git a/pandas/core/arrays/arrow/array.py b/pandas/core/arrays/arrow/array.py index 1f35013075751..b0e4d46564ba4 100644 --- a/pandas/core/arrays/arrow/array.py +++ b/pandas/core/arrays/arrow/array.py @@ -8,6 +8,7 @@ import numpy as np +from pandas._libs import lib from pandas._typing import ( Dtype, PositionalIndexer, @@ -17,6 +18,8 @@ from pandas.compat import ( pa_version_under1p01, pa_version_under2p0, + pa_version_under3p0, + pa_version_under4p0, pa_version_under5p0, pa_version_under6p0, ) @@ -31,6 +34,8 @@ ) from pandas.core.dtypes.missing import isna +from pandas.core.algorithms import resolve_na_sentinel +from pandas.core.arraylike import OpsMixin from pandas.core.arrays.base import ExtensionArray from pandas.core.indexers import ( check_array_indexer, @@ -45,13 +50,110 @@ from pandas.core.arrays.arrow._arrow_utils import fallback_performancewarning from pandas.core.arrays.arrow.dtype import ArrowDtype + ARROW_CMP_FUNCS = { + "eq": pc.equal, + "ne": pc.not_equal, + "lt": pc.less, + "gt": pc.greater, + "le": pc.less_equal, + "ge": pc.greater_equal, + } + + ARROW_LOGICAL_FUNCS = { + "and": NotImplemented if pa_version_under2p0 else pc.and_kleene, + "rand": NotImplemented + if pa_version_under2p0 + else lambda x, y: pc.and_kleene(y, x), + "or": NotImplemented if pa_version_under2p0 else pc.or_kleene, + "ror": NotImplemented + if pa_version_under2p0 + else lambda x, y: pc.or_kleene(y, x), + "xor": NotImplemented if pa_version_under2p0 else pc.xor, + "rxor": NotImplemented if pa_version_under2p0 else lambda x, y: pc.xor(y, x), + } + + def cast_for_truediv( + arrow_array: pa.ChunkedArray, pa_object: pa.Array | pa.Scalar + ) -> pa.ChunkedArray: + # Ensure int / int -> float mirroring Python/Numpy behavior + # as pc.divide_checked(int, int) -> int + if pa.types.is_integer(arrow_array.type) and pa.types.is_integer( + pa_object.type + ): + return arrow_array.cast(pa.float64()) + return arrow_array + + def floordiv_compat( + left: pa.ChunkedArray | pa.Array | pa.Scalar, + right: pa.ChunkedArray | pa.Array | pa.Scalar, + ) -> pa.ChunkedArray: + # Ensure int // int -> int mirroring Python/Numpy behavior + # as pc.floor(pc.divide_checked(int, int)) -> float + result = pc.floor(pc.divide_checked(left, right)) + if pa.types.is_integer(left.type) and pa.types.is_integer(right.type): + result = result.cast(left.type) + return result + + ARROW_ARITHMETIC_FUNCS = { + "add": NotImplemented if pa_version_under2p0 else pc.add_checked, + "radd": NotImplemented + if pa_version_under2p0 + else lambda x, y: pc.add_checked(y, x), + "sub": NotImplemented if pa_version_under2p0 else pc.subtract_checked, + "rsub": NotImplemented + if pa_version_under2p0 + else lambda x, y: pc.subtract_checked(y, x), + "mul": NotImplemented if pa_version_under2p0 else pc.multiply_checked, + "rmul": NotImplemented + if pa_version_under2p0 + else lambda x, y: pc.multiply_checked(y, x), + "truediv": NotImplemented + if pa_version_under2p0 + else lambda x, y: pc.divide_checked(cast_for_truediv(x, y), y), + "rtruediv": NotImplemented + if pa_version_under2p0 + else lambda x, y: pc.divide_checked(y, cast_for_truediv(x, y)), + "floordiv": NotImplemented + if pa_version_under2p0 + else lambda x, y: floordiv_compat(x, y), + "rfloordiv": NotImplemented + if pa_version_under2p0 + else lambda x, y: floordiv_compat(y, x), + "mod": NotImplemented, + "rmod": NotImplemented, + "divmod": NotImplemented, + "rdivmod": NotImplemented, + "pow": NotImplemented if pa_version_under4p0 else pc.power_checked, + "rpow": NotImplemented + if pa_version_under4p0 + else lambda x, y: pc.power_checked(y, x), + } + if TYPE_CHECKING: from pandas import Series ArrowExtensionArrayT = TypeVar("ArrowExtensionArrayT", bound="ArrowExtensionArray") -class ArrowExtensionArray(ExtensionArray): +def to_pyarrow_type( + dtype: ArrowDtype | pa.DataType | Dtype | None, +) -> pa.DataType | None: + """ + Convert dtype to a pyarrow type instance. + """ + if isinstance(dtype, ArrowDtype): + pa_dtype = dtype.pyarrow_dtype + elif isinstance(dtype, pa.DataType): + pa_dtype = dtype + elif dtype: + # Accepts python types too + pa_dtype = pa.from_numpy_dtype(dtype) + else: + pa_dtype = None + return pa_dtype + + +class ArrowExtensionArray(OpsMixin, ExtensionArray): """ Base class for ExtensionArray backed by Arrow ChunkedArray. """ @@ -77,13 +179,7 @@ def _from_sequence(cls, scalars, *, dtype: Dtype | None = None, copy=False): """ Construct a new ExtensionArray from a sequence of scalars. """ - if isinstance(dtype, ArrowDtype): - pa_dtype = dtype.pyarrow_dtype - elif dtype: - pa_dtype = pa.from_numpy_dtype(dtype) - else: - pa_dtype = None - + pa_dtype = to_pyarrow_type(dtype) if isinstance(scalars, cls): data = scalars._data if pa_dtype: @@ -101,7 +197,40 @@ def _from_sequence_of_strings( """ Construct a new ExtensionArray from a sequence of strings. """ - return cls._from_sequence(strings, dtype=dtype, copy=copy) + pa_type = to_pyarrow_type(dtype) + if pa.types.is_timestamp(pa_type): + from pandas.core.tools.datetimes import to_datetime + + scalars = to_datetime(strings, errors="raise") + elif pa.types.is_date(pa_type): + from pandas.core.tools.datetimes import to_datetime + + scalars = to_datetime(strings, errors="raise").date + elif pa.types.is_duration(pa_type): + from pandas.core.tools.timedeltas import to_timedelta + + scalars = to_timedelta(strings, errors="raise") + elif pa.types.is_time(pa_type): + from pandas.core.tools.times import to_time + + # "coerce" to allow "null times" (None) to not raise + scalars = to_time(strings, errors="coerce") + elif pa.types.is_boolean(pa_type): + from pandas.core.arrays import BooleanArray + + scalars = BooleanArray._from_sequence_of_strings(strings).to_numpy() + elif ( + pa.types.is_integer(pa_type) + or pa.types.is_floating(pa_type) + or pa.types.is_decimal(pa_type) + ): + from pandas.core.tools.numeric import to_numeric + + scalars = to_numeric(strings, errors="raise") + else: + # Let pyarrow try to infer or raise + scalars = strings + return cls._from_sequence(scalars, dtype=pa_type, copy=copy) def __getitem__(self, item: PositionalIndexer): """Select a subset of self. @@ -179,6 +308,70 @@ def __arrow_array__(self, type=None): """Convert myself to a pyarrow ChunkedArray.""" return self._data + def __invert__(self: ArrowExtensionArrayT) -> ArrowExtensionArrayT: + if pa_version_under2p0: + raise NotImplementedError("__invert__ not implement for pyarrow < 2.0") + return type(self)(pc.invert(self._data)) + + def __neg__(self: ArrowExtensionArrayT) -> ArrowExtensionArrayT: + return type(self)(pc.negate_checked(self._data)) + + def __pos__(self: ArrowExtensionArrayT) -> ArrowExtensionArrayT: + return type(self)(self._data) + + def __abs__(self: ArrowExtensionArrayT) -> ArrowExtensionArrayT: + return type(self)(pc.abs_checked(self._data)) + + def _cmp_method(self, other, op): + from pandas.arrays import BooleanArray + + pc_func = ARROW_CMP_FUNCS[op.__name__] + if isinstance(other, ArrowExtensionArray): + result = pc_func(self._data, other._data) + elif isinstance(other, (np.ndarray, list)): + result = pc_func(self._data, other) + elif is_scalar(other): + try: + result = pc_func(self._data, pa.scalar(other)) + except (pa.lib.ArrowNotImplementedError, pa.lib.ArrowInvalid): + mask = isna(self) | isna(other) + valid = ~mask + result = np.zeros(len(self), dtype="bool") + result[valid] = op(np.array(self)[valid], other) + return BooleanArray(result, mask) + else: + return NotImplementedError( + f"{op.__name__} not implemented for {type(other)}" + ) + + if pa_version_under2p0: + result = result.to_pandas().values + else: + result = result.to_numpy() + return BooleanArray._from_sequence(result) + + def _evaluate_op_method(self, other, op, arrow_funcs): + pc_func = arrow_funcs[op.__name__] + if pc_func is NotImplemented: + raise NotImplementedError(f"{op.__name__} not implemented.") + if isinstance(other, ArrowExtensionArray): + result = pc_func(self._data, other._data) + elif isinstance(other, (np.ndarray, list)): + result = pc_func(self._data, pa.array(other, from_pandas=True)) + elif is_scalar(other): + result = pc_func(self._data, pa.scalar(other)) + else: + raise NotImplementedError( + f"{op.__name__} not implemented for {type(other)}" + ) + return type(self)(result) + + def _logical_method(self, other, op): + return self._evaluate_op_method(other, op, ARROW_LOGICAL_FUNCS) + + def _arith_method(self, other, op): + return self._evaluate_op_method(other, op, ARROW_ARITHMETIC_FUNCS) + def equals(self, other) -> bool: if not isinstance(other, ArrowExtensionArray): return False @@ -210,6 +403,10 @@ def __len__(self) -> int: """ return len(self._data) + @property + def _hasna(self) -> bool: + return self._data.null_count > 0 + def isna(self) -> npt.NDArray[np.bool_]: """ Boolean NumPy array indicating if each value is missing. @@ -247,8 +444,60 @@ def dropna(self: ArrowExtensionArrayT) -> ArrowExtensionArrayT: else: return type(self)(pc.drop_null(self._data)) + def isin(self: ArrowExtensionArrayT, values) -> npt.NDArray[np.bool_]: + if pa_version_under2p0: + fallback_performancewarning(version="2") + return super().isin(values) + + # for an empty value_set pyarrow 3.0.0 segfaults and pyarrow 2.0.0 returns True + # for null values, so we short-circuit to return all False array. + if not len(values): + return np.zeros(len(self), dtype=bool) + + kwargs = {} + if pa_version_under3p0: + # in pyarrow 2.0.0 skip_null is ignored but is a required keyword and raises + # with unexpected keyword argument in pyarrow 3.0.0+ + kwargs["skip_null"] = True + + result = pc.is_in( + self._data, value_set=pa.array(values, from_pandas=True), **kwargs + ) + # pyarrow 2.0.0 returned nulls, so we explicitly specify dtype to convert nulls + # to False + return np.array(result, dtype=np.bool_) + + def _values_for_factorize(self) -> tuple[np.ndarray, Any]: + """ + Return an array and missing value suitable for factorization. + + Returns + ------- + values : ndarray + na_value : pd.NA + + Notes + ----- + The values returned by this method are also used in + :func:`pandas.util.hash_pandas_object`. + """ + if pa_version_under2p0: + values = self._data.to_pandas().values + else: + values = self._data.to_numpy() + return values, self.dtype.na_value + @doc(ExtensionArray.factorize) - def factorize(self, na_sentinel: int = -1) -> tuple[np.ndarray, ExtensionArray]: + def factorize( + self, + na_sentinel: int | lib.NoDefault = lib.no_default, + use_na_sentinel: bool | lib.NoDefault = lib.no_default, + ) -> tuple[np.ndarray, ExtensionArray]: + resolved_na_sentinel = resolve_na_sentinel(na_sentinel, use_na_sentinel) + if resolved_na_sentinel is None: + raise NotImplementedError("Encoding NaN values is not yet implemented") + else: + na_sentinel = resolved_na_sentinel encoded = self._data.dictionary_encode() indices = pa.chunked_array( [c.indices for c in encoded.chunks], type=encoded.type.index_type @@ -275,7 +524,7 @@ def take( indices: TakeIndexer, allow_fill: bool = False, fill_value: Any = None, - ): + ) -> ArrowExtensionArray: """ Take elements from an array. @@ -435,12 +684,73 @@ def _concat_same_type( ------- ArrowExtensionArray """ - import pyarrow as pa - chunks = [array for ea in to_concat for array in ea._data.iterchunks()] arr = pa.chunked_array(chunks) return cls(arr) + def _reduce(self, name: str, *, skipna: bool = True, **kwargs): + """ + Return a scalar result of performing the reduction operation. + + Parameters + ---------- + name : str + Name of the function, supported values are: + { any, all, min, max, sum, mean, median, prod, + std, var, sem, kurt, skew }. + skipna : bool, default True + If True, skip NaN values. + **kwargs + Additional keyword arguments passed to the reduction function. + Currently, `ddof` is the only supported kwarg. + + Returns + ------- + scalar + + Raises + ------ + TypeError : subclass does not define reductions + """ + if name == "sem": + + def pyarrow_meth(data, skipna, **kwargs): + numerator = pc.stddev(data, skip_nulls=skipna, **kwargs) + denominator = pc.sqrt_checked( + pc.subtract_checked( + pc.count(self._data, skip_nulls=skipna), kwargs["ddof"] + ) + ) + return pc.divide_checked(numerator, denominator) + + else: + pyarrow_name = { + "median": "approximate_median", + "prod": "product", + "std": "stddev", + "var": "variance", + }.get(name, name) + # error: Incompatible types in assignment + # (expression has type "Optional[Any]", variable has type + # "Callable[[Any, Any, KwArg(Any)], Any]") + pyarrow_meth = getattr(pc, pyarrow_name, None) # type: ignore[assignment] + if pyarrow_meth is None: + # Let ExtensionArray._reduce raise the TypeError + return super()._reduce(name, skipna=skipna, **kwargs) + try: + result = pyarrow_meth(self._data, skip_nulls=skipna, **kwargs) + except (AttributeError, NotImplementedError, TypeError) as err: + msg = ( + f"'{type(self).__name__}' with dtype {self.dtype} " + f"does not support reduction '{name}' with pyarrow " + f"version {pa.__version__}. '{name}' may be supported by " + f"upgrading pyarrow." + ) + raise TypeError(msg) from err + if pc.is_null(result).as_py(): + return self.dtype.na_value + return result.as_py() + def __setitem__(self, key: int | slice | np.ndarray, value: Any) -> None: """Set one or more values inplace. @@ -496,6 +806,14 @@ def _indexing_key_to_indices( if isinstance(key, slice): indices = np.arange(n)[key] elif is_integer(key): + # error: Invalid index type "List[Union[int, ndarray[Any, Any]]]" + # for "ndarray[Any, dtype[signedinteger[Any]]]"; expected type + # "Union[SupportsIndex, _SupportsArray[dtype[Union[bool_, + # integer[Any]]]], _NestedSequence[_SupportsArray[dtype[Union + # [bool_, integer[Any]]]]], _NestedSequence[Union[bool, int]] + # , Tuple[Union[SupportsIndex, _SupportsArray[dtype[Union[bool_ + # , integer[Any]]]], _NestedSequence[_SupportsArray[dtype[Union + # [bool_, integer[Any]]]]], _NestedSequence[Union[bool, int]]], ...]]" indices = np.arange(n)[[key]] # type: ignore[index] elif is_bool_dtype(key): key = np.asarray(key) @@ -581,7 +899,7 @@ def _replace_with_indices( # fast path for a contiguous set of indices arrays = [ chunk[:start], - pa.array(value, type=chunk.type), + pa.array(value, type=chunk.type, from_pandas=True), chunk[stop + 1 :], ] arrays = [arr for arr in arrays if len(arr)] diff --git a/pandas/core/arrays/arrow/dtype.py b/pandas/core/arrays/arrow/dtype.py index af5b51a39b9c3..4a32663a68ed2 100644 --- a/pandas/core/arrays/arrow/dtype.py +++ b/pandas/core/arrays/arrow/dtype.py @@ -44,7 +44,7 @@ def name(self) -> str: # type: ignore[override] """ A string identifying the data type. """ - return str(self.pyarrow_dtype) + return f"{str(self.pyarrow_dtype)}[{self.storage}]" @cache_readonly def numpy_dtype(self) -> np.dtype: @@ -77,7 +77,7 @@ def construct_array_type(cls): return ArrowExtensionArray @classmethod - def construct_from_string(cls, string: str): + def construct_from_string(cls, string: str) -> ArrowDtype: """ Construct this type from a string. @@ -92,10 +92,11 @@ def construct_from_string(cls, string: str): f"'construct_from_string' expects a string, got {type(string)}" ) if not string.endswith("[pyarrow]"): - raise TypeError(f"string {string} must end with '[pyarrow]'") + raise TypeError(f"'{string}' must end with '[pyarrow]'") base_type = string.split("[pyarrow]")[0] - pa_dtype = getattr(pa, base_type, None) - if pa_dtype is None: + try: + pa_dtype = pa.type_for_alias(base_type) + except ValueError as err: has_parameters = re.search(r"\[.*\]", base_type) if has_parameters: raise NotImplementedError( @@ -103,9 +104,9 @@ def construct_from_string(cls, string: str): f"({has_parameters.group()}) in the string is not supported. " "Please construct an ArrowDtype object with a pyarrow_dtype " "instance with specific parameters." - ) - raise TypeError(f"'{base_type}' is not a valid pyarrow data type.") - return cls(pa_dtype()) + ) from err + raise TypeError(f"'{base_type}' is not a valid pyarrow data type.") from err + return cls(pa_dtype) @property def _is_numeric(self) -> bool: diff --git a/pandas/core/arrays/base.py b/pandas/core/arrays/base.py index eb3c6d6d26101..6c9b7adadb7b0 100644 --- a/pandas/core/arrays/base.py +++ b/pandas/core/arrays/base.py @@ -8,11 +8,13 @@ """ from __future__ import annotations +import inspect import operator from typing import ( TYPE_CHECKING, Any, Callable, + ClassVar, Iterator, Literal, Sequence, @@ -20,6 +22,7 @@ cast, overload, ) +import warnings import numpy as np @@ -45,6 +48,7 @@ cache_readonly, deprecate_nonkeyword_arguments, ) +from pandas.util._exceptions import find_stack_level from pandas.util._validators import ( validate_bool_kwarg, validate_fillna_kwargs, @@ -76,6 +80,7 @@ isin, mode, rank, + resolve_na_sentinel, unique, ) from pandas.core.array_algos.quantile import quantile_with_mask @@ -456,6 +461,24 @@ def __ne__(self, other: Any) -> ArrayLike: # type: ignore[override] """ return ~(self == other) + def __init_subclass__(cls, **kwargs) -> None: + factorize = getattr(cls, "factorize") + if ( + "use_na_sentinel" not in inspect.signature(factorize).parameters + # TimelikeOps uses old factorize args to ensure we don't break things + and cls.__name__ not in ("TimelikeOps", "DatetimeArray", "TimedeltaArray") + ): + # See GH#46910 for details on the deprecation + name = cls.__name__ + warnings.warn( + f"The `na_sentinel` argument of `{name}.factorize` is deprecated. " + f"In the future, pandas will use the `use_na_sentinel` argument " + f"instead. Add this argument to `{name}.factorize` to be compatible " + f"with future versions of pandas and silence this warning.", + DeprecationWarning, + stacklevel=find_stack_level(), + ) + def to_numpy( self, dtype: npt.DTypeLike | None = None, @@ -748,11 +771,11 @@ def argmax(self, skipna: bool = True) -> int: return nargminmax(self, "argmax") def fillna( - self, + self: ExtensionArrayT, value: object | ArrayLike | None = None, method: FillnaOptions | None = None, limit: int | None = None, - ): + ) -> ExtensionArrayT: """ Fill NA/NaN values using the specified method. @@ -1002,7 +1025,11 @@ def _values_for_factorize(self) -> tuple[np.ndarray, Any]: """ return self.astype(object), np.nan - def factorize(self, na_sentinel: int = -1) -> tuple[np.ndarray, ExtensionArray]: + def factorize( + self, + na_sentinel: int | lib.NoDefault = lib.no_default, + use_na_sentinel: bool | lib.NoDefault = lib.no_default, + ) -> tuple[np.ndarray, ExtensionArray]: """ Encode the extension array as an enumerated type. @@ -1011,6 +1038,18 @@ def factorize(self, na_sentinel: int = -1) -> tuple[np.ndarray, ExtensionArray]: na_sentinel : int, default -1 Value to use in the `codes` array to indicate missing values. + .. deprecated:: 1.5.0 + The na_sentinel argument is deprecated and + will be removed in a future version of pandas. Specify use_na_sentinel + as either True or False. + + use_na_sentinel : bool, default True + If True, the sentinel -1 will be used for NaN values. If False, + NaN values will be encoded as non-negative integers and will not drop the + NaN from the uniques of the values. + + .. versionadded:: 1.5.0 + Returns ------- codes : ndarray @@ -1041,6 +1080,11 @@ def factorize(self, na_sentinel: int = -1) -> tuple[np.ndarray, ExtensionArray]: # original ExtensionArray. # 2. ExtensionArray.factorize. # Complete control over factorization. + resolved_na_sentinel = resolve_na_sentinel(na_sentinel, use_na_sentinel) + if resolved_na_sentinel is None: + raise NotImplementedError("Encoding NaN values is not yet implemented") + else: + na_sentinel = resolved_na_sentinel arr, na_value = self._values_for_factorize() codes, uniques = factorize_array( @@ -1096,7 +1140,9 @@ def factorize(self, na_sentinel: int = -1) -> tuple[np.ndarray, ExtensionArray]: @Substitution(klass="ExtensionArray") @Appender(_extension_array_shared_docs["repeat"]) - def repeat(self, repeats: int | Sequence[int], axis: int | None = None): + def repeat( + self: ExtensionArrayT, repeats: int | Sequence[int], axis: int | None = None + ) -> ExtensionArrayT: nv.validate_repeat((), {"axis": axis}) ind = np.arange(len(self)).repeat(repeats) return self.take(ind) @@ -1397,7 +1443,7 @@ def _reduce(self, name: str, *, skipna: bool = True, **kwargs): # https://github.com/python/typeshed/issues/2148#issuecomment-520783318 # Incompatible types in assignment (expression has type "None", base class # "object" defined the type as "Callable[[object], int]") - __hash__: None # type: ignore[assignment] + __hash__: ClassVar[None] # type: ignore[assignment] # ------------------------------------------------------------------------ # Non-Optimized Default Methods; in the case of the private methods here, diff --git a/pandas/core/arrays/categorical.py b/pandas/core/arrays/categorical.py index 70699c45e0c36..2c3b7c2f2589d 100644 --- a/pandas/core/arrays/categorical.py +++ b/pandas/core/arrays/categorical.py @@ -7,6 +7,7 @@ from typing import ( TYPE_CHECKING, Hashable, + Literal, Sequence, TypeVar, Union, @@ -29,7 +30,10 @@ lib, ) from pandas._libs.arrays import NDArrayBacked -from pandas._libs.lib import no_default +from pandas._libs.lib import ( + NoDefault, + no_default, +) from pandas._typing import ( ArrayLike, AstypeArg, @@ -114,7 +118,11 @@ from pandas.io.formats import console if TYPE_CHECKING: - from pandas import Index + from pandas import ( + DataFrame, + Index, + Series, + ) CategoricalT = TypeVar("CategoricalT", bound="Categorical") @@ -193,7 +201,7 @@ def func(self, other): return func -def contains(cat, key, container): +def contains(cat, key, container) -> bool: """ Helper for membership check for ``key`` in ``cat``. @@ -462,9 +470,7 @@ def __init__( dtype = CategoricalDtype(ordered=False).update_dtype(dtype) arr = coerce_indexer_dtype(codes, dtype.categories) - # error: Argument 1 to "__init__" of "NDArrayBacked" has incompatible - # type "Union[ExtensionArray, ndarray]"; expected "ndarray" - super().__init__(arr, dtype) # type: ignore[arg-type] + super().__init__(arr, dtype) @property def dtype(self) -> CategoricalDtype: @@ -639,7 +645,7 @@ def _from_inferred_categories( @classmethod def from_codes( cls, codes, categories=None, ordered=None, dtype: Dtype | None = None - ): + ) -> Categorical: """ Make a Categorical type from codes and categories or dtype. @@ -707,7 +713,7 @@ def from_codes( # Categories/Codes/Ordered @property - def categories(self): + def categories(self) -> Index: """ The categories of this categorical. @@ -738,7 +744,7 @@ def categories(self): return self.dtype.categories @categories.setter - def categories(self, categories): + def categories(self, categories) -> None: new_dtype = CategoricalDtype(categories, ordered=self.ordered) if self.dtype.categories is not None and len(self.dtype.categories) != len( new_dtype.categories @@ -829,7 +835,20 @@ def _set_dtype(self, dtype: CategoricalDtype) -> Categorical: codes = recode_for_categories(self.codes, self.categories, dtype.categories) return type(self)(codes, dtype=dtype, fastpath=True) - def set_ordered(self, value, inplace=False): + @overload + def set_ordered(self, value, *, inplace: Literal[False] = ...) -> Categorical: + ... + + @overload + def set_ordered(self, value, *, inplace: Literal[True]) -> None: + ... + + @overload + def set_ordered(self, value, *, inplace: bool) -> Categorical | None: + ... + + @deprecate_nonkeyword_arguments(version=None, allowed_args=["self", "value"]) + def set_ordered(self, value, inplace: bool = False) -> Categorical | None: """ Set the ordered attribute to the boolean value. @@ -847,8 +866,18 @@ def set_ordered(self, value, inplace=False): NDArrayBacked.__init__(cat, cat._ndarray, new_dtype) if not inplace: return cat + return None + + @overload + def as_ordered(self, *, inplace: Literal[False] = ...) -> Categorical: + ... + + @overload + def as_ordered(self, *, inplace: Literal[True]) -> None: + ... - def as_ordered(self, inplace=False): + @deprecate_nonkeyword_arguments(version=None, allowed_args=["self"]) + def as_ordered(self, inplace: bool = False) -> Categorical | None: """ Set the Categorical to be ordered. @@ -866,7 +895,16 @@ def as_ordered(self, inplace=False): inplace = validate_bool_kwarg(inplace, "inplace") return self.set_ordered(True, inplace=inplace) - def as_unordered(self, inplace=False): + @overload + def as_unordered(self, *, inplace: Literal[False] = ...) -> Categorical: + ... + + @overload + def as_unordered(self, *, inplace: Literal[True]) -> None: + ... + + @deprecate_nonkeyword_arguments(version=None, allowed_args=["self"]) + def as_unordered(self, inplace: bool = False) -> Categorical | None: """ Set the Categorical to be unordered. @@ -973,7 +1011,22 @@ def set_categories( if not inplace: return cat - def rename_categories(self, new_categories, inplace=no_default): + @overload + def rename_categories( + self, new_categories, *, inplace: Literal[False] | NoDefault = ... + ) -> Categorical: + ... + + @overload + def rename_categories(self, new_categories, *, inplace: Literal[True]) -> None: + ... + + @deprecate_nonkeyword_arguments( + version=None, allowed_args=["self", "new_categories"] + ) + def rename_categories( + self, new_categories, inplace: bool | NoDefault = no_default + ) -> Categorical | None: """ Rename categories. @@ -1062,6 +1115,7 @@ def rename_categories(self, new_categories, inplace=no_default): cat.categories = new_categories if not inplace: return cat + return None def reorder_categories(self, new_categories, ordered=None, inplace=no_default): """ @@ -1124,7 +1178,22 @@ def reorder_categories(self, new_categories, ordered=None, inplace=no_default): simplefilter("ignore") return self.set_categories(new_categories, ordered=ordered, inplace=inplace) - def add_categories(self, new_categories, inplace=no_default): + @overload + def add_categories( + self, new_categories, *, inplace: Literal[False] | NoDefault = ... + ) -> Categorical: + ... + + @overload + def add_categories(self, new_categories, *, inplace: Literal[True]) -> None: + ... + + @deprecate_nonkeyword_arguments( + version=None, allowed_args=["self", "new_categories"] + ) + def add_categories( + self, new_categories, inplace: bool | NoDefault = no_default + ) -> Categorical | None: """ Add new categories. @@ -1199,6 +1268,7 @@ def add_categories(self, new_categories, inplace=no_default): NDArrayBacked.__init__(cat, codes, new_dtype) if not inplace: return cat + return None def remove_categories(self, removals, inplace=no_default): """ @@ -1280,7 +1350,20 @@ def remove_categories(self, removals, inplace=no_default): new_categories, ordered=self.ordered, rename=False, inplace=inplace ) - def remove_unused_categories(self, inplace=no_default): + @overload + def remove_unused_categories( + self, *, inplace: Literal[False] | NoDefault = ... + ) -> Categorical: + ... + + @overload + def remove_unused_categories(self, *, inplace: Literal[True]) -> None: + ... + + @deprecate_nonkeyword_arguments(version=None, allowed_args=["self"]) + def remove_unused_categories( + self, inplace: bool | NoDefault = no_default + ) -> Categorical | None: """ Remove categories which are not used. @@ -1348,6 +1431,7 @@ def remove_unused_categories(self, inplace=no_default): NDArrayBacked.__init__(cat, new_codes, new_dtype) if not inplace: return cat + return None # ------------------------------------------------------------------ @@ -1531,7 +1615,7 @@ def __array_ufunc__(self, ufunc: np.ufunc, method: str, *inputs, **kwargs): f"the numpy op {ufunc.__name__}" ) - def __setstate__(self, state): + def __setstate__(self, state) -> None: """Necessary for making this object picklable""" if not isinstance(state, dict): return super().__setstate__(state) @@ -1617,7 +1701,7 @@ def notna(self) -> np.ndarray: notnull = notna - def value_counts(self, dropna: bool = True): + def value_counts(self, dropna: bool = True) -> Series: """ Return a Series containing counts of each category. @@ -1700,7 +1784,7 @@ def _internal_get_values(self): return self.categories.astype("object").take(self._codes, fill_value=np.nan) return np.array(self) - def check_for_ordered(self, op): + def check_for_ordered(self, op) -> None: """assert that we are ordered""" if not self.ordered: raise TypeError( @@ -1763,9 +1847,26 @@ def argsort(self, ascending=True, kind="quicksort", **kwargs): """ return super().argsort(ascending=ascending, kind=kind, **kwargs) + @overload + def sort_values( + self, + *, + inplace: Literal[False] = ..., + ascending: bool = ..., + na_position: str = ..., + ) -> Categorical: + ... + + @overload + def sort_values( + self, *, inplace: Literal[True], ascending: bool = ..., na_position: str = ... + ) -> None: + ... + + @deprecate_nonkeyword_arguments(version=None, allowed_args=["self"]) def sort_values( self, inplace: bool = False, ascending: bool = True, na_position: str = "last" - ): + ) -> Categorical | None: """ Sort the Categorical by category value returning a new Categorical by default. @@ -1845,11 +1946,11 @@ def sort_values( sorted_idx = nargsort(self, ascending=ascending, na_position=na_position) - if inplace: - self._codes[:] = self._codes[sorted_idx] - else: + if not inplace: codes = self._codes[sorted_idx] return self._from_backing_data(codes) + self._codes[:] = self._codes[sorted_idx] + return None def _rank( self, @@ -1954,7 +2055,9 @@ def _unbox_scalar(self, key) -> int: # ------------------------------------------------------------------ - def take_nd(self, indexer, allow_fill: bool = False, fill_value=None): + def take_nd( + self, indexer, allow_fill: bool = False, fill_value=None + ) -> Categorical: # GH#27745 deprecate alias that other EAs dont have warn( "Categorical.take_nd is deprecated, use Categorical.take instead", @@ -2402,7 +2505,7 @@ def is_dtype_equal(self, other) -> bool: except (AttributeError, TypeError): return False - def describe(self): + def describe(self) -> DataFrame: """ Describes this Categorical @@ -2476,7 +2579,18 @@ def isin(self, values) -> npt.NDArray[np.bool_]: code_values = code_values[null_mask | (code_values >= 0)] return algorithms.isin(self.codes, code_values) - def replace(self, to_replace, value, inplace: bool = False): + @overload + def replace( + self, to_replace, value, *, inplace: Literal[False] = ... + ) -> Categorical: + ... + + @overload + def replace(self, to_replace, value, *, inplace: Literal[True]) -> None: + ... + + @deprecate_nonkeyword_arguments(version=None, allowed_args=["self", "value"]) + def replace(self, to_replace, value, inplace: bool = False) -> Categorical | None: """ Replaces all instances of one value with another @@ -2724,7 +2838,7 @@ def _delegate_property_set(self, name, new_values): return setattr(self._parent, name, new_values) @property - def codes(self): + def codes(self) -> Series: """ Return Series of codes as well as the index. """ @@ -2823,6 +2937,7 @@ def factorize_from_iterable(values) -> tuple[np.ndarray, Index]: if not is_list_like(values): raise TypeError("Input must be list-like") + categories: Index if is_categorical_dtype(values): values = extract_array(values) # The Categorical we want to build has the same categories diff --git a/pandas/core/arrays/datetimelike.py b/pandas/core/arrays/datetimelike.py index 1dfb070e29c30..c3fbd716ad09d 100644 --- a/pandas/core/arrays/datetimelike.py +++ b/pandas/core/arrays/datetimelike.py @@ -25,6 +25,7 @@ algos, lib, ) +from pandas._libs.arrays import NDArrayBacked from pandas._libs.tslibs import ( BaseOffset, IncompatibleFrequency, @@ -45,6 +46,7 @@ RoundTo, round_nsint64, ) +from pandas._libs.tslibs.np_datetime import compare_mismatched_resolutions from pandas._libs.tslibs.timestamps import integer_op_not_supported from pandas._typing import ( ArrayLike, @@ -72,7 +74,6 @@ from pandas.util._exceptions import find_stack_level from pandas.core.dtypes.common import ( - DT64NS_DTYPE, is_all_strings, is_categorical_dtype, is_datetime64_any_dtype, @@ -94,6 +95,10 @@ DatetimeTZDtype, ExtensionDtype, ) +from pandas.core.dtypes.generic import ( + ABCCategorical, + ABCMultiIndex, +) from pandas.core.dtypes.missing import ( is_valid_na_for_dtype, isna, @@ -114,6 +119,8 @@ NDArrayBackedExtensionArray, ravel_compat, ) +from pandas.core.arrays.base import ExtensionArray +from pandas.core.arrays.integer import IntegerArray import pandas.core.common as com from pandas.core.construction import ( array as pd_array, @@ -424,17 +431,18 @@ def astype(self, dtype, copy: bool = True): if is_object_dtype(dtype): if self.dtype.kind == "M": + self = cast("DatetimeArray", self) # *much* faster than self._box_values # for e.g. test_get_loc_tuple_monotonic_above_size_cutoff - i8data = self.asi8.ravel() + i8data = self.asi8 converted = ints_to_pydatetime( i8data, - # error: "DatetimeLikeArrayMixin" has no attribute "tz" - tz=self.tz, # type: ignore[attr-defined] + tz=self.tz, freq=self.freq, box="timestamp", + reso=self._reso, ) - return converted.reshape(self.shape) + return converted elif self.dtype.kind == "m": return ints_to_pytimedelta(self._ndarray, box=True) @@ -923,7 +931,7 @@ def freq(self): return self._freq @freq.setter - def freq(self, value): + def freq(self, value) -> None: if value is not None: value = to_offset(value) self._validate_frequency(self, value) @@ -1058,6 +1066,24 @@ def _cmp_method(self, other, op): ) return result + if other is NaT: + if op is operator.ne: + result = np.ones(self.shape, dtype=bool) + else: + result = np.zeros(self.shape, dtype=bool) + return result + + if not is_period_dtype(self.dtype): + self = cast(TimelikeOps, self) + if self._reso != other._reso: + if not isinstance(other, type(self)): + # i.e. Timedelta/Timestamp, cast to ndarray and let + # compare_mismatched_resolutions handle broadcasting + other_arr = np.array(other.asm8) + else: + other_arr = other._ndarray + return compare_mismatched_resolutions(self._ndarray, other_arr, op) + other_vals = self._unbox(other) # GH#37462 comparison on i8 values is almost 2x faster than M8/m8 result = op(self._ndarray.view("i8"), other_vals.view("i8")) @@ -1086,31 +1112,40 @@ def _cmp_method(self, other, op): __rdivmod__ = make_invalid_op("__rdivmod__") @final - def _add_datetimelike_scalar(self, other): - # Overridden by TimedeltaArray + def _add_datetimelike_scalar(self, other) -> DatetimeArray: if not is_timedelta64_dtype(self.dtype): raise TypeError( f"cannot add {type(self).__name__} and {type(other).__name__}" ) + self = cast("TimedeltaArray", self) + from pandas.core.arrays import DatetimeArray + from pandas.core.arrays.datetimes import tz_to_dtype assert other is not NaT other = Timestamp(other) if other is NaT: # In this case we specifically interpret NaT as a datetime, not # the timedelta interpretation we would get by returning self + NaT - result = self.asi8.view("m8[ms]") + NaT.to_datetime64() - return DatetimeArray(result) + result = self._ndarray + NaT.to_datetime64().astype(f"M8[{self._unit}]") + # Preserve our resolution + return DatetimeArray._simple_new(result, dtype=result.dtype) + + if self._reso != other._reso: + raise NotImplementedError( + "Addition between TimedeltaArray and Timestamp with mis-matched " + "resolutions is not yet supported." + ) i8 = self.asi8 result = checked_add_with_arr(i8, other.value, arr_mask=self._isnan) - result = self._maybe_mask_results(result) - dtype = DatetimeTZDtype(tz=other.tz) if other.tz else DT64NS_DTYPE - return DatetimeArray(result, dtype=dtype, freq=self.freq) + dtype = tz_to_dtype(tz=other.tz, unit=self._unit) + res_values = result.view(f"M8[{self._unit}]") + return DatetimeArray._simple_new(res_values, dtype=dtype, freq=self.freq) @final - def _add_datetime_arraylike(self, other): + def _add_datetime_arraylike(self, other) -> DatetimeArray: if not is_timedelta64_dtype(self.dtype): raise TypeError( f"cannot add {type(self).__name__} and {type(other).__name__}" @@ -1146,7 +1181,6 @@ def _sub_datetimelike_scalar(self, other: datetime | np.datetime64): i8 = self.asi8 result = checked_add_with_arr(i8, -other.value, arr_mask=self._isnan) - result = self._maybe_mask_results(result) return result.view("timedelta64[ns]") @final @@ -1168,23 +1202,23 @@ def _sub_datetime_arraylike(self, other): self_i8 = self.asi8 other_i8 = other.asi8 - arr_mask = self._isnan | other._isnan - new_values = checked_add_with_arr(self_i8, -other_i8, arr_mask=arr_mask) - if self._hasna or other._hasna: - np.putmask(new_values, arr_mask, iNaT) + new_values = checked_add_with_arr( + self_i8, -other_i8, arr_mask=self._isnan, b_mask=other._isnan + ) return new_values.view("timedelta64[ns]") @final - def _sub_period(self, other: Period): + def _sub_period(self, other: Period) -> npt.NDArray[np.object_]: if not is_period_dtype(self.dtype): raise TypeError(f"cannot subtract Period from a {type(self).__name__}") # If the operation is well-defined, we return an object-dtype ndarray # of DateOffsets. Null entries are filled with pd.NaT self._check_compatible_with(other) - asi8 = self.asi8 - new_data = asi8 - other.ordinal - new_data = np.array([self.freq.base * x for x in new_data]) + new_i8_data = checked_add_with_arr( + self.asi8, -other.ordinal, arr_mask=self._isnan + ) + new_data = np.array([self.freq.base * x for x in new_i8_data]) if self._hasna: new_data[self._isnan] = NaT @@ -1192,7 +1226,7 @@ def _sub_period(self, other: Period): return new_data @final - def _add_period(self, other: Period): + def _add_period(self, other: Period) -> PeriodArray: if not is_timedelta64_dtype(self.dtype): raise TypeError(f"cannot add Period to a {type(self).__name__}") @@ -1225,8 +1259,6 @@ def _add_timedeltalike_scalar(self, other): inc = delta_to_nanoseconds(other, reso=self._reso) # type: ignore[attr-defined] new_values = checked_add_with_arr(self.asi8, inc, arr_mask=self._isnan) - new_values = new_values.view("i8") - new_values = self._maybe_mask_results(new_values) new_values = new_values.view(self._ndarray.dtype) new_freq = None @@ -1262,10 +1294,6 @@ def _add_timedelta_arraylike( new_values = checked_add_with_arr( self_i8, other_i8, arr_mask=self._isnan, b_mask=other._isnan ) - if self._hasna or other._hasna: - mask = self._isnan | other._isnan - np.putmask(new_values, mask, iNaT) - return type(self)(new_values, dtype=self.dtype) @final @@ -1277,12 +1305,14 @@ def _add_nat(self): raise TypeError( f"Cannot add {type(self).__name__} and {type(NaT).__name__}" ) + self = cast("TimedeltaArray | DatetimeArray", self) # GH#19124 pd.NaT is treated like a timedelta for both timedelta # and datetime dtypes result = np.empty(self.shape, dtype=np.int64) result.fill(iNaT) - return type(self)(result, dtype=self.dtype, freq=None) + result = result.view(self._ndarray.dtype) # preserve reso + return type(self)._simple_new(result, dtype=self.dtype, freq=None) @final def _sub_nat(self): @@ -1309,11 +1339,11 @@ def _sub_period_array(self, other: PeriodArray) -> npt.NDArray[np.object_]: self = cast("PeriodArray", self) self._require_matching_freq(other) - new_values = checked_add_with_arr( + new_i8_values = checked_add_with_arr( self.asi8, -other.asi8, arr_mask=self._isnan, b_mask=other._isnan ) - new_values = np.array([self.freq.base * x for x in new_values]) + new_values = np.array([self.freq.base * x for x in new_i8_values]) if self._hasna or other._hasna: mask = self._isnan | other._isnan new_values[mask] = NaT @@ -1544,7 +1574,7 @@ def __rsub__(self, other): # We get here with e.g. datetime objects return -(self - other) - def __iadd__(self, other): + def __iadd__(self: DatetimeLikeArrayT, other) -> DatetimeLikeArrayT: result = self + other self[:] = result[:] @@ -1553,7 +1583,7 @@ def __iadd__(self, other): self._freq = result.freq return self - def __isub__(self, other): + def __isub__(self: DatetimeLikeArrayT, other) -> DatetimeLikeArrayT: result = self - other self[:] = result[:] @@ -1682,12 +1712,11 @@ def median(self, *, axis: int | None = None, skipna: bool = True, **kwargs): return self._wrap_reduction_result(axis, result) def _mode(self, dropna: bool = True): - values = self + mask = None if dropna: - mask = values.isna() - values = values[~mask] + mask = self.isna() - i8modes = mode(values.view("i8")) + i8modes = mode(self.view("i8"), mask=mask) npmodes = i8modes.view(self._ndarray.dtype) npmodes = cast(np.ndarray, npmodes) return self._from_backing_data(npmodes) @@ -1904,10 +1933,94 @@ class TimelikeOps(DatetimeLikeArrayMixin): Common ops for TimedeltaIndex/DatetimeIndex, but not PeriodIndex. """ + _default_dtype: np.dtype + + def __init__(self, values, dtype=None, freq=lib.no_default, copy: bool = False): + values = extract_array(values, extract_numpy=True) + if isinstance(values, IntegerArray): + values = values.to_numpy("int64", na_value=iNaT) + + inferred_freq = getattr(values, "_freq", None) + explicit_none = freq is None + freq = freq if freq is not lib.no_default else None + + if isinstance(values, type(self)): + if explicit_none: + # don't inherit from values + pass + elif freq is None: + freq = values.freq + elif freq and values.freq: + freq = to_offset(freq) + freq, _ = validate_inferred_freq(freq, values.freq, False) + + if dtype is not None: + dtype = pandas_dtype(dtype) + if not is_dtype_equal(dtype, values.dtype): + # TODO: we only have tests for this for DTA, not TDA (2022-07-01) + raise TypeError( + f"dtype={dtype} does not match data dtype {values.dtype}" + ) + + dtype = values.dtype + values = values._ndarray + + elif dtype is None: + dtype = self._default_dtype + + if not isinstance(values, np.ndarray): + raise ValueError( + f"Unexpected type '{type(values).__name__}'. 'values' must be a " + f"{type(self).__name__}, ndarray, or Series or Index " + "containing one of those." + ) + if values.ndim not in [1, 2]: + raise ValueError("Only 1-dimensional input arrays are supported.") + + if values.dtype == "i8": + # for compat with datetime/timedelta/period shared methods, + # we can sometimes get here with int64 values. These represent + # nanosecond UTC (or tz-naive) unix timestamps + values = values.view(self._default_dtype) + + dtype = self._validate_dtype(values, dtype) + + if freq == "infer": + raise ValueError( + f"Frequency inference not allowed in {type(self).__name__}.__init__. " + "Use 'pd.array()' instead." + ) + + if copy: + values = values.copy() + if freq: + freq = to_offset(freq) + + NDArrayBacked.__init__(self, values=values, dtype=dtype) + self._freq = freq + + if inferred_freq is None and freq is not None: + type(self)._validate_frequency(self, freq) + + @classmethod + def _validate_dtype(cls, values, dtype): + raise AbstractMethodError(cls) + + # -------------------------------------------------------------- + @cache_readonly def _reso(self) -> int: return get_unit_from_dtype(self._ndarray.dtype) + @cache_readonly + def _unit(self) -> str: + # e.g. "ns", "us", "ms" + # error: Argument 1 to "dtype_to_unit" has incompatible type + # "ExtensionDtype"; expected "Union[DatetimeTZDtype, dtype[Any]]" + return dtype_to_unit(self.dtype) # type: ignore[arg-type] + + # -------------------------------------------------------------- + def __array_ufunc__(self, ufunc: np.ufunc, method: str, *inputs, **kwargs): if ( ufunc in [np.isnan, np.isinf, np.isfinite] @@ -1932,7 +2045,8 @@ def _round(self, freq, mode, ambiguous, nonexistent): values = self.view("i8") values = cast(np.ndarray, values) - nanos = to_offset(freq).nanos + nanos = to_offset(freq).nanos # raises on non-fixed frequencies + nanos = delta_to_nanoseconds(to_offset(freq), self._reso) result_i8 = round_nsint64(values, mode, nanos) result = self._maybe_mask_results(result_i8, fill_value=iNaT) result = result.view(self._ndarray.dtype) @@ -1953,11 +2067,11 @@ def ceil(self, freq, ambiguous="raise", nonexistent="raise"): # -------------------------------------------------------------- # Reductions - def any(self, *, axis: int | None = None, skipna: bool = True): + def any(self, *, axis: int | None = None, skipna: bool = True) -> bool: # GH#34479 discussion of desired behavior long-term return nanops.nanany(self._ndarray, axis=axis, skipna=skipna, mask=self.isna()) - def all(self, *, axis: int | None = None, skipna: bool = True): + def all(self, *, axis: int | None = None, skipna: bool = True) -> bool: # GH#34479 discussion of desired behavior long-term return nanops.nanall(self._ndarray, axis=axis, skipna=skipna, mask=self.isna()) @@ -1998,7 +2112,12 @@ def _with_freq(self, freq): # -------------------------------------------------------------- - def factorize(self, na_sentinel=-1, sort: bool = False): + # GH#46910 - Keep old signature to test we don't break things for EA library authors + def factorize( # type:ignore[override] + self, + na_sentinel: int = -1, + sort: bool = False, + ): if self.freq is not None: # We must be unique, so can short-circuit (and retain freq) codes = np.arange(len(self), dtype=np.intp) @@ -2015,7 +2134,47 @@ def factorize(self, na_sentinel=-1, sort: bool = False): # Shared Constructor Helpers -def validate_periods(periods): +def ensure_arraylike_for_datetimelike(data, copy: bool, cls_name: str): + if not hasattr(data, "dtype"): + # e.g. list, tuple + if np.ndim(data) == 0: + # i.e. generator + data = list(data) + data = np.asarray(data) + copy = False + elif isinstance(data, ABCMultiIndex): + raise TypeError(f"Cannot create a {cls_name} from a MultiIndex.") + else: + data = extract_array(data, extract_numpy=True) + + if isinstance(data, IntegerArray): + data = data.to_numpy("int64", na_value=iNaT) + copy = False + elif not isinstance(data, (np.ndarray, ExtensionArray)): + # GH#24539 e.g. xarray, dask object + data = np.asarray(data) + + elif isinstance(data, ABCCategorical): + # GH#18664 preserve tz in going DTI->Categorical->DTI + # TODO: cases where we need to do another pass through maybe_convert_dtype, + # e.g. the categories are timedelta64s + data = data.categories.take(data.codes, fill_value=NaT)._values + copy = False + + return data, copy + + +@overload +def validate_periods(periods: None) -> None: + ... + + +@overload +def validate_periods(periods: int | float) -> int: + ... + + +def validate_periods(periods: int | float | None) -> int | None: """ If a `periods` argument is passed to the Datetime/Timedelta Array/Index constructor, cast it to an integer. @@ -2038,10 +2197,14 @@ def validate_periods(periods): periods = int(periods) elif not lib.is_integer(periods): raise TypeError(f"periods must be a number, got {periods}") - return periods + # error: Incompatible return value type (got "Optional[float]", + # expected "Optional[int]") + return periods # type: ignore[return-value] -def validate_inferred_freq(freq, inferred_freq, freq_infer): +def validate_inferred_freq( + freq, inferred_freq, freq_infer +) -> tuple[BaseOffset | None, bool]: """ If the user passes a freq and another freq is inferred from passed data, require that they match. @@ -2102,3 +2265,21 @@ def maybe_infer_freq(freq): freq_infer = True freq = None return freq, freq_infer + + +def dtype_to_unit(dtype: DatetimeTZDtype | np.dtype) -> str: + """ + Return the unit str corresponding to the dtype's resolution. + + Parameters + ---------- + dtype : DatetimeTZDtype or np.dtype + If np.dtype, we assume it is a datetime64 dtype. + + Returns + ------- + str + """ + if isinstance(dtype, DatetimeTZDtype): + return dtype.unit + return np.datetime_data(dtype)[0] diff --git a/pandas/core/arrays/datetimes.py b/pandas/core/arrays/datetimes.py index db55c165c9974..7a56bba0e58b3 100644 --- a/pandas/core/arrays/datetimes.py +++ b/pandas/core/arrays/datetimes.py @@ -19,7 +19,6 @@ lib, tslib, ) -from pandas._libs.arrays import NDArrayBacked from pandas._libs.tslibs import ( BaseOffset, NaT, @@ -30,9 +29,9 @@ fields, get_resolution, get_unit_from_dtype, - iNaT, ints_to_pydatetime, is_date_array_normalized, + is_supported_unit, is_unitless, normalize_i8_timestamps, timezones, @@ -53,7 +52,6 @@ DT64NS_DTYPE, INT64_DTYPE, is_bool_dtype, - is_categorical_dtype, is_datetime64_any_dtype, is_datetime64_dtype, is_datetime64_ns_dtype, @@ -69,17 +67,11 @@ pandas_dtype, ) from pandas.core.dtypes.dtypes import DatetimeTZDtype -from pandas.core.dtypes.generic import ABCMultiIndex from pandas.core.dtypes.missing import isna -from pandas.core.arrays import ( - ExtensionArray, - datetimelike as dtl, -) +from pandas.core.arrays import datetimelike as dtl from pandas.core.arrays._ranges import generate_regular_range -from pandas.core.arrays.integer import IntegerArray import pandas.core.common as com -from pandas.core.construction import extract_array from pandas.tseries.frequencies import get_period_alias from pandas.tseries.offsets import ( @@ -99,22 +91,23 @@ _midnight = time(0, 0) -def tz_to_dtype(tz): +def tz_to_dtype(tz: tzinfo | None, unit: str = "ns"): """ Return a datetime64[ns] dtype appropriate for the given timezone. Parameters ---------- tz : tzinfo or None + unit : str, default "ns" Returns ------- np.dtype or Datetime64TZDType """ if tz is None: - return DT64NS_DTYPE + return np.dtype(f"M8[{unit}]") else: - return DatetimeTZDtype(tz=tz) + return DatetimeTZDtype(tz=tz, unit=unit) def _field_accessor(name: str, field: str, docstring=None): @@ -193,12 +186,15 @@ class DatetimeArray(dtl.TimelikeOps, dtl.DatelikeOps): """ _typ = "datetimearray" - _scalar_type = Timestamp _internal_fill_value = np.datetime64("NaT", "ns") _recognized_scalars = (datetime, np.datetime64) _is_recognized_dtype = is_datetime64_any_dtype _infer_matches = ("datetime", "datetime64", "date") + @property + def _scalar_type(self) -> type[Timestamp]: + return Timestamp + # define my properties & methods for delegation _bool_ops: list[str] = [ "is_month_start", @@ -255,84 +251,23 @@ class DatetimeArray(dtl.TimelikeOps, dtl.DatelikeOps): # Constructors _dtype: np.dtype | DatetimeTZDtype - _freq = None - - def __init__( - self, values, dtype=DT64NS_DTYPE, freq=lib.no_default, copy: bool = False - ) -> None: - values = extract_array(values, extract_numpy=True) - if isinstance(values, IntegerArray): - values = values.to_numpy("int64", na_value=iNaT) - - inferred_freq = getattr(values, "_freq", None) - explicit_none = freq is None - freq = freq if freq is not lib.no_default else None - - if isinstance(values, type(self)): - if explicit_none: - # don't inherit from values - pass - elif freq is None: - freq = values.freq - elif freq and values.freq: - freq = to_offset(freq) - freq, _ = dtl.validate_inferred_freq(freq, values.freq, False) - - # validation - dtz = getattr(dtype, "tz", None) - if dtz and values.tz is None: - dtype = DatetimeTZDtype(tz=dtype.tz) - elif dtz and values.tz: - if not timezones.tz_compare(dtz, values.tz): - msg = ( - "Timezone of the array and 'dtype' do not match. " - f"'{dtz}' != '{values.tz}'" - ) - raise TypeError(msg) - elif values.tz: - dtype = values.dtype - - values = values._ndarray - - if not isinstance(values, np.ndarray): - raise ValueError( - f"Unexpected type '{type(values).__name__}'. 'values' must be a " - f"{type(self).__name__}, ndarray, or Series or Index " - "containing one of those." - ) - if values.ndim not in [1, 2]: - raise ValueError("Only 1-dimensional input arrays are supported.") - - if values.dtype == "i8": - # for compat with datetime/timedelta/period shared methods, - # we can sometimes get here with int64 values. These represent - # nanosecond UTC (or tz-naive) unix timestamps - values = values.view(DT64NS_DTYPE) + _freq: BaseOffset | None = None + _default_dtype = DT64NS_DTYPE # used in TimeLikeOps.__init__ + @classmethod + def _validate_dtype(cls, values, dtype): + # used in TimeLikeOps.__init__ _validate_dt64_dtype(values.dtype) dtype = _validate_dt64_dtype(dtype) - - if freq == "infer": - raise ValueError( - f"Frequency inference not allowed in {type(self).__name__}.__init__. " - "Use 'pd.array()' instead." - ) - - if copy: - values = values.copy() - if freq: - freq = to_offset(freq) - - NDArrayBacked.__init__(self, values=values, dtype=dtype) - self._freq = freq - - if inferred_freq is None and freq is not None: - type(self)._validate_frequency(self, freq) + return dtype # error: Signature of "_simple_new" incompatible with supertype "NDArrayBacked" @classmethod def _simple_new( # type: ignore[override] - cls, values: np.ndarray, freq: BaseOffset | None = None, dtype=DT64NS_DTYPE + cls, + values: np.ndarray, + freq: BaseOffset | None = None, + dtype=DT64NS_DTYPE, ) -> DatetimeArray: assert isinstance(values, np.ndarray) assert dtype.kind == "M" @@ -359,7 +294,7 @@ def _from_sequence_not_strict( dtype=None, copy: bool = False, tz=None, - freq=lib.no_default, + freq: str | BaseOffset | lib.NoDefault | None = lib.no_default, dayfirst: bool = False, yearfirst: bool = False, ambiguous="raise", @@ -619,7 +554,7 @@ def is_normalized(self) -> bool: @property # NB: override with cache_readonly in immutable subclasses def _resolution_obj(self) -> Resolution: - return get_resolution(self.asi8, self.tz) + return get_resolution(self.asi8, self.tz, reso=self._reso) # ---------------------------------------------------------------- # Array-Like / EA-Interface Methods @@ -653,7 +588,11 @@ def __iter__(self): start_i = i * chunksize end_i = min((i + 1) * chunksize, length) converted = ints_to_pydatetime( - data[start_i:end_i], tz=self.tz, freq=self.freq, box="timestamp" + data[start_i:end_i], + tz=self.tz, + freq=self.freq, + box="timestamp", + reso=self._reso, ) yield from converted @@ -669,12 +608,26 @@ def astype(self, dtype, copy: bool = True): return self.copy() return self + elif ( + self.tz is None + and is_datetime64_dtype(dtype) + and not is_unitless(dtype) + and is_supported_unit(get_unit_from_dtype(dtype)) + ): + # unit conversion e.g. datetime64[s] + res_values = astype_overflowsafe(self._ndarray, dtype, copy=True) + return type(self)._simple_new(res_values, dtype=res_values.dtype) + # TODO: preserve freq? + elif is_datetime64_ns_dtype(dtype): return astype_dt64_to_dt64tz(self, dtype, copy, via_utc=False) - elif self.tz is None and is_datetime64_dtype(dtype) and dtype != self.dtype: - # unit conversion e.g. datetime64[s] - return self._ndarray.astype(dtype) + elif self.tz is not None and isinstance(dtype, DatetimeTZDtype): + # tzaware unit conversion e.g. datetime64[s, UTC] + np_dtype = np.dtype(dtype.str) + res_values = astype_overflowsafe(self._ndarray, np_dtype, copy=copy) + return type(self)._simple_new(res_values, dtype=dtype) + # TODO: preserve freq? elif is_period_dtype(dtype): return self.to_period(freq=dtype.freq) @@ -683,7 +636,6 @@ def astype(self, dtype, copy: bool = True): # ----------------------------------------------------------------- # Rendering Methods - @dtl.ravel_compat def _format_native_types( self, *, na_rep="NaT", date_format=None, **kwargs ) -> npt.NDArray[np.object_]: @@ -692,7 +644,7 @@ def _format_native_types( fmt = get_format_datetime64_from_values(self, date_format) return tslib.format_array_from_datetime( - self.asi8, tz=self.tz, format=fmt, na_rep=na_rep + self.asi8, tz=self.tz, format=fmt, na_rep=na_rep, reso=self._reso ) # ----------------------------------------------------------------- @@ -852,7 +804,7 @@ def tz_convert(self, tz) -> DatetimeArray: ) # No conversion since timestamps are all UTC to begin with - dtype = tz_to_dtype(tz) + dtype = tz_to_dtype(tz, unit=self._unit) return self._simple_new(self._ndarray, dtype=dtype, freq=self.freq) @dtl.ravel_compat @@ -1017,10 +969,14 @@ def tz_localize(self, tz, ambiguous="raise", nonexistent="raise") -> DatetimeArr # Convert to UTC new_dates = tzconversion.tz_localize_to_utc( - self.asi8, tz, ambiguous=ambiguous, nonexistent=nonexistent + self.asi8, + tz, + ambiguous=ambiguous, + nonexistent=nonexistent, + reso=self._reso, ) - new_dates = new_dates.view(DT64NS_DTYPE) - dtype = tz_to_dtype(tz) + new_dates = new_dates.view(f"M8[{self._unit}]") + dtype = tz_to_dtype(tz, unit=self._unit) freq = None if timezones.is_utc(tz) or (len(self) == 1 and not isna(new_dates[0])): @@ -1044,7 +1000,7 @@ def to_pydatetime(self) -> npt.NDArray[np.object_]: ------- datetimes : ndarray[object] """ - return ints_to_pydatetime(self.asi8, tz=self.tz) + return ints_to_pydatetime(self.asi8, tz=self.tz, reso=self._reso) def normalize(self) -> DatetimeArray: """ @@ -1301,7 +1257,7 @@ def time(self) -> npt.NDArray[np.object_]: # keeping their timezone and not using UTC timestamps = self._local_timestamps() - return ints_to_pydatetime(timestamps, box="time") + return ints_to_pydatetime(timestamps, box="time", reso=self._reso) @property def timetz(self) -> npt.NDArray[np.object_]: @@ -1311,7 +1267,7 @@ def timetz(self) -> npt.NDArray[np.object_]: The time part of the Timestamps. """ - return ints_to_pydatetime(self.asi8, self.tz, box="time") + return ints_to_pydatetime(self.asi8, self.tz, box="time", reso=self._reso) @property def date(self) -> npt.NDArray[np.object_]: @@ -1326,7 +1282,7 @@ def date(self) -> npt.NDArray[np.object_]: # keeping their timezone and not using UTC timestamps = self._local_timestamps() - return ints_to_pydatetime(timestamps, box="date") + return ints_to_pydatetime(timestamps, box="date", reso=self._reso) def isocalendar(self) -> DataFrame: """ @@ -2058,23 +2014,9 @@ def _sequence_to_dt64ns( # if dtype has an embedded tz, capture it tz = validate_tz_from_dtype(dtype, tz) - if not hasattr(data, "dtype"): - # e.g. list, tuple - if np.ndim(data) == 0: - # i.e. generator - data = list(data) - data = np.asarray(data) - copy = False - elif isinstance(data, ABCMultiIndex): - raise TypeError("Cannot create a DatetimeArray from a MultiIndex.") - else: - data = extract_array(data, extract_numpy=True) - - if isinstance(data, IntegerArray): - data = data.to_numpy("int64", na_value=iNaT) - elif not isinstance(data, (np.ndarray, ExtensionArray)): - # GH#24539 e.g. xarray, dask object - data = np.asarray(data) + data, copy = dtl.ensure_arraylike_for_datetimelike( + data, copy, cls_name="DatetimeArray" + ) if isinstance(data, DatetimeArray): inferred_freq = data.freq @@ -2314,13 +2256,6 @@ def maybe_convert_dtype(data, copy: bool, tz: tzinfo | None = None): "Passing PeriodDtype data is invalid. Use `data.to_timestamp()` instead" ) - elif is_categorical_dtype(data.dtype): - # GH#18664 preserve tz in going DTI->Categorical->DTI - # TODO: cases where we need to do another pass through this func, - # e.g. the categories are timedelta64s - data = data.categories.take(data.codes, fill_value=NaT)._values - copy = False - elif is_extension_array_dtype(data.dtype) and not is_datetime64tz_dtype(data.dtype): # TODO: We have no tests for these data = np.array(data, dtype=np.object_) diff --git a/pandas/core/arrays/interval.py b/pandas/core/arrays/interval.py index eecf1dff4dd48..6469dccf6e2d5 100644 --- a/pandas/core/arrays/interval.py +++ b/pandas/core/arrays/interval.py @@ -8,12 +8,14 @@ import textwrap from typing import ( TYPE_CHECKING, + Literal, Sequence, TypeVar, Union, cast, overload, ) +import warnings import numpy as np @@ -21,17 +23,16 @@ from pandas._libs import lib from pandas._libs.interval import ( - VALID_CLOSED, + VALID_INCLUSIVE, Interval, IntervalMixin, - _warning_interval, intervals_to_interval_bounds, ) from pandas._libs.missing import NA from pandas._typing import ( ArrayLike, Dtype, - IntervalClosedType, + IntervalInclusiveType, NpDtype, PositionalIndexer, ScalarIndexer, @@ -42,8 +43,10 @@ from pandas.errors import IntCastingNaNError from pandas.util._decorators import ( Appender, + deprecate_kwarg, deprecate_nonkeyword_arguments, ) +from pandas.util._exceptions import find_stack_level from pandas.core.dtypes.cast import LossySetitemError from pandas.core.dtypes.common import ( @@ -96,7 +99,10 @@ ) if TYPE_CHECKING: - from pandas import Index + from pandas import ( + Index, + Series, + ) IntervalArrayT = TypeVar("IntervalArrayT", bound="IntervalArray") @@ -124,7 +130,7 @@ Array-like containing Interval objects from which to build the %(klass)s. inclusive : {'left', 'right', 'both', 'neither'}, default 'right' - Whether the intervals are closed on the left-side, right-side, both or + Whether the intervals are inclusive on the left-side, right-side, both or neither. dtype : dtype or None, default None If None, dtype will be inferred. @@ -153,6 +159,7 @@ contains overlaps set_closed +set_inclusive to_tuples %(extra_methods)s\ @@ -178,7 +185,8 @@ _interval_shared_docs["class"] % { "klass": "IntervalArray", - "summary": "Pandas array for interval data that are closed on the same side.", + "summary": "Pandas array for interval data that are inclusive on the same " + "side.", "versionadded": "0.24.0", "name": "", "extra_attributes": "", @@ -204,10 +212,13 @@ } ) class IntervalArray(IntervalMixin, ExtensionArray): - ndim = 1 can_hold_na = True _na_value = _fill_value = np.nan + @property + def ndim(self) -> Literal[1]: + return 1 + # To make mypy recognize the fields _left: np.ndarray _right: np.ndarray @@ -216,16 +227,15 @@ class IntervalArray(IntervalMixin, ExtensionArray): # --------------------------------------------------------------------- # Constructors + @deprecate_kwarg(old_arg_name="closed", new_arg_name="inclusive") def __new__( cls: type[IntervalArrayT], data, - inclusive: str | None = None, - closed: None | lib.NoDefault = lib.no_default, + inclusive: IntervalInclusiveType | None = None, dtype: Dtype | None = None, copy: bool = False, verify_integrity: bool = True, ): - inclusive, closed = _warning_interval(inclusive, closed) data = extract_array(data, extract_numpy=True) @@ -245,13 +255,13 @@ def __new__( # might need to convert empty or purely na data data = _maybe_convert_platform_interval(data) - left, right, infer_closed = intervals_to_interval_bounds( - data, validate_closed=inclusive is None + left, right, infer_inclusive = intervals_to_interval_bounds( + data, validate_inclusive=inclusive is None ) if left.dtype == object: left = lib.maybe_convert_objects(left) right = lib.maybe_convert_objects(right) - inclusive = inclusive or infer_closed + inclusive = inclusive or infer_inclusive return cls._simple_new( left, @@ -263,24 +273,22 @@ def __new__( ) @classmethod + @deprecate_kwarg(old_arg_name="closed", new_arg_name="inclusive") def _simple_new( cls: type[IntervalArrayT], left, right, - inclusive=None, - closed: None | lib.NoDefault = lib.no_default, + inclusive: IntervalInclusiveType | None = None, copy: bool = False, dtype: Dtype | None = None, verify_integrity: bool = True, ) -> IntervalArrayT: result = IntervalMixin.__new__(cls) - inclusive, closed = _warning_interval(inclusive, closed) - if inclusive is None and isinstance(dtype, IntervalDtype): inclusive = dtype.inclusive - inclusive = inclusive or "both" + inclusive = inclusive or "right" left = ensure_index(left, copy=copy) right = ensure_index(right, copy=copy) @@ -382,7 +390,7 @@ def _from_factorized( breaks : array-like (1-dimensional) Left and right bounds for each interval. inclusive : {'left', 'right', 'both', 'neither'}, default 'right' - Whether the intervals are closed on the left-side, right-side, both + Whether the intervals are inclusive on the left-side, right-side, both or neither. copy : bool, default False Copy the data. @@ -420,13 +428,17 @@ def _from_factorized( ), } ) + @deprecate_kwarg(old_arg_name="closed", new_arg_name="inclusive") def from_breaks( cls: type[IntervalArrayT], breaks, - inclusive="both", + inclusive: IntervalInclusiveType | None = None, copy: bool = False, dtype: Dtype | None = None, ) -> IntervalArrayT: + if inclusive is None: + inclusive = "right" + breaks = _maybe_convert_platform_interval(breaks) return cls.from_arrays( @@ -444,7 +456,7 @@ def from_breaks( right : array-like (1-dimensional) Right bounds for each interval. inclusive : {'left', 'right', 'both', 'neither'}, default 'right' - Whether the intervals are closed on the left-side, right-side, both + Whether the intervals are inclusive on the left-side, right-side, both or neither. copy : bool, default False Copy the data. @@ -497,14 +509,19 @@ def from_breaks( ), } ) + @deprecate_kwarg(old_arg_name="closed", new_arg_name="inclusive") def from_arrays( cls: type[IntervalArrayT], left, right, - inclusive="both", + inclusive: IntervalInclusiveType | None = None, copy: bool = False, dtype: Dtype | None = None, ) -> IntervalArrayT: + + if inclusive is None: + inclusive = "right" + left = _maybe_convert_platform_interval(left) right = _maybe_convert_platform_interval(right) @@ -526,7 +543,7 @@ def from_arrays( data : array-like (1-dimensional) Array of tuples. inclusive : {'left', 'right', 'both', 'neither'}, default 'right' - Whether the intervals are closed on the left-side, right-side, both + Whether the intervals are inclusive on the left-side, right-side, both or neither. copy : bool, default False By-default copy the data, this is compat only and ignored. @@ -566,13 +583,17 @@ def from_arrays( ), } ) + @deprecate_kwarg(old_arg_name="closed", new_arg_name="inclusive") def from_tuples( cls: type[IntervalArrayT], data, - inclusive="both", + inclusive: IntervalInclusiveType | None = None, copy: bool = False, dtype: Dtype | None = None, ) -> IntervalArrayT: + if inclusive is None: + inclusive = "right" + if len(data): left, right = [], [] else: @@ -609,7 +630,7 @@ def _validate(self): * left and right have the same missing values * left is always below right """ - if self.inclusive not in VALID_CLOSED: + if self.inclusive not in VALID_INCLUSIVE: msg = f"invalid option for 'inclusive': {self.inclusive}" raise ValueError(msg) if len(self._left) != len(self._right): @@ -692,7 +713,7 @@ def __getitem__( raise ValueError("multi-dimensional indexing not allowed") return self._shallow_copy(left, right) - def __setitem__(self, key, value): + def __setitem__(self, key, value) -> None: value_left, value_right = self._validate_setitem_value(value) key = check_array_indexer(self, key) @@ -725,7 +746,7 @@ def _cmp_method(self, other, op): # for categorical defer to categories for dtype other_dtype = other.categories.dtype - # extract intervals if we have interval categories with matching closed + # extract intervals if we have interval categories with matching inclusive if is_interval_dtype(other_dtype): if self.inclusive != other.categories.inclusive: return invalid_comparison(self, other, op) @@ -734,7 +755,7 @@ def _cmp_method(self, other, op): other.codes, allow_fill=True, fill_value=other.categories._na_value ) - # interval-like -> need same closed and matching endpoints + # interval-like -> need same inclusive and matching endpoints if is_interval_dtype(other_dtype): if self.inclusive != other.inclusive: return invalid_comparison(self, other, op) @@ -821,7 +842,7 @@ def argsort( ascending=ascending, kind=kind, na_position=na_position, **kwargs ) - def min(self, *, axis: int | None = None, skipna: bool = True): + def min(self, *, axis: int | None = None, skipna: bool = True) -> IntervalOrNA: nv.validate_minmax_axis(axis, self.ndim) if not len(self): @@ -838,7 +859,7 @@ def min(self, *, axis: int | None = None, skipna: bool = True): indexer = obj.argsort()[0] return obj[indexer] - def max(self, *, axis: int | None = None, skipna: bool = True): + def max(self, *, axis: int | None = None, skipna: bool = True) -> IntervalOrNA: nv.validate_minmax_axis(axis, self.ndim) if not len(self): @@ -974,7 +995,7 @@ def _concat_same_type( """ inclusive_set = {interval.inclusive for interval in to_concat} if len(inclusive_set) != 1: - raise ValueError("Intervals must all be closed on the same side.") + raise ValueError("Intervals must all be inclusive on the same side.") inclusive = inclusive_set.pop() left = np.concatenate([interval.left for interval in to_concat]) @@ -1100,7 +1121,7 @@ def _validate_listlike(self, value): # list-like of intervals try: array = IntervalArray(value) - self._check_closed_matches(array, name="value") + self._check_inclusive_matches(array, name="value") value_left, value_right = array.left, array.right except TypeError as err: # wrong type: not interval or NA @@ -1120,7 +1141,7 @@ def _validate_listlike(self, value): def _validate_scalar(self, value): if isinstance(value, Interval): - self._check_closed_matches(value, name="value") + self._check_inclusive_matches(value, name="value") left, right = value.left, value.right # TODO: check subdtype match like _validate_setitem_value? elif is_valid_na_for_dtype(value, self.left.dtype): @@ -1146,7 +1167,7 @@ def _validate_setitem_value(self, value): elif isinstance(value, Interval): # scalar - self._check_closed_matches(value, name="value") + self._check_inclusive_matches(value, name="value") value_left, value_right = value.left, value.right self.left._validate_fill_value(value_left) self.left._validate_fill_value(value_right) @@ -1156,7 +1177,7 @@ def _validate_setitem_value(self, value): return value_left, value_right - def value_counts(self, dropna: bool = True): + def value_counts(self, dropna: bool = True) -> Series: """ Returns a Series containing counts of each interval. @@ -1332,7 +1353,7 @@ def overlaps(self, other): msg = f"`other` must be Interval-like, got {type(other).__name__}" raise TypeError(msg) - # equality is okay if both endpoints are closed (overlap at a point) + # equality is okay if both endpoints are inclusive (overlap at a point) op1 = le if (self.closed_left and other.closed_right) else lt op2 = le if (other.closed_left and self.closed_right) else lt @@ -1344,11 +1365,24 @@ def overlaps(self, other): # --------------------------------------------------------------------- @property - def inclusive(self) -> IntervalClosedType: + def inclusive(self) -> IntervalInclusiveType: + """ + Whether the intervals are inclusive on the left-side, right-side, both or + neither. + """ + return self.dtype.inclusive + + @property + def closed(self) -> IntervalInclusiveType: """ Whether the intervals are closed on the left-side, right-side, both or neither. """ + warnings.warn( + "Attribute `closed` is deprecated in favor of `inclusive`.", + FutureWarning, + stacklevel=find_stack_level(), + ) return self.dtype.inclusive _interval_shared_docs["set_closed"] = textwrap.dedent( @@ -1356,9 +1390,11 @@ def inclusive(self) -> IntervalClosedType: Return an %(klass)s identical to the current one, but closed on the specified side. + .. deprecated:: 1.5.0 + Parameters ---------- - inclusive : {'left', 'right', 'both', 'neither'} + closed : {'left', 'right', 'both', 'neither'} Whether the intervals are closed on the left-side, right-side, both or neither. @@ -1392,9 +1428,62 @@ def inclusive(self) -> IntervalClosedType: } ) def set_closed( - self: IntervalArrayT, inclusive: IntervalClosedType + self: IntervalArrayT, closed: IntervalInclusiveType + ) -> IntervalArrayT: + warnings.warn( + "set_closed is deprecated and will be removed in a future version. " + "Use set_inclusive instead.", + FutureWarning, + stacklevel=find_stack_level(), + ) + return self.set_inclusive(closed) + + _interval_shared_docs["set_inclusive"] = textwrap.dedent( + """ + Return an %(klass)s identical to the current one, but closed on the + specified side. + + .. versionadded:: 1.5 + + Parameters + ---------- + inclusive : {'left', 'right', 'both', 'neither'} + Whether the intervals are closed on the left-side, right-side, both + or neither. + + Returns + ------- + new_index : %(klass)s + + %(examples)s\ + """ + ) + + @Appender( + _interval_shared_docs["set_inclusive"] + % { + "klass": "IntervalArray", + "examples": textwrap.dedent( + """\ + Examples + -------- + >>> index = pd.arrays.IntervalArray.from_breaks(range(4), "right") + >>> index + + [(0, 1], (1, 2], (2, 3]] + Length: 3, dtype: interval[int64, right] + >>> index.set_inclusive('both') + + [[0, 1], [1, 2], [2, 3]] + Length: 3, dtype: interval[int64, both] + """ + ), + } + ) + def set_inclusive( + self: IntervalArrayT, inclusive: IntervalInclusiveType ) -> IntervalArrayT: - if inclusive not in VALID_CLOSED: + if inclusive not in VALID_INCLUSIVE: msg = f"invalid option for 'inclusive': {inclusive}" raise ValueError(msg) @@ -1665,12 +1754,13 @@ def isin(self, values) -> np.ndarray: # complex128 ndarray is much more performant. left = self._combined.view("complex128") right = values._combined.view("complex128") - # Argument 1 to "in1d" has incompatible type "Union[ExtensionArray, - # ndarray[Any, Any], ndarray[Any, dtype[Any]]]"; expected - # "Union[_SupportsArray[dtype[Any]], _NestedSequence[_SupportsArray[ - # dtype[Any]]], bool, int, float, complex, str, bytes, - # _NestedSequence[Union[bool, int, float, complex, str, bytes]]]" - # [arg-type] + # error: Argument 1 to "in1d" has incompatible type + # "Union[ExtensionArray, ndarray[Any, Any], + # ndarray[Any, dtype[Any]]]"; expected + # "Union[_SupportsArray[dtype[Any]], + # _NestedSequence[_SupportsArray[dtype[Any]]], bool, + # int, float, complex, str, bytes, _NestedSequence[ + # Union[bool, int, float, complex, str, bytes]]]" return np.in1d(left, right) # type: ignore[arg-type] elif needs_i8_conversion(self.left.dtype) ^ needs_i8_conversion( diff --git a/pandas/core/arrays/masked.py b/pandas/core/arrays/masked.py index 3616e3512c6fe..128c7e44f5075 100644 --- a/pandas/core/arrays/masked.py +++ b/pandas/core/arrays/masked.py @@ -322,13 +322,13 @@ def round(self, decimals: int = 0, *args, **kwargs): def __invert__(self: BaseMaskedArrayT) -> BaseMaskedArrayT: return type(self)(~self._data, self._mask.copy()) - def __neg__(self): + def __neg__(self: BaseMaskedArrayT) -> BaseMaskedArrayT: return type(self)(-self._data, self._mask.copy()) - def __pos__(self): + def __pos__(self: BaseMaskedArrayT) -> BaseMaskedArrayT: return self.copy() - def __abs__(self): + def __abs__(self: BaseMaskedArrayT) -> BaseMaskedArrayT: return type(self)(abs(self._data), self._mask.copy()) # ------------------------------------------------------------------ @@ -869,7 +869,16 @@ def searchsorted( return self._data.searchsorted(value, side=side, sorter=sorter) @doc(ExtensionArray.factorize) - def factorize(self, na_sentinel: int = -1) -> tuple[np.ndarray, ExtensionArray]: + def factorize( + self, + na_sentinel: int | lib.NoDefault = lib.no_default, + use_na_sentinel: bool | lib.NoDefault = lib.no_default, + ) -> tuple[np.ndarray, ExtensionArray]: + resolved_na_sentinel = algos.resolve_na_sentinel(na_sentinel, use_na_sentinel) + if resolved_na_sentinel is None: + raise NotImplementedError("Encoding NaN values is not yet implemented") + else: + na_sentinel = resolved_na_sentinel arr = self._data mask = self._mask @@ -936,9 +945,9 @@ def value_counts(self, dropna: bool = True) -> Series: index = index.astype(self.dtype) mask = np.zeros(len(counts), dtype="bool") - counts = IntegerArray(counts, mask) + counts_array = IntegerArray(counts, mask) - return Series(counts, index=index) + return Series(counts_array, index=index) @doc(ExtensionArray.equals) def equals(self, other) -> bool: @@ -1151,10 +1160,11 @@ def any(self, *, skipna: bool = True, **kwargs): nv.validate_any((), kwargs) values = self._data.copy() - # Argument 3 to "putmask" has incompatible type "object"; expected - # "Union[_SupportsArray[dtype[Any]], _NestedSequence[ - # _SupportsArray[dtype[Any]]], bool, int, float, complex, str, bytes, _Nested - # Sequence[Union[bool, int, float, complex, str, bytes]]]" [arg-type] + # error: Argument 3 to "putmask" has incompatible type "object"; + # expected "Union[_SupportsArray[dtype[Any]], + # _NestedSequence[_SupportsArray[dtype[Any]]], + # bool, int, float, complex, str, bytes, + # _NestedSequence[Union[bool, int, float, complex, str, bytes]]]" np.putmask(values, self._mask, self._falsey_value) # type: ignore[arg-type] result = values.any() if skipna: @@ -1231,10 +1241,11 @@ def all(self, *, skipna: bool = True, **kwargs): nv.validate_all((), kwargs) values = self._data.copy() - # Argument 3 to "putmask" has incompatible type "object"; expected - # "Union[_SupportsArray[dtype[Any]], _NestedSequence[ - # _SupportsArray[dtype[Any]]], bool, int, float, complex, str, bytes, _Neste - # dSequence[Union[bool, int, float, complex, str, bytes]]]" [arg-type] + # error: Argument 3 to "putmask" has incompatible type "object"; + # expected "Union[_SupportsArray[dtype[Any]], + # _NestedSequence[_SupportsArray[dtype[Any]]], + # bool, int, float, complex, str, bytes, + # _NestedSequence[Union[bool, int, float, complex, str, bytes]]]" np.putmask(values, self._mask, self._truthy_value) # type: ignore[arg-type] result = values.all() diff --git a/pandas/core/arrays/numeric.py b/pandas/core/arrays/numeric.py index cdffd57df9a84..b32cbdcba1853 100644 --- a/pandas/core/arrays/numeric.py +++ b/pandas/core/arrays/numeric.py @@ -5,6 +5,7 @@ TYPE_CHECKING, Any, Callable, + Mapping, TypeVar, ) @@ -113,11 +114,11 @@ def __from_arrow__( return array_class._concat_same_type(results) @classmethod - def _str_to_dtype_mapping(cls): + def _str_to_dtype_mapping(cls) -> Mapping[str, NumericDtype]: raise AbstractMethodError(cls) @classmethod - def _standardize_dtype(cls, dtype) -> NumericDtype: + def _standardize_dtype(cls, dtype: NumericDtype | str | np.dtype) -> NumericDtype: """ Convert a string representation or a numpy dtype to NumericDtype. """ @@ -126,7 +127,7 @@ def _standardize_dtype(cls, dtype) -> NumericDtype: # https://github.com/numpy/numpy/pull/7476 dtype = dtype.lower() - if not issubclass(type(dtype), cls): + if not isinstance(dtype, NumericDtype): mapping = cls._str_to_dtype_mapping() try: dtype = mapping[str(np.dtype(dtype))] diff --git a/pandas/core/arrays/period.py b/pandas/core/arrays/period.py index b6d21cd9dac54..2d676f94c6a64 100644 --- a/pandas/core/arrays/period.py +++ b/pandas/core/arrays/period.py @@ -8,6 +8,8 @@ Callable, Literal, Sequence, + TypeVar, + overload, ) import numpy as np @@ -22,6 +24,7 @@ astype_overflowsafe, delta_to_nanoseconds, dt64arr_to_periodarr as c_dt64arr_to_periodarr, + get_unit_from_dtype, iNaT, parsing, period as libperiod, @@ -69,10 +72,7 @@ ABCSeries, ABCTimedeltaArray, ) -from pandas.core.dtypes.missing import ( - isna, - notna, -) +from pandas.core.dtypes.missing import notna import pandas.core.algorithms as algos from pandas.core.arrays import datetimelike as dtl @@ -91,6 +91,8 @@ TimedeltaArray, ) +BaseOffsetT = TypeVar("BaseOffsetT", bound=BaseOffset) + _shared_doc_kwargs = { "klass": "PeriodArray", @@ -166,12 +168,15 @@ class PeriodArray(dtl.DatelikeOps, libperiod.PeriodMixin): # array priority higher than numpy scalars __array_priority__ = 1000 _typ = "periodarray" # ABCPeriodArray - _scalar_type = Period _internal_fill_value = np.int64(iNaT) _recognized_scalars = (Period,) _is_recognized_dtype = is_period_dtype _infer_matches = ("period",) + @property + def _scalar_type(self) -> type[Period]: + return Period + # Names others delegate to us _other_ops: list[str] = [] _bool_ops: list[str] = ["is_leap_year"] @@ -642,11 +647,14 @@ def _format_native_types( """ values = self.astype(object) + # Create the formatter function if date_format: formatter = lambda per: per.strftime(date_format) else: + # Uses `_Period.str` which in turn uses `format_period` formatter = lambda per: str(per) + # Apply the formatter to all values in the array, possibly with a mask if self._hasna: mask = self._isnan values[mask] = na_rep @@ -733,8 +741,6 @@ def _addsub_int_array_or_scalar( if op is operator.sub: other = -other res_values = algos.checked_add_with_arr(self.asi8, other, arr_mask=self._isnan) - res_values = res_values.view("i8") - np.putmask(res_values, self._isnan, iNaT) return type(self)(res_values, freq=self.freq) def _add_offset(self, other: BaseOffset): @@ -783,20 +789,30 @@ def _add_timedelta_arraylike( ------- result : ndarray[int64] """ - if not isinstance(self.freq, Tick): + freq = self.freq + if not isinstance(freq, Tick): # We cannot add timedelta-like to non-tick PeriodArray raise TypeError( f"Cannot add or subtract timedelta64[ns] dtype from {self.dtype}" ) - if not np.all(isna(other)): - delta = self._check_timedeltalike_freq_compat(other) - else: - # all-NaT TimedeltaIndex is equivalent to a single scalar td64 NaT - return self + np.timedelta64("NaT") + dtype = np.dtype(f"m8[{freq._td64_unit}]") + + try: + delta = astype_overflowsafe( + np.asarray(other), dtype=dtype, copy=False, round_ok=False + ) + except ValueError as err: + # TODO: not actually a great exception message in this case + raise raise_on_incompatible(self, other) from err - ordinals = self._addsub_int_array_or_scalar(delta, operator.add).asi8 - return type(self)(ordinals, dtype=self.dtype) + b_mask = np.isnat(delta) + + res_values = algos.checked_add_with_arr( + self.asi8, delta.view("i8"), arr_mask=self._isnan, b_mask=b_mask + ) + np.putmask(res_values, self._isnan | b_mask, iNaT) + return type(self)(res_values, freq=self.freq) def _check_timedeltalike_freq_compat(self, other): """ @@ -971,7 +987,19 @@ def period_array( return PeriodArray._from_sequence(data, dtype=dtype) -def validate_dtype_freq(dtype, freq): +@overload +def validate_dtype_freq(dtype, freq: BaseOffsetT) -> BaseOffsetT: + ... + + +@overload +def validate_dtype_freq(dtype, freq: timedelta | str | None) -> BaseOffset: + ... + + +def validate_dtype_freq( + dtype, freq: BaseOffsetT | timedelta | str | None +) -> BaseOffsetT: """ If both a dtype and a freq are available, ensure they match. If only dtype is available, extract the implied freq. @@ -991,7 +1019,10 @@ def validate_dtype_freq(dtype, freq): IncompatibleFrequency : mismatch between dtype and freq """ if freq is not None: - freq = to_offset(freq) + # error: Incompatible types in assignment (expression has type + # "BaseOffset", variable has type "Union[BaseOffsetT, timedelta, + # str, None]") + freq = to_offset(freq) # type: ignore[assignment] if dtype is not None: dtype = pandas_dtype(dtype) @@ -1001,10 +1032,14 @@ def validate_dtype_freq(dtype, freq): freq = dtype.freq elif freq != dtype.freq: raise IncompatibleFrequency("specified freq and dtype are different") - return freq + # error: Incompatible return value type (got "Union[BaseOffset, Any, None]", + # expected "BaseOffset") + return freq # type: ignore[return-value] -def dt64arr_to_periodarr(data, freq, tz=None): +def dt64arr_to_periodarr( + data, freq, tz=None +) -> tuple[npt.NDArray[np.int64], BaseOffset]: """ Convert an datetime-like array to values Period ordinals. @@ -1024,7 +1059,7 @@ def dt64arr_to_periodarr(data, freq, tz=None): used. """ - if data.dtype != np.dtype("M8[ns]"): + if not isinstance(data.dtype, np.dtype) or data.dtype.kind != "M": raise ValueError(f"Wrong dtype: {data.dtype}") if freq is None: @@ -1036,9 +1071,10 @@ def dt64arr_to_periodarr(data, freq, tz=None): elif isinstance(data, (ABCIndex, ABCSeries)): data = data._values + reso = get_unit_from_dtype(data.dtype) freq = Period._maybe_convert_freq(freq) base = freq._period_dtype_code - return c_dt64arr_to_periodarr(data.view("i8"), base, tz), freq + return c_dt64arr_to_periodarr(data.view("i8"), base, tz, reso=reso), freq def _get_ordinal_range(start, end, periods, freq, mult=1): diff --git a/pandas/core/arrays/sparse/__init__.py b/pandas/core/arrays/sparse/__init__.py index 18294ead0329d..56dbc6df54fc9 100644 --- a/pandas/core/arrays/sparse/__init__.py +++ b/pandas/core/arrays/sparse/__init__.py @@ -1,5 +1,3 @@ -# flake8: noqa: F401 - from pandas.core.arrays.sparse.accessor import ( SparseAccessor, SparseFrameAccessor, @@ -11,3 +9,13 @@ make_sparse_index, ) from pandas.core.arrays.sparse.dtype import SparseDtype + +__all__ = [ + "BlockIndex", + "IntIndex", + "make_sparse_index", + "SparseAccessor", + "SparseArray", + "SparseDtype", + "SparseFrameAccessor", +] diff --git a/pandas/core/arrays/sparse/accessor.py b/pandas/core/arrays/sparse/accessor.py index 41af7d4ccd506..80713a6fca323 100644 --- a/pandas/core/arrays/sparse/accessor.py +++ b/pandas/core/arrays/sparse/accessor.py @@ -1,4 +1,7 @@ """Sparse accessor""" +from __future__ import annotations + +from typing import TYPE_CHECKING import numpy as np @@ -13,6 +16,12 @@ from pandas.core.arrays.sparse.array import SparseArray from pandas.core.arrays.sparse.dtype import SparseDtype +if TYPE_CHECKING: + from pandas import ( + DataFrame, + Series, + ) + class BaseAccessor: _validation_msg = "Can only use the '.sparse' accessor with Sparse data." @@ -49,7 +58,7 @@ def _delegate_method(self, name, *args, **kwargs): raise ValueError @classmethod - def from_coo(cls, A, dense_index=False): + def from_coo(cls, A, dense_index=False) -> Series: """ Create a Series with sparse values from a scipy.sparse.coo_matrix. @@ -180,7 +189,7 @@ def to_coo(self, row_levels=(0,), column_levels=(1,), sort_labels=False): ) return A, rows, columns - def to_dense(self): + def to_dense(self) -> Series: """ Convert a Series from sparse values to dense. @@ -228,7 +237,7 @@ def _validate(self, data): raise AttributeError(self._validation_msg) @classmethod - def from_spmatrix(cls, data, index=None, columns=None): + def from_spmatrix(cls, data, index=None, columns=None) -> DataFrame: """ Create a new DataFrame from a scipy sparse matrix. @@ -284,7 +293,7 @@ def from_spmatrix(cls, data, index=None, columns=None): arrays, columns=columns, index=index, verify_integrity=False ) - def to_dense(self): + def to_dense(self) -> DataFrame: """ Convert a DataFrame with sparse values to dense. diff --git a/pandas/core/arrays/sparse/array.py b/pandas/core/arrays/sparse/array.py index 427bf50ca7424..b547446603853 100644 --- a/pandas/core/arrays/sparse/array.py +++ b/pandas/core/arrays/sparse/array.py @@ -821,7 +821,7 @@ def shift(self: SparseArrayT, periods: int = 1, fill_value=None) -> SparseArrayT def _first_fill_value_loc(self): """ - Get the location of the first missing value. + Get the location of the first fill value. Returns ------- @@ -834,27 +834,43 @@ def _first_fill_value_loc(self): if not len(indices) or indices[0] > 0: return 0 - diff = indices[1:] - indices[:-1] - return np.searchsorted(diff, 2) + 1 + # a number larger than 1 should be appended to + # the last in case of fill value only appears + # in the tail of array + diff = np.r_[np.diff(indices), 2] + return indices[(diff > 1).argmax()] + 1 def unique(self: SparseArrayT) -> SparseArrayT: uniques = algos.unique(self.sp_values) - fill_loc = self._first_fill_value_loc() - if fill_loc >= 0: - uniques = np.insert(uniques, fill_loc, self.fill_value) + if len(self.sp_values) != len(self): + fill_loc = self._first_fill_value_loc() + # Inorder to align the behavior of pd.unique or + # pd.Series.unique, we should keep the original + # order, here we use unique again to find the + # insertion place. Since the length of sp_values + # is not large, maybe minor performance hurt + # is worthwhile to the correctness. + insert_loc = len(algos.unique(self.sp_values[:fill_loc])) + uniques = np.insert(uniques, insert_loc, self.fill_value) return type(self)._from_sequence(uniques, dtype=self.dtype) def _values_for_factorize(self): # Still override this for hash_pandas_object return np.asarray(self), self.fill_value - def factorize(self, na_sentinel: int = -1) -> tuple[np.ndarray, SparseArray]: + def factorize( + self, + na_sentinel: int | lib.NoDefault = lib.no_default, + use_na_sentinel: bool | lib.NoDefault = lib.no_default, + ) -> tuple[np.ndarray, SparseArray]: # Currently, ExtensionArray.factorize -> Tuple[ndarray, EA] # The sparsity on this is backwards from what Sparse would want. Want # ExtensionArray.factorize -> Tuple[EA, EA] # Given that we have to return a dense array of codes, why bother # implementing an efficient factorize? - codes, uniques = algos.factorize(np.asarray(self), na_sentinel=na_sentinel) + codes, uniques = algos.factorize( + np.asarray(self), na_sentinel=na_sentinel, use_na_sentinel=use_na_sentinel + ) uniques_sp = SparseArray(uniques, dtype=self.dtype) return codes, uniques_sp @@ -883,12 +899,20 @@ def value_counts(self, dropna: bool = True) -> Series: if mask.any(): counts[mask] += fcounts else: - keys = np.insert(keys, 0, self.fill_value) + # error: Argument 1 to "insert" has incompatible type "Union[ + # ExtensionArray,ndarray[Any, Any]]"; expected "Union[ + # _SupportsArray[dtype[Any]], Sequence[_SupportsArray[dtype + # [Any]]], Sequence[Sequence[_SupportsArray[dtype[Any]]]], + # Sequence[Sequence[Sequence[_SupportsArray[dtype[Any]]]]], Sequence + # [Sequence[Sequence[Sequence[_SupportsArray[dtype[Any]]]]]]]" + keys = np.insert(keys, 0, self.fill_value) # type: ignore[arg-type] counts = np.insert(counts, 0, fcounts) if not isinstance(keys, ABCIndex): - keys = Index(keys) - return Series(counts, index=keys) + index = Index(keys) + else: + index = keys + return Series(counts, index=index) def _quantile(self, qs: npt.NDArray[np.float64], interpolation: str): @@ -944,14 +968,15 @@ def __getitem__( if is_integer(key): return self._get_val_at(key) elif isinstance(key, tuple): - # Invalid index type "Tuple[Union[int, ellipsis], ...]" for - # "ndarray[Any, Any]"; expected type "Union[SupportsIndex, - # _SupportsArray[dtype[Union[bool_, integer[Any]]]], _NestedSequence[_Su - # pportsArray[dtype[Union[bool_, integer[Any]]]]], - # _NestedSequence[Union[bool, int]], Tuple[Union[SupportsIndex, - # _SupportsArray[dtype[Union[bool_, integer[Any]]]], - # _NestedSequence[_SupportsArray[dtype[Union[bool_, integer[Any]]]]], _N - # estedSequence[Union[bool, int]]], ...]]" [index] + # error: Invalid index type "Tuple[Union[int, ellipsis], ...]" + # for "ndarray[Any, Any]"; expected type + # "Union[SupportsIndex, _SupportsArray[dtype[Union[bool_, + # integer[Any]]]], _NestedSequence[_SupportsArray[dtype[ + # Union[bool_, integer[Any]]]]], _NestedSequence[Union[ + # bool, int]], Tuple[Union[SupportsIndex, _SupportsArray[ + # dtype[Union[bool_, integer[Any]]]], _NestedSequence[ + # _SupportsArray[dtype[Union[bool_, integer[Any]]]]], + # _NestedSequence[Union[bool, int]]], ...]]" data_slice = self.to_dense()[key] # type: ignore[index] elif isinstance(key, slice): @@ -1192,8 +1217,9 @@ def _concat_same_type( data = np.concatenate(values) indices_arr = np.concatenate(indices) - # Argument 2 to "IntIndex" has incompatible type "ndarray[Any, - # dtype[signedinteger[_32Bit]]]"; expected "Sequence[int]" + # error: Argument 2 to "IntIndex" has incompatible type + # "ndarray[Any, dtype[signedinteger[_32Bit]]]"; + # expected "Sequence[int]" sp_index = IntIndex(length, indices_arr) # type: ignore[arg-type] else: @@ -1379,12 +1405,12 @@ def _where(self, mask, value): # ------------------------------------------------------------------------ # IO # ------------------------------------------------------------------------ - def __setstate__(self, state): + def __setstate__(self, state) -> None: """Necessary for making this object picklable""" if isinstance(state, tuple): # Compat for pandas < 0.24.0 nd_state, (fill_value, sp_index) = state - # Need type annotation for "sparse_values" [var-annotated] + # error: Need type annotation for "sparse_values" sparse_values = np.array([]) # type: ignore[var-annotated] sparse_values.__setstate__(nd_state) @@ -1394,7 +1420,7 @@ def __setstate__(self, state): else: self.__dict__.update(state) - def nonzero(self): + def nonzero(self) -> tuple[npt.NDArray[np.int32]]: if self.fill_value == 0: return (self.sp_index.indices,) else: diff --git a/pandas/core/arrays/sparse/dtype.py b/pandas/core/arrays/sparse/dtype.py index b6bb5faeebdee..eaed6257736ba 100644 --- a/pandas/core/arrays/sparse/dtype.py +++ b/pandas/core/arrays/sparse/dtype.py @@ -99,7 +99,7 @@ def __init__(self, dtype: Dtype = np.float64, fill_value: Any = None) -> None: self._fill_value = fill_value self._check_fill_value() - def __hash__(self): + def __hash__(self) -> int: # Python3 doesn't inherit __hash__ when a base class overrides # __eq__, so we explicitly do it here. return super().__hash__() @@ -179,7 +179,7 @@ def _is_boolean(self) -> bool: return is_bool_dtype(self.subtype) @property - def kind(self): + def kind(self) -> str: """ The sparse kind. Either 'integer', or 'block'. """ @@ -194,7 +194,7 @@ def subtype(self): return self._dtype @property - def name(self): + def name(self) -> str: return f"Sparse[{self.subtype.name}, {repr(self.fill_value)}]" def __repr__(self) -> str: diff --git a/pandas/core/arrays/string_.py b/pandas/core/arrays/string_.py index 45683d83a1303..c68ffec600c8a 100644 --- a/pandas/core/arrays/string_.py +++ b/pandas/core/arrays/string_.py @@ -14,6 +14,7 @@ from pandas._typing import ( Dtype, Scalar, + npt, type_t, ) from pandas.compat import pa_version_under1p01 @@ -51,6 +52,8 @@ if TYPE_CHECKING: import pyarrow + from pandas import Series + @register_extension_dtype class StringDtype(StorageExtensionDtype): @@ -88,8 +91,11 @@ class StringDtype(StorageExtensionDtype): name = "string" - #: StringDtype.na_value uses pandas.NA - na_value = libmissing.NA + #: StringDtype().na_value uses pandas.NA + @property + def na_value(self) -> libmissing.NAType: + return libmissing.NA + _metadata = ("storage",) def __init__(self, storage=None) -> None: @@ -333,13 +339,11 @@ def _from_sequence(cls, scalars, *, dtype: Dtype | None = None, copy=False): na_values = scalars._mask result = scalars._data result = lib.ensure_string_array(result, copy=copy, convert_na_value=False) - result[na_values] = StringDtype.na_value + result[na_values] = libmissing.NA else: - # convert non-na-likes to str, and nan-likes to StringDtype.na_value - result = lib.ensure_string_array( - scalars, na_value=StringDtype.na_value, copy=copy - ) + # convert non-na-likes to str, and nan-likes to StringDtype().na_value + result = lib.ensure_string_array(scalars, na_value=libmissing.NA, copy=copy) # Manually creating new array avoids the validation step in the __init__, so is # faster. Refactor need for validation? @@ -394,7 +398,7 @@ def __setitem__(self, key, value): # validate new items if scalar_value: if isna(value): - value = StringDtype.na_value + value = libmissing.NA elif not isinstance(value, str): raise ValueError( f"Cannot set non-string value '{value}' into a StringArray." @@ -407,6 +411,12 @@ def __setitem__(self, key, value): super().__setitem__(key, value) + def _putmask(self, mask: npt.NDArray[np.bool_], value) -> None: + # the super() method NDArrayBackedExtensionArray._putmask uses + # np.putmask which doesn't properly handle None/pd.NA, so using the + # base class implementation that uses __setitem__ + ExtensionArray._putmask(self, mask, value) + def astype(self, dtype, copy: bool = True): dtype = pandas_dtype(dtype) @@ -461,7 +471,7 @@ def max(self, axis=None, skipna: bool = True, **kwargs) -> Scalar: ) return self._wrap_reduction_result(axis, result) - def value_counts(self, dropna: bool = True): + def value_counts(self, dropna: bool = True) -> Series: from pandas import value_counts result = value_counts(self._ndarray, dropna=dropna).astype("Int64") @@ -495,7 +505,7 @@ def _cmp_method(self, other, op): if op.__name__ in ops.ARITHMETIC_BINOPS: result = np.empty_like(self._ndarray, dtype="object") - result[mask] = StringDtype.na_value + result[mask] = libmissing.NA result[valid] = op(self._ndarray[valid], other) return StringArray(result) else: @@ -510,7 +520,7 @@ def _cmp_method(self, other, op): # String methods interface # error: Incompatible types in assignment (expression has type "NAType", # base class "PandasArray" defined the type as "float") - _str_na_value = StringDtype.na_value # type: ignore[assignment] + _str_na_value = libmissing.NA # type: ignore[assignment] def _str_map( self, f, na_value=None, dtype: Dtype | None = None, convert: bool = True diff --git a/pandas/core/arrays/string_arrow.py b/pandas/core/arrays/string_arrow.py index a07f748fa0c8c..bb2fefabd6ae5 100644 --- a/pandas/core/arrays/string_arrow.py +++ b/pandas/core/arrays/string_arrow.py @@ -34,7 +34,6 @@ ) from pandas.core.dtypes.missing import isna -from pandas.core.arraylike import OpsMixin from pandas.core.arrays.arrow import ArrowExtensionArray from pandas.core.arrays.boolean import BooleanDtype from pandas.core.arrays.integer import Int64Dtype @@ -51,15 +50,6 @@ from pandas.core.arrays.arrow._arrow_utils import fallback_performancewarning - ARROW_CMP_FUNCS = { - "eq": pc.equal, - "ne": pc.not_equal, - "lt": pc.less, - "gt": pc.greater, - "le": pc.less_equal, - "ge": pc.greater_equal, - } - ArrowStringScalarOrNAT = Union[str, libmissing.NAType] @@ -74,9 +64,7 @@ def _chk_pyarrow_available() -> None: # fallback for the ones that pyarrow doesn't yet support -class ArrowStringArray( - OpsMixin, ArrowExtensionArray, BaseStringArray, ObjectStringArrayMixin -): +class ArrowStringArray(ArrowExtensionArray, BaseStringArray, ObjectStringArrayMixin): """ Extension array for string data in a ``pyarrow.ChunkedArray``. @@ -190,33 +178,7 @@ def to_numpy( result[mask] = na_value return result - def _cmp_method(self, other, op): - from pandas.arrays import BooleanArray - - pc_func = ARROW_CMP_FUNCS[op.__name__] - if isinstance(other, ArrowStringArray): - result = pc_func(self._data, other._data) - elif isinstance(other, (np.ndarray, list)): - result = pc_func(self._data, other) - elif is_scalar(other): - try: - result = pc_func(self._data, pa.scalar(other)) - except (pa.lib.ArrowNotImplementedError, pa.lib.ArrowInvalid): - mask = isna(self) | isna(other) - valid = ~mask - result = np.zeros(len(self), dtype="bool") - result[valid] = op(np.array(self)[valid], other) - return BooleanArray(result, mask) - else: - return NotImplemented - - if pa_version_under2p0: - result = result.to_pandas().values - else: - result = result.to_numpy() - return BooleanArray._from_sequence(result) - - def insert(self, loc: int, item): + def insert(self, loc: int, item) -> ArrowStringArray: if not isinstance(item, str) and item is not libmissing.NA: raise TypeError("Scalar must be NA or str") return super().insert(loc, item) @@ -280,8 +242,9 @@ def astype(self, dtype, copy: bool = True): # ------------------------------------------------------------------------ # String methods interface - # error: Cannot determine type of 'na_value' - _str_na_value = StringDtype.na_value # type: ignore[has-type] + # error: Incompatible types in assignment (expression has type "NAType", + # base class "ObjectStringArrayMixin" defined the type as "float") + _str_na_value = libmissing.NA # type: ignore[assignment] def _str_map( self, f, na_value=None, dtype: Dtype | None = None, convert: bool = True diff --git a/pandas/core/arrays/timedeltas.py b/pandas/core/arrays/timedeltas.py index e08518a54fe6b..5f227cb45a65b 100644 --- a/pandas/core/arrays/timedeltas.py +++ b/pandas/core/arrays/timedeltas.py @@ -12,7 +12,6 @@ lib, tslibs, ) -from pandas._libs.arrays import NDArrayBacked from pandas._libs.tslibs import ( BaseOffset, NaT, @@ -21,6 +20,7 @@ Timedelta, astype_overflowsafe, iNaT, + periods_per_second, to_offset, ) from pandas._libs.tslibs.conversion import precision_from_unit @@ -50,21 +50,12 @@ is_timedelta64_dtype, pandas_dtype, ) -from pandas.core.dtypes.generic import ( - ABCCategorical, - ABCMultiIndex, -) from pandas.core.dtypes.missing import isna from pandas.core import nanops -from pandas.core.arrays import ( - ExtensionArray, - IntegerArray, - datetimelike as dtl, -) +from pandas.core.arrays import datetimelike as dtl from pandas.core.arrays._ranges import generate_regular_range import pandas.core.common as com -from pandas.core.construction import extract_array from pandas.core.ops.common import unpack_zerodim_and_defer if TYPE_CHECKING: @@ -119,12 +110,15 @@ class TimedeltaArray(dtl.TimelikeOps): """ _typ = "timedeltaarray" - _scalar_type = Timedelta _internal_fill_value = np.timedelta64("NaT", "ns") _recognized_scalars = (timedelta, np.timedelta64, Tick) _is_recognized_dtype = is_timedelta64_dtype _infer_matches = ("timedelta", "timedelta64") + @property + def _scalar_type(self) -> type[Timedelta]: + return Timedelta + __array_priority__ = 1000 # define my properties & methods for delegation _other_ops: list[str] = [] @@ -172,64 +166,14 @@ def dtype(self) -> np.dtype: # type: ignore[override] # Constructors _freq = None + _default_dtype = TD64NS_DTYPE # used in TimeLikeOps.__init__ - def __init__( - self, values, dtype=TD64NS_DTYPE, freq=lib.no_default, copy: bool = False - ) -> None: - values = extract_array(values, extract_numpy=True) - if isinstance(values, IntegerArray): - values = values.to_numpy("int64", na_value=tslibs.iNaT) - - inferred_freq = getattr(values, "_freq", None) - explicit_none = freq is None - freq = freq if freq is not lib.no_default else None - - if isinstance(values, type(self)): - if explicit_none: - # don't inherit from values - pass - elif freq is None: - freq = values.freq - elif freq and values.freq: - freq = to_offset(freq) - freq, _ = dtl.validate_inferred_freq(freq, values.freq, False) - - values = values._ndarray - - if not isinstance(values, np.ndarray): - raise ValueError( - f"Unexpected type '{type(values).__name__}'. 'values' must be a " - f"{type(self).__name__}, ndarray, or Series or Index " - "containing one of those." - ) - if values.ndim not in [1, 2]: - raise ValueError("Only 1-dimensional input arrays are supported.") - - if values.dtype == "i8": - # for compat with datetime/timedelta/period shared methods, - # we can sometimes get here with int64 values. These represent - # nanosecond UTC (or tz-naive) unix timestamps - values = values.view(TD64NS_DTYPE) - + @classmethod + def _validate_dtype(cls, values, dtype): + # used in TimeLikeOps.__init__ _validate_td64_dtype(values.dtype) dtype = _validate_td64_dtype(dtype) - - if freq == "infer": - raise ValueError( - f"Frequency inference not allowed in {type(self).__name__}.__init__. " - "Use 'pd.array()' instead." - ) - - if copy: - values = values.copy() - if freq: - freq = to_offset(freq) - - NDArrayBacked.__init__(self, values=values, dtype=dtype) - self._freq = freq - - if inferred_freq is None and freq is not None: - type(self)._validate_frequency(self, freq) + return dtype # error: Signature of "_simple_new" incompatible with supertype "NDArrayBacked" @classmethod @@ -431,6 +375,7 @@ def _format_native_types( ) -> npt.NDArray[np.object_]: from pandas.io.formats.format import get_format_timedelta64 + # Relies on TimeDelta._repr_base formatter = get_format_timedelta64(self._ndarray, na_rep) # equiv: np.array([formatter(x) for x in self._ndarray]) # but independent of dimension @@ -453,7 +398,7 @@ def __mul__(self, other) -> TimedeltaArray: freq = None if self.freq is not None and not isna(other): freq = self.freq * other - return type(self)(result, freq=freq) + return type(self)._simple_new(result, dtype=result.dtype, freq=freq) if not hasattr(other, "dtype"): # list, tuple @@ -467,13 +412,14 @@ def __mul__(self, other) -> TimedeltaArray: # this multiplication will succeed only if all elements of other # are int or float scalars, so we will end up with # timedelta64[ns]-dtyped result - result = [self[n] * other[n] for n in range(len(self))] + arr = self._ndarray + result = [arr[n] * other[n] for n in range(len(self))] result = np.array(result) - return type(self)(result) + return type(self)._simple_new(result, dtype=result.dtype) # numpy will accept float or int dtype, raise TypeError for others result = self._ndarray * other - return type(self)(result) + return type(self)._simple_new(result, dtype=result.dtype) __rmul__ = __mul__ @@ -501,7 +447,8 @@ def __truediv__(self, other): if self.freq is not None: # Tick division is not implemented, so operate on Timedelta freq = self.freq.delta / other - return type(self)(result, freq=freq) + freq = to_offset(freq) + return type(self)._simple_new(result, dtype=result.dtype, freq=freq) if not hasattr(other, "dtype"): # e.g. list, tuple @@ -517,6 +464,7 @@ def __truediv__(self, other): elif is_object_dtype(other.dtype): # We operate on raveled arrays to avoid problems in inference # on NaT + # TODO: tests with non-nano srav = self.ravel() orav = other.ravel() result_list = [srav[n] / orav[n] for n in range(len(srav))] @@ -543,7 +491,7 @@ def __truediv__(self, other): else: result = self._ndarray / other - return type(self)(result) + return type(self)._simple_new(result, dtype=result.dtype) @unpack_zerodim_and_defer("__rtruediv__") def __rtruediv__(self, other): @@ -815,10 +763,11 @@ def total_seconds(self) -> npt.NDArray[np.float64]: dtype='timedelta64[ns]', freq=None) >>> idx.total_seconds() - Float64Index([0.0, 86400.0, 172800.0, 259200.00000000003, 345600.0], + Float64Index([0.0, 86400.0, 172800.0, 259200.0, 345600.0], dtype='float64') """ - return self._maybe_mask_results(1e-9 * self.asi8, fill_value=None) + pps = periods_per_second(self._reso) + return self._maybe_mask_results(self.asi8 / pps, fill_value=None) def to_pytimedelta(self) -> npt.NDArray[np.object_]: """ @@ -829,7 +778,7 @@ def to_pytimedelta(self) -> npt.NDArray[np.object_]: ------- timedeltas : ndarray[object] """ - return tslibs.ints_to_pytimedelta(self._ndarray) + return ints_to_pytimedelta(self._ndarray) days = _field_accessor("days", "days", "Number of days for each element.") seconds = _field_accessor( @@ -931,26 +880,9 @@ def sequence_to_td64ns( if unit is not None: unit = parse_timedelta_unit(unit) - # Unwrap whatever we have into a np.ndarray - if not hasattr(data, "dtype"): - # e.g. list, tuple - if np.ndim(data) == 0: - # i.e. generator - data = list(data) - data = np.array(data, copy=False) - elif isinstance(data, ABCMultiIndex): - raise TypeError("Cannot create a TimedeltaArray from a MultiIndex.") - else: - data = extract_array(data, extract_numpy=True) - - if isinstance(data, IntegerArray): - data = data.to_numpy("int64", na_value=iNaT) - elif not isinstance(data, (np.ndarray, ExtensionArray)): - # GH#24539 e.g. xarray, dask object - data = np.asarray(data) - elif isinstance(data, ABCCategorical): - data = data.categories.take(data.codes, fill_value=NaT)._values - copy = False + data, copy = dtl.ensure_arraylike_for_datetimelike( + data, copy, cls_name="TimedeltaArray" + ) if isinstance(data, TimedeltaArray): inferred_freq = data.freq diff --git a/pandas/core/base.py b/pandas/core/base.py index b4c2c81ee666f..2fa3f57f950b5 100644 --- a/pandas/core/base.py +++ b/pandas/core/base.py @@ -81,7 +81,10 @@ NumpyValueArrayLike, ) - from pandas import Categorical + from pandas import ( + Categorical, + Series, + ) _shared_docs: dict[str, str] = {} @@ -161,7 +164,7 @@ def _freeze(self): object.__setattr__(self, "__frozen", True) # prevent adding any attribute via s.xxx.new_attribute = ... - def __setattr__(self, key: str, value): + def __setattr__(self, key: str, value) -> None: # _cache is used by a decorator # We need to check both 1.) cls.__dict__ and 2.) getattr(self, key) # because @@ -318,7 +321,7 @@ def __len__(self) -> int: raise AbstractMethodError(self) @property - def ndim(self) -> int: + def ndim(self) -> Literal[1]: """ Number of dimensions of the underlying data, by definition 1. """ @@ -765,7 +768,7 @@ def hasnans(self) -> bool: # has no attribute "any" return bool(isna(self).any()) # type: ignore[union-attr] - def isna(self): + def isna(self) -> npt.NDArray[np.bool_]: return isna(self._values) def _reduce( @@ -840,6 +843,10 @@ def _map_values(self, mapper, na_action=None): f"{na_action} was passed" ) raise ValueError(msg) + + if na_action == "ignore": + mapper = mapper[mapper.index.notna()] + # Since values were input this means we came from either # a dict or a series and mapper should be an index if is_categorical_dtype(self.dtype): @@ -890,7 +897,7 @@ def value_counts( ascending: bool = False, bins=None, dropna: bool = True, - ): + ) -> Series: """ Return a Series containing counts of unique values. @@ -983,10 +990,12 @@ def unique(self): if not isinstance(values, np.ndarray): result: ArrayLike = values.unique() - if self.dtype.kind in ["m", "M"] and isinstance(self, ABCSeries): - # GH#31182 Series._values returns EA, unpack for backward-compat - if getattr(self.dtype, "tz", None) is None: - result = np.asarray(result) + if ( + isinstance(self.dtype, np.dtype) and self.dtype.kind in ["m", "M"] + ) and isinstance(self, ABCSeries): + # GH#31182 Series._values returns EA + # unpack numpy datetime for backward-compat + result = np.asarray(result) else: result = unique1d(values) @@ -1136,8 +1145,15 @@ def _memory_usage(self, deep: bool = False) -> int: """ ), ) - def factorize(self, sort: bool = False, na_sentinel: int | None = -1): - return algorithms.factorize(self, sort=sort, na_sentinel=na_sentinel) + def factorize( + self, + sort: bool = False, + na_sentinel: int | lib.NoDefault = lib.no_default, + use_na_sentinel: bool | lib.NoDefault = lib.no_default, + ): + return algorithms.factorize( + self, sort=sort, na_sentinel=na_sentinel, use_na_sentinel=use_na_sentinel + ) _shared_docs[ "searchsorted" diff --git a/pandas/core/common.py b/pandas/core/common.py index 7225b26a910dd..980e7a79414ba 100644 --- a/pandas/core/common.py +++ b/pandas/core/common.py @@ -125,11 +125,11 @@ def is_bool_indexer(key: Any) -> bool: is_array_like(key) and is_extension_array_dtype(key.dtype) ): if key.dtype == np.object_: - key = np.asarray(key) + key_array = np.asarray(key) - if not lib.is_bool_array(key): + if not lib.is_bool_array(key_array): na_msg = "Cannot mask with non-boolean array containing NA / NaN values" - if lib.infer_dtype(key) == "boolean" and isna(key).any(): + if lib.infer_dtype(key_array) == "boolean" and isna(key_array).any(): # Don't raise on e.g. ["A", "B", np.nan], see # test_loc_getitem_list_of_labels_categoricalindex_with_na raise ValueError(na_msg) @@ -508,18 +508,14 @@ def get_rename_function(mapper): Returns a function that will map names/labels, dependent if mapper is a dict, Series or just a function. """ - if isinstance(mapper, (abc.Mapping, ABCSeries)): - def f(x): - if x in mapper: - return mapper[x] - else: - return x - - else: - f = mapper + def f(x): + if x in mapper: + return mapper[x] + else: + return x - return f + return f if isinstance(mapper, (abc.Mapping, ABCSeries)) else mapper def convert_to_list_like( @@ -557,7 +553,7 @@ def temp_setattr(obj, attr: str, value) -> Iterator[None]: setattr(obj, attr, old_value) -def require_length_match(data, index: Index): +def require_length_match(data, index: Index) -> None: """ Check the length of data matches the length of the index. """ @@ -669,7 +665,9 @@ def resolve_numeric_only(numeric_only: bool | None | lib.NoDefault) -> bool: return result -def deprecate_numeric_only_default(cls: type, name: str, deprecate_none: bool = False): +def deprecate_numeric_only_default( + cls: type, name: str, deprecate_none: bool = False +) -> None: """Emit FutureWarning message for deprecation of numeric_only. See GH#46560 for details on the deprecation. diff --git a/pandas/core/computation/check.py b/pandas/core/computation/check.py index 7be617de63a40..3221b158241f5 100644 --- a/pandas/core/computation/check.py +++ b/pandas/core/computation/check.py @@ -1,3 +1,5 @@ +from __future__ import annotations + from pandas.compat._optional import import_optional_dependency ne = import_optional_dependency("numexpr", errors="warn") diff --git a/pandas/core/computation/common.py b/pandas/core/computation/common.py index 8a9583c465f50..a1ac3dfa06ee0 100644 --- a/pandas/core/computation/common.py +++ b/pandas/core/computation/common.py @@ -1,3 +1,5 @@ +from __future__ import annotations + from functools import reduce import numpy as np @@ -5,7 +7,7 @@ from pandas._config import get_option -def ensure_decoded(s): +def ensure_decoded(s) -> str: """ If we have bytes, decode them to unicode. """ diff --git a/pandas/core/computation/expr.py b/pandas/core/computation/expr.py index ae55e61ab01a6..90824ce8d856f 100644 --- a/pandas/core/computation/expr.py +++ b/pandas/core/computation/expr.py @@ -18,6 +18,7 @@ import numpy as np from pandas.compat import PY39 +from pandas.errors import UndefinedVariableError import pandas.core.common as com from pandas.core.computation.ops import ( @@ -35,7 +36,6 @@ Op, Term, UnaryOp, - UndefinedVariableError, is_term, ) from pandas.core.computation.parsing import ( @@ -548,13 +548,13 @@ def visit_UnaryOp(self, node, **kwargs): def visit_Name(self, node, **kwargs): return self.term_type(node.id, self.env, **kwargs) - def visit_NameConstant(self, node, **kwargs): + def visit_NameConstant(self, node, **kwargs) -> Term: return self.const_type(node.value, self.env) - def visit_Num(self, node, **kwargs): + def visit_Num(self, node, **kwargs) -> Term: return self.const_type(node.n, self.env) - def visit_Constant(self, node, **kwargs): + def visit_Constant(self, node, **kwargs) -> Term: return self.const_type(node.n, self.env) def visit_Str(self, node, **kwargs): diff --git a/pandas/core/computation/expressions.py b/pandas/core/computation/expressions.py index 9e180f11c4211..e82bec47c6ac5 100644 --- a/pandas/core/computation/expressions.py +++ b/pandas/core/computation/expressions.py @@ -38,7 +38,7 @@ _MIN_ELEMENTS = 1_000_000 -def set_use_numexpr(v=True): +def set_use_numexpr(v=True) -> None: # set/unset to use numexpr global USE_NUMEXPR if NUMEXPR_INSTALLED: @@ -51,7 +51,7 @@ def set_use_numexpr(v=True): _where = _where_numexpr if USE_NUMEXPR else _where_standard -def set_numexpr_threads(n=None): +def set_numexpr_threads(n=None) -> None: # if we are using numexpr, set the threads to n # otherwise reset if NUMEXPR_INSTALLED and USE_NUMEXPR: diff --git a/pandas/core/computation/ops.py b/pandas/core/computation/ops.py index 9c54065de0353..db5f28e2ae6c1 100644 --- a/pandas/core/computation/ops.py +++ b/pandas/core/computation/ops.py @@ -65,20 +65,6 @@ LOCAL_TAG = "__pd_eval_local_" -class UndefinedVariableError(NameError): - """ - NameError subclass for local variables. - """ - - def __init__(self, name: str, is_local: bool | None = None) -> None: - base_msg = f"{repr(name)} is not defined" - if is_local: - msg = f"local variable {base_msg}" - else: - msg = f"name {base_msg}" - super().__init__(msg) - - class Term: def __new__(cls, name, env, side=None, encoding=None): klass = Constant if not isinstance(name, str) else cls @@ -108,7 +94,7 @@ def __repr__(self) -> str: def __call__(self, *args, **kwargs): return self.value - def evaluate(self, *args, **kwargs): + def evaluate(self, *args, **kwargs) -> Term: return self def _resolve_name(self): @@ -121,7 +107,7 @@ def _resolve_name(self): ) return res - def update(self, value): + def update(self, value) -> None: """ search order for local (i.e., @variable) variables: @@ -461,7 +447,7 @@ def evaluate(self, env, engine: str, parser, term_type, eval_in_python): name = env.add_tmp(res) return term_type(name, env=env) - def convert_values(self): + def convert_values(self) -> None: """ Convert datetimes to a comparable value in an expression. """ @@ -578,7 +564,7 @@ def __init__(self, op: str, operand) -> None: f"valid operators are {UNARY_OPS_SYMS}" ) from err - def __call__(self, env): + def __call__(self, env) -> MathCall: operand = self.operand(env) # error: Cannot call function of unknown type return self.func(operand) # type: ignore[operator] diff --git a/pandas/core/computation/pytables.py b/pandas/core/computation/pytables.py index 91a8505fad8c5..29af322ba0b42 100644 --- a/pandas/core/computation/pytables.py +++ b/pandas/core/computation/pytables.py @@ -13,6 +13,7 @@ ) from pandas._typing import npt from pandas.compat.chainmap import DeepChainMap +from pandas.errors import UndefinedVariableError from pandas.core.dtypes.common import is_list_like @@ -24,10 +25,7 @@ ) from pandas.core.computation.common import ensure_decoded from pandas.core.computation.expr import BaseExprVisitor -from pandas.core.computation.ops import ( - UndefinedVariableError, - is_term, -) +from pandas.core.computation.ops import is_term from pandas.core.construction import extract_array from pandas.core.indexes.base import Index diff --git a/pandas/core/computation/scope.py b/pandas/core/computation/scope.py index 52169b034603d..5188b44618b4d 100644 --- a/pandas/core/computation/scope.py +++ b/pandas/core/computation/scope.py @@ -15,6 +15,7 @@ from pandas._libs.tslibs import Timestamp from pandas.compat.chainmap import DeepChainMap +from pandas.errors import UndefinedVariableError def ensure_scope( @@ -207,9 +208,6 @@ def resolve(self, key: str, is_local: bool): # e.g., df[df > 0] return self.temps[key] except KeyError as err: - # runtime import because ops imports from scope - from pandas.core.computation.ops import UndefinedVariableError - raise UndefinedVariableError(key, is_local) from err def swapkey(self, old_key: str, new_key: str, new_value=None) -> None: diff --git a/pandas/core/config_init.py b/pandas/core/config_init.py index 47cf64ba24022..8c1a3fece255e 100644 --- a/pandas/core/config_init.py +++ b/pandas/core/config_init.py @@ -9,6 +9,8 @@ module is imported, register them here rather than in the module. """ +from __future__ import annotations + import os from typing import Callable import warnings @@ -37,7 +39,7 @@ """ -def use_bottleneck_cb(key): +def use_bottleneck_cb(key) -> None: from pandas.core import nanops nanops.set_use_bottleneck(cf.get_option(key)) @@ -51,7 +53,7 @@ def use_bottleneck_cb(key): """ -def use_numexpr_cb(key): +def use_numexpr_cb(key) -> None: from pandas.core.computation import expressions expressions.set_use_numexpr(cf.get_option(key)) @@ -65,7 +67,7 @@ def use_numexpr_cb(key): """ -def use_numba_cb(key): +def use_numba_cb(key) -> None: from pandas.core.util import numba_ numba_.set_use_numba(cf.get_option(key)) @@ -329,7 +331,7 @@ def use_numba_cb(key): """ -def table_schema_cb(key): +def table_schema_cb(key) -> None: from pandas.io.formats.printing import enable_data_resource_formatter enable_data_resource_formatter(cf.get_option(key)) @@ -500,7 +502,7 @@ def _deprecate_negative_int_max_colwidth(key): # or we'll hit circular deps. -def use_inf_as_na_cb(key): +def use_inf_as_na_cb(key) -> None: from pandas.core.dtypes.missing import _use_inf_as_na _use_inf_as_na(key) @@ -720,7 +722,7 @@ def use_inf_as_na_cb(key): """ -def register_plotting_backend_cb(key): +def register_plotting_backend_cb(key) -> None: if key == "matplotlib": # We defer matplotlib validation, since it's the default return @@ -746,7 +748,7 @@ def register_plotting_backend_cb(key): """ -def register_converter_cb(key): +def register_converter_cb(key) -> None: from pandas.plotting import ( deregister_matplotlib_converters, register_matplotlib_converters, diff --git a/pandas/core/construction.py b/pandas/core/construction.py index 8d26284a5ce45..4b63d492ec1dd 100644 --- a/pandas/core/construction.py +++ b/pandas/core/construction.py @@ -556,7 +556,10 @@ def sanitize_array( if dtype is not None and is_float_dtype(data.dtype) and is_integer_dtype(dtype): # possibility of nan -> garbage try: - subarr = _try_cast(data, dtype, copy, True) + # GH 47391 numpy > 1.24 will raise a RuntimeError for nan -> int + # casting aligning with IntCastingNaNError below + with np.errstate(invalid="ignore"): + subarr = _try_cast(data, dtype, copy, True) except IntCastingNaNError: warnings.warn( "In a future version, passing float-dtype values containing NaN " diff --git a/pandas/core/dtypes/api.py b/pandas/core/dtypes/api.py index bb6bfda183802..e6a59bf12d7cc 100644 --- a/pandas/core/dtypes/api.py +++ b/pandas/core/dtypes/api.py @@ -1,5 +1,3 @@ -# flake8: noqa:F401 - from pandas.core.dtypes.common import ( is_array_like, is_bool, @@ -43,3 +41,47 @@ is_unsigned_integer_dtype, pandas_dtype, ) + +__all__ = [ + "is_array_like", + "is_bool", + "is_bool_dtype", + "is_categorical", + "is_categorical_dtype", + "is_complex", + "is_complex_dtype", + "is_datetime64_any_dtype", + "is_datetime64_dtype", + "is_datetime64_ns_dtype", + "is_datetime64tz_dtype", + "is_dict_like", + "is_dtype_equal", + "is_extension_array_dtype", + "is_extension_type", + "is_file_like", + "is_float", + "is_float_dtype", + "is_hashable", + "is_int64_dtype", + "is_integer", + "is_integer_dtype", + "is_interval", + "is_interval_dtype", + "is_iterator", + "is_list_like", + "is_named_tuple", + "is_number", + "is_numeric_dtype", + "is_object_dtype", + "is_period_dtype", + "is_re", + "is_re_compilable", + "is_scalar", + "is_signed_integer_dtype", + "is_sparse", + "is_string_dtype", + "is_timedelta64_dtype", + "is_timedelta64_ns_dtype", + "is_unsigned_integer_dtype", + "pandas_dtype", +] diff --git a/pandas/core/dtypes/astype.py b/pandas/core/dtypes/astype.py index 8d1427976276c..7fb58468746a8 100644 --- a/pandas/core/dtypes/astype.py +++ b/pandas/core/dtypes/astype.py @@ -15,6 +15,7 @@ import numpy as np from pandas._libs import lib +from pandas._libs.tslibs import is_unitless from pandas._libs.tslibs.timedeltas import array_to_timedelta64 from pandas._typing import ( ArrayLike, @@ -280,6 +281,20 @@ def astype_array_safe( # Ensure we don't end up with a PandasArray dtype = dtype.numpy_dtype + if ( + is_datetime64_dtype(values.dtype) + # need to do np.dtype check instead of is_datetime64_dtype + # otherwise pyright complains + and isinstance(dtype, np.dtype) + and dtype.kind == "M" + and not is_unitless(dtype) + and not is_dtype_equal(dtype, values.dtype) + ): + # unit conversion, we would re-cast to nanosecond, so this is + # effectively just a copy (regardless of copy kwd) + # TODO(2.0): remove special-case + return values.copy() + try: new_values = astype_array(values, dtype, copy=copy) except (ValueError, TypeError): diff --git a/pandas/core/dtypes/base.py b/pandas/core/dtypes/base.py index f96a9ab4cfb43..5ec2aaab98ba1 100644 --- a/pandas/core/dtypes/base.py +++ b/pandas/core/dtypes/base.py @@ -400,7 +400,7 @@ class StorageExtensionDtype(ExtensionDtype): def __init__(self, storage=None) -> None: self.storage = storage - def __repr__(self): + def __repr__(self) -> str: return f"{self.name}[{self.storage}]" def __str__(self): diff --git a/pandas/core/dtypes/cast.py b/pandas/core/dtypes/cast.py index ed3f9ee525c9e..769656d1c4755 100644 --- a/pandas/core/dtypes/cast.py +++ b/pandas/core/dtypes/cast.py @@ -305,7 +305,7 @@ def maybe_downcast_to_dtype(result: ArrayLike, dtype: str | np.dtype) -> ArrayLi result = cast(np.ndarray, result) result = array_to_timedelta64(result) - elif dtype == "M8[ns]" and result.dtype == _dtype_obj: + elif dtype == np.dtype("M8[ns]") and result.dtype == _dtype_obj: return np.asarray(maybe_cast_to_datetime(result, dtype=dtype)) return result @@ -978,7 +978,7 @@ def maybe_upcast( return upcast_values, fill_value # type: ignore[return-value] -def invalidate_string_dtypes(dtype_set: set[DtypeObj]): +def invalidate_string_dtypes(dtype_set: set[DtypeObj]) -> None: """ Change string like dtypes to object for ``DataFrame.select_dtypes()``. @@ -995,7 +995,7 @@ def invalidate_string_dtypes(dtype_set: set[DtypeObj]): raise TypeError("string dtypes are not allowed, use 'object' instead") -def coerce_indexer_dtype(indexer, categories): +def coerce_indexer_dtype(indexer, categories) -> np.ndarray: """coerce the indexer input array to the smallest dtype possible""" length = len(categories) if length < _int8_max: @@ -1709,7 +1709,9 @@ def construct_1d_arraylike_from_scalar( value = _maybe_unbox_datetimelike_tz_deprecation(value, dtype) subarr = np.empty(length, dtype=dtype) - subarr.fill(value) + if length: + # GH 47391: numpy > 1.24 will raise filling np.nan into int dtypes + subarr.fill(value) return subarr diff --git a/pandas/core/dtypes/common.py b/pandas/core/dtypes/common.py index a192337daf59b..c10461b2fc7f8 100644 --- a/pandas/core/dtypes/common.py +++ b/pandas/core/dtypes/common.py @@ -36,7 +36,7 @@ ABCCategorical, ABCIndex, ) -from pandas.core.dtypes.inference import ( # noqa:F401 +from pandas.core.dtypes.inference import ( is_array_like, is_bool, is_complex, @@ -966,7 +966,9 @@ def is_datetime64_ns_dtype(arr_or_dtype) -> bool: tipo = get_dtype(arr_or_dtype.dtype) else: return False - return tipo == DT64NS_DTYPE or getattr(tipo, "base", None) == DT64NS_DTYPE + return tipo == DT64NS_DTYPE or ( + isinstance(tipo, DatetimeTZDtype) and tipo._unit == "ns" + ) def is_timedelta64_ns_dtype(arr_or_dtype) -> bool: @@ -1039,7 +1041,7 @@ def is_datetime_or_timedelta_dtype(arr_or_dtype) -> bool: # This exists to silence numpy deprecation warnings, see GH#29553 -def is_numeric_v_string_like(a: ArrayLike, b): +def is_numeric_v_string_like(a: ArrayLike, b) -> bool: """ Check if we are comparing a string-like object to a numeric ndarray. NumPy doesn't like to compare such objects, especially numeric arrays @@ -1088,7 +1090,7 @@ def is_numeric_v_string_like(a: ArrayLike, b): # This exists to silence numpy deprecation warnings, see GH#29553 -def is_datetimelike_v_numeric(a, b): +def is_datetimelike_v_numeric(a, b) -> bool: """ Check if we are comparing a datetime-like object to a numeric object. By "numeric," we mean an object that is either of an int or float dtype. @@ -1812,3 +1814,70 @@ def is_all_strings(value: ArrayLike) -> bool: elif isinstance(dtype, CategoricalDtype): return dtype.categories.inferred_type == "string" return dtype == "string" + + +__all__ = [ + "classes", + "classes_and_not_datetimelike", + "DT64NS_DTYPE", + "ensure_float", + "ensure_float64", + "ensure_python_int", + "ensure_str", + "get_dtype", + "infer_dtype_from_object", + "INT64_DTYPE", + "is_1d_only_ea_dtype", + "is_1d_only_ea_obj", + "is_all_strings", + "is_any_int_dtype", + "is_array_like", + "is_bool", + "is_bool_dtype", + "is_categorical", + "is_categorical_dtype", + "is_complex", + "is_complex_dtype", + "is_dataclass", + "is_datetime64_any_dtype", + "is_datetime64_dtype", + "is_datetime64_ns_dtype", + "is_datetime64tz_dtype", + "is_datetimelike_v_numeric", + "is_datetime_or_timedelta_dtype", + "is_decimal", + "is_dict_like", + "is_dtype_equal", + "is_ea_or_datetimelike_dtype", + "is_extension_array_dtype", + "is_extension_type", + "is_file_like", + "is_float_dtype", + "is_int64_dtype", + "is_integer_dtype", + "is_interval", + "is_interval_dtype", + "is_iterator", + "is_named_tuple", + "is_nested_list_like", + "is_number", + "is_numeric_dtype", + "is_numeric_v_string_like", + "is_object_dtype", + "is_period_dtype", + "is_re", + "is_re_compilable", + "is_scipy_sparse", + "is_sequence", + "is_signed_integer_dtype", + "is_sparse", + "is_string_dtype", + "is_string_or_object_np_dtype", + "is_timedelta64_dtype", + "is_timedelta64_ns_dtype", + "is_unsigned_integer_dtype", + "needs_i8_conversion", + "pandas_dtype", + "TD64NS_DTYPE", + "validate_all_hashable", +] diff --git a/pandas/core/dtypes/concat.py b/pandas/core/dtypes/concat.py index c61e9aaa59362..059df4009e2f6 100644 --- a/pandas/core/dtypes/concat.py +++ b/pandas/core/dtypes/concat.py @@ -1,6 +1,8 @@ """ Utility functions related to concat. """ +from __future__ import annotations + from typing import ( TYPE_CHECKING, cast, @@ -32,6 +34,7 @@ ) if TYPE_CHECKING: + from pandas.core.arrays import Categorical from pandas.core.arrays.sparse import SparseArray @@ -156,7 +159,7 @@ def is_nonempty(x) -> bool: def union_categoricals( to_union, sort_categories: bool = False, ignore_order: bool = False -): +) -> Categorical: """ Combine list-like of Categorical-like, unioning categories. diff --git a/pandas/core/dtypes/dtypes.py b/pandas/core/dtypes/dtypes.py index 32594854f49ae..99b2082d409a9 100644 --- a/pandas/core/dtypes/dtypes.py +++ b/pandas/core/dtypes/dtypes.py @@ -10,6 +10,7 @@ MutableMapping, cast, ) +import warnings import numpy as np import pytz @@ -26,6 +27,7 @@ from pandas._libs.tslibs import ( BaseOffset, NaT, + NaTType, Period, Timestamp, dtypes, @@ -36,10 +38,12 @@ from pandas._typing import ( Dtype, DtypeObj, + IntervalInclusiveType, Ordered, npt, type_t, ) +from pandas.util._exceptions import find_stack_level from pandas.core.dtypes.base import ( ExtensionDtype, @@ -672,11 +676,14 @@ class DatetimeTZDtype(PandasExtensionDtype): kind: str_type = "M" num = 101 base = np.dtype("M8[ns]") # TODO: depend on reso? - na_value = NaT _metadata = ("unit", "tz") _match = re.compile(r"(datetime64|M8)\[(?P.+), (?P.+)\]") _cache_dtypes: dict[str_type, PandasExtensionDtype] = {} + @property + def na_value(self) -> NaTType: + return NaT + @cache_readonly def str(self): return f"|M8[{self._unit}]" @@ -943,7 +950,7 @@ def name(self) -> str_type: return f"period[{self.freq.freqstr}]" @property - def na_value(self): + def na_value(self) -> NaTType: return NaT def __hash__(self) -> int: @@ -970,7 +977,7 @@ def __eq__(self, other: Any) -> bool: def __ne__(self, other: Any) -> bool: return not self.__eq__(other) - def __setstate__(self, state): + def __setstate__(self, state) -> None: # for pickle compat. __getstate__ is defined in the # PandasExtensionDtype superclass and uses the public properties to # pickle -> need to set the settable private ones here (see GH26067) @@ -1032,7 +1039,9 @@ def __from_arrow__( for arr in chunks: data, mask = pyarrow_array_to_numpy_and_mask(arr, dtype=np.dtype(np.int64)) parr = PeriodArray(data.copy(), freq=self.freq, copy=False) - parr[~mask] = NaT + # error: Invalid index type "ndarray[Any, dtype[bool_]]" for "PeriodArray"; + # expected type "Union[int, Sequence[int], Sequence[bool], slice]" + parr[~mask] = NaT # type: ignore[index] results.append(parr) if not results: @@ -1086,7 +1095,7 @@ class IntervalDtype(PandasExtensionDtype): def __new__( cls, subtype=None, - inclusive: str_type | None = None, + inclusive: IntervalInclusiveType | None = None, closed: None | lib.NoDefault = lib.no_default, ): from pandas.core.dtypes.common import ( @@ -1118,7 +1127,7 @@ def __new__( # generally for pickle compat u = object.__new__(cls) u._subtype = None - u._closed = inclusive + u._inclusive = inclusive return u elif isinstance(subtype, str) and subtype.lower() == "interval": subtype = None @@ -1135,7 +1144,11 @@ def __new__( "'inclusive' keyword does not match value " "specified in dtype string" ) - inclusive = gd["inclusive"] + # Incompatible types in assignment (expression has type + # "Union[str, Any]", variable has type + # "Optional[Union[Literal['left', 'right'], + # Literal['both', 'neither']]]") + inclusive = gd["inclusive"] # type: ignore[assignment] try: subtype = pandas_dtype(subtype) @@ -1156,7 +1169,7 @@ def __new__( except KeyError: u = object.__new__(cls) u._subtype = subtype - u._closed = inclusive + u._inclusive = inclusive cls._cache_dtypes[key] = u return u @@ -1174,7 +1187,16 @@ def _can_hold_na(self) -> bool: @property def inclusive(self): - return self._closed + return self._inclusive + + @property + def closed(self): + warnings.warn( + "Attribute `closed` is deprecated in favor of `inclusive`.", + FutureWarning, + stacklevel=find_stack_level(), + ) + return self._inclusive @property def subtype(self): @@ -1219,7 +1241,7 @@ def construct_from_string(cls, string: str_type) -> IntervalDtype: raise TypeError(msg) @property - def type(self): + def type(self) -> type[Interval]: return Interval def __str__(self) -> str_type: @@ -1249,13 +1271,13 @@ def __eq__(self, other: Any) -> bool: return is_dtype_equal(self.subtype, other.subtype) - def __setstate__(self, state): + def __setstate__(self, state) -> None: # for pickle compat. __get_state__ is defined in the # PandasExtensionDtype superclass and uses the public properties to # pickle -> need to set the settable private ones here (see GH26067) self._subtype = state["subtype"] # backward-compat older pickles won't have "inclusive" key - self._closed = state.pop("inclusive", None) + self._inclusive = state.pop("inclusive", None) @classmethod def is_dtype(cls, dtype: object) -> bool: @@ -1431,7 +1453,9 @@ class BaseMaskedDtype(ExtensionDtype): base = None type: type - na_value = libmissing.NA + @property + def na_value(self) -> libmissing.NAType: + return libmissing.NA @cache_readonly def numpy_dtype(self) -> np.dtype: diff --git a/pandas/core/dtypes/inference.py b/pandas/core/dtypes/inference.py index f47aeb16e19f1..893e4a9be58ef 100644 --- a/pandas/core/dtypes/inference.py +++ b/pandas/core/dtypes/inference.py @@ -1,5 +1,7 @@ """ basic inference routines """ +from __future__ import annotations + from collections import abc from numbers import Number import re diff --git a/pandas/core/dtypes/missing.py b/pandas/core/dtypes/missing.py index 4316109da1cbb..e809e761ebbb2 100644 --- a/pandas/core/dtypes/missing.py +++ b/pandas/core/dtypes/missing.py @@ -18,6 +18,7 @@ import pandas._libs.missing as libmissing from pandas._libs.tslibs import ( NaT, + Period, iNaT, ) @@ -40,6 +41,7 @@ ) from pandas.core.dtypes.dtypes import ( CategoricalDtype, + DatetimeTZDtype, ExtensionDtype, IntervalDtype, PeriodDtype, @@ -739,3 +741,44 @@ def is_valid_na_for_dtype(obj, dtype: DtypeObj) -> bool: # fallback, default to allowing NaN, None, NA, NaT return not isinstance(obj, (np.datetime64, np.timedelta64, Decimal)) + + +def isna_all(arr: ArrayLike) -> bool: + """ + Optimized equivalent to isna(arr).all() + """ + total_len = len(arr) + + # Usually it's enough to check but a small fraction of values to see if + # a block is NOT null, chunks should help in such cases. + # parameters 1000 and 40 were chosen arbitrarily + chunk_len = max(total_len // 40, 1000) + + dtype = arr.dtype + if dtype.kind == "f" and isinstance(dtype, np.dtype): + checker = nan_checker + + elif ( + (isinstance(dtype, np.dtype) and dtype.kind in ["m", "M"]) + or isinstance(dtype, DatetimeTZDtype) + or dtype.type is Period + ): + # error: Incompatible types in assignment (expression has type + # "Callable[[Any], Any]", variable has type "ufunc") + checker = lambda x: np.asarray(x.view("i8")) == iNaT # type: ignore[assignment] + + else: + # error: Incompatible types in assignment (expression has type "Callable[[Any], + # Any]", variable has type "ufunc") + checker = lambda x: _isna_array( # type: ignore[assignment] + x, inf_as_na=INF_AS_NA + ) + + return all( + # error: Argument 1 to "__call__" of "ufunc" has incompatible type + # "Union[ExtensionArray, Any]"; expected "Union[Union[int, float, complex, str, + # bytes, generic], Sequence[Union[int, float, complex, str, bytes, generic]], + # Sequence[Sequence[Any]], _SupportsArray]" + checker(arr[i : i + chunk_len]).all() # type: ignore[arg-type] + for i in range(0, total_len, chunk_len) + ) diff --git a/pandas/core/exchange/buffer.py b/pandas/core/exchange/buffer.py index 098c596bff4cd..a3b05a0c5d24a 100644 --- a/pandas/core/exchange/buffer.py +++ b/pandas/core/exchange/buffer.py @@ -1,7 +1,4 @@ -from typing import ( - Optional, - Tuple, -) +from __future__ import annotations import numpy as np from packaging import version @@ -60,7 +57,7 @@ def __dlpack__(self): return self._x.__dlpack__() raise NotImplementedError("__dlpack__") - def __dlpack_device__(self) -> Tuple[DlpackDeviceType, Optional[int]]: + def __dlpack_device__(self) -> tuple[DlpackDeviceType, int | None]: """ Device type and device ID for where the data in the buffer resides. """ diff --git a/pandas/core/exchange/column.py b/pandas/core/exchange/column.py index ae24c5d295cc9..c2a1cfe766b22 100644 --- a/pandas/core/exchange/column.py +++ b/pandas/core/exchange/column.py @@ -1,7 +1,6 @@ -from typing import ( - Any, - Tuple, -) +from __future__ import annotations + +from typing import Any import numpy as np @@ -97,7 +96,7 @@ def offset(self) -> int: return 0 @cache_readonly - def dtype(self): + def dtype(self) -> tuple[DtypeKind, int, str, str]: dtype = self._col.dtype if is_categorical_dtype(dtype): @@ -126,7 +125,7 @@ def dtype(self): else: return self._dtype_from_pandasdtype(dtype) - def _dtype_from_pandasdtype(self, dtype) -> Tuple[DtypeKind, int, str, str]: + def _dtype_from_pandasdtype(self, dtype) -> tuple[DtypeKind, int, str, str]: """ See `self.dtype` for details. """ @@ -139,7 +138,7 @@ def _dtype_from_pandasdtype(self, dtype) -> Tuple[DtypeKind, int, str, str]: # Not a NumPy dtype. Check if it's a categorical maybe raise ValueError(f"Data type {dtype} not supported by exchange protocol") - return (kind, dtype.itemsize * 8, dtype_to_arrow_c_fmt(dtype), dtype.byteorder) + return kind, dtype.itemsize * 8, dtype_to_arrow_c_fmt(dtype), dtype.byteorder @property def describe_categorical(self): @@ -182,10 +181,10 @@ def null_count(self) -> int: """ Number of null elements. Should always be known. """ - return self._col.isna().sum() + return self._col.isna().sum().item() @property - def metadata(self): + def metadata(self) -> dict[str, pd.Index]: """ Store specific metadata of the column. """ @@ -197,7 +196,7 @@ def num_chunks(self) -> int: """ return 1 - def get_chunks(self, n_chunks=None): + def get_chunks(self, n_chunks: int | None = None): """ Return an iterator yielding the chunks. See `DataFrame.get_chunks` for details on ``n_chunks``. @@ -214,7 +213,7 @@ def get_chunks(self, n_chunks=None): else: yield self - def get_buffers(self): + def get_buffers(self) -> ColumnBuffers: """ Return a dictionary containing the underlying buffers. The returned dictionary has the following contents: @@ -253,7 +252,7 @@ def get_buffers(self): def _get_data_buffer( self, - ) -> Tuple[PandasBuffer, Any]: # Any is for self.dtype tuple + ) -> tuple[PandasBuffer, Any]: # Any is for self.dtype tuple """ Return the buffer containing the data and the buffer's associated dtype. """ @@ -296,7 +295,7 @@ def _get_data_buffer( return buffer, dtype - def _get_validity_buffer(self) -> Tuple[PandasBuffer, Any]: + def _get_validity_buffer(self) -> tuple[PandasBuffer, Any]: """ Return the buffer containing the mask values indicating missing data and the buffer's associated dtype. @@ -334,7 +333,7 @@ def _get_validity_buffer(self) -> Tuple[PandasBuffer, Any]: raise NoBufferPresent(msg) - def _get_offsets_buffer(self) -> Tuple[PandasBuffer, Any]: + def _get_offsets_buffer(self) -> tuple[PandasBuffer, Any]: """ Return the buffer containing the offset values for variable-size binary data (e.g., variable-length strings) and the buffer's associated dtype. diff --git a/pandas/core/exchange/dataframe.py b/pandas/core/exchange/dataframe.py index c8a89184b34c6..e5bb3811afed0 100644 --- a/pandas/core/exchange/dataframe.py +++ b/pandas/core/exchange/dataframe.py @@ -1,9 +1,15 @@ +from __future__ import annotations + from collections import abc +from typing import TYPE_CHECKING import pandas as pd from pandas.core.exchange.column import PandasColumn from pandas.core.exchange.dataframe_protocol import DataFrame as DataFrameXchg +if TYPE_CHECKING: + from pandas import Index + class PandasDataFrameXchg(DataFrameXchg): """ @@ -29,11 +35,13 @@ def __init__( self._nan_as_null = nan_as_null self._allow_copy = allow_copy - def __dataframe__(self, nan_as_null: bool = False, allow_copy: bool = True): + def __dataframe__( + self, nan_as_null: bool = False, allow_copy: bool = True + ) -> PandasDataFrameXchg: return PandasDataFrameXchg(self._df, nan_as_null, allow_copy) @property - def metadata(self): + def metadata(self) -> dict[str, Index]: # `index` isn't a regular column, and the protocol doesn't support row # labels - so we export it as Pandas-specific metadata here. return {"pandas.index": self._df.index} @@ -47,7 +55,7 @@ def num_rows(self) -> int: def num_chunks(self) -> int: return 1 - def column_names(self): + def column_names(self) -> Index: return self._df.columns def get_column(self, i: int) -> PandasColumn: @@ -56,13 +64,13 @@ def get_column(self, i: int) -> PandasColumn: def get_column_by_name(self, name: str) -> PandasColumn: return PandasColumn(self._df[name], allow_copy=self._allow_copy) - def get_columns(self): + def get_columns(self) -> list[PandasColumn]: return [ PandasColumn(self._df[name], allow_copy=self._allow_copy) for name in self._df.columns ] - def select_columns(self, indices): + def select_columns(self, indices) -> PandasDataFrameXchg: if not isinstance(indices, abc.Sequence): raise ValueError("`indices` is not a sequence") if not isinstance(indices, list): @@ -72,7 +80,7 @@ def select_columns(self, indices): self._df.iloc[:, indices], self._nan_as_null, self._allow_copy ) - def select_columns_by_name(self, names): + def select_columns_by_name(self, names) -> PandasDataFrameXchg: if not isinstance(names, abc.Sequence): raise ValueError("`names` is not a sequence") if not isinstance(names, list): diff --git a/pandas/core/exchange/dataframe_protocol.py b/pandas/core/exchange/dataframe_protocol.py index ee2ae609e73f9..367b906332741 100644 --- a/pandas/core/exchange/dataframe_protocol.py +++ b/pandas/core/exchange/dataframe_protocol.py @@ -2,6 +2,8 @@ A verbatim copy (vendored) of the spec from https://github.com/data-apis/dataframe-api """ +from __future__ import annotations + from abc import ( ABC, abstractmethod, @@ -9,11 +11,8 @@ import enum from typing import ( Any, - Dict, Iterable, - Optional, Sequence, - Tuple, TypedDict, ) @@ -90,18 +89,18 @@ class ColumnNullType(enum.IntEnum): class ColumnBuffers(TypedDict): # first element is a buffer containing the column data; # second element is the data buffer's associated dtype - data: Tuple["Buffer", Any] + data: tuple[Buffer, Any] # first element is a buffer containing mask values indicating missing data; # second element is the mask value buffer's associated dtype. # None if the null representation is not a bit or byte mask - validity: Optional[Tuple["Buffer", Any]] + validity: tuple[Buffer, Any] | None # first element is a buffer containing the offset values for # variable-size binary data (e.g., variable-length strings); # second element is the offsets buffer's associated dtype. # None if the data buffer does not have an associated offsets buffer - offsets: Optional[Tuple["Buffer", Any]] + offsets: tuple[Buffer, Any] | None class CategoricalDescription(TypedDict): @@ -111,7 +110,7 @@ class CategoricalDescription(TypedDict): is_dictionary: bool # Python-level only (e.g. ``{int: str}``). # None if not a dictionary-style categorical. - mapping: Optional[dict] + mapping: dict | None class Buffer(ABC): @@ -161,7 +160,7 @@ def __dlpack__(self): raise NotImplementedError("__dlpack__") @abstractmethod - def __dlpack_device__(self) -> Tuple[DlpackDeviceType, Optional[int]]: + def __dlpack_device__(self) -> tuple[DlpackDeviceType, int | None]: """ Device type and device ID for where the data in the buffer resides. Uses device type codes matching DLPack. @@ -239,7 +238,7 @@ def offset(self) -> int: @property @abstractmethod - def dtype(self) -> Tuple[DtypeKind, int, str, str]: + def dtype(self) -> tuple[DtypeKind, int, str, str]: """ Dtype description as a tuple ``(kind, bit-width, format string, endianness)``. @@ -293,7 +292,7 @@ def describe_categorical(self) -> CategoricalDescription: @property @abstractmethod - def describe_null(self) -> Tuple[ColumnNullType, Any]: + def describe_null(self) -> tuple[ColumnNullType, Any]: """ Return the missing value (or "null") representation the column dtype uses, as a tuple ``(kind, value)``. @@ -306,7 +305,7 @@ def describe_null(self) -> Tuple[ColumnNullType, Any]: @property @abstractmethod - def null_count(self) -> Optional[int]: + def null_count(self) -> int | None: """ Number of null elements, if known. @@ -316,7 +315,7 @@ def null_count(self) -> Optional[int]: @property @abstractmethod - def metadata(self) -> Dict[str, Any]: + def metadata(self) -> dict[str, Any]: """ The metadata for the column. See `DataFrame.metadata` for more details. """ @@ -330,7 +329,7 @@ def num_chunks(self) -> int: pass @abstractmethod - def get_chunks(self, n_chunks: Optional[int] = None) -> Iterable["Column"]: + def get_chunks(self, n_chunks: int | None = None) -> Iterable[Column]: """ Return an iterator yielding the chunks. @@ -395,7 +394,7 @@ def __dataframe__(self, nan_as_null: bool = False, allow_copy: bool = True): @property @abstractmethod - def metadata(self) -> Dict[str, Any]: + def metadata(self) -> dict[str, Any]: """ The metadata for the data frame, as a dictionary with string keys. The contents of `metadata` may be anything, they are meant for a library @@ -415,7 +414,7 @@ def num_columns(self) -> int: pass @abstractmethod - def num_rows(self) -> Optional[int]: + def num_rows(self) -> int | None: # TODO: not happy with Optional, but need to flag it may be expensive # why include it if it may be None - what do we expect consumers # to do here? @@ -460,21 +459,21 @@ def get_columns(self) -> Iterable[Column]: pass @abstractmethod - def select_columns(self, indices: Sequence[int]) -> "DataFrame": + def select_columns(self, indices: Sequence[int]) -> DataFrame: """ Create a new DataFrame by selecting a subset of columns by index. """ pass @abstractmethod - def select_columns_by_name(self, names: Sequence[str]) -> "DataFrame": + def select_columns_by_name(self, names: Sequence[str]) -> DataFrame: """ Create a new DataFrame by selecting a subset of columns by name. """ pass @abstractmethod - def get_chunks(self, n_chunks: Optional[int] = None) -> Iterable["DataFrame"]: + def get_chunks(self, n_chunks: int | None = None) -> Iterable[DataFrame]: """ Return an iterator yielding the chunks. diff --git a/pandas/core/exchange/from_dataframe.py b/pandas/core/exchange/from_dataframe.py index 805e63ac67f16..a33e47ba3b68e 100644 --- a/pandas/core/exchange/from_dataframe.py +++ b/pandas/core/exchange/from_dataframe.py @@ -1,13 +1,8 @@ +from __future__ import annotations + import ctypes import re -from typing import ( - Any, - Dict, - List, - Optional, - Tuple, - Union, -) +from typing import Any import numpy as np @@ -24,7 +19,7 @@ Endianness, ) -_NP_DTYPES: Dict[DtypeKind, Dict[int, Any]] = { +_NP_DTYPES: dict[DtypeKind, dict[int, Any]] = { DtypeKind.INT: {8: np.int8, 16: np.int16, 32: np.int32, 64: np.int64}, DtypeKind.UINT: {8: np.uint8, 16: np.uint16, 32: np.uint32, 64: np.uint64}, DtypeKind.FLOAT: {32: np.float32, 64: np.float64}, @@ -32,7 +27,7 @@ } -def from_dataframe(df, allow_copy=True): +def from_dataframe(df, allow_copy=True) -> pd.DataFrame: """ Build a ``pd.DataFrame`` from any DataFrame supporting the interchange protocol. @@ -108,7 +103,7 @@ def protocol_df_chunk_to_pandas(df: DataFrameXchg) -> pd.DataFrame: """ # We need a dict of columns here, with each column being a NumPy array (at # least for now, deal with non-NumPy dtypes later). - columns: Dict[str, Any] = {} + columns: dict[str, Any] = {} buffers = [] # hold on to buffers, keeps memory alive for name in df.column_names(): if not isinstance(name, str): @@ -140,7 +135,7 @@ def protocol_df_chunk_to_pandas(df: DataFrameXchg) -> pd.DataFrame: return pandas_df -def primitive_column_to_ndarray(col: Column) -> Tuple[np.ndarray, Any]: +def primitive_column_to_ndarray(col: Column) -> tuple[np.ndarray, Any]: """ Convert a column holding one of the primitive dtypes to a NumPy array. @@ -165,7 +160,7 @@ def primitive_column_to_ndarray(col: Column) -> Tuple[np.ndarray, Any]: return data, buffers -def categorical_column_to_series(col: Column) -> Tuple[pd.Series, Any]: +def categorical_column_to_series(col: Column) -> tuple[pd.Series, Any]: """ Convert a column holding categorical data to a pandas Series. @@ -205,7 +200,7 @@ def categorical_column_to_series(col: Column) -> Tuple[pd.Series, Any]: return data, buffers -def string_column_to_ndarray(col: Column) -> Tuple[np.ndarray, Any]: +def string_column_to_ndarray(col: Column) -> tuple[np.ndarray, Any]: """ Convert a column holding string data to a NumPy array. @@ -268,7 +263,7 @@ def string_column_to_ndarray(col: Column) -> Tuple[np.ndarray, Any]: null_pos = ~null_pos # Assemble the strings from the code units - str_list: List[Union[None, float, str]] = [None] * col.size + str_list: list[None | float | str] = [None] * col.size for i in range(col.size): # Check for missing values if null_pos is not None and null_pos[i]: @@ -324,7 +319,7 @@ def parse_datetime_format_str(format_str, data): raise NotImplementedError(f"DateTime kind is not supported: {format_str}") -def datetime_column_to_ndarray(col: Column) -> Tuple[np.ndarray, Any]: +def datetime_column_to_ndarray(col: Column) -> tuple[np.ndarray, Any]: """ Convert a column holding DateTime data to a NumPy array. @@ -362,9 +357,9 @@ def datetime_column_to_ndarray(col: Column) -> Tuple[np.ndarray, Any]: def buffer_to_ndarray( buffer: Buffer, - dtype: Tuple[DtypeKind, int, str, str], + dtype: tuple[DtypeKind, int, str, str], offset: int = 0, - length: Optional[int] = None, + length: int | None = None, ) -> np.ndarray: """ Build a NumPy array from the passed buffer. @@ -470,9 +465,9 @@ def bitmask_to_bool_ndarray( def set_nulls( - data: Union[np.ndarray, pd.Series], + data: np.ndarray | pd.Series, col: Column, - validity: Optional[Tuple[Buffer, Tuple[DtypeKind, int, str, str]]], + validity: tuple[Buffer, tuple[DtypeKind, int, str, str]] | None, allow_modify_inplace: bool = True, ): """ diff --git a/pandas/core/exchange/utils.py b/pandas/core/exchange/utils.py index 0c746113babee..2cc5126591718 100644 --- a/pandas/core/exchange/utils.py +++ b/pandas/core/exchange/utils.py @@ -2,6 +2,8 @@ Utility functions and objects for implementing the exchange API. """ +from __future__ import annotations + import re import typing diff --git a/pandas/core/flags.py b/pandas/core/flags.py index 001cd3d41177a..f07c6917d91e5 100644 --- a/pandas/core/flags.py +++ b/pandas/core/flags.py @@ -1,3 +1,5 @@ +from __future__ import annotations + import weakref @@ -81,7 +83,7 @@ def allows_duplicate_labels(self) -> bool: return self._allows_duplicate_labels @allows_duplicate_labels.setter - def allows_duplicate_labels(self, value: bool): + def allows_duplicate_labels(self, value: bool) -> None: value = bool(value) obj = self._obj() if obj is None: @@ -99,12 +101,12 @@ def __getitem__(self, key): return getattr(self, key) - def __setitem__(self, key, value): + def __setitem__(self, key, value) -> None: if key not in self._keys: raise ValueError(f"Unknown flag {key}. Must be one of {self._keys}") setattr(self, key, value) - def __repr__(self): + def __repr__(self) -> str: return f"" def __eq__(self, other): diff --git a/pandas/core/frame.py b/pandas/core/frame.py index 4376c784bc847..e62f9fa8076d8 100644 --- a/pandas/core/frame.py +++ b/pandas/core/frame.py @@ -1008,7 +1008,7 @@ def _repr_fits_horizontal_(self, ignore_width: bool = False) -> bool: # used by repr_html under IPython notebook or scripts ignore terminal # dims - if ignore_width or not console.in_interactive_session(): + if ignore_width or width is None or not console.in_interactive_session(): return True if get_option("display.width") is not None or console.in_ipython_frontend(): @@ -1601,7 +1601,7 @@ def __matmul__(self, other: AnyArrayLike | DataFrame) -> DataFrame | Series: """ return self.dot(other) - def __rmatmul__(self, other): + def __rmatmul__(self, other) -> DataFrame: """ Matrix multiplication using binary `@` operator in Python>=3.5. """ @@ -1720,7 +1720,10 @@ def from_dict( if columns is not None: raise ValueError(f"cannot use columns parameter with orient='{orient}'") else: # pragma: no cover - raise ValueError("only recognize index or columns for orient") + raise ValueError( + f"Expected 'index', 'columns' or 'tight' for orient parameter. " + f"Got '{orient}' instead" + ) if orient != "tight": return cls(data, index=index, columns=columns, dtype=dtype) @@ -1743,7 +1746,7 @@ def to_numpy( self, dtype: npt.DTypeLike | None = None, copy: bool = False, - na_value=lib.no_default, + na_value: object = lib.no_default, ) -> np.ndarray: """ Convert the DataFrame to a NumPy array. @@ -1817,7 +1820,7 @@ def to_dict(self, orient: str = "dict", into=dict): Parameters ---------- - orient : str {'dict', 'list', 'series', 'split', 'records', 'index'} + orient : str {'dict', 'list', 'series', 'split', 'tight', 'records', 'index'} Determines the type of the values of the dictionary. - 'dict' (default) : dict like {column -> {index -> value}} @@ -2481,8 +2484,8 @@ def to_records( if dtype_mapping is None: formats.append(v.dtype) elif isinstance(dtype_mapping, (type, np.dtype, str)): - # Argument 1 to "append" of "list" has incompatible type - # "Union[type, dtype[Any], str]"; expected "dtype[_SCT]" [arg-type] + # error: Argument 1 to "append" of "list" has incompatible + # type "Union[type, dtype[Any], str]"; expected "dtype[Any]" formats.append(dtype_mapping) # type: ignore[arg-type] else: element = "row" if i < index_len else "column" @@ -2788,6 +2791,32 @@ def to_markdown( handles.handle.write(result) return None + @overload + def to_parquet( + self, + path: None = ..., + engine: str = ..., + compression: str | None = ..., + index: bool | None = ..., + partition_cols: list[str] | None = ..., + storage_options: StorageOptions = ..., + **kwargs, + ) -> bytes: + ... + + @overload + def to_parquet( + self, + path: FilePath | WriteBuffer[bytes], + engine: str = ..., + compression: str | None = ..., + index: bool | None = ..., + partition_cols: list[str] | None = ..., + storage_options: StorageOptions = ..., + **kwargs, + ) -> None: + ... + @doc(storage_options=_shared_docs["storage_options"]) @deprecate_kwarg(old_arg_name="fname", new_arg_name="path") def to_parquet( @@ -2854,6 +2883,7 @@ def to_parquet( See Also -------- read_parquet : Read a parquet file. + DataFrame.to_orc : Write an orc file. DataFrame.to_csv : Write a csv file. DataFrame.to_sql : Write to a sql table. DataFrame.to_hdf : Write to hdf. @@ -2897,6 +2927,151 @@ def to_parquet( **kwargs, ) + def to_orc( + self, + path: FilePath | WriteBuffer[bytes] | None = None, + *, + engine: Literal["pyarrow"] = "pyarrow", + index: bool | None = None, + engine_kwargs: dict[str, Any] | None = None, + ) -> bytes | None: + """ + Write a DataFrame to the ORC format. + + .. versionadded:: 1.5.0 + + Parameters + ---------- + path : str, file-like object or None, default None + If a string, it will be used as Root Directory path + when writing a partitioned dataset. By file-like object, + we refer to objects with a write() method, such as a file handle + (e.g. via builtin open function). If path is None, + a bytes object is returned. + engine : str, default 'pyarrow' + ORC library to use. Pyarrow must be >= 7.0.0. + index : bool, optional + If ``True``, include the dataframe's index(es) in the file output. + If ``False``, they will not be written to the file. + If ``None``, similar to ``infer`` the dataframe's index(es) + will be saved. However, instead of being saved as values, + the RangeIndex will be stored as a range in the metadata so it + doesn't require much space and is faster. Other indexes will + be included as columns in the file output. + engine_kwargs : dict[str, Any] or None, default None + Additional keyword arguments passed to :func:`pyarrow.orc.write_table`. + + Returns + ------- + bytes if no path argument is provided else None + + Raises + ------ + NotImplementedError + Dtype of one or more columns is category, unsigned integers, interval, + period or sparse. + ValueError + engine is not pyarrow. + + See Also + -------- + read_orc : Read a ORC file. + DataFrame.to_parquet : Write a parquet file. + DataFrame.to_csv : Write a csv file. + DataFrame.to_sql : Write to a sql table. + DataFrame.to_hdf : Write to hdf. + + Notes + ----- + * Before using this function you should read the :ref:`user guide about + ORC ` and :ref:`install optional dependencies `. + * This function requires `pyarrow `_ + library. + * For supported dtypes please refer to `supported ORC features in Arrow + `__. + * Currently timezones in datetime columns are not preserved when a + dataframe is converted into ORC files. + + Examples + -------- + >>> df = pd.DataFrame(data={'col1': [1, 2], 'col2': [4, 3]}) + >>> df.to_orc('df.orc') # doctest: +SKIP + >>> pd.read_orc('df.orc') # doctest: +SKIP + col1 col2 + 0 1 4 + 1 2 3 + + If you want to get a buffer to the orc content you can write it to io.BytesIO + >>> import io + >>> b = io.BytesIO(df.to_orc()) # doctest: +SKIP + >>> b.seek(0) # doctest: +SKIP + 0 + >>> content = b.read() # doctest: +SKIP + """ + from pandas.io.orc import to_orc + + return to_orc( + self, path, engine=engine, index=index, engine_kwargs=engine_kwargs + ) + + @overload + def to_html( + self, + buf: FilePath | WriteBuffer[str], + columns: Sequence[Level] | None = ..., + col_space: ColspaceArgType | None = ..., + header: bool | Sequence[str] = ..., + index: bool = ..., + na_rep: str = ..., + formatters: FormattersType | None = ..., + float_format: FloatFormatType | None = ..., + sparsify: bool | None = ..., + index_names: bool = ..., + justify: str | None = ..., + max_rows: int | None = ..., + max_cols: int | None = ..., + show_dimensions: bool | str = ..., + decimal: str = ..., + bold_rows: bool = ..., + classes: str | list | tuple | None = ..., + escape: bool = ..., + notebook: bool = ..., + border: int | bool | None = ..., + table_id: str | None = ..., + render_links: bool = ..., + encoding: str | None = ..., + ) -> None: + ... + + @overload + def to_html( + self, + buf: None = ..., + columns: Sequence[Level] | None = ..., + col_space: ColspaceArgType | None = ..., + header: bool | Sequence[str] = ..., + index: bool = ..., + na_rep: str = ..., + formatters: FormattersType | None = ..., + float_format: FloatFormatType | None = ..., + sparsify: bool | None = ..., + index_names: bool = ..., + justify: str | None = ..., + max_rows: int | None = ..., + max_cols: int | None = ..., + show_dimensions: bool | str = ..., + decimal: str = ..., + bold_rows: bool = ..., + classes: str | list | tuple | None = ..., + escape: bool = ..., + notebook: bool = ..., + border: int | bool | None = ..., + table_id: str | None = ..., + render_links: bool = ..., + encoding: str | None = ..., + ) -> str: + ... + @Substitution( header_type="bool", header="Whether to print column labels, default True", @@ -2910,7 +3085,7 @@ def to_parquet( def to_html( self, buf: FilePath | WriteBuffer[str] | None = None, - columns: Sequence[str] | None = None, + columns: Sequence[Level] | None = None, col_space: ColspaceArgType | None = None, header: bool | Sequence[str] = True, index: bool = True, @@ -2932,7 +3107,7 @@ def to_html( table_id: str | None = None, render_links: bool = False, encoding: str | None = None, - ): + ) -> str | None: """ Render a DataFrame as an HTML table. %(shared_params)s @@ -3467,16 +3642,16 @@ def T(self) -> DataFrame: # ---------------------------------------------------------------------- # Indexing Methods - def _ixs(self, i: int, axis: int = 0): + def _ixs(self, i: int, axis: int = 0) -> Series: """ Parameters ---------- i : int axis : int - Notes - ----- - If slice passed, the resulting data will be a view. + Returns + ------- + Series """ # irow if axis == 0: @@ -3524,11 +3699,18 @@ def __getitem__(self, key): if is_hashable(key) and not is_iterator(key): # is_iterator to exclude generator e.g. test_getitem_listlike # shortcut if the key is in columns - if self.columns.is_unique and key in self.columns: - if isinstance(self.columns, MultiIndex): - return self._getitem_multilevel(key) + is_mi = isinstance(self.columns, MultiIndex) + # GH#45316 Return view if key is not duplicated + # Only use drop_duplicates with duplicates for performance + if not is_mi and ( + self.columns.is_unique + and key in self.columns + or key in self.columns.drop_duplicates(keep=False) + ): return self._get_item_cache(key) + elif is_mi and self.columns.is_unique and key in self.columns: + return self._getitem_multilevel(key) # Do we have a slicer (on rows)? indexer = convert_to_index_sliceable(self, key) if indexer is not None: @@ -4038,7 +4220,20 @@ def _maybe_cache_changed(self, item, value: Series, inplace: bool) -> None: # ---------------------------------------------------------------------- # Unsorted - def query(self, expr: str, inplace: bool = False, **kwargs): + @overload + def query(self, expr: str, *, inplace: Literal[False] = ..., **kwargs) -> DataFrame: + ... + + @overload + def query(self, expr: str, *, inplace: Literal[True], **kwargs) -> None: + ... + + @overload + def query(self, expr: str, *, inplace: bool = ..., **kwargs) -> DataFrame | None: + ... + + @deprecate_nonkeyword_arguments(version=None, allowed_args=["self", "expr"]) + def query(self, expr: str, inplace: bool = False, **kwargs) -> DataFrame | None: """ Query the columns of a DataFrame with a boolean expression. @@ -4185,7 +4380,7 @@ def query(self, expr: str, inplace: bool = False, **kwargs): if not isinstance(expr, str): msg = f"expr must be a string to be evaluated, {type(expr)} given" raise ValueError(msg) - kwargs["level"] = kwargs.pop("level", 0) + 1 + kwargs["level"] = kwargs.pop("level", 0) + 2 kwargs["target"] = None res = self.eval(expr, **kwargs) @@ -4202,7 +4397,16 @@ def query(self, expr: str, inplace: bool = False, **kwargs): else: return result - def eval(self, expr: str, inplace: bool = False, **kwargs): + @overload + def eval(self, expr: str, *, inplace: Literal[False] = ..., **kwargs) -> Any: + ... + + @overload + def eval(self, expr: str, *, inplace: Literal[True], **kwargs) -> None: + ... + + @deprecate_nonkeyword_arguments(version=None, allowed_args=["self", "expr"]) + def eval(self, expr: str, inplace: bool = False, **kwargs) -> Any | None: """ Evaluate a string describing operations on DataFrame columns. @@ -4308,7 +4512,7 @@ def eval(self, expr: str, inplace: bool = False, **kwargs): from pandas.core.computation.eval import eval as _eval inplace = validate_bool_kwarg(inplace, "inplace") - kwargs["level"] = kwargs.pop("level", 0) + 1 + kwargs["level"] = kwargs.pop("level", 0) + 2 index_resolvers = self._get_index_resolvers() column_resolvers = self._get_cleaned_column_resolvers() resolvers = column_resolvers, index_resolvers @@ -4608,9 +4812,12 @@ def _sanitize_column(self, value) -> ArrayLike: """ self._ensure_valid_index(value) - # We should never get here with DataFrame value - if isinstance(value, Series): + # We can get there through isetitem with a DataFrame + # or through loc single_block_path + if isinstance(value, DataFrame): return _reindex_for_setitem(value, self.index) + elif is_dict_like(value): + return _reindex_for_setitem(Series(value), self.index) if is_list_like(value): com.require_length_match(value, self.index) @@ -4802,21 +5009,17 @@ def align( @overload def set_axis( - self, labels, axis: Axis = ..., inplace: Literal[False] = ... + self, labels, *, axis: Axis = ..., inplace: Literal[False] = ... ) -> DataFrame: ... @overload - def set_axis(self, labels, axis: Axis, inplace: Literal[True]) -> None: - ... - - @overload - def set_axis(self, labels, *, inplace: Literal[True]) -> None: + def set_axis(self, labels, *, axis: Axis = ..., inplace: Literal[True]) -> None: ... @overload def set_axis( - self, labels, axis: Axis = ..., inplace: bool = ... + self, labels, *, axis: Axis = ..., inplace: bool = ... ) -> DataFrame | None: ... @@ -4887,11 +5090,11 @@ def reindex(self, *args, **kwargs) -> DataFrame: @overload def drop( self, - labels: Hashable | list[Hashable] = ..., + labels: IndexLabel = ..., *, axis: Axis = ..., - index: Hashable | list[Hashable] = ..., - columns: Hashable | list[Hashable] = ..., + index: IndexLabel = ..., + columns: IndexLabel = ..., level: Level | None = ..., inplace: Literal[True], errors: IgnoreRaise = ..., @@ -4901,11 +5104,11 @@ def drop( @overload def drop( self, - labels: Hashable | list[Hashable] = ..., + labels: IndexLabel = ..., *, axis: Axis = ..., - index: Hashable | list[Hashable] = ..., - columns: Hashable | list[Hashable] = ..., + index: IndexLabel = ..., + columns: IndexLabel = ..., level: Level | None = ..., inplace: Literal[False] = ..., errors: IgnoreRaise = ..., @@ -4915,11 +5118,11 @@ def drop( @overload def drop( self, - labels: Hashable | list[Hashable] = ..., + labels: IndexLabel = ..., *, axis: Axis = ..., - index: Hashable | list[Hashable] = ..., - columns: Hashable | list[Hashable] = ..., + index: IndexLabel = ..., + columns: IndexLabel = ..., level: Level | None = ..., inplace: bool = ..., errors: IgnoreRaise = ..., @@ -4931,10 +5134,10 @@ def drop( @deprecate_nonkeyword_arguments(version=None, allowed_args=["self", "labels"]) def drop( # type: ignore[override] self, - labels: Hashable | list[Hashable] = None, + labels: IndexLabel = None, axis: Axis = 0, - index: Hashable | list[Hashable] = None, - columns: Hashable | list[Hashable] = None, + index: IndexLabel = None, + columns: IndexLabel = None, level: Level | None = None, inplace: bool = False, errors: IgnoreRaise = "raise", @@ -5439,16 +5642,47 @@ def pop(self, item: Hashable) -> Series: """ return super().pop(item=item) - @doc(NDFrame.replace, **_shared_doc_kwargs) + # error: Signature of "replace" incompatible with supertype "NDFrame" + @overload # type: ignore[override] def replace( + self, + to_replace=..., + value=..., + *, + inplace: Literal[False] = ..., + limit: int | None = ..., + regex: bool = ..., + method: Literal["pad", "ffill", "bfill"] | lib.NoDefault = ..., + ) -> DataFrame: + ... + + @overload + def replace( + self, + to_replace=..., + value=..., + *, + inplace: Literal[True], + limit: int | None = ..., + regex: bool = ..., + method: Literal["pad", "ffill", "bfill"] | lib.NoDefault = ..., + ) -> None: + ... + + # error: Signature of "replace" incompatible with supertype "NDFrame" + @deprecate_nonkeyword_arguments( + version=None, allowed_args=["self", "to_replace", "value"] + ) + @doc(NDFrame.replace, **_shared_doc_kwargs) + def replace( # type: ignore[override] self, to_replace=None, value=lib.no_default, inplace: bool = False, - limit=None, + limit: int | None = None, regex: bool = False, - method: str | lib.NoDefault = lib.no_default, - ): + method: Literal["pad", "ffill", "bfill"] | lib.NoDefault = lib.no_default, + ) -> DataFrame | None: return super().replace( to_replace=to_replace, value=value, @@ -5580,6 +5814,30 @@ def shift( periods=periods, freq=freq, axis=axis, fill_value=fill_value ) + @overload + def set_index( + self, + keys, + *, + drop: bool = ..., + append: bool = ..., + inplace: Literal[False] = ..., + verify_integrity: bool = ..., + ) -> DataFrame: + ... + + @overload + def set_index( + self, + keys, + *, + drop: bool = ..., + append: bool = ..., + inplace: Literal[True], + verify_integrity: bool = ..., + ) -> None: + ... + @deprecate_nonkeyword_arguments(version=None, allowed_args=["self", "keys"]) def set_index( self, @@ -5588,7 +5846,7 @@ def set_index( append: bool = False, inplace: bool = False, verify_integrity: bool = False, - ): + ) -> DataFrame | None: """ Set the DataFrame index using existing columns. @@ -5781,6 +6039,7 @@ def set_index( if not inplace: return frame + return None @overload def reset_index( @@ -6131,6 +6390,30 @@ def notnull(self) -> DataFrame: """ return ~self.isna() + @overload + def dropna( + self, + *, + axis: Axis = ..., + how: str | NoDefault = ..., + thresh: int | NoDefault = ..., + subset: IndexLabel = ..., + inplace: Literal[False] = ..., + ) -> DataFrame: + ... + + @overload + def dropna( + self, + *, + axis: Axis = ..., + how: str | NoDefault = ..., + thresh: int | NoDefault = ..., + subset: IndexLabel = ..., + inplace: Literal[True], + ) -> None: + ... + @deprecate_nonkeyword_arguments(version=None, allowed_args=["self"]) def dropna( self, @@ -6139,7 +6422,7 @@ def dropna( thresh: int | NoDefault = no_default, subset: IndexLabel = None, inplace: bool = False, - ): + ) -> DataFrame | None: """ Remove missing values. @@ -6288,10 +6571,10 @@ def dropna( else: result = self.loc(axis=axis)[mask] - if inplace: - self._update_inplace(result) - else: + if not inplace: return result + self._update_inplace(result) + return None @deprecate_nonkeyword_arguments(version=None, allowed_args=["self", "subset"]) def drop_duplicates( @@ -6536,11 +6819,42 @@ def f(vals) -> tuple[np.ndarray, int]: # ---------------------------------------------------------------------- # Sorting + # error: Signature of "sort_values" incompatible with supertype "NDFrame" + @overload # type: ignore[override] + def sort_values( + self, + by, + *, + axis: Axis = ..., + ascending=..., + inplace: Literal[False] = ..., + kind: str = ..., + na_position: str = ..., + ignore_index: bool = ..., + key: ValueKeyFunc = ..., + ) -> DataFrame: + ... + + @overload + def sort_values( + self, + by, + *, + axis: Axis = ..., + ascending=..., + inplace: Literal[True], + kind: str = ..., + na_position: str = ..., + ignore_index: bool = ..., + key: ValueKeyFunc = ..., + ) -> None: + ... + # TODO: Just move the sort_values doc here. + # error: Signature of "sort_values" incompatible with supertype "NDFrame" @deprecate_nonkeyword_arguments(version=None, allowed_args=["self", "by"]) @Substitution(**_shared_doc_kwargs) @Appender(NDFrame.sort_values.__doc__) - # error: Signature of "sort_values" incompatible with supertype "NDFrame" def sort_values( # type: ignore[override] self, by, @@ -6551,7 +6865,7 @@ def sort_values( # type: ignore[override] na_position: str = "last", ignore_index: bool = False, key: ValueKeyFunc = None, - ): + ) -> DataFrame | None: inplace = validate_bool_kwarg(inplace, "inplace") axis = self._get_axis_number(axis) ascending = validate_ascending(ascending) @@ -6783,7 +7097,7 @@ def value_counts( sort: bool = True, ascending: bool = False, dropna: bool = True, - ): + ) -> Series: """ Return a Series containing counts of unique rows in the DataFrame. @@ -7462,6 +7776,14 @@ def __rdivmod__(self, other) -> tuple[DataFrame, DataFrame]: 0 a c NaN NaN 2 NaN NaN 3.0 4.0 +Assign result_names + +>>> df.compare(df2, result_names=("left", "right")) + col1 col3 + left right left right +0 a c NaN NaN +2 NaN NaN 3.0 4.0 + Stack the differences on rows >>> df.compare(df2, align_axis=0) @@ -7509,12 +7831,14 @@ def compare( align_axis: Axis = 1, keep_shape: bool = False, keep_equal: bool = False, + result_names: Suffixes = ("self", "other"), ) -> DataFrame: return super().compare( other=other, align_axis=align_axis, keep_shape=keep_shape, keep_equal=keep_equal, + result_names=result_names, ) def combine( @@ -7912,7 +8236,7 @@ def update( if mask.all(): continue - self[col] = expressions.where(mask, this, that) + self.loc[:, col] = expressions.where(mask, this, that) # ---------------------------------------------------------------------- # Data reshaping @@ -8025,7 +8349,7 @@ def groupby( self, by=None, axis: Axis = 0, - level: Level | None = None, + level: IndexLabel | None = None, as_index: bool = True, sort: bool = True, group_keys: bool | lib.NoDefault = no_default, @@ -9423,7 +9747,6 @@ def _append( verify_integrity: bool = False, sort: bool = False, ) -> DataFrame: - combined_columns = None if isinstance(other, (Series, dict)): if isinstance(other, dict): if not ignore_index: @@ -9436,8 +9759,6 @@ def _append( ) index = Index([other.name], name=self.index.name) - idx_diff = other.index.difference(self.columns) - combined_columns = self.columns.append(idx_diff) row_df = other.to_frame().T # infer_objects is needed for # test_append_empty_frame_to_series_with_dateutil_tz @@ -9463,21 +9784,11 @@ def _append( verify_integrity=verify_integrity, sort=sort, ) - if ( - combined_columns is not None - and not sort - and not combined_columns.equals(result.columns) - ): - # TODO: reindexing here is a kludge bc union_indexes does not - # pass sort to index.union, xref #43375 - # combined_columns.equals check is necessary for preserving dtype - # in test_crosstab_normalize - result = result.reindex(combined_columns, axis=1) return result.__finalize__(self, method="append") def join( self, - other: DataFrame | Series, + other: DataFrame | Series | list[DataFrame | Series], on: IndexLabel | None = None, how: str = "left", lsuffix: str = "", @@ -9494,7 +9805,7 @@ def join( Parameters ---------- - other : DataFrame, Series, or list of DataFrame + other : DataFrame, Series, or a list containing any combination of them Index should be similar to one of the columns in this one. If a Series is passed, its name attribute must be set, and that will be used as the column name in the resulting joined DataFrame. @@ -9650,7 +9961,7 @@ def join( def _join_compat( self, - other: DataFrame | Series, + other: DataFrame | Series | Iterable[DataFrame | Series], on: IndexLabel | None = None, how: str = "left", lsuffix: str = "", @@ -9699,7 +10010,11 @@ def _join_compat( "Suffixes not supported when joining multiple DataFrames" ) - frames = [self] + list(other) + # Mypy thinks the RHS is a + # "Union[DataFrame, Series, Iterable[Union[DataFrame, Series]]]" whereas + # the LHS is an "Iterable[DataFrame]", but in reality both types are + # "Iterable[Union[DataFrame, Series]]" due to the if statements + frames = [cast("DataFrame | Series", self)] + list(other) can_concat = all(df.index.is_unique for df in frames) @@ -10054,9 +10369,10 @@ def cov( See Also -------- Series.cov : Compute covariance with another Series. - core.window.ExponentialMovingWindow.cov: Exponential weighted sample covariance. - core.window.Expanding.cov : Expanding sample covariance. - core.window.Rolling.cov : Rolling sample covariance. + core.window.ewm.ExponentialMovingWindow.cov : Exponential weighted sample + covariance. + core.window.expanding.Expanding.cov : Expanding sample covariance. + core.window.rolling.Rolling.cov : Rolling sample covariance. Notes ----- @@ -10249,7 +10565,8 @@ def corrwith( else: return this.apply(lambda x: other.corr(x, method=method), axis=axis) - other = other._get_numeric_data() + if numeric_only_bool: + other = other._get_numeric_data() left, right = this.align(other, join="inner", copy=False) if axis == 1: @@ -10262,11 +10579,15 @@ def corrwith( right = right + left * 0 # demeaned data - ldem = left - left.mean() - rdem = right - right.mean() + ldem = left - left.mean(numeric_only=numeric_only_bool) + rdem = right - right.mean(numeric_only=numeric_only_bool) num = (ldem * rdem).sum() - dom = (left.count() - 1) * left.std() * right.std() + dom = ( + (left.count() - 1) + * left.std(numeric_only=numeric_only_bool) + * right.std(numeric_only=numeric_only_bool) + ) correl = num / dom @@ -10866,7 +11187,7 @@ def quantile( See Also -------- - core.window.Rolling.quantile: Rolling quantile. + core.window.rolling.Rolling.quantile: Rolling quantile. numpy.percentile: Numpy function to compute the percentile. Examples @@ -11319,25 +11640,93 @@ def values(self) -> np.ndarray: self._consolidate_inplace() return self._mgr.as_array() - @deprecate_nonkeyword_arguments(version=None, allowed_args=["self"]) + @overload def ffill( - self: DataFrame, + self, + *, + axis: None | Axis = ..., + inplace: Literal[False] = ..., + limit: None | int = ..., + downcast=..., + ) -> DataFrame: + ... + + @overload + def ffill( + self, + *, + axis: None | Axis = ..., + inplace: Literal[True], + limit: None | int = ..., + downcast=..., + ) -> None: + ... + + @overload + def ffill( + self, + *, + axis: None | Axis = ..., + inplace: bool = ..., + limit: None | int = ..., + downcast=..., + ) -> DataFrame | None: + ... + + # error: Signature of "ffill" incompatible with supertype "NDFrame" + @deprecate_nonkeyword_arguments(version=None, allowed_args=["self"]) + def ffill( # type: ignore[override] + self, axis: None | Axis = None, inplace: bool = False, limit: None | int = None, downcast=None, ) -> DataFrame | None: - return super().ffill(axis, inplace, limit, downcast) + return super().ffill(axis=axis, inplace=inplace, limit=limit, downcast=downcast) - @deprecate_nonkeyword_arguments(version=None, allowed_args=["self"]) + @overload def bfill( - self: DataFrame, + self, + *, + axis: None | Axis = ..., + inplace: Literal[False] = ..., + limit: None | int = ..., + downcast=..., + ) -> DataFrame: + ... + + @overload + def bfill( + self, + *, + axis: None | Axis = ..., + inplace: Literal[True], + limit: None | int = ..., + downcast=..., + ) -> None: + ... + + @overload + def bfill( + self, + *, + axis: None | Axis = ..., + inplace: bool = ..., + limit: None | int = ..., + downcast=..., + ) -> DataFrame | None: + ... + + # error: Signature of "bfill" incompatible with supertype "NDFrame" + @deprecate_nonkeyword_arguments(version=None, allowed_args=["self"]) + def bfill( # type: ignore[override] + self, axis: None | Axis = None, inplace: bool = False, limit: None | int = None, downcast=None, ) -> DataFrame | None: - return super().bfill(axis, inplace, limit, downcast) + return super().bfill(axis=axis, inplace=inplace, limit=limit, downcast=downcast) @deprecate_nonkeyword_arguments( version=None, allowed_args=["self", "lower", "upper"] @@ -11376,35 +11765,137 @@ def interpolate( **kwargs, ) + @overload + def where( + self, + cond, + other=..., + *, + inplace: Literal[False] = ..., + axis=..., + level=..., + errors: IgnoreRaise | lib.NoDefault = ..., + try_cast=..., + ) -> DataFrame: + ... + + @overload + def where( + self, + cond, + other=..., + *, + inplace: Literal[True], + axis=..., + level=..., + errors: IgnoreRaise | lib.NoDefault = ..., + try_cast=..., + ) -> None: + ... + + @overload + def where( + self, + cond, + other=..., + *, + inplace: bool = ..., + axis=..., + level=..., + errors: IgnoreRaise | lib.NoDefault = ..., + try_cast=..., + ) -> DataFrame | None: + ... + + # error: Signature of "where" incompatible with supertype "NDFrame" + @deprecate_kwarg(old_arg_name="errors", new_arg_name=None) @deprecate_nonkeyword_arguments( version=None, allowed_args=["self", "cond", "other"] ) - def where( + def where( # type: ignore[override] self, cond, other=lib.no_default, - inplace=False, + inplace: bool = False, axis=None, level=None, - errors: IgnoreRaise = "raise", + errors: IgnoreRaise | lib.NoDefault = "raise", try_cast=lib.no_default, - ): - return super().where(cond, other, inplace, axis, level, errors, try_cast) + ) -> DataFrame | None: + return super().where( + cond, + other, + inplace=inplace, + axis=axis, + level=level, + try_cast=try_cast, + ) + @overload + def mask( + self, + cond, + other=..., + *, + inplace: Literal[False] = ..., + axis=..., + level=..., + errors: IgnoreRaise | lib.NoDefault = ..., + try_cast=..., + ) -> DataFrame: + ... + + @overload + def mask( + self, + cond, + other=..., + *, + inplace: Literal[True], + axis=..., + level=..., + errors: IgnoreRaise | lib.NoDefault = ..., + try_cast=..., + ) -> None: + ... + + @overload + def mask( + self, + cond, + other=..., + *, + inplace: bool = ..., + axis=..., + level=..., + errors: IgnoreRaise | lib.NoDefault = ..., + try_cast=..., + ) -> DataFrame | None: + ... + + # error: Signature of "mask" incompatible with supertype "NDFrame" + @deprecate_kwarg(old_arg_name="errors", new_arg_name=None) @deprecate_nonkeyword_arguments( version=None, allowed_args=["self", "cond", "other"] ) - def mask( + def mask( # type: ignore[override] self, cond, other=np.nan, - inplace=False, + inplace: bool = False, axis=None, level=None, - errors: IgnoreRaise = "raise", + errors: IgnoreRaise | lib.NoDefault = "raise", try_cast=lib.no_default, - ): - return super().mask(cond, other, inplace, axis, level, errors, try_cast) + ) -> DataFrame | None: + return super().mask( + cond, + other, + inplace=inplace, + axis=axis, + level=level, + try_cast=try_cast, + ) DataFrame._add_numeric_operations() diff --git a/pandas/core/generic.py b/pandas/core/generic.py index 673228a758aca..0a439faed0896 100644 --- a/pandas/core/generic.py +++ b/pandas/core/generic.py @@ -13,6 +13,7 @@ TYPE_CHECKING, Any, Callable, + ClassVar, Hashable, Literal, Mapping, @@ -48,7 +49,7 @@ IgnoreRaise, IndexKeyFunc, IndexLabel, - IntervalClosedType, + IntervalInclusiveType, JSONSerializable, Level, Manager, @@ -58,6 +59,7 @@ Renamer, SortKind, StorageOptions, + Suffixes, T, TimedeltaConvertibleTypes, TimestampConvertibleTypes, @@ -699,25 +701,24 @@ def size(self) -> int: @overload def set_axis( - self: NDFrameT, labels, axis: Axis = ..., inplace: Literal[False] = ... + self: NDFrameT, labels, *, axis: Axis = ..., inplace: Literal[False] = ... ) -> NDFrameT: ... @overload - def set_axis(self, labels, axis: Axis, inplace: Literal[True]) -> None: - ... - - @overload - def set_axis(self, labels, *, inplace: Literal[True]) -> None: + def set_axis(self, labels, *, axis: Axis = ..., inplace: Literal[True]) -> None: ... @overload def set_axis( - self: NDFrameT, labels, axis: Axis = ..., inplace: bool_t = ... + self: NDFrameT, labels, *, axis: Axis = ..., inplace: bool_t = ... ) -> NDFrameT | None: ... - def set_axis(self, labels, axis: Axis = 0, inplace: bool_t = False): + @deprecate_nonkeyword_arguments(version=None, allowed_args=["self", "labels"]) + def set_axis( + self: NDFrameT, labels, axis: Axis = 0, inplace: bool_t = False + ) -> NDFrameT | None: """ Assign desired index to given axis. @@ -1049,8 +1050,44 @@ def _rename( else: return result.__finalize__(self, method="rename") - @rewrite_axis_style_signature("mapper", [("copy", True), ("inplace", False)]) - def rename_axis(self, mapper=lib.no_default, **kwargs): + @overload + def rename_axis( + self: NDFrameT, + mapper: IndexLabel | lib.NoDefault = ..., + *, + inplace: Literal[False] = ..., + **kwargs, + ) -> NDFrameT: + ... + + @overload + def rename_axis( + self, + mapper: IndexLabel | lib.NoDefault = ..., + *, + inplace: Literal[True], + **kwargs, + ) -> None: + ... + + @overload + def rename_axis( + self: NDFrameT, + mapper: IndexLabel | lib.NoDefault = ..., + *, + inplace: bool_t = ..., + **kwargs, + ) -> NDFrameT | None: + ... + + @rewrite_axis_style_signature("mapper", [("copy", True)]) + @deprecate_nonkeyword_arguments(version=None, allowed_args=["self", "mapper"]) + def rename_axis( + self: NDFrameT, + mapper: IndexLabel | lib.NoDefault = lib.no_default, + inplace: bool_t = False, + **kwargs, + ) -> NDFrameT | None: """ Set the name of the axis for the index or columns. @@ -1174,6 +1211,7 @@ class name cat 4 0 monkey 2 2 """ + kwargs["inplace"] = inplace axes, kwargs = self._construct_axes_from_arguments( (), kwargs, sentinel=lib.no_default ) @@ -1219,6 +1257,7 @@ class name result._set_axis_name(newnames, axis=axis, inplace=True) if not inplace: return result + return None @final def _set_axis_name(self, name, axis=0, inplace=False): @@ -1377,7 +1416,7 @@ def equals(self, other: object) -> bool_t: # Unary Methods @final - def __neg__(self): + def __neg__(self: NDFrameT) -> NDFrameT: def blk_func(values: ArrayLike): if is_bool_dtype(values.dtype): # error: Argument 1 to "inv" has incompatible type "Union @@ -1395,7 +1434,7 @@ def blk_func(values: ArrayLike): return res.__finalize__(self, method="__neg__") @final - def __pos__(self): + def __pos__(self: NDFrameT) -> NDFrameT: def blk_func(values: ArrayLike): if is_bool_dtype(values.dtype): return values.copy() @@ -1410,7 +1449,7 @@ def blk_func(values: ArrayLike): return res.__finalize__(self, method="__pos__") @final - def __invert__(self): + def __invert__(self: NDFrameT) -> NDFrameT: if not self.size: # inv fails with 0 len return self @@ -1428,7 +1467,7 @@ def __nonzero__(self): __bool__ = __nonzero__ @final - def bool(self): + def bool(self) -> bool_t: """ Return the bool of a single element Series or DataFrame. @@ -1471,6 +1510,8 @@ def bool(self): ) self.__nonzero__() + # for mypy (__nonzero__ raises) + return True @final def abs(self: NDFrameT) -> NDFrameT: @@ -1731,7 +1772,14 @@ def _get_label_or_level_values(self, key: str, axis: int = 0) -> np.ndarray: self._check_label_or_level_ambiguity(key, axis=axis) values = self.xs(key, axis=other_axes[0])._values elif self._is_level_reference(key, axis=axis): - values = self.axes[axis].get_level_values(key)._values + # error: Incompatible types in assignment (expression has type "Union[ + # ExtensionArray, ndarray[Any, Any]]", variable has type "ndarray[Any, + # Any]") + values = ( + self.axes[axis] + .get_level_values(key) # type: ignore[assignment] + ._values + ) else: raise KeyError(key) @@ -1836,7 +1884,7 @@ def _drop_labels_or_levels(self, keys, axis: int = 0): # https://github.com/python/typeshed/issues/2148#issuecomment-520783318 # Incompatible types in assignment (expression has type "None", base class # "object" defined the type as "Callable[[object], int]") - __hash__: None # type: ignore[assignment] + __hash__: ClassVar[None] # type: ignore[assignment] def __iter__(self): """ @@ -1850,7 +1898,7 @@ def __iter__(self): return iter(self._info_axis) # can we get a better explanation of this? - def keys(self): + def keys(self) -> Index: """ Get the 'info axis' (see Indexing for more). @@ -2014,7 +2062,7 @@ def __getstate__(self) -> dict[str, Any]: } @final - def __setstate__(self, state): + def __setstate__(self, state) -> None: if isinstance(state, BlockManager): self._mgr = state elif isinstance(state, dict): @@ -2047,7 +2095,7 @@ def __setstate__(self, state): elif len(state) == 2: raise NotImplementedError("Pre-0.12 pickles are no longer supported") - self._item_cache = {} + self._item_cache: dict[Hashable, Series] = {} # ---------------------------------------------------------------------- # Rendering Methods @@ -2086,7 +2134,11 @@ def _repr_data_resource_(self): # I/O Methods @final - @doc(klass="object", storage_options=_shared_docs["storage_options"]) + @doc( + klass="object", + storage_options=_shared_docs["storage_options"], + storage_options_versionadded="1.2.0", + ) def to_excel( self, excel_writer, @@ -2172,7 +2224,7 @@ def to_excel( is to be frozen. {storage_options} - .. versionadded:: 1.2.0 + .. versionadded:: {storage_options_versionadded} See Also -------- @@ -2630,6 +2682,7 @@ def to_hdf( See Also -------- read_hdf : Read from HDF file. + DataFrame.to_orc : Write a DataFrame to the binary orc format. DataFrame.to_parquet : Write a DataFrame to the binary parquet format. DataFrame.to_sql : Write to a SQL table. DataFrame.to_feather : Write out feather-format for DataFrames. @@ -2751,7 +2804,7 @@ def to_sql( ------- None or int Number of rows affected by to_sql. None is returned if the callable - passed into ``method`` does not return the number of rows. + passed into ``method`` does not return an integer number of rows. The number of returned rows affected is the sum of the ``rowcount`` attribute of ``sqlite3.Cursor`` or SQLAlchemy connectable which may not @@ -3290,10 +3343,64 @@ def to_latex( position=position, ) + @overload + def to_csv( + self, + path_or_buf: None = ..., + sep: str = ..., + na_rep: str = ..., + float_format: str | Callable | None = ..., + columns: Sequence[Hashable] | None = ..., + header: bool_t | list[str] = ..., + index: bool_t = ..., + index_label: IndexLabel | None = ..., + mode: str = ..., + encoding: str | None = ..., + compression: CompressionOptions = ..., + quoting: int | None = ..., + quotechar: str = ..., + lineterminator: str | None = ..., + chunksize: int | None = ..., + date_format: str | None = ..., + doublequote: bool_t = ..., + escapechar: str | None = ..., + decimal: str = ..., + errors: str = ..., + storage_options: StorageOptions = ..., + ) -> str: + ... + + @overload + def to_csv( + self, + path_or_buf: FilePath | WriteBuffer[bytes] | WriteBuffer[str], + sep: str = ..., + na_rep: str = ..., + float_format: str | Callable | None = ..., + columns: Sequence[Hashable] | None = ..., + header: bool_t | list[str] = ..., + index: bool_t = ..., + index_label: IndexLabel | None = ..., + mode: str = ..., + encoding: str | None = ..., + compression: CompressionOptions = ..., + quoting: int | None = ..., + quotechar: str = ..., + lineterminator: str | None = ..., + chunksize: int | None = ..., + date_format: str | None = ..., + doublequote: bool_t = ..., + escapechar: str | None = ..., + decimal: str = ..., + errors: str = ..., + storage_options: StorageOptions = ..., + ) -> None: + ... + @final @doc( storage_options=_shared_docs["storage_options"], - compression_options=_shared_docs["compression_options"], + compression_options=_shared_docs["compression_options"] % "path_or_buf", ) @deprecate_kwarg(old_arg_name="line_terminator", new_arg_name="lineterminator") def to_csv( @@ -3301,7 +3408,7 @@ def to_csv( path_or_buf: FilePath | WriteBuffer[bytes] | WriteBuffer[str] | None = None, sep: str = ",", na_rep: str = "", - float_format: str | None = None, + float_format: str | Callable | None = None, columns: Sequence[Hashable] | None = None, header: bool_t | list[str] = True, index: bool_t = True, @@ -3340,8 +3447,9 @@ def to_csv( String of length 1. Field delimiter for the output file. na_rep : str, default '' Missing data representation. - float_format : str, default None - Format string for floating point numbers. + float_format : str, Callable, default None + Format string for floating point numbers. If a Callable is given, it takes + precedence over other numeric formatting parameters, like decimal. columns : sequence, optional Columns to write. header : bool or list of str, default True @@ -3662,7 +3770,9 @@ def _take_with_is_copy(self: NDFrameT, indices, axis=0) -> NDFrameT: return result @final - def xs(self, key, axis=0, level=None, drop_level: bool_t = True): + def xs( + self: NDFrameT, key, axis=0, level=None, drop_level: bool_t = True + ) -> NDFrameT: """ Return cross-section from the Series/DataFrame. @@ -4177,11 +4287,11 @@ def reindex_like( @overload def drop( self, - labels: Hashable | list[Hashable] = ..., + labels: IndexLabel = ..., *, axis: Axis = ..., - index: Hashable | list[Hashable] = ..., - columns: Hashable | list[Hashable] = ..., + index: IndexLabel = ..., + columns: IndexLabel = ..., level: Level | None = ..., inplace: Literal[True], errors: IgnoreRaise = ..., @@ -4191,11 +4301,11 @@ def drop( @overload def drop( self: NDFrameT, - labels: Hashable | list[Hashable] = ..., + labels: IndexLabel = ..., *, axis: Axis = ..., - index: Hashable | list[Hashable] = ..., - columns: Hashable | list[Hashable] = ..., + index: IndexLabel = ..., + columns: IndexLabel = ..., level: Level | None = ..., inplace: Literal[False] = ..., errors: IgnoreRaise = ..., @@ -4205,11 +4315,11 @@ def drop( @overload def drop( self: NDFrameT, - labels: Hashable | list[Hashable] = ..., + labels: IndexLabel = ..., *, axis: Axis = ..., - index: Hashable | list[Hashable] = ..., - columns: Hashable | list[Hashable] = ..., + index: IndexLabel = ..., + columns: IndexLabel = ..., level: Level | None = ..., inplace: bool_t = ..., errors: IgnoreRaise = ..., @@ -4219,10 +4329,10 @@ def drop( @deprecate_nonkeyword_arguments(version=None, allowed_args=["self", "labels"]) def drop( self: NDFrameT, - labels: Hashable | list[Hashable] = None, + labels: IndexLabel = None, axis: Axis = 0, - index: Hashable | list[Hashable] = None, - columns: Hashable | list[Hashable] = None, + index: IndexLabel = None, + columns: IndexLabel = None, level: Level | None = None, inplace: bool_t = False, errors: IgnoreRaise = "raise", @@ -4485,16 +4595,59 @@ def add_suffix(self: NDFrameT, suffix: str) -> NDFrameT: # "**Dict[str, partial[str]]"; expected "Union[str, int, None]" return self._rename(**mapper) # type: ignore[return-value, arg-type] + @overload + def sort_values( + self: NDFrameT, + *, + axis: Axis = ..., + ascending=..., + inplace: Literal[False] = ..., + kind: str = ..., + na_position: str = ..., + ignore_index: bool_t = ..., + key: ValueKeyFunc = ..., + ) -> NDFrameT: + ... + + @overload def sort_values( self, - axis=0, + *, + axis: Axis = ..., + ascending=..., + inplace: Literal[True], + kind: str = ..., + na_position: str = ..., + ignore_index: bool_t = ..., + key: ValueKeyFunc = ..., + ) -> None: + ... + + @overload + def sort_values( + self: NDFrameT, + *, + axis: Axis = ..., + ascending=..., + inplace: bool_t = ..., + kind: str = ..., + na_position: str = ..., + ignore_index: bool_t = ..., + key: ValueKeyFunc = ..., + ) -> NDFrameT | None: + ... + + @deprecate_nonkeyword_arguments(version=None, allowed_args=["self"]) + def sort_values( + self: NDFrameT, + axis: Axis = 0, ascending=True, inplace: bool_t = False, kind: str = "quicksort", na_position: str = "last", ignore_index: bool_t = False, key: ValueKeyFunc = None, - ): + ) -> NDFrameT | None: """ Sort by the values along either axis. @@ -5726,7 +5879,7 @@ def _check_inplace_setting(self, value) -> bool_t: return True @final - def _get_numeric_data(self): + def _get_numeric_data(self: NDFrameT) -> NDFrameT: return self._constructor(self._mgr.get_numeric_data()).__finalize__(self) @final @@ -5737,7 +5890,7 @@ def _get_bool_data(self): # Internal Interface Methods @property - def values(self) -> np.ndarray: + def values(self): raise AbstractMethodError(self) @property @@ -6521,11 +6674,13 @@ def fillna( if k not in result: continue downcast_k = downcast if not is_dict else downcast.get(k) - result[k] = result[k].fillna(v, limit=limit, downcast=downcast_k) + result.loc[:, k] = result[k].fillna( + v, limit=limit, downcast=downcast_k + ) return result if not inplace else None elif not is_list_like(value): - if not self._mgr.is_single_block and axis == 1: + if axis == 1: result = self.T.fillna(value=value, limit=limit).T @@ -6547,6 +6702,40 @@ def fillna( else: return result.__finalize__(self, method="fillna") + @overload + def ffill( + self: NDFrameT, + *, + axis: None | Axis = ..., + inplace: Literal[False] = ..., + limit: None | int = ..., + downcast=..., + ) -> NDFrameT: + ... + + @overload + def ffill( + self, + *, + axis: None | Axis = ..., + inplace: Literal[True], + limit: None | int = ..., + downcast=..., + ) -> None: + ... + + @overload + def ffill( + self: NDFrameT, + *, + axis: None | Axis = ..., + inplace: bool_t = ..., + limit: None | int = ..., + downcast=..., + ) -> NDFrameT | None: + ... + + @deprecate_nonkeyword_arguments(version=None, allowed_args=["self"]) @doc(klass=_shared_doc_kwargs["klass"]) def ffill( self: NDFrameT, @@ -6569,6 +6758,40 @@ def ffill( pad = ffill + @overload + def bfill( + self: NDFrameT, + *, + axis: None | Axis = ..., + inplace: Literal[False] = ..., + limit: None | int = ..., + downcast=..., + ) -> NDFrameT: + ... + + @overload + def bfill( + self, + *, + axis: None | Axis = ..., + inplace: Literal[True], + limit: None | int = ..., + downcast=..., + ) -> None: + ... + + @overload + def bfill( + self: NDFrameT, + *, + axis: None | Axis = ..., + inplace: bool_t = ..., + limit: None | int = ..., + downcast=..., + ) -> NDFrameT | None: + ... + + @deprecate_nonkeyword_arguments(version=None, allowed_args=["self"]) @doc(klass=_shared_doc_kwargs["klass"]) def bfill( self: NDFrameT, @@ -6591,6 +6814,48 @@ def bfill( backfill = bfill + @overload + def replace( + self: NDFrameT, + to_replace=..., + value=..., + *, + inplace: Literal[False] = ..., + limit: int | None = ..., + regex=..., + method: Literal["pad", "ffill", "bfill"] | lib.NoDefault = ..., + ) -> NDFrameT: + ... + + @overload + def replace( + self, + to_replace=..., + value=..., + *, + inplace: Literal[True], + limit: int | None = ..., + regex=..., + method: Literal["pad", "ffill", "bfill"] | lib.NoDefault = ..., + ) -> None: + ... + + @overload + def replace( + self: NDFrameT, + to_replace=..., + value=..., + *, + inplace: bool_t = ..., + limit: int | None = ..., + regex=..., + method: Literal["pad", "ffill", "bfill"] | lib.NoDefault = ..., + ) -> NDFrameT | None: + ... + + @deprecate_nonkeyword_arguments( + version=None, allowed_args=["self", "to_replace", "value"] + ) @doc( _shared_docs["replace"], klass=_shared_doc_kwargs["klass"], @@ -6598,14 +6863,14 @@ def bfill( replace_iloc=_shared_doc_kwargs["replace_iloc"], ) def replace( - self, + self: NDFrameT, to_replace=None, value=lib.no_default, inplace: bool_t = False, limit: int | None = None, regex=False, - method=lib.no_default, - ): + method: Literal["pad", "ffill", "bfill"] | lib.NoDefault = lib.no_default, + ) -> NDFrameT | None: if not ( is_scalar(to_replace) or is_re_compilable(to_replace) @@ -6646,9 +6911,8 @@ def replace( args=(to_replace, method, inplace, limit), ) if inplace: - return + return None return result - self = cast("Series", self) return self._replace_single(to_replace, method, inplace, limit) if not is_dict_like(to_replace): @@ -6697,7 +6961,7 @@ def replace( # need a non-zero len on all axes if not self.size: if inplace: - return + return None return self.copy() if is_dict_like(to_replace): @@ -7808,7 +8072,7 @@ def between_time( end_time, include_start: bool_t | lib.NoDefault = lib.no_default, include_end: bool_t | lib.NoDefault = lib.no_default, - inclusive: IntervalClosedType | None = None, + inclusive: IntervalInclusiveType | None = None, axis=None, ) -> NDFrameT: """ @@ -7914,7 +8178,7 @@ def between_time( left = True if include_start is lib.no_default else include_start right = True if include_end is lib.no_default else include_end - inc_dict: dict[tuple[bool_t, bool_t], IntervalClosedType] = { + inc_dict: dict[tuple[bool_t, bool_t], IntervalInclusiveType] = { (True, True): "both", (True, False): "left", (False, True): "right", @@ -8685,6 +8949,15 @@ def ranker(data): ) if numeric_only: + if self.ndim == 1 and not is_numeric_dtype(self.dtype): + # GH#47500 + warnings.warn( + f"Calling Series.rank with numeric_only={numeric_only} and dtype " + f"{self.dtype} is deprecated and will raise a TypeError in a " + "future version of pandas", + category=FutureWarning, + stacklevel=find_stack_level(), + ) data = self._get_numeric_data() else: data = self @@ -8698,6 +8971,7 @@ def compare( align_axis: Axis = 1, keep_shape: bool_t = False, keep_equal: bool_t = False, + result_names: Suffixes = ("self", "other"), ): from pandas.core.reshape.concat import concat @@ -8708,7 +8982,6 @@ def compare( ) mask = ~((self == other) | (self.isna() & other.isna())) - keys = ["self", "other"] if not keep_equal: self = self.where(mask) @@ -8723,13 +8996,18 @@ def compare( else: self = self[mask] other = other[mask] + if not isinstance(result_names, tuple): + raise TypeError( + f"Passing 'result_names' as a {type(result_names)} is not " + "supported. Provide 'result_names' as a tuple instead." + ) if align_axis in (1, "columns"): # This is needed for Series axis = 1 else: axis = self._get_axis_number(align_axis) - diff = concat([self, other], axis=axis, keys=keys) + diff = concat([self, other], axis=axis, keys=result_names) if axis >= self.ndim: # No need to reorganize data if stacking on new axis @@ -9109,7 +9387,6 @@ def _where( inplace=False, axis=None, level=None, - errors: IgnoreRaise = "raise", ): """ Equivalent to public method `where`, except that `other` is not @@ -9234,6 +9511,52 @@ def _where( result = self._constructor(new_data) return result.__finalize__(self) + @overload + def where( + self: NDFrameT, + cond, + other=..., + *, + inplace: Literal[False] = ..., + axis=..., + level=..., + errors: IgnoreRaise | lib.NoDefault = ..., + try_cast=..., + ) -> NDFrameT: + ... + + @overload + def where( + self, + cond, + other=..., + *, + inplace: Literal[True], + axis=..., + level=..., + errors: IgnoreRaise | lib.NoDefault = ..., + try_cast=..., + ) -> None: + ... + + @overload + def where( + self: NDFrameT, + cond, + other=..., + *, + inplace: bool_t = ..., + axis=..., + level=..., + errors: IgnoreRaise | lib.NoDefault = ..., + try_cast=..., + ) -> NDFrameT | None: + ... + + @deprecate_kwarg(old_arg_name="errors", new_arg_name=None) + @deprecate_nonkeyword_arguments( + version=None, allowed_args=["self", "cond", "other"] + ) @doc( klass=_shared_doc_kwargs["klass"], cond="True", @@ -9242,15 +9565,15 @@ def _where( name_other="mask", ) def where( - self, + self: NDFrameT, cond, other=np.nan, - inplace=False, + inplace: bool_t = False, axis=None, level=None, - errors: IgnoreRaise = "raise", + errors: IgnoreRaise | lib.NoDefault = "raise", try_cast=lib.no_default, - ): + ) -> NDFrameT | None: """ Replace values where the condition is {cond_rev}. @@ -9282,6 +9605,9 @@ def where( - 'raise' : allow exceptions to be raised. - 'ignore' : suppress exceptions. On error return original object. + .. deprecated:: 1.5.0 + This argument had no effect. + try_cast : bool, default None Try to cast the result back to the input type (if possible). @@ -9302,7 +9628,9 @@ def where( The {name} method is an application of the if-then idiom. For each element in the calling DataFrame, if ``cond`` is ``{cond}`` the element is used; otherwise the corresponding element from the DataFrame - ``other`` is used. + ``other`` is used. If the axis of ``other`` does not align with axis of + ``cond`` {klass}, the misaligned index positions will be filled with + {cond_rev}. The signature for :func:`DataFrame.where` differs from :func:`numpy.where`. Roughly ``df1.where(m, df2)`` is equivalent to @@ -9329,6 +9657,23 @@ def where( 4 NaN dtype: float64 + >>> s = pd.Series(range(5)) + >>> t = pd.Series([True, False]) + >>> s.where(t, 99) + 0 0 + 1 99 + 2 99 + 3 99 + 4 99 + dtype: int64 + >>> s.mask(t, 99) + 0 99 + 1 1 + 2 99 + 3 99 + 4 99 + dtype: int64 + >>> s.where(s > 1, 10) 0 10 1 10 @@ -9385,8 +9730,54 @@ def where( stacklevel=find_stack_level(), ) - return self._where(cond, other, inplace, axis, level, errors=errors) + return self._where(cond, other, inplace, axis, level) + @overload + def mask( + self: NDFrameT, + cond, + other=..., + *, + inplace: Literal[False] = ..., + axis=..., + level=..., + errors: IgnoreRaise | lib.NoDefault = ..., + try_cast=..., + ) -> NDFrameT: + ... + + @overload + def mask( + self, + cond, + other=..., + *, + inplace: Literal[True], + axis=..., + level=..., + errors: IgnoreRaise | lib.NoDefault = ..., + try_cast=..., + ) -> None: + ... + + @overload + def mask( + self: NDFrameT, + cond, + other=..., + *, + inplace: bool_t = ..., + axis=..., + level=..., + errors: IgnoreRaise | lib.NoDefault = ..., + try_cast=..., + ) -> NDFrameT | None: + ... + + @deprecate_kwarg(old_arg_name="errors", new_arg_name=None) + @deprecate_nonkeyword_arguments( + version=None, allowed_args=["self", "cond", "other"] + ) @doc( where, klass=_shared_doc_kwargs["klass"], @@ -9396,15 +9787,15 @@ def where( name_other="where", ) def mask( - self, + self: NDFrameT, cond, other=np.nan, - inplace=False, + inplace: bool_t = False, axis=None, level=None, - errors: IgnoreRaise = "raise", + errors: IgnoreRaise | lib.NoDefault = "raise", try_cast=lib.no_default, - ): + ) -> NDFrameT | None: inplace = validate_bool_kwarg(inplace, "inplace") cond = com.apply_if_callable(cond, self) @@ -9427,7 +9818,6 @@ def mask( inplace=inplace, axis=axis, level=level, - errors=errors, ) @doc(klass=_shared_doc_kwargs["klass"]) @@ -10953,7 +11343,8 @@ def mad( data = self._get_numeric_data() if axis == 0: - demeaned = data - data.mean(axis=0) + # error: Unsupported operand types for - ("NDFrame" and "float") + demeaned = data - data.mean(axis=0) # type: ignore[operator] else: demeaned = data.sub(data.mean(axis=1), axis=0) return np.abs(demeaned).mean(axis=axis, skipna=skipna) @@ -10968,7 +11359,6 @@ def _add_numeric_operations(cls): @deprecate_nonkeyword_arguments( version=None, allowed_args=["self"], - stacklevel=find_stack_level() - 1, name="DataFrame.any and Series.any", ) @doc( @@ -11350,7 +11740,7 @@ def rolling( closed: str | None = None, step: int | None = None, method: str = "single", - ): + ) -> Window | Rolling: axis = self._get_axis_number(axis) if win_type is not None: @@ -11462,47 +11852,47 @@ def _inplace_method(self, other, op): ) return self - def __iadd__(self, other): + def __iadd__(self: NDFrameT, other) -> NDFrameT: # error: Unsupported left operand type for + ("Type[NDFrame]") return self._inplace_method(other, type(self).__add__) # type: ignore[operator] - def __isub__(self, other): + def __isub__(self: NDFrameT, other) -> NDFrameT: # error: Unsupported left operand type for - ("Type[NDFrame]") return self._inplace_method(other, type(self).__sub__) # type: ignore[operator] - def __imul__(self, other): + def __imul__(self: NDFrameT, other) -> NDFrameT: # error: Unsupported left operand type for * ("Type[NDFrame]") return self._inplace_method(other, type(self).__mul__) # type: ignore[operator] - def __itruediv__(self, other): + def __itruediv__(self: NDFrameT, other) -> NDFrameT: # error: Unsupported left operand type for / ("Type[NDFrame]") return self._inplace_method( other, type(self).__truediv__ # type: ignore[operator] ) - def __ifloordiv__(self, other): + def __ifloordiv__(self: NDFrameT, other) -> NDFrameT: # error: Unsupported left operand type for // ("Type[NDFrame]") return self._inplace_method( other, type(self).__floordiv__ # type: ignore[operator] ) - def __imod__(self, other): + def __imod__(self: NDFrameT, other) -> NDFrameT: # error: Unsupported left operand type for % ("Type[NDFrame]") return self._inplace_method(other, type(self).__mod__) # type: ignore[operator] - def __ipow__(self, other): + def __ipow__(self: NDFrameT, other) -> NDFrameT: # error: Unsupported left operand type for ** ("Type[NDFrame]") return self._inplace_method(other, type(self).__pow__) # type: ignore[operator] - def __iand__(self, other): + def __iand__(self: NDFrameT, other) -> NDFrameT: # error: Unsupported left operand type for & ("Type[NDFrame]") return self._inplace_method(other, type(self).__and__) # type: ignore[operator] - def __ior__(self, other): + def __ior__(self: NDFrameT, other) -> NDFrameT: # error: Unsupported left operand type for | ("Type[NDFrame]") return self._inplace_method(other, type(self).__or__) # type: ignore[operator] - def __ixor__(self, other): + def __ixor__(self: NDFrameT, other) -> NDFrameT: # error: Unsupported left operand type for ^ ("Type[NDFrame]") return self._inplace_method(other, type(self).__xor__) # type: ignore[operator] @@ -11809,7 +12199,7 @@ def _doc_params(cls): See Also -------- -core.window.Expanding.{accum_func_name} : Similar functionality +core.window.expanding.Expanding.{accum_func_name} : Similar functionality but ignores ``NaN`` values. {name2}.{accum_func_name} : Return the {desc} over {name2} axis. diff --git a/pandas/core/groupby/base.py b/pandas/core/groupby/base.py index ec9a2e4a4b5c0..ad1f36e0cddd8 100644 --- a/pandas/core/groupby/base.py +++ b/pandas/core/groupby/base.py @@ -6,7 +6,10 @@ from __future__ import annotations import dataclasses -from typing import Hashable +from typing import ( + Hashable, + Literal, +) @dataclasses.dataclass(order=True, frozen=True) @@ -92,7 +95,7 @@ class OutputKey: # TODO(2.0) Remove after pad/backfill deprecation enforced -def maybe_normalize_deprecated_kernels(kernel): +def maybe_normalize_deprecated_kernels(kernel) -> Literal["bfill", "ffill"]: if kernel == "backfill": kernel = "bfill" elif kernel == "pad": diff --git a/pandas/core/groupby/generic.py b/pandas/core/groupby/generic.py index 38b93c6be60f8..9e26598d85e74 100644 --- a/pandas/core/groupby/generic.py +++ b/pandas/core/groupby/generic.py @@ -603,7 +603,7 @@ def value_counts( ascending: bool = False, bins=None, dropna: bool = True, - ): + ) -> Series: from pandas.core.reshape.merge import get_join_indexers from pandas.core.reshape.tile import cut @@ -747,7 +747,7 @@ def build_codes(lev_codes: np.ndarray) -> np.ndarray: return self.obj._constructor(out, index=mi, name=self.obj.name) @doc(Series.nlargest) - def nlargest(self, n: int = 5, keep: str = "first"): + def nlargest(self, n: int = 5, keep: str = "first") -> Series: f = partial(Series.nlargest, n=n, keep=keep) data = self._obj_with_exclusions # Don't change behavior if result index happens to be the same, i.e. @@ -756,7 +756,7 @@ def nlargest(self, n: int = 5, keep: str = "first"): return result @doc(Series.nsmallest) - def nsmallest(self, n: int = 5, keep: str = "first"): + def nsmallest(self, n: int = 5, keep: str = "first") -> Series: f = partial(Series.nsmallest, n=n, keep=keep) data = self._obj_with_exclusions # Don't change behavior if result index happens to be the same, i.e. @@ -814,6 +814,14 @@ class DataFrameGroupBy(GroupBy[DataFrame]): 1 1 2 2 3 4 + User-defined function for aggregation + + >>> df.groupby('A').agg(lambda x: sum(x) + 2) + B C + A + 1 5 2.590715 + 2 9 2.704907 + Different aggregations per column >>> df.groupby('A').agg({'B': ['min', 'max'], 'C': 'sum'}) @@ -1137,7 +1145,7 @@ def _cython_transform( ) -> DataFrame: assert axis == 0 # handled by caller # TODO: no tests with self.ndim == 1 for DataFrameGroupBy - numeric_only_bool = self._resolve_numeric_only(numeric_only, axis) + numeric_only_bool = self._resolve_numeric_only(how, numeric_only, axis) # With self.axis == 0, we have multi-block tests # e.g. test_rank_min_int, test_cython_transform_frame @@ -1592,7 +1600,7 @@ def idxmax( axis=0, skipna: bool = True, numeric_only: bool | lib.NoDefault = lib.no_default, - ): + ) -> DataFrame: axis = DataFrame._get_axis_number(axis) if numeric_only is lib.no_default: # Cannot use self._resolve_numeric_only; we must pass None to @@ -1602,17 +1610,20 @@ def idxmax( numeric_only_arg = numeric_only def func(df): - res = df._reduce( - nanops.nanargmax, - "argmax", - axis=axis, - skipna=skipna, - numeric_only=numeric_only_arg, - ) - indices = res._values - index = df._get_axis(axis) - result = [index[i] if i >= 0 else np.nan for i in indices] - return df._constructor_sliced(result, index=res.index) + with warnings.catch_warnings(): + # Suppress numeric_only warnings here, will warn below + warnings.filterwarnings("ignore", ".*numeric_only in DataFrame.argmax") + res = df._reduce( + nanops.nanargmax, + "argmax", + axis=axis, + skipna=skipna, + numeric_only=numeric_only_arg, + ) + indices = res._values + index = df._get_axis(axis) + result = [index[i] if i >= 0 else np.nan for i in indices] + return df._constructor_sliced(result, index=res.index) func.__name__ = "idxmax" result = self._python_apply_general(func, self._obj_with_exclusions) @@ -1628,7 +1639,7 @@ def idxmin( axis=0, skipna: bool = True, numeric_only: bool | lib.NoDefault = lib.no_default, - ): + ) -> DataFrame: axis = DataFrame._get_axis_number(axis) if numeric_only is lib.no_default: # Cannot use self._resolve_numeric_only; we must pass None to @@ -1638,17 +1649,20 @@ def idxmin( numeric_only_arg = numeric_only def func(df): - res = df._reduce( - nanops.nanargmin, - "argmin", - axis=axis, - skipna=skipna, - numeric_only=numeric_only_arg, - ) - indices = res._values - index = df._get_axis(axis) - result = [index[i] if i >= 0 else np.nan for i in indices] - return df._constructor_sliced(result, index=res.index) + with warnings.catch_warnings(): + # Suppress numeric_only warnings here, will warn below + warnings.filterwarnings("ignore", ".*numeric_only in DataFrame.argmin") + res = df._reduce( + nanops.nanargmin, + "argmin", + axis=axis, + skipna=skipna, + numeric_only=numeric_only_arg, + ) + indices = res._values + index = df._get_axis(axis) + result = [index[i] if i >= 0 else np.nan for i in indices] + return df._constructor_sliced(result, index=res.index) func.__name__ = "idxmin" result = self._python_apply_general(func, self._obj_with_exclusions) diff --git a/pandas/core/groupby/groupby.py b/pandas/core/groupby/groupby.py index c294082edce71..9b4991d32692b 100644 --- a/pandas/core/groupby/groupby.py +++ b/pandas/core/groupby/groupby.py @@ -18,6 +18,7 @@ class providing the base-class of operations. from textwrap import dedent import types from typing import ( + TYPE_CHECKING, Callable, Hashable, Iterable, @@ -122,6 +123,13 @@ class providing the base-class of operations. maybe_use_numba, ) +if TYPE_CHECKING: + from pandas.core.window import ( + ExpandingGroupby, + ExponentialMovingWindowGroupby, + RollingGroupby, + ) + _common_see_also = """ See Also -------- @@ -663,7 +671,7 @@ def ngroups(self) -> int: @final @property - def indices(self): + def indices(self) -> dict[Hashable, npt.NDArray[np.intp]]: """ Dict {group name -> group indices}. """ @@ -1016,7 +1024,7 @@ def curried(x): curried, self._obj_with_exclusions, is_transform=is_transform ) - if self._selected_obj.ndim != 1 and self.axis != 1: + if self._selected_obj.ndim != 1 and self.axis != 1 and result.ndim != 1: missing = self._obj_with_exclusions.columns.difference(result.columns) if len(missing) > 0: warn_dropping_nuisance_columns_deprecated( @@ -1291,7 +1299,7 @@ def _wrap_applied_output( raise AbstractMethodError(self) def _resolve_numeric_only( - self, numeric_only: bool | lib.NoDefault, axis: int + self, how: str, numeric_only: bool | lib.NoDefault, axis: int ) -> bool: """ Determine subclass-specific default value for 'numeric_only'. @@ -1328,6 +1336,20 @@ def _resolve_numeric_only( else: numeric_only = False + if numeric_only and self.obj.ndim == 1 and not is_numeric_dtype(self.obj.dtype): + # GH#47500 + how = "sum" if how == "add" else how + warnings.warn( + f"{type(self).__name__}.{how} called with " + f"numeric_only={numeric_only} and dtype {self.obj.dtype}. This will " + "raise a TypeError in a future version of pandas", + category=FutureWarning, + stacklevel=find_stack_level(), + ) + raise NotImplementedError( + f"{type(self).__name__}.{how} does not implement numeric_only" + ) + return numeric_only def _maybe_warn_numeric_only_depr( @@ -1704,7 +1726,7 @@ def _cython_agg_general( ): # Note: we never get here with how="ohlc" for DataFrameGroupBy; # that goes through SeriesGroupBy - numeric_only_bool = self._resolve_numeric_only(numeric_only, axis=0) + numeric_only_bool = self._resolve_numeric_only(how, numeric_only, axis=0) data = self._get_data_to_aggregate() is_ser = data.ndim == 1 @@ -1716,8 +1738,9 @@ def _cython_agg_general( kwd_name = "numeric_only" if how in ["any", "all"]: kwd_name = "bool_only" + kernel = "sum" if how == "add" else how raise NotImplementedError( - f"{type(self).__name__}.{how} does not implement {kwd_name}." + f"{type(self).__name__}.{kernel} does not implement {kwd_name}." ) elif not is_ser: data = data.get_numeric_data(copy=False) @@ -2099,7 +2122,7 @@ def mean( 2 4.0 Name: B, dtype: float64 """ - numeric_only_bool = self._resolve_numeric_only(numeric_only, axis=0) + numeric_only_bool = self._resolve_numeric_only("mean", numeric_only, axis=0) if maybe_use_numba(engine): from pandas.core._numba.kernels import sliding_mean @@ -2133,7 +2156,7 @@ def median(self, numeric_only: bool | lib.NoDefault = lib.no_default): Series or DataFrame Median of values within each group. """ - numeric_only_bool = self._resolve_numeric_only(numeric_only, axis=0) + numeric_only_bool = self._resolve_numeric_only("median", numeric_only, axis=0) result = self._cython_agg_general( "median", @@ -2194,10 +2217,21 @@ def std( return np.sqrt(self._numba_agg_general(sliding_var, engine_kwargs, ddof)) else: + # Resolve numeric_only so that var doesn't warn + numeric_only_bool = self._resolve_numeric_only("std", numeric_only, axis=0) + if ( + numeric_only_bool + and self.obj.ndim == 1 + and not is_numeric_dtype(self.obj.dtype) + ): + raise TypeError( + f"{type(self).__name__}.std called with " + f"numeric_only={numeric_only} and dtype {self.obj.dtype}" + ) result = self._get_cythonized_result( libgroupby.group_var, cython_dtype=np.dtype(np.float64), - numeric_only=numeric_only, + numeric_only=numeric_only_bool, needs_counts=True, post_processing=lambda vals, inference: np.sqrt(vals), ddof=ddof, @@ -2257,7 +2291,7 @@ def var( return self._numba_agg_general(sliding_var, engine_kwargs, ddof) else: - numeric_only_bool = self._resolve_numeric_only(numeric_only, axis=0) + numeric_only_bool = self._resolve_numeric_only("var", numeric_only, axis=0) if ddof == 1: return self._cython_agg_general( "var", @@ -2296,7 +2330,18 @@ def sem(self, ddof: int = 1, numeric_only: bool | lib.NoDefault = lib.no_default Series or DataFrame Standard error of the mean of values within each group. """ - result = self.std(ddof=ddof, numeric_only=numeric_only) + # Reolve numeric_only so that std doesn't warn + numeric_only_bool = self._resolve_numeric_only("sem", numeric_only, axis=0) + if ( + numeric_only_bool + and self.obj.ndim == 1 + and not is_numeric_dtype(self.obj.dtype) + ): + raise TypeError( + f"{type(self).__name__}.sem called with " + f"numeric_only={numeric_only} and dtype {self.obj.dtype}" + ) + result = self.std(ddof=ddof, numeric_only=numeric_only_bool) self._maybe_warn_numeric_only_depr("sem", result, numeric_only) if result.ndim == 1: @@ -2721,7 +2766,7 @@ def resample(self, rule, *args, **kwargs): @final @Substitution(name="groupby") @Appender(_common_see_also) - def rolling(self, *args, **kwargs): + def rolling(self, *args, **kwargs) -> RollingGroupby: """ Return a rolling grouper, providing rolling functionality per group. """ @@ -2738,7 +2783,7 @@ def rolling(self, *args, **kwargs): @final @Substitution(name="groupby") @Appender(_common_see_also) - def expanding(self, *args, **kwargs): + def expanding(self, *args, **kwargs) -> ExpandingGroupby: """ Return an expanding grouper, providing expanding functionality per group. @@ -2755,7 +2800,7 @@ def expanding(self, *args, **kwargs): @final @Substitution(name="groupby") @Appender(_common_see_also) - def ewm(self, *args, **kwargs): + def ewm(self, *args, **kwargs) -> ExponentialMovingWindowGroupby: """ Return an ewm grouper, providing ewm functionality per group. """ @@ -3166,7 +3211,16 @@ def quantile( a 2.0 b 3.0 """ - numeric_only_bool = self._resolve_numeric_only(numeric_only, axis=0) + numeric_only_bool = self._resolve_numeric_only("quantile", numeric_only, axis=0) + if ( + numeric_only_bool + and self.obj.ndim == 1 + and not is_numeric_dtype(self.obj.dtype) + ): + raise TypeError( + f"{type(self).__name__}.quantile called with " + f"numeric_only={numeric_only} and dtype {self.obj.dtype}" + ) def pre_processor(vals: ArrayLike) -> tuple[np.ndarray, np.dtype | None]: if is_object_dtype(vals): @@ -3438,7 +3492,7 @@ def rank( na_option: str = "keep", pct: bool = False, axis: int = 0, - ): + ) -> NDFrameT: """ Provide the rank of values within each group. @@ -3529,7 +3583,7 @@ def rank( @final @Substitution(name="groupby") @Appender(_common_see_also) - def cumprod(self, axis=0, *args, **kwargs): + def cumprod(self, axis=0, *args, **kwargs) -> NDFrameT: """ Cumulative product for each group. @@ -3547,7 +3601,7 @@ def cumprod(self, axis=0, *args, **kwargs): @final @Substitution(name="groupby") @Appender(_common_see_also) - def cumsum(self, axis=0, *args, **kwargs): + def cumsum(self, axis=0, *args, **kwargs) -> NDFrameT: """ Cumulative sum for each group. @@ -3565,7 +3619,7 @@ def cumsum(self, axis=0, *args, **kwargs): @final @Substitution(name="groupby") @Appender(_common_see_also) - def cummin(self, axis=0, numeric_only=False, **kwargs): + def cummin(self, axis=0, numeric_only=False, **kwargs) -> NDFrameT: """ Cumulative min for each group. @@ -3576,7 +3630,11 @@ def cummin(self, axis=0, numeric_only=False, **kwargs): skipna = kwargs.get("skipna", True) if axis != 0: f = lambda x: np.minimum.accumulate(x, axis) - return self._python_apply_general(f, self._selected_obj, is_transform=True) + numeric_only_bool = self._resolve_numeric_only("cummax", numeric_only, axis) + obj = self._selected_obj + if numeric_only_bool: + obj = obj._get_numeric_data() + return self._python_apply_general(f, obj, is_transform=True) return self._cython_transform( "cummin", numeric_only=numeric_only, skipna=skipna @@ -3585,7 +3643,7 @@ def cummin(self, axis=0, numeric_only=False, **kwargs): @final @Substitution(name="groupby") @Appender(_common_see_also) - def cummax(self, axis=0, numeric_only=False, **kwargs): + def cummax(self, axis=0, numeric_only=False, **kwargs) -> NDFrameT: """ Cumulative max for each group. @@ -3596,7 +3654,11 @@ def cummax(self, axis=0, numeric_only=False, **kwargs): skipna = kwargs.get("skipna", True) if axis != 0: f = lambda x: np.maximum.accumulate(x, axis) - return self._python_apply_general(f, self._selected_obj, is_transform=True) + numeric_only_bool = self._resolve_numeric_only("cummax", numeric_only, axis) + obj = self._selected_obj + if numeric_only_bool: + obj = obj._get_numeric_data() + return self._python_apply_general(f, obj, is_transform=True) return self._cython_transform( "cummax", numeric_only=numeric_only, skipna=skipna @@ -3654,7 +3716,8 @@ def _get_cythonized_result( ------- `Series` or `DataFrame` with filled values """ - numeric_only_bool = self._resolve_numeric_only(numeric_only, axis=0) + how = base_func.__name__ + numeric_only_bool = self._resolve_numeric_only(how, numeric_only, axis=0) if post_processing and not callable(post_processing): raise ValueError("'post_processing' must be a callable!") @@ -3665,7 +3728,6 @@ def _get_cythonized_result( ids, _, ngroups = grouper.group_info - how = base_func.__name__ base_func = partial(base_func, labels=ids) def blk_func(values: ArrayLike) -> ArrayLike: @@ -3875,7 +3937,7 @@ def pct_change(self, periods=1, fill_method="ffill", limit=None, freq=None, axis @final @Substitution(name="groupby") @Substitution(see_also=_common_see_also) - def head(self, n=5): + def head(self, n: int = 5) -> NDFrameT: """ Return first n rows of each group. @@ -3914,7 +3976,7 @@ def head(self, n=5): @final @Substitution(name="groupby") @Substitution(see_also=_common_see_also) - def tail(self, n=5): + def tail(self, n: int = 5) -> NDFrameT: """ Return last n rows of each group. diff --git a/pandas/core/groupby/grouper.py b/pandas/core/groupby/grouper.py index 05ef155ecbcda..b9f4166b475ca 100644 --- a/pandas/core/groupby/grouper.py +++ b/pandas/core/groupby/grouper.py @@ -679,15 +679,17 @@ def _codes_and_uniques(self) -> tuple[npt.NDArray[np.signedinteger], ArrayLike]: elif isinstance(self.grouping_vector, ops.BaseGrouper): # we have a list of groupers codes = self.grouping_vector.codes_info - uniques = self.grouping_vector.result_index._values + # error: Incompatible types in assignment (expression has type "Union + # [ExtensionArray, ndarray[Any, Any]]", variable has type "Categorical") + uniques = ( + self.grouping_vector.result_index._values # type: ignore[assignment] + ) else: - # GH35667, replace dropna=False with na_sentinel=None - if not self._dropna: - na_sentinel = None - else: - na_sentinel = -1 - codes, uniques = algorithms.factorize( - self.grouping_vector, sort=self._sort, na_sentinel=na_sentinel + # GH35667, replace dropna=False with use_na_sentinel=False + # error: Incompatible types in assignment (expression has type "Union[ + # ndarray[Any, Any], Index]", variable has type "Categorical") + codes, uniques = algorithms.factorize( # type: ignore[assignment] + self.grouping_vector, sort=self._sort, use_na_sentinel=self._dropna ) return codes, uniques @@ -835,7 +837,11 @@ def get_grouper( # if the actual grouper should be obj[key] def is_in_axis(key) -> bool: + if not _is_label_like(key): + if obj.ndim == 1: + return False + # items -> .columns for DataFrame, .index for Series items = obj.axes[-1] try: diff --git a/pandas/core/groupby/ops.py b/pandas/core/groupby/ops.py index 7f74c60c8e534..6dc4ccfa8e1ee 100644 --- a/pandas/core/groupby/ops.py +++ b/pandas/core/groupby/ops.py @@ -171,7 +171,7 @@ def _get_cython_function( f = getattr(libgroupby, ftype) if is_numeric: return f - elif dtype == object: + elif dtype == np.dtype(object): if how in ["median", "cumprod"]: # no fused types -> no __signatures__ raise NotImplementedError( @@ -735,7 +735,7 @@ def groupings(self) -> list[grouper.Grouping]: def shape(self) -> Shape: return tuple(ping.ngroups for ping in self.groupings) - def __iter__(self): + def __iter__(self) -> Iterator[Hashable]: return iter(self.indices) @property diff --git a/pandas/core/index.py b/pandas/core/index.py index 00ca6f9048a40..19e9c6b27e4e7 100644 --- a/pandas/core/index.py +++ b/pandas/core/index.py @@ -1,3 +1,6 @@ +# pyright: reportUnusedImport = false +from __future__ import annotations + import warnings from pandas.util._exceptions import find_stack_level @@ -30,3 +33,5 @@ FutureWarning, stacklevel=find_stack_level(), ) + +__all__: list[str] = [] diff --git a/pandas/core/indexers/utils.py b/pandas/core/indexers/utils.py index f098066d1c7d7..0f3cdc4195c85 100644 --- a/pandas/core/indexers/utils.py +++ b/pandas/core/indexers/utils.py @@ -240,7 +240,7 @@ def validate_indices(indices: np.ndarray, n: int) -> None: # Indexer Conversion -def maybe_convert_indices(indices, n: int, verify: bool = True): +def maybe_convert_indices(indices, n: int, verify: bool = True) -> np.ndarray: """ Attempt to convert indices into valid, positive indices. diff --git a/pandas/core/indexes/accessors.py b/pandas/core/indexes/accessors.py index 8694ad94dae26..46959aa5cd3e2 100644 --- a/pandas/core/indexes/accessors.py +++ b/pandas/core/indexes/accessors.py @@ -38,7 +38,10 @@ from pandas.core.indexes.timedeltas import TimedeltaIndex if TYPE_CHECKING: - from pandas import Series + from pandas import ( + DataFrame, + Series, + ) class Properties(PandasDelegate, PandasObject, NoNewAttributesMixin): @@ -241,7 +244,7 @@ def to_pydatetime(self) -> np.ndarray: def freq(self): return self._get_values().inferred_freq - def isocalendar(self): + def isocalendar(self) -> DataFrame: """ Calculate year, week, and day according to the ISO 8601 standard. diff --git a/pandas/core/indexes/api.py b/pandas/core/indexes/api.py index 1e740132e3464..b4f47f70c5a84 100644 --- a/pandas/core/indexes/api.py +++ b/pandas/core/indexes/api.py @@ -11,6 +11,7 @@ ) from pandas.errors import InvalidIndexError +from pandas.core.dtypes.cast import find_common_type from pandas.core.dtypes.common import is_dtype_equal from pandas.core.algorithms import safe_sort @@ -70,6 +71,7 @@ "get_unanimous_names", "all_indexes_same", "default_index", + "safe_sort_index", ] @@ -157,16 +159,7 @@ def _get_combined_index( index = ensure_index(index) if sort: - try: - array_sorted = safe_sort(index) - array_sorted = cast(np.ndarray, array_sorted) - if isinstance(index, MultiIndex): - index = MultiIndex.from_tuples(array_sorted, names=index.names) - else: - index = Index(array_sorted, name=index.name) - except TypeError: - pass - + index = safe_sort_index(index) # GH 29879 if copy: index = index.copy() @@ -174,6 +167,37 @@ def _get_combined_index( return index +def safe_sort_index(index: Index) -> Index: + """ + Returns the sorted index + + We keep the dtypes and the name attributes. + + Parameters + ---------- + index : an Index + + Returns + ------- + Index + """ + if index.is_monotonic_increasing: + return index + + try: + array_sorted = safe_sort(index) + except TypeError: + pass + else: + array_sorted = cast(np.ndarray, array_sorted) + if isinstance(index, MultiIndex): + index = MultiIndex.from_tuples(array_sorted, names=index.names) + else: + index = Index(array_sorted, name=index.name, dtype=index.dtype) + + return index + + def union_indexes(indexes, sort: bool | None = True) -> Index: """ Return the union of indexes. @@ -200,7 +224,7 @@ def union_indexes(indexes, sort: bool | None = True) -> Index: indexes, kind = _sanitize_and_check(indexes) - def _unique_indices(inds) -> Index: + def _unique_indices(inds, dtype) -> Index: """ Convert indexes to lists and concatenate them, removing duplicates. @@ -209,6 +233,7 @@ def _unique_indices(inds) -> Index: Parameters ---------- inds : list of Index or list objects + dtype : dtype to set for the resulting Index Returns ------- @@ -220,7 +245,30 @@ def conv(i): i = i.tolist() return i - return Index(lib.fast_unique_multiple_list([conv(i) for i in inds], sort=sort)) + return Index( + lib.fast_unique_multiple_list([conv(i) for i in inds], sort=sort), + dtype=dtype, + ) + + def _find_common_index_dtype(inds): + """ + Finds a common type for the indexes to pass through to resulting index. + + Parameters + ---------- + inds: list of Index or list objects + + Returns + ------- + The common type or None if no indexes were given + """ + dtypes = [idx.dtype for idx in indexes if isinstance(idx, Index)] + if dtypes: + dtype = find_common_type(dtypes) + else: + dtype = None + + return dtype if kind == "special": result = indexes[0] @@ -260,16 +308,18 @@ def conv(i): return result elif kind == "array": + dtype = _find_common_index_dtype(indexes) index = indexes[0] if not all(index.equals(other) for other in indexes[1:]): - index = _unique_indices(indexes) + index = _unique_indices(indexes, dtype) name = get_unanimous_names(*indexes)[0] if name != index.name: index = index.rename(name) return index else: # kind='list' - return _unique_indices(indexes) + dtype = _find_common_index_dtype(indexes) + return _unique_indices(indexes, dtype) def _sanitize_and_check(indexes): diff --git a/pandas/core/indexes/base.py b/pandas/core/indexes/base.py index cb01cfc981739..a212da050e1f1 100644 --- a/pandas/core/indexes/base.py +++ b/pandas/core/indexes/base.py @@ -8,6 +8,7 @@ TYPE_CHECKING, Any, Callable, + ClassVar, Hashable, Iterable, Literal, @@ -403,9 +404,12 @@ def _outer_indexer( # associated code in pandas 2.0. _is_backward_compat_public_numeric_index: bool = False - _engine_type: type[libindex.IndexEngine] | type[ - libindex.ExtensionEngine - ] = libindex.ObjectEngine + @property + def _engine_type( + self, + ) -> type[libindex.IndexEngine] | type[libindex.ExtensionEngine]: + return libindex.ObjectEngine + # whether we support partial string indexing. Overridden # in DatetimeIndex and PeriodIndex _supports_partial_string_indexing = False @@ -1064,16 +1068,6 @@ def astype(self, dtype, copy: bool = True): # Ensure that self.astype(self.dtype) is self return self.copy() if copy else self - if ( - self.dtype == np.dtype("M8[ns]") - and isinstance(dtype, np.dtype) - and dtype.kind == "M" - and dtype != np.dtype("M8[ns]") - ): - # For now DatetimeArray supports this by unwrapping ndarray, - # but DatetimeIndex doesn't - raise TypeError(f"Cannot cast {type(self).__name__} to dtype") - values = self._data if isinstance(values, ExtensionArray): with rewrite_exception(type(values).__name__, type(self).__name__): @@ -1562,7 +1556,7 @@ def _summary(self, name=None) -> str_t: # -------------------------------------------------------------------- # Conversion Methods - def to_flat_index(self): + def to_flat_index(self: _IndexT) -> _IndexT: """ Identity method. @@ -1719,14 +1713,14 @@ def to_frame( # Name-Centric Methods @property - def name(self): + def name(self) -> Hashable: """ Return Index or MultiIndex name. """ return self._name @name.setter - def name(self, value: Hashable): + def name(self, value: Hashable) -> None: if self._no_setting_name: # Used in MultiIndex.levels to avoid silently ignoring name updates. raise RuntimeError( @@ -4834,11 +4828,12 @@ def _join_non_unique( right = other._values.take(right_idx) if isinstance(join_array, np.ndarray): - # Argument 3 to "putmask" has incompatible type "Union[ExtensionArray, - # ndarray[Any, Any]]"; expected "Union[_SupportsArray[dtype[Any]], - # _NestedSequence[_SupportsArray[dtype[Any]]], bool, int, f - # loat, complex, str, bytes, _NestedSequence[Union[bool, int, float, - # complex, str, bytes]]]" [arg-type] + # error: Argument 3 to "putmask" has incompatible type + # "Union[ExtensionArray, ndarray[Any, Any]]"; expected + # "Union[_SupportsArray[dtype[Any]], _NestedSequence[ + # _SupportsArray[dtype[Any]]], bool, int, float, complex, + # str, bytes, _NestedSequence[Union[bool, int, float, + # complex, str, bytes]]]" np.putmask(join_array, mask, right) # type: ignore[arg-type] else: join_array._putmask(mask, right) @@ -5305,7 +5300,7 @@ def __contains__(self, key: Any) -> bool: # https://github.com/python/typeshed/issues/2148#issuecomment-520783318 # Incompatible types in assignment (expression has type "None", base class # "object" defined the type as "Callable[[object], int]") - __hash__: None # type: ignore[assignment] + __hash__: ClassVar[None] # type: ignore[assignment] @final def __setitem__(self, key, value): @@ -5351,10 +5346,11 @@ def __getitem__(self, key): if result.ndim > 1: deprecate_ndim_indexing(result) if hasattr(result, "_ndarray"): - # error: Item "ndarray[Any, Any]" of "Union[ExtensionArray, - # ndarray[Any, Any]]" has no attribute "_ndarray" [union-attr] # i.e. NDArrayBackedExtensionArray # Unpack to ndarray for MPL compat + # error: Item "ndarray[Any, Any]" of + # "Union[ExtensionArray, ndarray[Any, Any]]" + # has no attribute "_ndarray" return result._ndarray # type: ignore[union-attr] return result @@ -5955,7 +5951,7 @@ def _get_values_for_loc(self, series: Series, loc, key): return series.iloc[loc] @final - def set_value(self, arr, key, value): + def set_value(self, arr, key, value) -> None: """ Fast lookup of value from 1-dimensional ndarray. @@ -6893,14 +6889,15 @@ def insert(self, loc: int, item) -> Index: new_values = np.insert(arr, loc, casted) else: - # No overload variant of "insert" matches argument types - # "ndarray[Any, Any]", "int", "None" [call-overload] + # error: No overload variant of "insert" matches argument types + # "ndarray[Any, Any]", "int", "None" new_values = np.insert(arr, loc, None) # type: ignore[call-overload] loc = loc if loc >= 0 else loc - 1 new_values[loc] = item # Use self._constructor instead of Index to retain NumericIndex GH#43921 # TODO(2.0) can use Index instead of self._constructor + # Check if doing so fixes GH#47071 return self._constructor._with_infer(new_values, name=self.name) def drop( @@ -6955,7 +6952,7 @@ def _cmp_method(self, other, op): # TODO: should set MultiIndex._can_hold_na = False? arr[self.isna()] = False return arr - elif op in {operator.ne, operator.lt, operator.gt}: + elif op is operator.ne: arr = np.zeros(len(self), dtype=bool) if self._can_hold_na and not isinstance(self, ABCMultiIndex): arr[self.isna()] = True @@ -7016,16 +7013,16 @@ def _unary_method(self, op): result = op(self._values) return Index(result, name=self.name) - def __abs__(self): + def __abs__(self) -> Index: return self._unary_method(operator.abs) - def __neg__(self): + def __neg__(self) -> Index: return self._unary_method(operator.neg) - def __pos__(self): + def __pos__(self) -> Index: return self._unary_method(operator.pos) - def __invert__(self): + def __invert__(self) -> Index: # GH#8875 return self._unary_method(operator.inv) @@ -7139,7 +7136,7 @@ def _maybe_disable_logical_methods(self, opname: str_t) -> None: make_invalid_op(opname)(self) @Appender(IndexOpsMixin.argmin.__doc__) - def argmin(self, axis=None, skipna=True, *args, **kwargs): + def argmin(self, axis=None, skipna=True, *args, **kwargs) -> int: nv.validate_argmin(args, kwargs) nv.validate_minmax_axis(axis) @@ -7151,7 +7148,7 @@ def argmin(self, axis=None, skipna=True, *args, **kwargs): return super().argmin(skipna=skipna) @Appender(IndexOpsMixin.argmax.__doc__) - def argmax(self, axis=None, skipna=True, *args, **kwargs): + def argmax(self, axis=None, skipna=True, *args, **kwargs) -> int: nv.validate_argmax(args, kwargs) nv.validate_minmax_axis(axis) diff --git a/pandas/core/indexes/category.py b/pandas/core/indexes/category.py index c2bcd90ff10fb..c1ae3cb1b16ea 100644 --- a/pandas/core/indexes/category.py +++ b/pandas/core/indexes/category.py @@ -192,7 +192,7 @@ def _should_fallback_to_positional(self) -> bool: _values: Categorical @property - def _engine_type(self): + def _engine_type(self) -> type[libindex.IndexEngine]: # self.codes can have dtype int8, int16, int32 or int64, so we need # to return the corresponding engine type (libindex.Int8Engine, etc.). return { @@ -422,6 +422,7 @@ def reindex( stacklevel=find_stack_level(), ) + new_target: Index if len(self) and indexer is not None: new_target = self.take(indexer) else: @@ -434,8 +435,8 @@ def reindex( if not isinstance(target, CategoricalIndex) or (cats == -1).any(): new_target, indexer, _ = super()._reindex_non_unique(target) else: - - codes = new_target.codes.copy() + # error: "Index" has no attribute "codes" + codes = new_target.codes.copy() # type: ignore[attr-defined] codes[indexer == -1] = cats[missing] cat = self._data._from_backing_data(codes) new_target = type(self)._simple_new(cat, name=self.name) @@ -450,8 +451,8 @@ def reindex( new_target = type(self)._simple_new(cat, name=self.name) else: # e.g. test_reindex_with_categoricalindex, test_reindex_duplicate_target - new_target = np.asarray(new_target) - new_target = Index._with_infer(new_target, name=self.name) + new_target_array = np.asarray(new_target) + new_target = Index._with_infer(new_target_array, name=self.name) return new_target, indexer @@ -488,7 +489,7 @@ def _maybe_cast_listlike_indexer(self, values) -> CategoricalIndex: def _is_comparable_dtype(self, dtype: DtypeObj) -> bool: return self.categories._is_comparable_dtype(dtype) - def take_nd(self, *args, **kwargs): + def take_nd(self, *args, **kwargs) -> CategoricalIndex: """Alias for `take`""" warnings.warn( "CategoricalIndex.take_nd is deprecated, use CategoricalIndex.take " diff --git a/pandas/core/indexes/datetimelike.py b/pandas/core/indexes/datetimelike.py index 811dc72e9b908..8014d010afc1b 100644 --- a/pandas/core/indexes/datetimelike.py +++ b/pandas/core/indexes/datetimelike.py @@ -672,7 +672,7 @@ def _get_insert_freq(self, loc: int, item): return freq @doc(NDArrayBackedExtensionIndex.delete) - def delete(self, loc): + def delete(self, loc) -> DatetimeTimedeltaMixin: result = super().delete(loc) result._data._freq = self._get_delete_freq(loc) return result diff --git a/pandas/core/indexes/datetimes.py b/pandas/core/indexes/datetimes.py index e7b810dacdf57..3a7adb19f1c01 100644 --- a/pandas/core/indexes/datetimes.py +++ b/pandas/core/indexes/datetimes.py @@ -25,15 +25,18 @@ lib, ) from pandas._libs.tslibs import ( + BaseOffset, Resolution, + periods_per_day, timezones, to_offset, ) +from pandas._libs.tslibs.dtypes import NpyDatetimeUnit from pandas._libs.tslibs.offsets import prefix_mapping from pandas._typing import ( Dtype, DtypeObj, - IntervalClosedType, + IntervalInclusiveType, IntervalLeftRight, npt, ) @@ -44,9 +47,9 @@ from pandas.util._exceptions import find_stack_level from pandas.core.dtypes.common import ( - DT64NS_DTYPE, is_datetime64_dtype, is_datetime64tz_dtype, + is_dtype_equal, is_scalar, ) from pandas.core.dtypes.missing import is_valid_na_for_dtype @@ -250,9 +253,12 @@ class DatetimeIndex(DatetimeTimedeltaMixin): _typ = "datetimeindex" _data_cls = DatetimeArray - _engine_type = libindex.DatetimeEngine _supports_partial_string_indexing = True + @property + def _engine_type(self) -> type[libindex.DatetimeEngine]: + return libindex.DatetimeEngine + _data: DatetimeArray inferred_freq: str | None tz: tzinfo | None @@ -307,7 +313,7 @@ def isocalendar(self) -> DataFrame: def __new__( cls, data=None, - freq=lib.no_default, + freq: str | BaseOffset | lib.NoDefault = lib.no_default, tz=None, normalize: bool = False, closed=None, @@ -326,6 +332,30 @@ def __new__( name = maybe_extract_name(name, data, cls) + if ( + isinstance(data, DatetimeArray) + and freq is lib.no_default + and tz is None + and dtype is None + ): + # fastpath, similar logic in TimedeltaIndex.__new__; + # Note in this particular case we retain non-nano. + if copy: + data = data.copy() + return cls._simple_new(data, name=name) + elif ( + isinstance(data, DatetimeArray) + and freq is lib.no_default + and tz is None + and is_dtype_equal(data.dtype, dtype) + ): + # Reached via Index.__new__ when we call .astype + # TODO(2.0): special casing can be removed once _from_sequence_not_strict + # no longer chokes on non-nano + if copy: + data = data.copy() + return cls._simple_new(data, name=name) + dtarr = DatetimeArray._from_sequence_not_strict( data, dtype=dtype, @@ -436,7 +466,7 @@ def _maybe_utc_convert(self, other: Index) -> tuple[DatetimeIndex, Index]: # -------------------------------------------------------------------- - def _get_time_micros(self) -> np.ndarray: + def _get_time_micros(self) -> npt.NDArray[np.int64]: """ Return the number of microseconds since midnight. @@ -446,8 +476,20 @@ def _get_time_micros(self) -> np.ndarray: """ values = self._data._local_timestamps() - nanos = values % (24 * 3600 * 1_000_000_000) - micros = nanos // 1000 + reso = self._data._reso + ppd = periods_per_day(reso) + + frac = values % ppd + if reso == NpyDatetimeUnit.NPY_FR_ns.value: + micros = frac // 1000 + elif reso == NpyDatetimeUnit.NPY_FR_us.value: + micros = frac + elif reso == NpyDatetimeUnit.NPY_FR_ms.value: + micros = frac * 1000 + elif reso == NpyDatetimeUnit.NPY_FR_s.value: + micros = frac * 1_000_000 + else: # pragma: no cover + raise NotImplementedError(reso) micros[self._isnan] = -1 return micros @@ -540,7 +582,7 @@ def snap(self, freq="S") -> DatetimeIndex: # Superdumb, punting on any optimizing freq = to_offset(freq) - snapped = np.empty(len(self), dtype=DT64NS_DTYPE) + dta = self._data.copy() for i, v in enumerate(self): s = v @@ -551,9 +593,8 @@ def snap(self, freq="S") -> DatetimeIndex: s = t0 else: s = t1 - snapped[i] = s + dta[i] = s - dta = DatetimeArray(snapped, dtype=self.dtype) return DatetimeIndex._simple_new(dta, name=self.name) # -------------------------------------------------------------------- @@ -883,7 +924,7 @@ def date_range( normalize: bool = False, name: Hashable = None, closed: Literal["left", "right"] | None | lib.NoDefault = lib.no_default, - inclusive: IntervalClosedType | None = None, + inclusive: IntervalInclusiveType | None = None, **kwargs, ) -> DatetimeIndex: """ @@ -1089,7 +1130,7 @@ def bdate_range( weekmask=None, holidays=None, closed: IntervalLeftRight | lib.NoDefault | None = lib.no_default, - inclusive: IntervalClosedType | None = None, + inclusive: IntervalInclusiveType | None = None, **kwargs, ) -> DatetimeIndex: """ diff --git a/pandas/core/indexes/frozen.py b/pandas/core/indexes/frozen.py index ed5cf047ab59f..90713e846fbd1 100644 --- a/pandas/core/indexes/frozen.py +++ b/pandas/core/indexes/frozen.py @@ -18,7 +18,7 @@ class FrozenList(PandasObject, list): """ Container that doesn't allow setting item *but* - because it's technically non-hashable, will be used + because it's technically hashable, will be used for lookups, appropriately, etc. """ @@ -89,7 +89,8 @@ def __mul__(self, other): def __reduce__(self): return type(self), (list(self),) - def __hash__(self): + # error: Signature of "__hash__" incompatible with supertype "list" + def __hash__(self) -> int: # type: ignore[override] return hash(tuple(self)) def _disabled(self, *args, **kwargs): diff --git a/pandas/core/indexes/interval.py b/pandas/core/indexes/interval.py index 11e2da47c5738..23f2e724e208c 100644 --- a/pandas/core/indexes/interval.py +++ b/pandas/core/indexes/interval.py @@ -11,6 +11,7 @@ Hashable, Literal, ) +import warnings import numpy as np @@ -19,7 +20,6 @@ Interval, IntervalMixin, IntervalTree, - _warning_interval, ) from pandas._libs.tslibs import ( BaseOffset, @@ -30,15 +30,19 @@ from pandas._typing import ( Dtype, DtypeObj, - IntervalClosedType, + IntervalInclusiveType, npt, ) from pandas.errors import InvalidIndexError from pandas.util._decorators import ( Appender, cache_readonly, + deprecate_kwarg, +) +from pandas.util._exceptions import ( + find_stack_level, + rewrite_exception, ) -from pandas.util._exceptions import rewrite_exception from pandas.core.dtypes.cast import ( find_common_type, @@ -149,7 +153,7 @@ def _new_IntervalIndex(cls, d): _interval_shared_docs["class"] % { "klass": "IntervalIndex", - "summary": "Immutable index of intervals that are closed on the same side.", + "summary": "Immutable index of intervals that are inclusive on the same side.", "name": _index_doc_kwargs["name"], "versionadded": "0.20.0", "extra_attributes": "is_overlapping\nvalues\n", @@ -175,7 +179,7 @@ def _new_IntervalIndex(cls, d): ), } ) -@inherit_names(["set_closed", "to_tuples"], IntervalArray, wrap=True) +@inherit_names(["set_closed", "set_inclusive", "to_tuples"], IntervalArray, wrap=True) @inherit_names( [ "__array__", @@ -194,7 +198,7 @@ class IntervalIndex(ExtensionIndex): _typ = "intervalindex" # annotate properties pinned via inherit_names - inclusive: IntervalClosedType + inclusive: IntervalInclusiveType is_non_overlapping_monotonic: bool closed_left: bool closed_right: bool @@ -209,19 +213,17 @@ class IntervalIndex(ExtensionIndex): # -------------------------------------------------------------------- # Constructors + @deprecate_kwarg(old_arg_name="closed", new_arg_name="inclusive") def __new__( cls, data, - inclusive=None, - closed: None | lib.NoDefault = lib.no_default, + inclusive: IntervalInclusiveType | None = None, dtype: Dtype | None = None, copy: bool = False, name: Hashable = None, verify_integrity: bool = True, ) -> IntervalIndex: - inclusive, closed = _warning_interval(inclusive, closed) - name = maybe_extract_name(name, data, cls) with rewrite_exception("IntervalArray", cls.__name__): @@ -235,6 +237,15 @@ def __new__( return cls._simple_new(array, name) + @property + def closed(self): + warnings.warn( + "Attribute `closed` is deprecated in favor of `inclusive`.", + FutureWarning, + stacklevel=find_stack_level(), + ) + return self.inclusive + @classmethod @Appender( _interval_shared_docs["from_breaks"] @@ -251,19 +262,18 @@ def __new__( ), } ) + @deprecate_kwarg(old_arg_name="closed", new_arg_name="inclusive") def from_breaks( cls, breaks, - inclusive=None, - closed: None | lib.NoDefault = lib.no_default, + inclusive: IntervalInclusiveType | None = None, name: Hashable = None, copy: bool = False, dtype: Dtype | None = None, ) -> IntervalIndex: - inclusive, closed = _warning_interval(inclusive, closed) if inclusive is None: - inclusive = "both" + inclusive = "right" with rewrite_exception("IntervalArray", cls.__name__): array = IntervalArray.from_breaks( @@ -287,20 +297,19 @@ def from_breaks( ), } ) + @deprecate_kwarg(old_arg_name="closed", new_arg_name="inclusive") def from_arrays( cls, left, right, - inclusive=None, - closed: None | lib.NoDefault = lib.no_default, + inclusive: IntervalInclusiveType | None = None, name: Hashable = None, copy: bool = False, dtype: Dtype | None = None, ) -> IntervalIndex: - inclusive, closed = _warning_interval(inclusive, closed) if inclusive is None: - inclusive = "both" + inclusive = "right" with rewrite_exception("IntervalArray", cls.__name__): array = IntervalArray.from_arrays( @@ -324,19 +333,18 @@ def from_arrays( ), } ) + @deprecate_kwarg(old_arg_name="closed", new_arg_name="inclusive") def from_tuples( cls, data, - inclusive=None, - closed: None | lib.NoDefault = lib.no_default, + inclusive: IntervalInclusiveType | None = None, name: Hashable = None, copy: bool = False, dtype: Dtype | None = None, ) -> IntervalIndex: - inclusive, closed = _warning_interval(inclusive, closed) if inclusive is None: - inclusive = "both" + inclusive = "right" with rewrite_exception("IntervalArray", cls.__name__): arr = IntervalArray.from_tuples( @@ -465,7 +473,7 @@ def is_overlapping(self) -> bool: >>> index.is_overlapping True - Intervals that share closed endpoints overlap: + Intervals that share inclusive endpoints overlap: >>> index = pd.interval_range(0, 3, inclusive='both') >>> index @@ -974,14 +982,14 @@ def _is_type_compatible(a, b) -> bool: ) +@deprecate_kwarg(old_arg_name="closed", new_arg_name="inclusive") def interval_range( start=None, end=None, periods=None, freq=None, name: Hashable = None, - closed: IntervalClosedType | lib.NoDefault = lib.no_default, - inclusive: IntervalClosedType | None = None, + inclusive: IntervalInclusiveType | None = None, ) -> IntervalIndex: """ Return a fixed frequency IntervalIndex. @@ -1000,6 +1008,10 @@ def interval_range( for numeric and 'D' for datetime-like. name : str, default None Name of the resulting IntervalIndex. + inclusive : {"both", "neither", "left", "right"}, default "both" + Include boundaries; Whether to set each bound as inclusive or not. + + .. versionadded:: 1.5.0 closed : {'left', 'right', 'both', 'neither'}, default 'right' Whether the intervals are closed on the left-side, right-side, both or neither. @@ -1007,10 +1019,6 @@ def interval_range( .. deprecated:: 1.5.0 Argument `closed` has been deprecated to standardize boundary inputs. Use `inclusive` instead, to set each bound as closed or open. - inclusive : {"both", "neither", "left", "right"}, default "both" - Include boundaries; Whether to set each bound as closed or open. - - .. versionadded:: 1.5.0 Returns ------- @@ -1018,7 +1026,7 @@ def interval_range( See Also -------- - IntervalIndex : An Index of intervals that are all closed on the same side. + IntervalIndex : An Index of intervals that are all inclusive on the same side. Notes ----- @@ -1071,15 +1079,14 @@ def interval_range( dtype='interval[float64, right]') The ``inclusive`` parameter specifies which endpoints of the individual - intervals within the ``IntervalIndex`` are closed. + intervals within the ``IntervalIndex`` are inclusive. >>> pd.interval_range(end=5, periods=4, inclusive='both') IntervalIndex([[1, 2], [2, 3], [3, 4], [4, 5]], dtype='interval[int64, both]') """ - inclusive, closed = _warning_interval(inclusive, closed) if inclusive is None: - inclusive = "both" + inclusive = "right" start = maybe_box_datetimelike(start) end = maybe_box_datetimelike(end) diff --git a/pandas/core/indexes/multi.py b/pandas/core/indexes/multi.py index 351cae6816ace..fd6b6ba63d7e0 100644 --- a/pandas/core/indexes/multi.py +++ b/pandas/core/indexes/multi.py @@ -308,7 +308,7 @@ def __new__( copy=False, name=None, verify_integrity: bool = True, - ): + ) -> MultiIndex: # compat with Index if name is not None: @@ -363,8 +363,9 @@ def _validate_codes(self, level: list, code: list): """ null_mask = isna(level) if np.any(null_mask): - # Incompatible types in assignment (expression has type - # "ndarray[Any, dtype[Any]]", variable has type "List[Any]") + # error: Incompatible types in assignment + # (expression has type "ndarray[Any, dtype[Any]]", + # variable has type "List[Any]") code = np.where(null_mask[code], -1, code) # type: ignore[assignment] return code @@ -502,7 +503,7 @@ def from_tuples( cls, tuples: Iterable[tuple[Hashable, ...]], sortorder: int | None = None, - names: Sequence[Hashable] | None = None, + names: Sequence[Hashable] | Hashable | None = None, ) -> MultiIndex: """ Convert list of tuples to MultiIndex. @@ -561,7 +562,9 @@ def from_tuples( if len(tuples) == 0: if names is None: raise TypeError("Cannot infer number of levels from empty list") - arrays = [[]] * len(names) + # error: Argument 1 to "len" has incompatible type "Hashable"; + # expected "Sized" + arrays = [[]] * len(names) # type: ignore[arg-type] elif isinstance(tuples, (np.ndarray, Index)): if isinstance(tuples, Index): tuples = np.asarray(tuples._values) @@ -577,7 +580,10 @@ def from_tuples( @classmethod def from_product( - cls, iterables, sortorder=None, names=lib.no_default + cls, + iterables: Sequence[Iterable[Hashable]], + sortorder: int | None = None, + names: Sequence[Hashable] | lib.NoDefault = lib.no_default, ) -> MultiIndex: """ Make a MultiIndex from the cartesian product of multiple iterables. @@ -763,7 +769,7 @@ def dtypes(self) -> Series: from pandas import Series names = com.fill_missing_names([level.name for level in self.levels]) - return Series([level.dtype for level in self.levels], index=names) + return Series([level.dtype for level in self.levels], index=Index(names)) def __len__(self) -> int: return len(self.codes[0]) @@ -1517,8 +1523,7 @@ def _get_grouper_for_level( return grouper, None, None values = self.get_level_values(level) - na_sentinel = -1 if dropna else None - codes, uniques = algos.factorize(values, sort=True, na_sentinel=na_sentinel) + codes, uniques = algos.factorize(values, sort=True, use_na_sentinel=dropna) assert isinstance(uniques, Index) if self.levels[level]._can_hold_na: @@ -1579,11 +1584,12 @@ def is_monotonic_increasing(self) -> bool: self._get_level_values(i)._values for i in reversed(range(len(self.levels))) ] try: - # Argument 1 to "lexsort" has incompatible type "List[Union[ExtensionArray, - # ndarray[Any, Any]]]"; expected "Union[_SupportsArray[dtype[Any]], + # error: Argument 1 to "lexsort" has incompatible type + # "List[Union[ExtensionArray, ndarray[Any, Any]]]"; + # expected "Union[_SupportsArray[dtype[Any]], # _NestedSequence[_SupportsArray[dtype[Any]]], bool, - # int, float, complex, str, bytes, _NestedSequence[Union[bool, int, float, - # complex, str, bytes]]]" [arg-type] + # int, float, complex, str, bytes, _NestedSequence[Union + # [bool, int, float, complex, str, bytes]]]" sort_order = np.lexsort(values) # type: ignore[arg-type] return Index(sort_order).is_monotonic_increasing except TypeError: @@ -1822,7 +1828,9 @@ def to_frame( result.index = self return result - def to_flat_index(self) -> Index: + # error: Return type "Index" of "to_flat_index" incompatible with return type + # "MultiIndex" in supertype "Index" + def to_flat_index(self) -> Index: # type: ignore[override] """ Convert a MultiIndex to an Index of Tuples containing the level values. @@ -2642,7 +2650,10 @@ def _get_indexer_level_0(self, target) -> npt.NDArray[np.intp]: return ci.get_indexer_for(target) def get_slice_bound( - self, label: Hashable | Sequence[Hashable], side: str, kind=lib.no_default + self, + label: Hashable | Sequence[Hashable], + side: Literal["left", "right"], + kind=lib.no_default, ) -> int: """ For an ordered MultiIndex, compute slice bound @@ -2757,7 +2768,7 @@ def slice_locs( # happens in get_slice_bound method), but it adds meaningful doc. return super().slice_locs(start, end, step) - def _partial_tup_index(self, tup: tuple, side="left"): + def _partial_tup_index(self, tup: tuple, side: Literal["left", "right"] = "left"): if len(tup) > self._lexsort_depth: raise UnsortedIndexError( f"Key length ({len(tup)}) was greater than MultiIndex lexsort depth " diff --git a/pandas/core/indexes/numeric.py b/pandas/core/indexes/numeric.py index c1cb5ad315298..5731d476cef10 100644 --- a/pandas/core/indexes/numeric.py +++ b/pandas/core/indexes/numeric.py @@ -106,7 +106,7 @@ class NumericIndex(Index): } @property - def _engine_type(self): + def _engine_type(self) -> type[libindex.IndexEngine]: # error: Invalid index type "Union[dtype[Any], ExtensionDtype]" for # "Dict[dtype[Any], Type[IndexEngine]]"; expected type "dtype[Any]" return self._engine_types[self.dtype] # type: ignore[index] @@ -120,7 +120,9 @@ def inferred_type(self) -> str: "c": "complex", }[self.dtype.kind] - def __new__(cls, data=None, dtype: Dtype | None = None, copy=False, name=None): + def __new__( + cls, data=None, dtype: Dtype | None = None, copy=False, name=None + ) -> NumericIndex: name = maybe_extract_name(name, data, cls) subarr = cls._ensure_array(data, dtype, copy) @@ -371,10 +373,13 @@ class Int64Index(IntegerIndex): __doc__ = _num_index_shared_docs["class_descr"] % _index_descr_args _typ = "int64index" - _engine_type = libindex.Int64Engine _default_dtype = np.dtype(np.int64) _dtype_validation_metadata = (is_signed_integer_dtype, "signed integer") + @property + def _engine_type(self) -> type[libindex.Int64Engine]: + return libindex.Int64Engine + class UInt64Index(IntegerIndex): _index_descr_args = { @@ -386,10 +391,13 @@ class UInt64Index(IntegerIndex): __doc__ = _num_index_shared_docs["class_descr"] % _index_descr_args _typ = "uint64index" - _engine_type = libindex.UInt64Engine _default_dtype = np.dtype(np.uint64) _dtype_validation_metadata = (is_unsigned_integer_dtype, "unsigned integer") + @property + def _engine_type(self) -> type[libindex.UInt64Engine]: + return libindex.UInt64Engine + class Float64Index(NumericIndex): _index_descr_args = { @@ -401,7 +409,10 @@ class Float64Index(NumericIndex): __doc__ = _num_index_shared_docs["class_descr"] % _index_descr_args _typ = "float64index" - _engine_type = libindex.Float64Engine _default_dtype = np.dtype(np.float64) _dtype_validation_metadata = (is_float_dtype, "float") _is_backward_compat_public_numeric_index: bool = False + + @property + def _engine_type(self) -> type[libindex.Float64Engine]: + return libindex.Float64Engine diff --git a/pandas/core/indexes/period.py b/pandas/core/indexes/period.py index e3ab5e8624585..c034d9416eae7 100644 --- a/pandas/core/indexes/period.py +++ b/pandas/core/indexes/period.py @@ -159,9 +159,12 @@ class PeriodIndex(DatetimeIndexOpsMixin): dtype: PeriodDtype _data_cls = PeriodArray - _engine_type = libindex.PeriodEngine _supports_partial_string_indexing = True + @property + def _engine_type(self) -> type[libindex.PeriodEngine]: + return libindex.PeriodEngine + @cache_readonly # Signature of "_resolution_obj" incompatible with supertype "DatetimeIndexOpsMixin" def _resolution_obj(self) -> Resolution: # type: ignore[override] diff --git a/pandas/core/indexes/range.py b/pandas/core/indexes/range.py index fdb1ee754a7e6..376c98b6e176f 100644 --- a/pandas/core/indexes/range.py +++ b/pandas/core/indexes/range.py @@ -8,6 +8,7 @@ Any, Callable, Hashable, + Iterator, List, cast, ) @@ -19,6 +20,7 @@ index as libindex, lib, ) +from pandas._libs.algos import unique_deltas from pandas._libs.lib import no_default from pandas._typing import ( Dtype, @@ -43,6 +45,7 @@ from pandas.core.dtypes.generic import ABCTimedeltaIndex from pandas.core import ops +from pandas.core.algorithms import resolve_na_sentinel import pandas.core.common as com from pandas.core.construction import extract_array import pandas.core.indexes.base as ibase @@ -101,11 +104,14 @@ class RangeIndex(NumericIndex): """ _typ = "rangeindex" - _engine_type = libindex.Int64Engine _dtype_validation_metadata = (is_signed_integer_dtype, "signed integer") _range: range _is_backward_compat_public_numeric_index: bool = False + @property + def _engine_type(self) -> type[libindex.Int64Engine]: + return libindex.Int64Engine + # -------------------------------------------------------------------- # Constructors @@ -425,7 +431,7 @@ def tolist(self) -> list[int]: return list(self._range) @doc(Int64Index.__iter__) - def __iter__(self): + def __iter__(self) -> Iterator[int]: yield from self._range @doc(Int64Index._shallow_copy) @@ -434,7 +440,15 @@ def _shallow_copy(self, values, name: Hashable = no_default): if values.dtype.kind == "f": return Float64Index(values, name=name) - return Int64Index._simple_new(values, name=name) + # GH 46675 & 43885: If values is equally spaced, return a + # more memory-compact RangeIndex instead of Int64Index + unique_diffs = unique_deltas(values) + if len(unique_diffs) == 1 and unique_diffs[0] != 0: + diff = unique_diffs[0] + new_range = range(values[0], values[-1] + diff, diff) + return type(self)._simple_new(new_range, name=name) + else: + return Int64Index._simple_new(values, name=name) def _view(self: RangeIndex) -> RangeIndex: result = type(self)._simple_new(self._range, name=self._name) @@ -510,8 +524,13 @@ def argsort(self, *args, **kwargs) -> npt.NDArray[np.intp]: return result def factorize( - self, sort: bool = False, na_sentinel: int | None = -1 + self, + sort: bool = False, + na_sentinel: int | lib.NoDefault = lib.no_default, + use_na_sentinel: bool | lib.NoDefault = lib.no_default, ) -> tuple[npt.NDArray[np.intp], RangeIndex]: + # resolve to emit warning if appropriate + resolve_na_sentinel(na_sentinel, use_na_sentinel) codes = np.arange(len(self), dtype=np.intp) uniques = self if sort and self.step < 0: @@ -631,6 +650,17 @@ def _extended_gcd(self, a: int, b: int) -> tuple[int, int, int]: old_t, t = t, old_t - quotient * t return old_r, old_s, old_t + def _range_in_self(self, other: range) -> bool: + """Check if other range is contained in self""" + # https://stackoverflow.com/a/32481015 + if not other: + return True + if not self._range: + return False + if len(other) > 1 and other.step % self._range.step: + return False + return other.start in self._range and other[-1] in self._range + def _union(self, other: Index, sort): """ Form the union of two Index objects and sorts if possible @@ -640,10 +670,12 @@ def _union(self, other: Index, sort): other : Index or array-like sort : False or None, default None - Whether to sort resulting index. ``sort=None`` returns a - monotonically increasing ``RangeIndex`` if possible or a sorted - ``Int64Index`` if not. ``sort=False`` always returns an - unsorted ``Int64Index`` + Whether to sort (monotonically increasing) the resulting index. + ``sort=None`` returns a ``RangeIndex`` if possible or a sorted + ``Int64Index`` if not. + ``sort=False`` can return a ``RangeIndex`` if self is monotonically + increasing and other is fully contained in self. Otherwise, returns + an unsorted ``Int64Index`` .. versionadded:: 0.25.0 @@ -651,53 +683,58 @@ def _union(self, other: Index, sort): ------- union : Index """ - if isinstance(other, RangeIndex) and sort is None: - start_s, step_s = self.start, self.step - end_s = self.start + self.step * (len(self) - 1) - start_o, step_o = other.start, other.step - end_o = other.start + other.step * (len(other) - 1) - if self.step < 0: - start_s, step_s, end_s = end_s, -step_s, start_s - if other.step < 0: - start_o, step_o, end_o = end_o, -step_o, start_o - if len(self) == 1 and len(other) == 1: - step_s = step_o = abs(self.start - other.start) - elif len(self) == 1: - step_s = step_o - elif len(other) == 1: - step_o = step_s - start_r = min(start_s, start_o) - end_r = max(end_s, end_o) - if step_o == step_s: - if ( - (start_s - start_o) % step_s == 0 - and (start_s - end_o) <= step_s - and (start_o - end_s) <= step_s - ): - return type(self)(start_r, end_r + step_s, step_s) - if ( - (step_s % 2 == 0) - and (abs(start_s - start_o) == step_s / 2) - and (abs(end_s - end_o) == step_s / 2) - ): - # e.g. range(0, 10, 2) and range(1, 11, 2) - # but not range(0, 20, 4) and range(1, 21, 4) GH#44019 - return type(self)(start_r, end_r + step_s / 2, step_s / 2) - - elif step_o % step_s == 0: - if ( - (start_o - start_s) % step_s == 0 - and (start_o + step_s >= start_s) - and (end_o - step_s <= end_s) - ): - return type(self)(start_r, end_r + step_s, step_s) - elif step_s % step_o == 0: - if ( - (start_s - start_o) % step_o == 0 - and (start_s + step_o >= start_o) - and (end_s - step_o <= end_o) - ): - return type(self)(start_r, end_r + step_o, step_o) + if isinstance(other, RangeIndex): + if sort is None or ( + sort is False and self.step > 0 and self._range_in_self(other._range) + ): + # GH 47557: Can still return a RangeIndex + # if other range in self and sort=False + start_s, step_s = self.start, self.step + end_s = self.start + self.step * (len(self) - 1) + start_o, step_o = other.start, other.step + end_o = other.start + other.step * (len(other) - 1) + if self.step < 0: + start_s, step_s, end_s = end_s, -step_s, start_s + if other.step < 0: + start_o, step_o, end_o = end_o, -step_o, start_o + if len(self) == 1 and len(other) == 1: + step_s = step_o = abs(self.start - other.start) + elif len(self) == 1: + step_s = step_o + elif len(other) == 1: + step_o = step_s + start_r = min(start_s, start_o) + end_r = max(end_s, end_o) + if step_o == step_s: + if ( + (start_s - start_o) % step_s == 0 + and (start_s - end_o) <= step_s + and (start_o - end_s) <= step_s + ): + return type(self)(start_r, end_r + step_s, step_s) + if ( + (step_s % 2 == 0) + and (abs(start_s - start_o) == step_s / 2) + and (abs(end_s - end_o) == step_s / 2) + ): + # e.g. range(0, 10, 2) and range(1, 11, 2) + # but not range(0, 20, 4) and range(1, 21, 4) GH#44019 + return type(self)(start_r, end_r + step_s / 2, step_s / 2) + + elif step_o % step_s == 0: + if ( + (start_o - start_s) % step_s == 0 + and (start_o + step_s >= start_s) + and (end_o - step_s <= end_s) + ): + return type(self)(start_r, end_r + step_s, step_s) + elif step_s % step_o == 0: + if ( + (start_s - start_o) % step_o == 0 + and (start_s + step_o >= start_o) + and (end_s - step_o <= end_o) + ): + return type(self)(start_r, end_r + step_o, step_o) return super()._union(other, sort=sort) diff --git a/pandas/core/indexes/timedeltas.py b/pandas/core/indexes/timedeltas.py index 0249bf51f71b7..095c5d1b1ba03 100644 --- a/pandas/core/indexes/timedeltas.py +++ b/pandas/core/indexes/timedeltas.py @@ -101,7 +101,10 @@ class TimedeltaIndex(DatetimeTimedeltaMixin): _typ = "timedeltaindex" _data_cls = TimedeltaArray - _engine_type = libindex.TimedeltaEngine + + @property + def _engine_type(self) -> type[libindex.TimedeltaEngine]: + return libindex.TimedeltaEngine _data: TimedeltaArray @@ -132,6 +135,7 @@ def __new__( "represent unambiguous timedelta values durations." ) + # FIXME: need to check for dtype/data match if isinstance(data, TimedeltaArray) and freq is lib.no_default: if copy: data = data.copy() diff --git a/pandas/core/indexing.py b/pandas/core/indexing.py index 20ac0fedc28d1..fa1ad7ce3c874 100644 --- a/pandas/core/indexing.py +++ b/pandas/core/indexing.py @@ -5,6 +5,7 @@ TYPE_CHECKING, Hashable, Sequence, + TypeVar, cast, final, ) @@ -16,15 +17,20 @@ from pandas._libs.lib import item_from_zerodim from pandas.errors import ( AbstractMethodError, + IndexingError, InvalidIndexError, ) from pandas.util._decorators import doc from pandas.util._exceptions import find_stack_level -from pandas.core.dtypes.cast import can_hold_element +from pandas.core.dtypes.cast import ( + can_hold_element, + maybe_promote, +) from pandas.core.dtypes.common import ( is_array_like, is_bool_dtype, + is_extension_array_dtype, is_hashable, is_integer, is_iterator, @@ -41,7 +47,9 @@ ) from pandas.core.dtypes.missing import ( infer_fill_value, + is_valid_na_for_dtype, isna, + na_value_for_dtype, ) from pandas.core import algorithms as algos @@ -68,6 +76,8 @@ Series, ) +_LocationIndexerT = TypeVar("_LocationIndexerT", bound="_LocationIndexer") + # "null slice" _NS = slice(None, None) _one_ellipsis_message = "indexer may only contain one '...' entry" @@ -121,10 +131,6 @@ def __getitem__(self, arg): IndexSlice = _IndexSlice() -class IndexingError(Exception): - pass - - class IndexingMixin: """ Mixin for adding .loc/.iloc/.at/.iat to Dataframes and Series. @@ -524,6 +530,9 @@ def loc(self) -> _LocIndexer: sidewinder mark i 10 20 mark ii 1 4 viper mark ii 7 1 + + Please see the :ref:`user guide` + for more details and explanations of advanced indexing. """ return _LocIndexer("loc", self) @@ -649,7 +658,7 @@ class _LocationIndexer(NDFrameIndexerBase): _takeable: bool @final - def __call__(self, axis=None): + def __call__(self: _LocationIndexerT, axis=None) -> _LocationIndexerT: # we need to return a copy of ourselves new_self = type(self)(self.name, self.obj) @@ -793,7 +802,7 @@ def _ensure_listlike_indexer(self, key, axis=None, value=None): self.obj._mgr = self.obj._mgr.reindex_axis(keys, axis=0, only_slice=True) @final - def __setitem__(self, key, value): + def __setitem__(self, key, value) -> None: check_deprecated_indexers(key) if isinstance(key, tuple): key = tuple(list(x) if is_iterator(x) else x for x in key) @@ -1980,7 +1989,11 @@ def _setitem_single_column(self, loc: int, value, plane_indexer): # We will not operate in-place, but will attempt to in the future. # To determine whether we need to issue a FutureWarning, see if the # setting in-place would work, i.e. behavior will change. - warn = can_hold_element(orig_values, value) + if isinstance(value, ABCSeries): + warn = can_hold_element(orig_values, value._values) + else: + warn = can_hold_element(orig_values, value) + # Don't issue the warning yet, as we can still trim a few cases where # behavior will not change. @@ -1992,11 +2005,16 @@ def _setitem_single_column(self, loc: int, value, plane_indexer): if ( isinstance(new_values, np.ndarray) and isinstance(orig_values, np.ndarray) - and np.shares_memory(new_values, orig_values) + and ( + np.shares_memory(new_values, orig_values) + or new_values.shape != orig_values.shape + ) ): # TODO: get something like tm.shares_memory working? # The values were set inplace after all, no need to warn, # e.g. test_rename_nocopy + # In case of enlarging we can not set inplace, so need to + # warn either pass else: warnings.warn( @@ -2083,8 +2101,23 @@ def _setitem_with_indexer_missing(self, indexer, value): # We get only here with loc, so can hard code return self._setitem_with_indexer(new_indexer, value, "loc") - # this preserves dtype of the value - new_values = Series([value])._values + # this preserves dtype of the value and of the object + if is_valid_na_for_dtype(value, self.obj.dtype): + value = na_value_for_dtype(self.obj.dtype, compat=False) + new_dtype = maybe_promote(self.obj.dtype, value)[0] + elif isna(value): + new_dtype = None + elif not self.obj.empty and not is_object_dtype(self.obj.dtype): + # We should not cast, if we have object dtype because we can + # set timedeltas into object series + curr_dtype = self.obj.dtype + curr_dtype = getattr(curr_dtype, "numpy_dtype", curr_dtype) + new_dtype = maybe_promote(curr_dtype, value)[0] + else: + new_dtype = None + + new_values = Series([value], dtype=new_dtype)._values + if len(self.obj._values): # GH#22717 handle casting compatibility that np.concatenate # does incorrectly @@ -2338,7 +2371,7 @@ def __getitem__(self, key): key = self._convert_key(key) return self.obj._get_value(*key, takeable=self._takeable) - def __setitem__(self, key, value): + def __setitem__(self, key, value) -> None: if isinstance(key, tuple): key = tuple(com.apply_if_callable(x, self.obj) for x in key) else: @@ -2504,15 +2537,20 @@ def check_bool_indexer(index: Index, key) -> np.ndarray: """ result = key if isinstance(key, ABCSeries) and not key.index.equals(index): - result = result.reindex(index) - mask = isna(result._values) - if mask.any(): + indexer = result.index.get_indexer_for(index) + if -1 in indexer: raise IndexingError( "Unalignable boolean Series provided as " "indexer (index of the boolean Series and of " "the indexed object do not match)." ) - return result.astype(bool)._values + + result = result.take(indexer) + + # fall through for boolean + if not is_extension_array_dtype(result.dtype): + return result.astype(bool)._values + if is_object_dtype(key): # key might be object-dtype bool, check_array_indexer needs bool array result = np.asarray(result, dtype=bool) diff --git a/pandas/core/internals/__init__.py b/pandas/core/internals/__init__.py index 75715bdc90003..ea69b567611e4 100644 --- a/pandas/core/internals/__init__.py +++ b/pandas/core/internals/__init__.py @@ -23,7 +23,6 @@ __all__ = [ "Block", - "CategoricalBlock", "NumericBlock", "DatetimeTZBlock", "ExtensionBlock", diff --git a/pandas/core/internals/array_manager.py b/pandas/core/internals/array_manager.py index ee6c183898079..88f81064b826f 100644 --- a/pandas/core/internals/array_manager.py +++ b/pandas/core/internals/array_manager.py @@ -8,6 +8,7 @@ Any, Callable, Hashable, + Literal, TypeVar, ) @@ -170,13 +171,13 @@ def set_axis(self, axis: int, new_labels: Index) -> None: axis = self._normalize_axis(axis) self._axes[axis] = new_labels - def get_dtypes(self): + def get_dtypes(self) -> np.ndarray: return np.array([arr.dtype for arr in self.arrays], dtype="object") def __getstate__(self): return self.arrays, self._axes - def __setstate__(self, state): + def __setstate__(self, state) -> None: self.arrays = state[0] self._axes = state[1] @@ -347,7 +348,7 @@ def where(self: T, other, cond, align: bool) -> T: def setitem(self: T, indexer, value) -> T: return self.apply_with_block("setitem", indexer=indexer, value=value) - def putmask(self, mask, new, align: bool = True): + def putmask(self: T, mask, new, align: bool = True) -> T: if align: align_keys = ["new", "mask"] else: @@ -450,7 +451,7 @@ def replace_list( regex=regex, ) - def to_native_types(self, **kwargs): + def to_native_types(self: T, **kwargs) -> T: return self.apply(to_native_types, **kwargs) @property @@ -704,7 +705,9 @@ def _equal_values(self, other) -> bool: class ArrayManager(BaseArrayManager): - ndim = 2 + @property + def ndim(self) -> Literal[2]: + return 2 def __init__( self, @@ -812,7 +815,7 @@ def column_arrays(self) -> list[ArrayLike]: def iset( self, loc: int | slice | np.ndarray, value: ArrayLike, inplace: bool = False - ): + ) -> None: """ Set new column(s). @@ -920,7 +923,7 @@ def insert(self, loc: int, item: Hashable, value: ArrayLike) -> None: self.arrays = arrays self._axes[1] = new_axis - def idelete(self, indexer): + def idelete(self, indexer) -> ArrayManager: """ Delete selected locations in-place (new block and array, same BlockManager) """ @@ -1129,7 +1132,7 @@ def as_array( self, dtype=None, copy: bool = False, - na_value=lib.no_default, + na_value: object = lib.no_default, ) -> np.ndarray: """ Convert the blockmanager data into an numpy array. @@ -1191,7 +1194,9 @@ class SingleArrayManager(BaseArrayManager, SingleDataManager): arrays: list[np.ndarray | ExtensionArray] _axes: list[Index] - ndim = 1 + @property + def ndim(self) -> Literal[1]: + return 1 def __init__( self, @@ -1235,7 +1240,7 @@ def make_empty(self, axes=None) -> SingleArrayManager: return type(self)([array], axes) @classmethod - def from_array(cls, array, index): + def from_array(cls, array, index) -> SingleArrayManager: return cls([array], [index]) @property @@ -1300,7 +1305,7 @@ def apply(self, func, **kwargs): new_array = getattr(self.array, func)(**kwargs) return type(self)([new_array], self._axes) - def setitem(self, indexer, value): + def setitem(self, indexer, value) -> SingleArrayManager: """ Set values with indexer. @@ -1331,7 +1336,7 @@ def _get_data_subset(self, predicate: Callable) -> SingleArrayManager: else: return self.make_empty() - def set_values(self, values: ArrayLike): + def set_values(self, values: ArrayLike) -> None: """ Set (replace) the values of the SingleArrayManager in place. @@ -1367,7 +1372,7 @@ def __init__(self, n: int) -> None: self.n = n @property - def shape(self): + def shape(self) -> tuple[int]: return (self.n,) def to_array(self, dtype: DtypeObj) -> ArrayLike: diff --git a/pandas/core/internals/base.py b/pandas/core/internals/base.py index d8d1b6a34526c..ddc4495318568 100644 --- a/pandas/core/internals/base.py +++ b/pandas/core/internals/base.py @@ -5,6 +5,7 @@ from __future__ import annotations from typing import ( + Literal, TypeVar, final, ) @@ -155,7 +156,9 @@ def _consolidate_inplace(self) -> None: class SingleDataManager(DataManager): - ndim = 1 + @property + def ndim(self) -> Literal[1]: + return 1 @final @property diff --git a/pandas/core/internals/blocks.py b/pandas/core/internals/blocks.py index 421fac4ea767b..df327716970f1 100644 --- a/pandas/core/internals/blocks.py +++ b/pandas/core/internals/blocks.py @@ -215,7 +215,7 @@ def mgr_locs(self) -> BlockPlacement: return self._mgr_locs @mgr_locs.setter - def mgr_locs(self, new_mgr_locs: BlockPlacement): + def mgr_locs(self, new_mgr_locs: BlockPlacement) -> None: self._mgr_locs = new_mgr_locs @final @@ -504,7 +504,7 @@ def dtype(self) -> DtypeObj: @final def astype( self, dtype: DtypeObj, copy: bool = False, errors: IgnoreRaise = "raise" - ): + ) -> Block: """ Coerce to the new dtype. @@ -536,13 +536,13 @@ def astype( return newb @final - def to_native_types(self, na_rep="nan", quoting=None, **kwargs): + def to_native_types(self, na_rep="nan", quoting=None, **kwargs) -> Block: """convert to our native types format""" result = to_native_types(self.values, na_rep=na_rep, quoting=quoting, **kwargs) return self.make_block(result) @final - def copy(self, deep: bool = True): + def copy(self, deep: bool = True) -> Block: """copy constructor""" values = self.values if deep: @@ -575,7 +575,11 @@ def replace( if isinstance(values, Categorical): # TODO: avoid special-casing blk = self if inplace else self.copy() - blk.values._replace(to_replace=to_replace, value=value, inplace=True) + # error: Item "ExtensionArray" of "Union[ndarray[Any, Any], + # ExtensionArray]" has no attribute "_replace" + blk.values._replace( # type: ignore[union-attr] + to_replace=to_replace, value=value, inplace=True + ) return [blk] if not self._can_hold_element(to_replace): @@ -725,10 +729,13 @@ def replace_list( assert not isinstance(mib, bool) m = mib[blk_num : blk_num + 1] + # error: Argument "mask" to "_replace_coerce" of "Block" has + # incompatible type "Union[ExtensionArray, ndarray[Any, Any], bool]"; + # expected "ndarray[Any, dtype[bool_]]" result = blk._replace_coerce( to_replace=src, value=dest, - mask=m, + mask=m, # type: ignore[arg-type] inplace=inplace, regex=regex, ) @@ -815,7 +822,7 @@ def _unwrap_setitem_indexer(self, indexer): def shape(self) -> Shape: return self.values.shape - def iget(self, i: int | tuple[int, int] | tuple[slice, int]): + def iget(self, i: int | tuple[int, int] | tuple[slice, int]) -> np.ndarray: # In the case where we have a tuple[slice, int], the slice will always # be slice(None) # Note: only reached with self.ndim == 2 @@ -924,7 +931,7 @@ def _unstack( # --------------------------------------------------------------------- - def setitem(self, indexer, value): + def setitem(self, indexer, value) -> Block: """ Attempt self.values[indexer] = value, possibly creating a new array. @@ -2156,7 +2163,7 @@ def new_block(values, placement, *, ndim: int) -> Block: return klass(values, ndim=ndim, placement=placement) -def check_ndim(values, placement: BlockPlacement, ndim: int): +def check_ndim(values, placement: BlockPlacement, ndim: int) -> None: """ ndim inference and validation. @@ -2262,7 +2269,7 @@ def to_native_types( **kwargs, ) -> np.ndarray: """convert to our native types format""" - if isinstance(values, Categorical): + if isinstance(values, Categorical) and values.categories.dtype.kind in "Mm": # GH#40754 Convert categorical datetimes to datetime array values = algos.take_nd( values.categories._values, diff --git a/pandas/core/internals/concat.py b/pandas/core/internals/concat.py index 228d57fe196a4..77197dac3363b 100644 --- a/pandas/core/internals/concat.py +++ b/pandas/core/internals/concat.py @@ -1,5 +1,6 @@ from __future__ import annotations +import copy import itertools from typing import ( TYPE_CHECKING, @@ -13,6 +14,7 @@ NaT, internals as libinternals, ) +from pandas._libs.missing import NA from pandas._typing import ( ArrayLike, DtypeObj, @@ -29,17 +31,26 @@ is_1d_only_ea_dtype, is_datetime64tz_dtype, is_dtype_equal, + is_scalar, + needs_i8_conversion, ) from pandas.core.dtypes.concat import ( cast_to_common_type, concat_compat, ) from pandas.core.dtypes.dtypes import ExtensionDtype +from pandas.core.dtypes.missing import ( + is_valid_na_for_dtype, + isna, + isna_all, +) +import pandas.core.algorithms as algos from pandas.core.arrays import ( DatetimeArray, ExtensionArray, ) +from pandas.core.arrays.sparse import SparseDtype from pandas.core.construction import ensure_wrapped_if_datetimelike from pandas.core.internals.array_manager import ( ArrayManager, @@ -191,31 +202,19 @@ def concatenate_managers( if isinstance(mgrs_indexers[0][0], ArrayManager): return _concatenate_array_managers(mgrs_indexers, axes, concat_axis, copy) - # Assertions disabled for performance - # for tup in mgrs_indexers: - # # caller is responsible for ensuring this - # indexers = tup[1] - # assert concat_axis not in indexers - - if concat_axis == 0: - return _concat_managers_axis0(mgrs_indexers, axes, copy) - mgrs_indexers = _maybe_reindex_columns_na_proxy(axes, mgrs_indexers) - # Assertion disabled for performance - # assert all(not x[1] for x in mgrs_indexers) - - concat_plans = [_get_mgr_concatenation_plan(mgr) for mgr, _ in mgrs_indexers] - concat_plan = _combine_concat_plans(concat_plans) + concat_plans = [ + _get_mgr_concatenation_plan(mgr, indexers) for mgr, indexers in mgrs_indexers + ] + concat_plan = _combine_concat_plans(concat_plans, concat_axis) blocks = [] for placement, join_units in concat_plan: unit = join_units[0] blk = unit.block - # Assertion disabled for performance - # assert len(join_units) == len(mgrs_indexers) - if len(join_units) == 1: + if len(join_units) == 1 and not join_units[0].indexers: values = blk.values if copy: values = values.copy() @@ -239,7 +238,7 @@ def concatenate_managers( fastpath = blk.values.dtype == values.dtype else: - values = _concatenate_join_units(join_units, copy=copy) + values = _concatenate_join_units(join_units, concat_axis, copy=copy) fastpath = False if fastpath: @@ -252,42 +251,6 @@ def concatenate_managers( return BlockManager(tuple(blocks), axes) -def _concat_managers_axis0( - mgrs_indexers, axes: list[Index], copy: bool -) -> BlockManager: - """ - concat_managers specialized to concat_axis=0, with reindexing already - having been done in _maybe_reindex_columns_na_proxy. - """ - had_reindexers = { - i: len(mgrs_indexers[i][1]) > 0 for i in range(len(mgrs_indexers)) - } - mgrs_indexers = _maybe_reindex_columns_na_proxy(axes, mgrs_indexers) - - mgrs = [x[0] for x in mgrs_indexers] - - offset = 0 - blocks = [] - for i, mgr in enumerate(mgrs): - # If we already reindexed, then we definitely don't need another copy - made_copy = had_reindexers[i] - - for blk in mgr.blocks: - if made_copy: - nb = blk.copy(deep=False) - elif copy: - nb = blk.copy() - else: - # by slicing instead of copy(deep=False), we get a new array - # object, see test_concat_copy - nb = blk.getitem_block(slice(None)) - nb._mgr_locs = nb._mgr_locs.add(offset) - blocks.append(nb) - - offset += len(mgr.items) - return BlockManager(tuple(blocks), axes) - - def _maybe_reindex_columns_na_proxy( axes: list[Index], mgrs_indexers: list[tuple[BlockManager, dict[int, np.ndarray]]] ) -> list[tuple[BlockManager, dict[int, np.ndarray]]]: @@ -298,43 +261,54 @@ def _maybe_reindex_columns_na_proxy( Columns added in this reindexing have dtype=np.void, indicating they should be ignored when choosing a column's final dtype. """ - new_mgrs_indexers: list[tuple[BlockManager, dict[int, np.ndarray]]] = [] - + new_mgrs_indexers = [] for mgr, indexers in mgrs_indexers: - # For axis=0 (i.e. columns) we use_na_proxy and only_slice, so this - # is a cheap reindexing. - for i, indexer in indexers.items(): - mgr = mgr.reindex_indexer( - axes[i], - indexers[i], - axis=i, + # We only reindex for axis=0 (i.e. columns), as this can be done cheaply + if 0 in indexers: + new_mgr = mgr.reindex_indexer( + axes[0], + indexers[0], + axis=0, copy=False, - only_slice=True, # only relevant for i==0 + only_slice=True, allow_dups=True, - use_na_proxy=True, # only relevant for i==0 + use_na_proxy=True, ) - new_mgrs_indexers.append((mgr, {})) + new_indexers = indexers.copy() + del new_indexers[0] + new_mgrs_indexers.append((new_mgr, new_indexers)) + else: + new_mgrs_indexers.append((mgr, indexers)) return new_mgrs_indexers -def _get_mgr_concatenation_plan(mgr: BlockManager): +def _get_mgr_concatenation_plan(mgr: BlockManager, indexers: dict[int, np.ndarray]): """ - Construct concatenation plan for given block manager. + Construct concatenation plan for given block manager and indexers. Parameters ---------- mgr : BlockManager + indexers : dict of {axis: indexer} Returns ------- plan : list of (BlockPlacement, JoinUnit) tuples """ + # Calculate post-reindex shape , save for item axis which will be separate + # for each block anyway. + mgr_shape_list = list(mgr.shape) + for ax, indexer in indexers.items(): + mgr_shape_list[ax] = len(indexer) + mgr_shape = tuple(mgr_shape_list) + + assert 0 not in indexers if mgr.is_single_block: blk = mgr.blocks[0] - return [(blk.mgr_locs, JoinUnit(blk))] + return [(blk.mgr_locs, JoinUnit(blk, mgr_shape, indexers))] blknos = mgr.blknos blklocs = mgr.blklocs @@ -342,9 +316,14 @@ def _get_mgr_concatenation_plan(mgr: BlockManager): plan = [] for blkno, placements in libinternals.get_blkno_placements(blknos, group=False): - # Assertions disabled for performance; these should always hold - # assert placements.is_slice_like - # assert blkno != -1 + assert placements.is_slice_like + assert blkno != -1 + + join_unit_indexers = indexers.copy() + + shape_list = list(mgr_shape) + shape_list[0] = len(placements) + shape = tuple(shape_list) blk = mgr.blocks[blkno] ax0_blk_indexer = blklocs[placements.indexer] @@ -366,15 +345,13 @@ def _get_mgr_concatenation_plan(mgr: BlockManager): ) ) - if not unit_no_ax0_reindexing: - # create block from subset of columns - # Note: Blocks with only 1 column will always have unit_no_ax0_reindexing, - # so we will never get here with ExtensionBlock. - blk = blk.getitem_block(ax0_blk_indexer) + # Omit indexer if no item reindexing is required. + if unit_no_ax0_reindexing: + join_unit_indexers.pop(0, None) + else: + join_unit_indexers[0] = ax0_blk_indexer - # Assertions disabled for performance - # assert blk._mgr_locs.as_slice == placements.as_slice - unit = JoinUnit(blk) + unit = JoinUnit(blk, shape, join_unit_indexers) plan.append((placements, unit)) @@ -382,69 +359,192 @@ def _get_mgr_concatenation_plan(mgr: BlockManager): class JoinUnit: - def __init__(self, block: Block) -> None: + def __init__(self, block: Block, shape: Shape, indexers=None): + # Passing shape explicitly is required for cases when block is None. + # Note: block is None implies indexers is None, but not vice-versa + if indexers is None: + indexers = {} self.block = block + self.indexers = indexers + self.shape = shape def __repr__(self) -> str: - return f"{type(self).__name__}({repr(self.block)})" + return f"{type(self).__name__}({repr(self.block)}, {self.indexers})" + + @cache_readonly + def needs_filling(self) -> bool: + for indexer in self.indexers.values(): + # FIXME: cache results of indexer == -1 checks. + if (indexer == -1).any(): + return True + + return False + + @cache_readonly + def dtype(self) -> DtypeObj: + blk = self.block + if blk.values.dtype.kind == "V": + raise AssertionError("Block is None, no dtype") + + if not self.needs_filling: + return blk.dtype + return ensure_dtype_can_hold_na(blk.dtype) + + def _is_valid_na_for(self, dtype: DtypeObj) -> bool: + """ + Check that we are all-NA of a type/dtype that is compatible with this dtype. + Augments `self.is_na` with an additional check of the type of NA values. + """ + if not self.is_na: + return False + if self.block.dtype.kind == "V": + return True + + if self.dtype == object: + values = self.block.values + return all(is_valid_na_for_dtype(x, dtype) for x in values.ravel(order="K")) + + na_value = self.block.fill_value + if na_value is NaT and not is_dtype_equal(self.dtype, dtype): + # e.g. we are dt64 and other is td64 + # fill_values match but we should not cast self.block.values to dtype + # TODO: this will need updating if we ever have non-nano dt64/td64 + return False + + if na_value is NA and needs_i8_conversion(dtype): + # FIXME: kludge; test_append_empty_frame_with_timedelta64ns_nat + # e.g. self.dtype == "Int64" and dtype is td64, we dont want + # to consider these as matching + return False + + # TODO: better to use can_hold_element? + return is_valid_na_for_dtype(na_value, dtype) @cache_readonly def is_na(self) -> bool: blk = self.block if blk.dtype.kind == "V": return True - return False - def get_reindexed_values(self, empty_dtype: DtypeObj) -> ArrayLike: - if self.is_na: - return make_na_array(empty_dtype, self.block.shape) + if not blk._can_hold_na: + return False + values = blk.values + if values.size == 0: + return True + if isinstance(values.dtype, SparseDtype): + return False + + if values.ndim == 1: + # TODO(EA2D): no need for special case with 2D EAs + val = values[0] + if not is_scalar(val) or not isna(val): + # ideally isna_all would do this short-circuiting + return False + return isna_all(values) + else: + val = values[0][0] + if not is_scalar(val) or not isna(val): + # ideally isna_all would do this short-circuiting + return False + return all(isna_all(row) for row in values) + + def get_reindexed_values(self, empty_dtype: DtypeObj, upcasted_na) -> ArrayLike: + values: ArrayLike + + if upcasted_na is None and self.block.dtype.kind != "V": + # No upcasting is necessary + fill_value = self.block.fill_value + values = self.block.get_values() else: - return self.block.values + fill_value = upcasted_na + + if self._is_valid_na_for(empty_dtype): + # note: always holds when self.block.dtype.kind == "V" + blk_dtype = self.block.dtype + + if blk_dtype == np.dtype("object"): + # we want to avoid filling with np.nan if we are + # using None; we already know that we are all + # nulls + values = self.block.values.ravel(order="K") + if len(values) and values[0] is None: + fill_value = None + + if is_datetime64tz_dtype(empty_dtype): + i8values = np.full(self.shape, fill_value.value) + return DatetimeArray(i8values, dtype=empty_dtype) + + elif is_1d_only_ea_dtype(empty_dtype): + empty_dtype = cast(ExtensionDtype, empty_dtype) + cls = empty_dtype.construct_array_type() + + missing_arr = cls._from_sequence([], dtype=empty_dtype) + ncols, nrows = self.shape + assert ncols == 1, ncols + empty_arr = -1 * np.ones((nrows,), dtype=np.intp) + return missing_arr.take( + empty_arr, allow_fill=True, fill_value=fill_value + ) + elif isinstance(empty_dtype, ExtensionDtype): + # TODO: no tests get here, a handful would if we disabled + # the dt64tz special-case above (which is faster) + cls = empty_dtype.construct_array_type() + missing_arr = cls._empty(shape=self.shape, dtype=empty_dtype) + missing_arr[:] = fill_value + return missing_arr + else: + # NB: we should never get here with empty_dtype integer or bool; + # if we did, the missing_arr.fill would cast to gibberish + missing_arr = np.empty(self.shape, dtype=empty_dtype) + missing_arr.fill(fill_value) + return missing_arr + + if (not self.indexers) and (not self.block._can_consolidate): + # preserve these for validation in concat_compat + return self.block.values + + if self.block.is_bool: + # External code requested filling/upcasting, bool values must + # be upcasted to object to avoid being upcasted to numeric. + values = self.block.astype(np.dtype("object")).values + else: + # No dtype upcasting is done here, it will be performed during + # concatenation itself. + values = self.block.values + if not self.indexers: + # If there's no indexing to be done, we want to signal outside + # code that this array must be copied explicitly. This is done + # by returning a view and checking `retval.base`. + values = values.view() -def make_na_array(dtype: DtypeObj, shape: Shape) -> ArrayLike: - """ - Construct an np.ndarray or ExtensionArray of the given dtype and shape - holding all-NA values. - """ - if is_datetime64tz_dtype(dtype): - # NaT here is analogous to dtype.na_value below - i8values = np.full(shape, NaT.value) - return DatetimeArray(i8values, dtype=dtype) - - elif is_1d_only_ea_dtype(dtype): - dtype = cast(ExtensionDtype, dtype) - cls = dtype.construct_array_type() - - missing_arr = cls._from_sequence([], dtype=dtype) - nrows = shape[-1] - taker = -1 * np.ones((nrows,), dtype=np.intp) - return missing_arr.take(taker, allow_fill=True, fill_value=dtype.na_value) - elif isinstance(dtype, ExtensionDtype): - # TODO: no tests get here, a handful would if we disabled - # the dt64tz special-case above (which is faster) - cls = dtype.construct_array_type() - missing_arr = cls._empty(shape=shape, dtype=dtype) - missing_arr[:] = dtype.na_value - return missing_arr - else: - # NB: we should never get here with dtype integer or bool; - # if we did, the missing_arr.fill would cast to gibberish - missing_arr = np.empty(shape, dtype=dtype) - fill_value = _dtype_to_na_value(dtype) - missing_arr.fill(fill_value) - return missing_arr + else: + for ax, indexer in self.indexers.items(): + values = algos.take_nd(values, indexer, axis=ax) + + return values -def _concatenate_join_units(join_units: list[JoinUnit], copy: bool) -> ArrayLike: +def _concatenate_join_units( + join_units: list[JoinUnit], concat_axis: int, copy: bool +) -> ArrayLike: """ - Concatenate values from several join units along axis=1. + Concatenate values from several join units along selected axis. """ + if concat_axis == 0 and len(join_units) > 1: + # Concatenating join units along ax0 is handled in _merge_blocks. + raise AssertionError("Concatenating join units along axis0") empty_dtype = _get_empty_dtype(join_units) - to_concat = [ju.get_reindexed_values(empty_dtype=empty_dtype) for ju in join_units] + has_none_blocks = any(unit.block.dtype.kind == "V" for unit in join_units) + upcasted_na = _dtype_to_na_value(empty_dtype, has_none_blocks) + + to_concat = [ + ju.get_reindexed_values(empty_dtype=empty_dtype, upcasted_na=upcasted_na) + for ju in join_units + ] if len(to_concat) == 1: # Only one block, nothing to concatenate. @@ -476,12 +576,12 @@ def _concatenate_join_units(join_units: list[JoinUnit], copy: bool) -> ArrayLike concat_values = ensure_block_shape(concat_values, 2) else: - concat_values = concat_compat(to_concat, axis=1) + concat_values = concat_compat(to_concat, axis=concat_axis) return concat_values -def _dtype_to_na_value(dtype: DtypeObj): +def _dtype_to_na_value(dtype: DtypeObj, has_none_blocks: bool): """ Find the NA value to go with this dtype. """ @@ -495,6 +595,9 @@ def _dtype_to_na_value(dtype: DtypeObj): # different from missing.na_value_for_dtype return None elif dtype.kind in ["i", "u"]: + if not has_none_blocks: + # different from missing.na_value_for_dtype + return None return np.nan elif dtype.kind == "O": return np.nan @@ -519,12 +622,14 @@ def _get_empty_dtype(join_units: Sequence[JoinUnit]) -> DtypeObj: empty_dtype = join_units[0].block.dtype return empty_dtype - needs_can_hold_na = any(unit.is_na for unit in join_units) + has_none_blocks = any(unit.block.dtype.kind == "V" for unit in join_units) - dtypes = [unit.block.dtype for unit in join_units if not unit.is_na] + dtypes = [unit.dtype for unit in join_units if not unit.is_na] + if not len(dtypes): + dtypes = [unit.dtype for unit in join_units if unit.block.dtype.kind != "V"] dtype = find_common_type(dtypes) - if needs_can_hold_na: + if has_none_blocks: dtype = ensure_dtype_can_hold_na(dtype) return dtype @@ -539,9 +644,6 @@ def _is_uniform_join_units(join_units: list[JoinUnit]) -> bool: first = join_units[0].block if first.dtype.kind == "V": return False - elif len(join_units) == 1: - # only use this path when there is something to concatenate - return False return ( # exclude cases where a) ju.block is None or b) we have e.g. Int64+int64 all(type(ju.block) is type(first) for ju in join_units) @@ -554,8 +656,16 @@ def _is_uniform_join_units(join_units: list[JoinUnit]) -> bool: or ju.block.dtype.kind in ["b", "i", "u"] for ju in join_units ) - # this also precludes any blocks with dtype.kind == "V", since - # we excluded that case for `first` above. + and + # no blocks that would get missing values (can lead to type upcasts) + # unless we're an extension dtype. + all(not ju.is_na or ju.block.is_extension for ju in join_units) + and + # no blocks with indexers (as then the dimensions do not fit) + all(not ju.indexers for ju in join_units) + and + # only use this path when there is something to concatenate + len(join_units) > 1 ) @@ -573,14 +683,28 @@ def _trim_join_unit(join_unit: JoinUnit, length: int) -> JoinUnit: Extra items that didn't fit are returned as a separate block. """ + if 0 not in join_unit.indexers: + extra_indexers = join_unit.indexers - extra_block = join_unit.block.getitem_block(slice(length, None)) - join_unit.block = join_unit.block.getitem_block(slice(length)) + if join_unit.block is None: + extra_block = None + else: + extra_block = join_unit.block.getitem_block(slice(length, None)) + join_unit.block = join_unit.block.getitem_block(slice(length)) + else: + extra_block = join_unit.block + + extra_indexers = copy.copy(join_unit.indexers) + extra_indexers[0] = extra_indexers[0][length:] + join_unit.indexers[0] = join_unit.indexers[0][:length] + + extra_shape = (join_unit.shape[0] - length,) + join_unit.shape[1:] + join_unit.shape = (length,) + join_unit.shape[1:] - return JoinUnit(block=extra_block) + return JoinUnit(block=extra_block, indexers=extra_indexers, shape=extra_shape) -def _combine_concat_plans(plans): +def _combine_concat_plans(plans, concat_axis: int): """ Combine multiple concatenation plans into one. @@ -590,6 +714,18 @@ def _combine_concat_plans(plans): for p in plans[0]: yield p[0], [p[1]] + elif concat_axis == 0: + offset = 0 + for plan in plans: + last_plc = None + + for plc, unit in plan: + yield plc.add(offset), [unit] + last_plc = plc + + if last_plc is not None: + offset += last_plc.as_slice.stop + else: # singleton list so we can modify it as a side-effect within _next_or_none num_ended = [0] diff --git a/pandas/core/internals/construction.py b/pandas/core/internals/construction.py index 7a5db56cb48fe..c1d0ab730fe7e 100644 --- a/pandas/core/internals/construction.py +++ b/pandas/core/internals/construction.py @@ -167,7 +167,7 @@ def rec_array_to_mgr( dtype: DtypeObj | None, copy: bool, typ: str, -): +) -> Manager: """ Extract from a masked rec array and create the manager. """ @@ -326,7 +326,7 @@ def ndarray_to_mgr( else: # by definition an array here # the dtypes will be coerced to a single dtype - values = _prep_ndarray(values, copy=copy_on_sanitize) + values = _prep_ndarraylike(values, copy=copy_on_sanitize) if dtype is not None and not is_dtype_equal(values.dtype, dtype): # GH#40110 see similar check inside sanitize_array @@ -341,7 +341,7 @@ def ndarray_to_mgr( allow_2d=True, ) - # _prep_ndarray ensures that values.ndim == 2 at this point + # _prep_ndarraylike ensures that values.ndim == 2 at this point index, columns = _get_axes( values.shape[0], values.shape[1], index=index, columns=columns ) @@ -537,15 +537,16 @@ def treat_as_nested(data) -> bool: # --------------------------------------------------------------------- -def _prep_ndarray(values, copy: bool = True) -> np.ndarray: +def _prep_ndarraylike( + values, copy: bool = True +) -> np.ndarray | DatetimeArray | TimedeltaArray: if isinstance(values, TimedeltaArray) or ( isinstance(values, DatetimeArray) and values.tz is None ): - # On older numpy, np.asarray below apparently does not call __array__, - # so nanoseconds get dropped. - values = values._ndarray + # By retaining DTA/TDA instead of unpacking, we end up retaining non-nano + pass - if not isinstance(values, (np.ndarray, ABCSeries, Index)): + elif not isinstance(values, (np.ndarray, ABCSeries, Index)): if len(values) == 0: return np.empty((0, 0), dtype=object) elif isinstance(values, range): diff --git a/pandas/core/internals/managers.py b/pandas/core/internals/managers.py index 7cccc9833de6b..435992f7d5cff 100644 --- a/pandas/core/internals/managers.py +++ b/pandas/core/internals/managers.py @@ -5,6 +5,7 @@ Any, Callable, Hashable, + Literal, Sequence, TypeVar, cast, @@ -142,7 +143,10 @@ class BaseBlockManager(DataManager): blocks: tuple[Block, ...] axes: list[Index] - ndim: int + @property + def ndim(self) -> int: + raise NotImplementedError + _known_consolidated: bool _is_consolidated: bool @@ -1678,7 +1682,10 @@ def _consolidate_inplace(self) -> None: class SingleBlockManager(BaseBlockManager, SingleDataManager): """manage a single block with""" - ndim = 1 + @property + def ndim(self) -> Literal[1]: + return 1 + _is_consolidated = True _known_consolidated = True __slots__ = () diff --git a/pandas/core/missing.py b/pandas/core/missing.py index b0bfbf13fbb2c..6005e11efbac4 100644 --- a/pandas/core/missing.py +++ b/pandas/core/missing.py @@ -104,7 +104,7 @@ def mask_missing(arr: ArrayLike, values_to_mask) -> npt.NDArray[np.bool_]: return mask -def clean_fill_method(method, allow_nearest: bool = False): +def clean_fill_method(method: str | None, allow_nearest: bool = False): # asfreq is compat for resampling if method in [None, "asfreq"]: return None @@ -333,14 +333,12 @@ def func(yvalues: np.ndarray) -> None: **kwargs, ) - # Argument 1 to "apply_along_axis" has incompatible type - # "Callable[[ndarray[Any, Any]], None]"; expected - # "Callable[..., Union[_SupportsArray[dtype[]], - # Sequence[_SupportsArray[dtype[ - # ]]], Sequence[Sequence[_SupportsArray[dtype[]]]], + # error: Argument 1 to "apply_along_axis" has incompatible type + # "Callable[[ndarray[Any, Any]], None]"; expected "Callable[..., + # Union[_SupportsArray[dtype[]], Sequence[_SupportsArray + # [dtype[]]], Sequence[Sequence[_SupportsArray[dtype[]]]], # Sequence[Sequence[Sequence[_SupportsArray[dtype[]]]]], # Sequence[Sequence[Sequence[Sequence[_SupportsArray[dtype[]]]]]]]]" - # interp each column independently np.apply_along_axis(func, axis, data) # type: ignore[arg-type] return @@ -779,22 +777,23 @@ def interpolate_2d( Modifies values in-place. """ if limit_area is not None: - # Argument 1 to "apply_along_axis" has incompatible type "partial[None]"; - # expected "Callable[..., Union[_SupportsArray[dtype[]], - # Sequence[_SupportsArray[dtype[]]], Sequence[Sequence - # [_SupportsArray[dtype[]]]], - # Sequence[Sequence[Sequence[_SupportsArray[dtype[]]]]], - # Sequence[Sequence[Sequence[Sequence[_SupportsArray[dtype[]]]]]]]]" - - # Argument 2 to "apply_along_axis" has incompatible type "Union[str, int]"; - # expected "SupportsIndex" [arg-type] np.apply_along_axis( - partial( + # error: Argument 1 to "apply_along_axis" has incompatible type + # "partial[None]"; expected + # "Callable[..., Union[_SupportsArray[dtype[]], + # Sequence[_SupportsArray[dtype[]]], + # Sequence[Sequence[_SupportsArray[dtype[]]]], + # Sequence[Sequence[Sequence[_SupportsArray[dtype[]]]]], + # Sequence[Sequence[Sequence[Sequence[_ + # SupportsArray[dtype[]]]]]]]]" + partial( # type: ignore[arg-type] _interpolate_with_limit_area, method=method, limit=limit, limit_area=limit_area, - ), # type: ignore[arg-type] + ), + # error: Argument 2 to "apply_along_axis" has incompatible type + # "Union[str, int]"; expected "SupportsIndex" axis, # type: ignore[arg-type] values, ) @@ -908,7 +907,7 @@ def get_fill_func(method, ndim: int = 1): return {"pad": _pad_2d, "backfill": _backfill_2d}[method] -def clean_reindex_fill_method(method): +def clean_reindex_fill_method(method) -> str | None: return clean_fill_method(method, allow_nearest=True) diff --git a/pandas/core/nanops.py b/pandas/core/nanops.py index a96fb9c8129dd..81766dc91f271 100644 --- a/pandas/core/nanops.py +++ b/pandas/core/nanops.py @@ -5,6 +5,7 @@ import operator from typing import ( Any, + Callable, cast, ) import warnings @@ -161,6 +162,10 @@ def f( def _bn_ok_dtype(dtype: DtypeObj, name: str) -> bool: # Bottleneck chokes on datetime64, PeriodDtype (or and EA) if not is_object_dtype(dtype) and not needs_i8_conversion(dtype): + # GH 42878 + # Bottleneck uses naive summation leading to O(n) loss of precision + # unlike numpy which implements pairwise summation, which has O(log(n)) loss + # crossref: https://github.com/pydata/bottleneck/issues/379 # GH 15507 # bottleneck does not properly upcast during the sum @@ -170,7 +175,7 @@ def _bn_ok_dtype(dtype: DtypeObj, name: str) -> bool: # further we also want to preserve NaN when all elements # are NaN, unlike bottleneck/numpy which consider this # to be 0 - return name not in ["nansum", "nanprod"] + return name not in ["nansum", "nanprod", "nanmean"] return False @@ -1527,7 +1532,7 @@ def _zero_out_fperr(arg): @disallow("M8", "m8") def nancorr( a: np.ndarray, b: np.ndarray, *, method="pearson", min_periods: int | None = None -): +) -> float: """ a, b: ndarrays """ @@ -1549,7 +1554,7 @@ def nancorr( return f(a, b) -def get_corr_func(method): +def get_corr_func(method) -> Callable[[np.ndarray, np.ndarray], float]: if method == "kendall": from scipy.stats import kendalltau @@ -1586,7 +1591,7 @@ def nancov( *, min_periods: int | None = None, ddof: int | None = 1, -): +) -> float: if len(a) != len(b): raise AssertionError("Operands to nancov must have same size") diff --git a/pandas/core/ops/__init__.py b/pandas/core/ops/__init__.py index 540a557f7c7cc..e9fefd9268870 100644 --- a/pandas/core/ops/__init__.py +++ b/pandas/core/ops/__init__.py @@ -11,7 +11,7 @@ import numpy as np -from pandas._libs.ops_dispatch import maybe_dispatch_ufunc_to_dunder_op # noqa:F401 +from pandas._libs.ops_dispatch import maybe_dispatch_ufunc_to_dunder_op from pandas._typing import Level from pandas.util._decorators import Appender from pandas.util._exceptions import find_stack_level @@ -30,7 +30,7 @@ algorithms, roperator, ) -from pandas.core.ops.array_ops import ( # noqa:F401 +from pandas.core.ops.array_ops import ( arithmetic_op, comp_method_OBJECT_ARRAY, comparison_op, @@ -38,7 +38,7 @@ logical_op, maybe_prepare_scalar_for_op, ) -from pandas.core.ops.common import ( # noqa:F401 +from pandas.core.ops.common import ( get_op_result_name, unpack_zerodim_and_defer, ) @@ -47,14 +47,14 @@ _op_descriptions, make_flex_doc, ) -from pandas.core.ops.invalid import invalid_comparison # noqa:F401 -from pandas.core.ops.mask_ops import ( # noqa: F401 +from pandas.core.ops.invalid import invalid_comparison +from pandas.core.ops.mask_ops import ( kleene_and, kleene_or, kleene_xor, ) -from pandas.core.ops.methods import add_flex_arithmetic_methods # noqa:F401 -from pandas.core.roperator import ( # noqa:F401 +from pandas.core.ops.methods import add_flex_arithmetic_methods +from pandas.core.roperator import ( radd, rand_, rdiv, @@ -473,3 +473,40 @@ def f(self, other, axis=default_axis, level=None): f.__name__ = op_name return f + + +__all__ = [ + "add_flex_arithmetic_methods", + "align_method_FRAME", + "align_method_SERIES", + "ARITHMETIC_BINOPS", + "arithmetic_op", + "COMPARISON_BINOPS", + "comparison_op", + "comp_method_OBJECT_ARRAY", + "fill_binop", + "flex_arith_method_FRAME", + "flex_comp_method_FRAME", + "flex_method_SERIES", + "frame_arith_method_with_reindex", + "invalid_comparison", + "kleene_and", + "kleene_or", + "kleene_xor", + "logical_op", + "maybe_dispatch_ufunc_to_dunder_op", + "radd", + "rand_", + "rdiv", + "rdivmod", + "rfloordiv", + "rmod", + "rmul", + "ror_", + "rpow", + "rsub", + "rtruediv", + "rxor", + "should_reindex_frame_op", + "unpack_zerodim_and_defer", +] diff --git a/pandas/core/ops/array_ops.py b/pandas/core/ops/array_ops.py index 2caaadbc05cff..6a1c586d90b6e 100644 --- a/pandas/core/ops/array_ops.py +++ b/pandas/core/ops/array_ops.py @@ -2,6 +2,8 @@ Functions for arithmetic and comparison operations on NumPy arrays and ExtensionArrays. """ +from __future__ import annotations + import datetime from functools import partial import operator diff --git a/pandas/core/ops/common.py b/pandas/core/ops/common.py index b883fe7751daa..f0e6aa3750cee 100644 --- a/pandas/core/ops/common.py +++ b/pandas/core/ops/common.py @@ -1,6 +1,8 @@ """ Boilerplate functions used in defining binary operations. """ +from __future__ import annotations + from functools import wraps from typing import Callable diff --git a/pandas/core/ops/dispatch.py b/pandas/core/ops/dispatch.py index bfd4afe0de86f..2f500703ccfb3 100644 --- a/pandas/core/ops/dispatch.py +++ b/pandas/core/ops/dispatch.py @@ -1,6 +1,8 @@ """ Functions for defining unary operations. """ +from __future__ import annotations + from typing import Any from pandas._typing import ArrayLike diff --git a/pandas/core/ops/invalid.py b/pandas/core/ops/invalid.py index cc4a1f11edd2b..eb27cf7450119 100644 --- a/pandas/core/ops/invalid.py +++ b/pandas/core/ops/invalid.py @@ -1,12 +1,14 @@ """ Templates for invalid operations. """ +from __future__ import annotations + import operator import numpy as np -def invalid_comparison(left, right, op): +def invalid_comparison(left, right, op) -> np.ndarray: """ If a comparison has mismatched types and is not necessarily meaningful, follow python3 conventions by: diff --git a/pandas/core/ops/mask_ops.py b/pandas/core/ops/mask_ops.py index 57bacba0d4bee..adc1f63c568bf 100644 --- a/pandas/core/ops/mask_ops.py +++ b/pandas/core/ops/mask_ops.py @@ -184,6 +184,6 @@ def kleene_and( return result, mask -def raise_for_nan(value, method: str): +def raise_for_nan(value, method: str) -> None: if lib.is_float(value) and np.isnan(value): raise ValueError(f"Cannot perform logical '{method}' with floating NaN") diff --git a/pandas/core/ops/methods.py b/pandas/core/ops/methods.py index df22919ed19f1..e8a930083a778 100644 --- a/pandas/core/ops/methods.py +++ b/pandas/core/ops/methods.py @@ -1,6 +1,8 @@ """ Functions to generate methods and pin them to the appropriate classes. """ +from __future__ import annotations + import operator from pandas.core.dtypes.generic import ( @@ -43,7 +45,7 @@ def _get_method_wrappers(cls): return arith_flex, comp_flex -def add_flex_arithmetic_methods(cls): +def add_flex_arithmetic_methods(cls) -> None: """ Adds the full suite of flex arithmetic methods (``pow``, ``mul``, ``add``) to the class. diff --git a/pandas/core/ops/missing.py b/pandas/core/ops/missing.py index 8d5f7fb8de758..850ca44e996c4 100644 --- a/pandas/core/ops/missing.py +++ b/pandas/core/ops/missing.py @@ -21,6 +21,8 @@ 3) divmod behavior consistent with 1) and 2). """ +from __future__ import annotations + import operator import numpy as np diff --git a/pandas/core/resample.py b/pandas/core/resample.py index 0a62861cdaba7..917382544199a 100644 --- a/pandas/core/resample.py +++ b/pandas/core/resample.py @@ -205,7 +205,7 @@ def __getattr__(self, attr: str): # error: Signature of "obj" incompatible with supertype "BaseGroupBy" @property - def obj(self) -> NDFrameT: # type: ignore[override] + def obj(self) -> NDFrame: # type: ignore[override] # error: Incompatible return value type (got "Optional[Any]", # expected "NDFrameT") return self.groupby.obj # type: ignore[return-value] @@ -502,11 +502,11 @@ def _apply_loffset(self, result): self.loffset = None return result - def _get_resampler_for_grouping(self, groupby): + def _get_resampler_for_grouping(self, groupby, key=None): """ Return the correct class for resampling with groupby. """ - return self._resampler_for_grouping(self, groupby=groupby) + return self._resampler_for_grouping(self, groupby=groupby, key=key) def _wrap_result(self, result): """ @@ -937,7 +937,13 @@ def asfreq(self, fill_value=None): """ return self._upsample("asfreq", fill_value=fill_value) - def std(self, ddof=1, numeric_only: bool = False, *args, **kwargs): + def std( + self, + ddof=1, + numeric_only: bool | lib.NoDefault = lib.no_default, + *args, + **kwargs, + ): """ Compute standard deviation of groups, excluding missing values. @@ -958,7 +964,13 @@ def std(self, ddof=1, numeric_only: bool = False, *args, **kwargs): nv.validate_resampler_func("std", args, kwargs) return self._downsample("std", ddof=ddof, numeric_only=numeric_only) - def var(self, ddof=1, numeric_only: bool = False, *args, **kwargs): + def var( + self, + ddof=1, + numeric_only: bool | lib.NoDefault = lib.no_default, + *args, + **kwargs, + ): """ Compute variance of groups, excluding missing values. @@ -1132,7 +1144,7 @@ class _GroupByMixin(PandasObject): _attributes: list[str] # in practice the same as Resampler._attributes _selection: IndexLabel | None = None - def __init__(self, obj, parent=None, groupby=None, **kwargs) -> None: + def __init__(self, obj, parent=None, groupby=None, key=None, **kwargs) -> None: # reached via ._gotitem and _get_resampler_for_grouping if parent is None: @@ -1145,6 +1157,7 @@ def __init__(self, obj, parent=None, groupby=None, **kwargs) -> None: self._selection = kwargs.get("selection") self.binner = parent.binner + self.key = key self._groupby = groupby self._groupby.mutated = True @@ -1197,6 +1210,8 @@ def _gotitem(self, key, ndim, subset=None): # Try to select from a DataFrame, falling back to a Series try: + if isinstance(key, list) and self.key not in key: + key.append(self.key) groupby = self._groupby[key] except IndexError: groupby = self._groupby @@ -1491,7 +1506,9 @@ def _constructor(self): return TimedeltaIndexResampler -def get_resampler(obj, kind=None, **kwds): +def get_resampler( + obj, kind=None, **kwds +) -> DatetimeIndexResampler | PeriodIndexResampler | TimedeltaIndexResampler: """ Create a TimeGrouper and return our resampler. """ @@ -1511,7 +1528,7 @@ def get_resampler_for_grouping( # .resample uses 'on' similar to how .groupby uses 'key' tg = TimeGrouper(freq=rule, key=on, **kwargs) resampler = tg._get_resampler(groupby.obj, kind=kind) - return resampler._get_resampler_for_grouping(groupby=groupby) + return resampler._get_resampler_for_grouping(groupby=groupby, key=tg.key) class TimeGrouper(Grouper): diff --git a/pandas/core/reshape/api.py b/pandas/core/reshape/api.py index 7226c57cc27d8..b1884c497f0ad 100644 --- a/pandas/core/reshape/api.py +++ b/pandas/core/reshape/api.py @@ -1,7 +1,8 @@ -# flake8: noqa:F401 - from pandas.core.reshape.concat import concat -from pandas.core.reshape.encoding import get_dummies +from pandas.core.reshape.encoding import ( + from_dummies, + get_dummies, +) from pandas.core.reshape.melt import ( lreshape, melt, @@ -21,3 +22,20 @@ cut, qcut, ) + +__all__ = [ + "concat", + "crosstab", + "cut", + "from_dummies", + "get_dummies", + "lreshape", + "melt", + "merge", + "merge_asof", + "merge_ordered", + "pivot", + "pivot_table", + "qcut", + "wide_to_long", +] diff --git a/pandas/core/reshape/encoding.py b/pandas/core/reshape/encoding.py index f0500ec142955..fc908a5648885 100644 --- a/pandas/core/reshape/encoding.py +++ b/pandas/core/reshape/encoding.py @@ -1,6 +1,8 @@ from __future__ import annotations +from collections import defaultdict import itertools +from typing import Hashable import numpy as np @@ -68,6 +70,7 @@ def get_dummies( See Also -------- Series.str.get_dummies : Convert Series to dummy codes. + :func:`~pandas.from_dummies` : Convert dummy codes to categorical ``DataFrame``. Notes ----- @@ -316,3 +319,202 @@ def get_empty_frame(data) -> DataFrame: dummy_mat = dummy_mat[:, 1:] dummy_cols = dummy_cols[1:] return DataFrame(dummy_mat, index=index, columns=dummy_cols) + + +def from_dummies( + data: DataFrame, + sep: None | str = None, + default_category: None | Hashable | dict[str, Hashable] = None, +) -> DataFrame: + """ + Create a categorical ``DataFrame`` from a ``DataFrame`` of dummy variables. + + Inverts the operation performed by :func:`~pandas.get_dummies`. + + .. versionadded:: 1.5.0 + + Parameters + ---------- + data : DataFrame + Data which contains dummy-coded variables in form of integer columns of + 1's and 0's. + sep : str, default None + Separator used in the column names of the dummy categories they are + character indicating the separation of the categorical names from the prefixes. + For example, if your column names are 'prefix_A' and 'prefix_B', + you can strip the underscore by specifying sep='_'. + default_category : None, Hashable or dict of Hashables, default None + The default category is the implied category when a value has none of the + listed categories specified with a one, i.e. if all dummies in a row are + zero. Can be a single value for all variables or a dict directly mapping + the default categories to a prefix of a variable. + + Returns + ------- + DataFrame + Categorical data decoded from the dummy input-data. + + Raises + ------ + ValueError + * When the input ``DataFrame`` ``data`` contains NA values. + * When the input ``DataFrame`` ``data`` contains column names with separators + that do not match the separator specified with ``sep``. + * When a ``dict`` passed to ``default_category`` does not include an implied + category for each prefix. + * When a value in ``data`` has more than one category assigned to it. + * When ``default_category=None`` and a value in ``data`` has no category + assigned to it. + TypeError + * When the input ``data`` is not of type ``DataFrame``. + * When the input ``DataFrame`` ``data`` contains non-dummy data. + * When the passed ``sep`` is of a wrong data type. + * When the passed ``default_category`` is of a wrong data type. + + See Also + -------- + :func:`~pandas.get_dummies` : Convert ``Series`` or ``DataFrame`` to dummy codes. + :class:`~pandas.Categorical` : Represent a categorical variable in classic. + + Notes + ----- + The columns of the passed dummy data should only include 1's and 0's, + or boolean values. + + Examples + -------- + >>> df = pd.DataFrame({"a": [1, 0, 0, 1], "b": [0, 1, 0, 0], + ... "c": [0, 0, 1, 0]}) + + >>> df + a b c + 0 1 0 0 + 1 0 1 0 + 2 0 0 1 + 3 1 0 0 + + >>> pd.from_dummies(df) + 0 a + 1 b + 2 c + 3 a + + >>> df = pd.DataFrame({"col1_a": [1, 0, 1], "col1_b": [0, 1, 0], + ... "col2_a": [0, 1, 0], "col2_b": [1, 0, 0], + ... "col2_c": [0, 0, 1]}) + + >>> df + col1_a col1_b col2_a col2_b col2_c + 0 1 0 0 1 0 + 1 0 1 1 0 0 + 2 1 0 0 0 1 + + >>> pd.from_dummies(df, sep="_") + col1 col2 + 0 a b + 1 b a + 2 a c + + >>> df = pd.DataFrame({"col1_a": [1, 0, 0], "col1_b": [0, 1, 0], + ... "col2_a": [0, 1, 0], "col2_b": [1, 0, 0], + ... "col2_c": [0, 0, 0]}) + + >>> df + col1_a col1_b col2_a col2_b col2_c + 0 1 0 0 1 0 + 1 0 1 1 0 0 + 2 0 0 0 0 0 + + >>> pd.from_dummies(df, sep="_", default_category={"col1": "d", "col2": "e"}) + col1 col2 + 0 a b + 1 b a + 2 d e + """ + from pandas.core.reshape.concat import concat + + if not isinstance(data, DataFrame): + raise TypeError( + "Expected 'data' to be a 'DataFrame'; " + f"Received 'data' of type: {type(data).__name__}" + ) + + if data.isna().any().any(): + raise ValueError( + "Dummy DataFrame contains NA value in column: " + f"'{data.isna().any().idxmax()}'" + ) + + # index data with a list of all columns that are dummies + try: + data_to_decode = data.astype("boolean", copy=False) + except TypeError: + raise TypeError("Passed DataFrame contains non-dummy data") + + # collect prefixes and get lists to slice data for each prefix + variables_slice = defaultdict(list) + if sep is None: + variables_slice[""] = list(data.columns) + elif isinstance(sep, str): + for col in data_to_decode.columns: + prefix = col.split(sep)[0] + if len(prefix) == len(col): + raise ValueError(f"Separator not specified for column: {col}") + variables_slice[prefix].append(col) + else: + raise TypeError( + "Expected 'sep' to be of type 'str' or 'None'; " + f"Received 'sep' of type: {type(sep).__name__}" + ) + + if default_category is not None: + if isinstance(default_category, dict): + if not len(default_category) == len(variables_slice): + len_msg = ( + f"Length of 'default_category' ({len(default_category)}) " + f"did not match the length of the columns being encoded " + f"({len(variables_slice)})" + ) + raise ValueError(len_msg) + elif isinstance(default_category, Hashable): + default_category = dict( + zip(variables_slice, [default_category] * len(variables_slice)) + ) + else: + raise TypeError( + "Expected 'default_category' to be of type " + "'None', 'Hashable', or 'dict'; " + "Received 'default_category' of type: " + f"{type(default_category).__name__}" + ) + + cat_data = {} + for prefix, prefix_slice in variables_slice.items(): + if sep is None: + cats = prefix_slice.copy() + else: + cats = [col[len(prefix + sep) :] for col in prefix_slice] + assigned = data_to_decode.loc[:, prefix_slice].sum(axis=1) + if any(assigned > 1): + raise ValueError( + "Dummy DataFrame contains multi-assignment(s); " + f"First instance in row: {assigned.idxmax()}" + ) + elif any(assigned == 0): + if isinstance(default_category, dict): + cats.append(default_category[prefix]) + else: + raise ValueError( + "Dummy DataFrame contains unassigned value(s); " + f"First instance in row: {assigned.idxmin()}" + ) + data_slice = concat( + (data_to_decode.loc[:, prefix_slice], assigned == 0), axis=1 + ) + else: + data_slice = data_to_decode.loc[:, prefix_slice] + cats_array = np.array(cats, dtype="object") + # get indices of True entries along axis=1 + cat_data[prefix] = cats_array[data_slice.to_numpy().nonzero()[1]] + + return DataFrame(cat_data) diff --git a/pandas/core/reshape/melt.py b/pandas/core/reshape/melt.py index 262cd9774f694..5de9c8e2f4108 100644 --- a/pandas/core/reshape/melt.py +++ b/pandas/core/reshape/melt.py @@ -131,10 +131,14 @@ def melt( for col in id_vars: id_data = frame.pop(col) if is_extension_array_dtype(id_data): - id_data = concat([id_data] * K, ignore_index=True) + if K > 0: + id_data = concat([id_data] * K, ignore_index=True) + else: + # We can't concat empty list. (GH 46044) + id_data = type(id_data)([], name=id_data.name, dtype=id_data.dtype) else: - # Incompatible types in assignment (expression has type - # "ndarray[Any, dtype[Any]]", variable has type "Series") [assignment] + # error: Incompatible types in assignment (expression has type + # "ndarray[Any, dtype[Any]]", variable has type "Series") id_data = np.tile(id_data._values, K) # type: ignore[assignment] mdata[col] = id_data diff --git a/pandas/core/reshape/merge.py b/pandas/core/reshape/merge.py index 4227d43c459d0..6ce5ffac9de52 100644 --- a/pandas/core/reshape/merge.py +++ b/pandas/core/reshape/merge.py @@ -1200,23 +1200,27 @@ def _maybe_coerce_merge_keys(self) -> None: # check whether ints and floats elif is_integer_dtype(rk.dtype) and is_float_dtype(lk.dtype): - if not (lk == lk.astype(rk.dtype))[~np.isnan(lk)].all(): - warnings.warn( - "You are merging on int and float " - "columns where the float values " - "are not equal to their int representation.", - UserWarning, - ) + # GH 47391 numpy > 1.24 will raise a RuntimeError for nan -> int + with np.errstate(invalid="ignore"): + if not (lk == lk.astype(rk.dtype))[~np.isnan(lk)].all(): + warnings.warn( + "You are merging on int and float " + "columns where the float values " + "are not equal to their int representation.", + UserWarning, + ) continue elif is_float_dtype(rk.dtype) and is_integer_dtype(lk.dtype): - if not (rk == rk.astype(lk.dtype))[~np.isnan(rk)].all(): - warnings.warn( - "You are merging on int and float " - "columns where the float values " - "are not equal to their int representation.", - UserWarning, - ) + # GH 47391 numpy > 1.24 will raise a RuntimeError for nan -> int + with np.errstate(invalid="ignore"): + if not (rk == rk.astype(lk.dtype))[~np.isnan(rk)].all(): + warnings.warn( + "You are merging on int and float " + "columns where the float values " + "are not equal to their int representation.", + UserWarning, + ) continue # let's infer and see if we are ok diff --git a/pandas/core/reshape/pivot.py b/pandas/core/reshape/pivot.py index 8c861c199169b..03aad0ef64dec 100644 --- a/pandas/core/reshape/pivot.py +++ b/pandas/core/reshape/pivot.py @@ -481,6 +481,7 @@ def pivot( columns_listlike = com.convert_to_list_like(columns) + indexed: DataFrame | Series if values is None: if index is not None: cols = com.convert_to_list_like(index) @@ -517,7 +518,10 @@ def pivot( ) else: indexed = data._constructor_sliced(data[values]._values, index=multiindex) - return indexed.unstack(columns_listlike) + # error: Argument 1 to "unstack" of "DataFrame" has incompatible type "Union + # [List[Any], ExtensionArray, ndarray[Any, Any], Index, Series]"; expected + # "Hashable" + return indexed.unstack(columns_listlike) # type: ignore[arg-type] def crosstab( diff --git a/pandas/core/reshape/reshape.py b/pandas/core/reshape/reshape.py index b4e944861f1bc..d4f4057af7bfd 100644 --- a/pandas/core/reshape/reshape.py +++ b/pandas/core/reshape/reshape.py @@ -152,7 +152,7 @@ def _indexer_and_to_sort( return indexer, to_sort @cache_readonly - def sorted_labels(self): + def sorted_labels(self) -> list[np.ndarray]: indexer, to_sort = self._indexer_and_to_sort return [line.take(indexer) for line in to_sort] @@ -199,7 +199,7 @@ def arange_result(self) -> tuple[npt.NDArray[np.intp], npt.NDArray[np.bool_]]: return new_values, mask.any(0) # TODO: in all tests we have mask.any(0).all(); can we rely on that? - def get_result(self, values, value_columns, fill_value): + def get_result(self, values, value_columns, fill_value) -> DataFrame: if values.ndim == 1: values = values[:, np.newaxis] @@ -346,7 +346,7 @@ def _repeater(self) -> np.ndarray: return repeater @cache_readonly - def new_index(self): + def new_index(self) -> MultiIndex: # Does not depend on values or value_columns result_codes = [lab.take(self.compressor) for lab in self.sorted_labels[:-1]] diff --git a/pandas/core/reshape/util.py b/pandas/core/reshape/util.py index 9f9143f4aaa60..459928acc0da3 100644 --- a/pandas/core/reshape/util.py +++ b/pandas/core/reshape/util.py @@ -1,3 +1,5 @@ +from __future__ import annotations + import numpy as np from pandas._typing import NumpyIndexT @@ -5,7 +7,7 @@ from pandas.core.dtypes.common import is_list_like -def cartesian_product(X): +def cartesian_product(X) -> list[np.ndarray]: """ Numpy version of itertools.product. Sometimes faster (for large inputs)... diff --git a/pandas/core/roperator.py b/pandas/core/roperator.py index 15b16b6fa976a..2f320f4e9c6b9 100644 --- a/pandas/core/roperator.py +++ b/pandas/core/roperator.py @@ -2,6 +2,8 @@ Reversed Operations not available in the stdlib operator module. Defining these instead of using lambdas allows us to reference them by name. """ +from __future__ import annotations + import operator diff --git a/pandas/core/series.py b/pandas/core/series.py index d8ee7365120f7..67cdb5d8d72ab 100644 --- a/pandas/core/series.py +++ b/pandas/core/series.py @@ -38,11 +38,14 @@ Axis, Dtype, DtypeObj, + FilePath, FillnaOptions, IgnoreRaise, IndexKeyFunc, + IndexLabel, Level, NaPosition, + QuantileInterpolation, Renamer, SingleManager, SortKind, @@ -50,6 +53,7 @@ TimedeltaConvertibleTypes, TimestampConvertibleTypes, ValueKeyFunc, + WriteBuffer, npt, ) from pandas.compat.numpy import function as nv @@ -57,6 +61,7 @@ from pandas.util._decorators import ( Appender, Substitution, + deprecate_kwarg, deprecate_nonkeyword_arguments, doc, ) @@ -79,6 +84,7 @@ is_integer, is_iterator, is_list_like, + is_numeric_dtype, is_object_dtype, is_scalar, pandas_dtype, @@ -159,6 +165,7 @@ from pandas._typing import ( NumpySorter, NumpyValueArrayLike, + Suffixes, ) from pandas.core.frame import DataFrame @@ -931,7 +938,7 @@ def _take_with_is_copy(self, indices, axis=0) -> Series: """ return self.take(indices=indices, axis=axis) - def _ixs(self, i: int, axis: int = 0): + def _ixs(self, i: int, axis: int = 0) -> Any: """ Return the i-th value or values in the Series by location. @@ -1367,15 +1374,39 @@ def repeat(self, repeats, axis=None) -> Series: self, method="repeat" ) + @overload + def reset_index( + self, + level: Level = ..., + *, + drop: bool = ..., + name: Level = ..., + inplace: Literal[False] = ..., + allow_duplicates: bool = ..., + ) -> Series: + ... + + @overload + def reset_index( + self, + level: Level = ..., + *, + drop: bool = ..., + name: Level = ..., + inplace: Literal[True], + allow_duplicates: bool = ..., + ) -> None: + ... + @deprecate_nonkeyword_arguments(version=None, allowed_args=["self", "level"]) def reset_index( self, - level=None, - drop=False, - name=lib.no_default, - inplace=False, + level: Level = None, + drop: bool = False, + name: Level = lib.no_default, + inplace: bool = False, allow_duplicates: bool = False, - ): + ) -> Series | None: """ Generate a new DataFrame or Series with the index reset. @@ -1491,11 +1522,14 @@ def reset_index( if drop: new_index = default_index(len(self)) if level is not None: + level_list: Sequence[Hashable] if not isinstance(level, (tuple, list)): - level = [level] - level = [self.index._get_level_number(lev) for lev in level] - if len(level) < self.index.nlevels: - new_index = self.index.droplevel(level) + level_list = [level] + else: + level_list = level + level_list = [self.index._get_level_number(lev) for lev in level_list] + if len(level_list) < self.index.nlevels: + new_index = self.index.droplevel(level_list) if inplace: self.index = new_index @@ -1517,9 +1551,12 @@ def reset_index( name = self.name df = self.to_frame(name) - return df.reset_index( + # error: Incompatible return value type (got "DataFrame", expected + # "Optional[Series]") + return df.reset_index( # type: ignore[return-value] level=level, drop=drop, allow_duplicates=allow_duplicates ) + return None # ---------------------------------------------------------------------- # Rendering Methods @@ -1531,19 +1568,51 @@ def __repr__(self) -> str: repr_params = fmt.get_series_repr_params() return self.to_string(**repr_params) + @overload def to_string( self, - buf=None, - na_rep="NaN", - float_format=None, - header=True, - index=True, + buf: None = ..., + na_rep: str = ..., + float_format: str | None = ..., + header: bool = ..., + index: bool = ..., + length=..., + dtype=..., + name=..., + max_rows: int | None = ..., + min_rows: int | None = ..., + ) -> str: + ... + + @overload + def to_string( + self, + buf: FilePath | WriteBuffer[str], + na_rep: str = ..., + float_format: str | None = ..., + header: bool = ..., + index: bool = ..., + length=..., + dtype=..., + name=..., + max_rows: int | None = ..., + min_rows: int | None = ..., + ) -> None: + ... + + def to_string( + self, + buf: FilePath | WriteBuffer[str] | None = None, + na_rep: str = "NaN", + float_format: str | None = None, + header: bool = True, + index: bool = True, length=False, dtype=False, name=False, - max_rows=None, - min_rows=None, - ): + max_rows: int | None = None, + min_rows: int | None = None, + ) -> str | None: """ Render a string representation of the Series. @@ -1602,11 +1671,17 @@ def to_string( if buf is None: return result else: - try: - buf.write(result) - except AttributeError: - with open(buf, "w") as f: + if hasattr(buf, "write"): + # error: Item "str" of "Union[str, PathLike[str], WriteBuffer + # [str]]" has no attribute "write" + buf.write(result) # type: ignore[union-attr] + else: + # error: Argument 1 to "open" has incompatible type "Union[str, + # PathLike[str], WriteBuffer[str]]"; expected "Union[Union[str, + # bytes, PathLike[str], PathLike[bytes]], int]" + with open(buf, "w") as f: # type: ignore[arg-type] f.write(result) + return None @doc( klass=_shared_doc_kwargs["klass"], @@ -2023,8 +2098,8 @@ def count(self, level=None): lev = lev.insert(cnt, lev._na_value) obs = level_codes[notna(self._values)] - # Argument "minlength" to "bincount" has incompatible type "Optional[int]"; - # expected "SupportsIndex" [arg-type] + # error: Argument "minlength" to "bincount" has incompatible type + # "Optional[int]"; expected "SupportsIndex" out = np.bincount(obs, minlength=len(lev) or None) # type: ignore[arg-type] return self._constructor(out, index=lev, dtype="int64").__finalize__( self, method="count" @@ -2479,7 +2554,33 @@ def round(self, decimals=0, *args, **kwargs) -> Series: return result - def quantile(self, q=0.5, interpolation="linear"): + @overload + def quantile( + self, q: float = ..., interpolation: QuantileInterpolation = ... + ) -> float: + ... + + @overload + def quantile( + self, + q: Sequence[float] | AnyArrayLike, + interpolation: QuantileInterpolation = ..., + ) -> Series: + ... + + @overload + def quantile( + self, + q: float | Sequence[float] | AnyArrayLike = ..., + interpolation: QuantileInterpolation = ..., + ) -> float | Series: + ... + + def quantile( + self, + q: float | Sequence[float] | AnyArrayLike = 0.5, + interpolation: QuantileInterpolation = "linear", + ) -> float | Series: """ Return value at the given quantile. @@ -2894,7 +2995,7 @@ def searchsorted( # type: ignore[override] def append( self, to_append, ignore_index: bool = False, verify_integrity: bool = False - ): + ) -> Series: """ Concatenate two or more Series. @@ -3137,12 +3238,14 @@ def compare( align_axis: Axis = 1, keep_shape: bool = False, keep_equal: bool = False, + result_names: Suffixes = ("self", "other"), ) -> DataFrame | Series: return super().compare( other=other, align_axis=align_axis, keep_shape=keep_shape, keep_equal=keep_equal, + result_names=result_names, ) def combine(self, other, func, fill_value=None) -> Series: @@ -3372,17 +3475,47 @@ def update(self, other) -> None: # ---------------------------------------------------------------------- # Reindexing, sorting - @deprecate_nonkeyword_arguments(version=None, allowed_args=["self"]) + # error: Signature of "sort_values" incompatible with supertype "NDFrame" + @overload # type: ignore[override] def sort_values( self, - axis=0, - ascending: bool | int | Sequence[bool | int] = True, + *, + axis: Axis = ..., + ascending: bool | int | Sequence[bool] | Sequence[int] = ..., + inplace: Literal[False] = ..., + kind: str = ..., + na_position: str = ..., + ignore_index: bool = ..., + key: ValueKeyFunc = ..., + ) -> Series: + ... + + @overload + def sort_values( + self, + *, + axis: Axis = ..., + ascending: bool | int | Sequence[bool] | Sequence[int] = ..., + inplace: Literal[True], + kind: str = ..., + na_position: str = ..., + ignore_index: bool = ..., + key: ValueKeyFunc = ..., + ) -> None: + ... + + # error: Signature of "sort_values" incompatible with supertype "NDFrame" + @deprecate_nonkeyword_arguments(version=None, allowed_args=["self"]) + def sort_values( # type: ignore[override] + self, + axis: Axis = 0, + ascending: bool | int | Sequence[bool] | Sequence[int] = True, inplace: bool = False, kind: str = "quicksort", na_position: str = "last", ignore_index: bool = False, key: ValueKeyFunc = None, - ): + ) -> Series | None: """ Sort by the values. @@ -3576,10 +3709,10 @@ def sort_values( if ignore_index: result.index = default_index(len(sorted_index)) - if inplace: - self._update_inplace(result) - else: + if not inplace: return result.__finalize__(self, method="sort_values") + self._update_inplace(result) + return None @overload def sort_index( @@ -4589,10 +4722,17 @@ def _reduce( else: # dispatch to numpy arrays - if numeric_only: + if numeric_only and not is_numeric_dtype(self.dtype): kwd_name = "numeric_only" if name in ["any", "all"]: kwd_name = "bool_only" + # GH#47500 - change to TypeError to match other methods + warnings.warn( + f"Calling Series.{name} with {kwd_name}={numeric_only} and " + f"dtype {self.dtype} will raise a TypeError in the future", + FutureWarning, + stacklevel=find_stack_level(), + ) raise NotImplementedError( f"Series.{name} does not implement {kwd_name}." ) @@ -4784,22 +4924,21 @@ def rename( @overload def set_axis( - self, labels, axis: Axis = ..., inplace: Literal[False] = ... + self, labels, *, axis: Axis = ..., inplace: Literal[False] = ... ) -> Series: ... @overload - def set_axis(self, labels, axis: Axis, inplace: Literal[True]) -> None: - ... - - @overload - def set_axis(self, labels, *, inplace: Literal[True]) -> None: + def set_axis(self, labels, *, axis: Axis = ..., inplace: Literal[True]) -> None: ... @overload - def set_axis(self, labels, axis: Axis = ..., inplace: bool = ...) -> Series | None: + def set_axis( + self, labels, *, axis: Axis = ..., inplace: bool = ... + ) -> Series | None: ... + # error: Signature of "set_axis" incompatible with supertype "NDFrame" @deprecate_nonkeyword_arguments(version=None, allowed_args=["self", "labels"]) @Appender( """ @@ -4826,7 +4965,9 @@ def set_axis(self, labels, axis: Axis = ..., inplace: bool = ...) -> Series | No see_also_sub="", ) @Appender(NDFrame.set_axis.__doc__) - def set_axis(self, labels, axis: Axis = 0, inplace: bool = False): + def set_axis( # type: ignore[override] + self, labels, axis: Axis = 0, inplace: bool = False + ) -> Series | None: return super().set_axis(labels, axis=axis, inplace=inplace) # error: Cannot determine type of 'reindex' @@ -4852,11 +4993,11 @@ def reindex(self, *args, **kwargs) -> Series: @overload def drop( self, - labels: Hashable | list[Hashable] = ..., + labels: IndexLabel = ..., *, axis: Axis = ..., - index: Hashable | list[Hashable] = ..., - columns: Hashable | list[Hashable] = ..., + index: IndexLabel = ..., + columns: IndexLabel = ..., level: Level | None = ..., inplace: Literal[True], errors: IgnoreRaise = ..., @@ -4866,11 +5007,11 @@ def drop( @overload def drop( self, - labels: Hashable | list[Hashable] = ..., + labels: IndexLabel = ..., *, axis: Axis = ..., - index: Hashable | list[Hashable] = ..., - columns: Hashable | list[Hashable] = ..., + index: IndexLabel = ..., + columns: IndexLabel = ..., level: Level | None = ..., inplace: Literal[False] = ..., errors: IgnoreRaise = ..., @@ -4880,11 +5021,11 @@ def drop( @overload def drop( self, - labels: Hashable | list[Hashable] = ..., + labels: IndexLabel = ..., *, axis: Axis = ..., - index: Hashable | list[Hashable] = ..., - columns: Hashable | list[Hashable] = ..., + index: IndexLabel = ..., + columns: IndexLabel = ..., level: Level | None = ..., inplace: bool = ..., errors: IgnoreRaise = ..., @@ -4896,10 +5037,10 @@ def drop( @deprecate_nonkeyword_arguments(version=None, allowed_args=["self", "labels"]) def drop( # type: ignore[override] self, - labels: Hashable | list[Hashable] = None, + labels: IndexLabel = None, axis: Axis = 0, - index: Hashable | list[Hashable] = None, - columns: Hashable | list[Hashable] = None, + index: IndexLabel = None, + columns: IndexLabel = None, level: Level | None = None, inplace: bool = False, errors: IgnoreRaise = "raise", @@ -5163,22 +5304,52 @@ def pop(self, item: Hashable) -> Any: """ return super().pop(item=item) - # error: Cannot determine type of 'replace' + # error: Signature of "replace" incompatible with supertype "NDFrame" + @overload # type: ignore[override] + def replace( + self, + to_replace=..., + value=..., + *, + inplace: Literal[False] = ..., + limit: int | None = ..., + regex=..., + method: Literal["pad", "ffill", "bfill"] | lib.NoDefault = ..., + ) -> Series: + ... + + @overload + def replace( + self, + to_replace=..., + value=..., + *, + inplace: Literal[True], + limit: int | None = ..., + regex=..., + method: Literal["pad", "ffill", "bfill"] | lib.NoDefault = ..., + ) -> None: + ... + + # error: Signature of "replace" incompatible with supertype "NDFrame" + @deprecate_nonkeyword_arguments( + version=None, allowed_args=["self", "to_replace", "value"] + ) @doc( - NDFrame.replace, # type: ignore[has-type] + NDFrame.replace, klass=_shared_doc_kwargs["klass"], inplace=_shared_doc_kwargs["inplace"], replace_iloc=_shared_doc_kwargs["replace_iloc"], ) - def replace( + def replace( # type: ignore[override] self, to_replace=None, value=lib.no_default, - inplace=False, - limit=None, + inplace: bool = False, + limit: int | None = None, regex=False, - method: str | lib.NoDefault = lib.no_default, - ): + method: Literal["pad", "ffill", "bfill"] | lib.NoDefault = lib.no_default, + ) -> Series | None: return super().replace( to_replace=to_replace, value=value, @@ -5495,8 +5666,10 @@ def _convert_dtypes( return result # error: Cannot determine type of 'isna' + # error: Return type "Series" of "isna" incompatible with return type "ndarray + # [Any, dtype[bool_]]" in supertype "IndexOpsMixin" @doc(NDFrame.isna, klass=_shared_doc_kwargs["klass"]) # type: ignore[has-type] - def isna(self) -> Series: + def isna(self) -> Series: # type: ignore[override] return NDFrame.isna(self) # error: Cannot determine type of 'isna' @@ -5520,8 +5693,22 @@ def notnull(self) -> Series: """ return super().notnull() + @overload + def dropna( + self, *, axis: Axis = ..., inplace: Literal[False] = ..., how: str | None = ... + ) -> Series: + ... + + @overload + def dropna( + self, *, axis: Axis = ..., inplace: Literal[True], how: str | None = ... + ) -> None: + ... + @deprecate_nonkeyword_arguments(version=None, allowed_args=["self"]) - def dropna(self, axis=0, inplace=False, how=None): + def dropna( + self, axis: Axis = 0, inplace: bool = False, how: str | None = None + ) -> Series | None: """ Return a new Series with missing values removed. @@ -5603,11 +5790,9 @@ def dropna(self, axis=0, inplace=False, how=None): else: return result else: - if inplace: - # do nothing - pass - else: + if not inplace: return self.copy() + return None # ---------------------------------------------------------------------- # Time series-oriented methods @@ -5720,25 +5905,93 @@ def to_period(self, freq=None, copy=True) -> Series: self, method="to_period" ) - @deprecate_nonkeyword_arguments(version=None, allowed_args=["self"]) + @overload def ffill( - self: Series, + self, + *, + axis: None | Axis = ..., + inplace: Literal[False] = ..., + limit: None | int = ..., + downcast=..., + ) -> Series: + ... + + @overload + def ffill( + self, + *, + axis: None | Axis = ..., + inplace: Literal[True], + limit: None | int = ..., + downcast=..., + ) -> None: + ... + + @overload + def ffill( + self, + *, + axis: None | Axis = ..., + inplace: bool = ..., + limit: None | int = ..., + downcast=..., + ) -> Series | None: + ... + + # error: Signature of "ffill" incompatible with supertype "NDFrame" + @deprecate_nonkeyword_arguments(version=None, allowed_args=["self"]) + def ffill( # type: ignore[override] + self, axis: None | Axis = None, inplace: bool = False, limit: None | int = None, downcast=None, ) -> Series | None: - return super().ffill(axis, inplace, limit, downcast) + return super().ffill(axis=axis, inplace=inplace, limit=limit, downcast=downcast) - @deprecate_nonkeyword_arguments(version=None, allowed_args=["self"]) + @overload def bfill( - self: Series, + self, + *, + axis: None | Axis = ..., + inplace: Literal[False] = ..., + limit: None | int = ..., + downcast=..., + ) -> Series: + ... + + @overload + def bfill( + self, + *, + axis: None | Axis = ..., + inplace: Literal[True], + limit: None | int = ..., + downcast=..., + ) -> None: + ... + + @overload + def bfill( + self, + *, + axis: None | Axis = ..., + inplace: bool = ..., + limit: None | int = ..., + downcast=..., + ) -> Series | None: + ... + + # error: Signature of "bfill" incompatible with supertype "NDFrame" + @deprecate_nonkeyword_arguments(version=None, allowed_args=["self"]) + def bfill( # type: ignore[override] + self, axis: None | Axis = None, inplace: bool = False, limit: None | int = None, downcast=None, ) -> Series | None: - return super().bfill(axis, inplace, limit, downcast) + return super().bfill(axis=axis, inplace=inplace, limit=limit, downcast=downcast) @deprecate_nonkeyword_arguments( version=None, allowed_args=["self", "lower", "upper"] @@ -5777,35 +6030,137 @@ def interpolate( **kwargs, ) + @overload + def where( + self, + cond, + other=..., + *, + inplace: Literal[False] = ..., + axis=..., + level=..., + errors: IgnoreRaise | lib.NoDefault = ..., + try_cast=..., + ) -> Series: + ... + + @overload + def where( + self, + cond, + other=..., + *, + inplace: Literal[True], + axis=..., + level=..., + errors: IgnoreRaise | lib.NoDefault = ..., + try_cast=..., + ) -> None: + ... + + @overload + def where( + self, + cond, + other=..., + *, + inplace: bool = ..., + axis=..., + level=..., + errors: IgnoreRaise | lib.NoDefault = ..., + try_cast=..., + ) -> Series | None: + ... + + # error: Signature of "where" incompatible with supertype "NDFrame" + @deprecate_kwarg(old_arg_name="errors", new_arg_name=None) @deprecate_nonkeyword_arguments( version=None, allowed_args=["self", "cond", "other"] ) - def where( + def where( # type: ignore[override] self, cond, other=lib.no_default, - inplace=False, + inplace: bool = False, axis=None, level=None, - errors=lib.no_default, + errors: IgnoreRaise | lib.NoDefault = lib.no_default, try_cast=lib.no_default, - ): - return super().where(cond, other, inplace, axis, level, errors, try_cast) + ) -> Series | None: + return super().where( + cond, + other, + inplace=inplace, + axis=axis, + level=level, + try_cast=try_cast, + ) + + @overload + def mask( + self, + cond, + other=..., + *, + inplace: Literal[False] = ..., + axis=..., + level=..., + errors: IgnoreRaise | lib.NoDefault = ..., + try_cast=..., + ) -> Series: + ... + @overload + def mask( + self, + cond, + other=..., + *, + inplace: Literal[True], + axis=..., + level=..., + errors: IgnoreRaise | lib.NoDefault = ..., + try_cast=..., + ) -> None: + ... + + @overload + def mask( + self, + cond, + other=..., + *, + inplace: bool = ..., + axis=..., + level=..., + errors: IgnoreRaise | lib.NoDefault = ..., + try_cast=..., + ) -> Series | None: + ... + + # error: Signature of "mask" incompatible with supertype "NDFrame" + @deprecate_kwarg(old_arg_name="errors", new_arg_name=None) @deprecate_nonkeyword_arguments( version=None, allowed_args=["self", "cond", "other"] ) - def mask( + def mask( # type: ignore[override] self, cond, other=np.nan, - inplace=False, + inplace: bool = False, axis=None, level=None, - errors=lib.no_default, + errors: IgnoreRaise | lib.NoDefault = lib.no_default, try_cast=lib.no_default, - ): - return super().mask(cond, other, inplace, axis, level, errors, try_cast) + ) -> Series | None: + return super().mask( + cond, + other, + inplace=inplace, + axis=axis, + level=level, + try_cast=try_cast, + ) # ---------------------------------------------------------------------- # Add index diff --git a/pandas/core/shared_docs.py b/pandas/core/shared_docs.py index 3a8a95865d10e..b7b75d6464da3 100644 --- a/pandas/core/shared_docs.py +++ b/pandas/core/shared_docs.py @@ -75,6 +75,11 @@ keep_equal : bool, default False If true, the result keeps values that are equal. Otherwise, equal values are shown as NaNs. + +result_names : tuple, default ('self', 'other') + Set the dataframes names in the comparison. + + .. versionadded:: 1.5.0 """ _shared_docs[ @@ -423,7 +428,7 @@ _shared_docs[ "compression_options" ] = """compression : str or dict, default 'infer' - For on-the-fly compression of the output data. If 'infer' and '%s' + For on-the-fly compression of the output data. If 'infer' and '%s' is path-like, then detect compression from the following extensions: '.gz', '.bz2', '.zip', '.xz', '.zst', '.tar', '.tar.gz', '.tar.xz' or '.tar.bz2' (otherwise no compression). @@ -432,7 +437,7 @@ to one of {``'zip'``, ``'gzip'``, ``'bz2'``, ``'zstd'``, ``'tar'``} and other key-value pairs are forwarded to ``zipfile.ZipFile``, ``gzip.GzipFile``, - ``bz2.BZ2File``, ``zstandard.ZstdDecompressor`` or + ``bz2.BZ2File``, ``zstandard.ZstdCompressor`` or ``tarfile.TarFile``, respectively. As an example, the following could be passed for faster compression and to create a reproducible gzip archive: @@ -541,7 +546,7 @@ string. Alternatively, this could be a regular expression or a list, dict, or array of regular expressions in which case `to_replace` must be ``None``. - method : {{'pad', 'ffill', 'bfill', `None`}} + method : {{'pad', 'ffill', 'bfill'}} The method to use when for replacement, when `to_replace` is a scalar, list or tuple and `value` is ``None``. diff --git a/pandas/core/strings/accessor.py b/pandas/core/strings/accessor.py index abd380299ba02..73d5c04ecd652 100644 --- a/pandas/core/strings/accessor.py +++ b/pandas/core/strings/accessor.py @@ -18,7 +18,10 @@ DtypeObj, F, ) -from pandas.util._decorators import Appender +from pandas.util._decorators import ( + Appender, + deprecate_nonkeyword_arguments, +) from pandas.util._exceptions import find_stack_level from pandas.core.dtypes.common import ( @@ -843,6 +846,7 @@ def cat( """, } ) + @deprecate_nonkeyword_arguments(version=None, allowed_args=["self", "pat"]) @forbid_nonstring_types(["bytes"]) def split( self, @@ -874,6 +878,7 @@ def split( "regex_examples": "", } ) + @deprecate_nonkeyword_arguments(version=None, allowed_args=["self", "pat"]) @forbid_nonstring_types(["bytes"]) def rsplit(self, pat=None, n=-1, expand=False): result = self._data.array._str_rsplit(pat, n=n) @@ -1683,19 +1688,23 @@ def zfill(self, width): Note that ``10`` and ``NaN`` are not strings, therefore they are converted to ``NaN``. The minus sign in ``'-1'`` is treated as a - regular character and the zero is added to the left of it + special character and the zero is added to the right of it (:meth:`str.zfill` would have moved it to the left). ``1000`` remains unchanged as it is longer than `width`. >>> s.str.zfill(3) - 0 0-1 + 0 -01 1 001 2 1000 3 NaN 4 NaN dtype: object """ - result = self.pad(width, side="left", fillchar="0") + if not is_integer(width): + msg = f"width must be of integer type, not {type(width).__name__}" + raise TypeError(msg) + f = lambda x: x.zfill(width) + result = self._data.array._str_map(f) return self._wrap_result(result) def slice(self, start=None, stop=None, step=None): diff --git a/pandas/core/strings/object_array.py b/pandas/core/strings/object_array.py index 7421645baa463..f884264e9ab75 100644 --- a/pandas/core/strings/object_array.py +++ b/pandas/core/strings/object_array.py @@ -360,7 +360,7 @@ def _str_get_dummies(self, sep="|"): arr = Series(self).fillna("") try: arr = sep + arr + sep - except TypeError: + except (TypeError, NotImplementedError): arr = sep + arr.astype(str) + sep tags: set[str] = set() diff --git a/pandas/core/tools/datetimes.py b/pandas/core/tools/datetimes.py index d4d61df915acb..1ec0e6ca83d8f 100644 --- a/pandas/core/tools/datetimes.py +++ b/pandas/core/tools/datetimes.py @@ -29,7 +29,7 @@ parsing, timezones, ) -from pandas._libs.tslibs.parsing import ( # noqa:F401 +from pandas._libs.tslibs.parsing import ( DateParseError, format_is_iso, guess_datetime_format, @@ -266,7 +266,7 @@ def _box_as_indexlike( def _convert_and_box_cache( arg: DatetimeScalarOrArrayConvertible, cache_array: Series, - name: str | None = None, + name: Hashable | None = None, ) -> Index: """ Convert array of dates with a cache and wrap the result in an Index. @@ -1289,3 +1289,11 @@ def to_time(arg, format=None, infer_time_format=False, errors="raise"): from pandas.core.tools.times import to_time return to_time(arg, format, infer_time_format, errors) + + +__all__ = [ + "DateParseError", + "should_cache", + "to_datetime", + "to_time", +] diff --git a/pandas/core/window/__init__.py b/pandas/core/window/__init__.py index 8f42cd782c67f..857e12e5467a6 100644 --- a/pandas/core/window/__init__.py +++ b/pandas/core/window/__init__.py @@ -1,13 +1,23 @@ -from pandas.core.window.ewm import ( # noqa:F401 +from pandas.core.window.ewm import ( ExponentialMovingWindow, ExponentialMovingWindowGroupby, ) -from pandas.core.window.expanding import ( # noqa:F401 +from pandas.core.window.expanding import ( Expanding, ExpandingGroupby, ) -from pandas.core.window.rolling import ( # noqa:F401 +from pandas.core.window.rolling import ( Rolling, RollingGroupby, Window, ) + +__all__ = [ + "Expanding", + "ExpandingGroupby", + "ExponentialMovingWindow", + "ExponentialMovingWindowGroupby", + "Rolling", + "RollingGroupby", + "Window", +] diff --git a/pandas/core/window/common.py b/pandas/core/window/common.py index 15144116fa924..ed2a4002f5ce7 100644 --- a/pandas/core/window/common.py +++ b/pandas/core/window/common.py @@ -1,4 +1,6 @@ """Common utility functions for rolling operations""" +from __future__ import annotations + from collections import defaultdict from typing import cast diff --git a/pandas/core/window/doc.py b/pandas/core/window/doc.py index 61cfa29ffc481..4fe08e2fa20b3 100644 --- a/pandas/core/window/doc.py +++ b/pandas/core/window/doc.py @@ -1,4 +1,6 @@ """Any shareable docstring components for rolling/expanding/ewm""" +from __future__ import annotations + from textwrap import dedent from pandas.core.shared_docs import _shared_docs diff --git a/pandas/core/window/ewm.py b/pandas/core/window/ewm.py index a153761f377b3..3a42a4b1a1663 100644 --- a/pandas/core/window/ewm.py +++ b/pandas/core/window/ewm.py @@ -442,7 +442,9 @@ def _get_window_indexer(self) -> BaseIndexer: """ return ExponentialMovingWindowIndexer() - def online(self, engine="numba", engine_kwargs=None): + def online( + self, engine="numba", engine_kwargs=None + ) -> OnlineExponentialMovingWindow: """ Return an ``OnlineExponentialMovingWindow`` object to calculate exponentially moving window aggregations in an online method. @@ -948,7 +950,7 @@ def __init__( else: raise ValueError("'numba' is the only supported engine") - def reset(self): + def reset(self) -> None: """ Reset the state captured by `update` calls. """ diff --git a/pandas/core/window/expanding.py b/pandas/core/window/expanding.py index d1a8b70b34462..dcdcbc0483d59 100644 --- a/pandas/core/window/expanding.py +++ b/pandas/core/window/expanding.py @@ -9,6 +9,7 @@ from pandas._typing import ( Axis, + QuantileInterpolation, WindowingRankType, ) @@ -651,7 +652,7 @@ def kurt(self, numeric_only: bool = False, **kwargs): def quantile( self, quantile: float, - interpolation: str = "linear", + interpolation: QuantileInterpolation = "linear", numeric_only: bool = False, **kwargs, ): diff --git a/pandas/core/window/online.py b/pandas/core/window/online.py index 2ef06732f9800..2e25bdd12d3e0 100644 --- a/pandas/core/window/online.py +++ b/pandas/core/window/online.py @@ -1,3 +1,5 @@ +from __future__ import annotations + from typing import TYPE_CHECKING import numpy as np @@ -112,6 +114,6 @@ def run_ewm(self, weighted_avg, deltas, min_periods, ewm_func): self.last_ewm = result[-1] return result - def reset(self): + def reset(self) -> None: self.old_wt = np.ones(self.shape[self.axis - 1]) self.last_ewm = None diff --git a/pandas/core/window/rolling.py b/pandas/core/window/rolling.py index b45f43adbe952..93f07c5d75625 100644 --- a/pandas/core/window/rolling.py +++ b/pandas/core/window/rolling.py @@ -29,6 +29,7 @@ ArrayLike, Axis, NDFrameT, + QuantileInterpolation, WindowingRankType, ) from pandas.compat._optional import import_optional_dependency @@ -107,7 +108,6 @@ ) from pandas.core.generic import NDFrame from pandas.core.groupby.ops import BaseGrouper - from pandas.core.internals import Block # noqa:F401 class BaseWindow(SelectionMixin): @@ -1658,7 +1658,7 @@ def kurt(self, numeric_only: bool = False, **kwargs): def quantile( self, quantile: float, - interpolation: str = "linear", + interpolation: QuantileInterpolation = "linear", numeric_only: bool = False, **kwargs, ): @@ -2553,7 +2553,7 @@ def kurt(self, numeric_only: bool = False, **kwargs): def quantile( self, quantile: float, - interpolation: str = "linear", + interpolation: QuantileInterpolation = "linear", numeric_only: bool = False, **kwargs, ): diff --git a/pandas/errors/__init__.py b/pandas/errors/__init__.py index 1918065d855a5..08ee5650e97a6 100644 --- a/pandas/errors/__init__.py +++ b/pandas/errors/__init__.py @@ -1,10 +1,13 @@ """ Expose public exceptions & warnings """ +from __future__ import annotations -from pandas._config.config import OptionError # noqa:F401 +import ctypes -from pandas._libs.tslibs import ( # noqa:F401 +from pandas._config.config import OptionError + +from pandas._libs.tslibs import ( OutOfBoundsDatetime, OutOfBoundsTimedelta, ) @@ -326,3 +329,178 @@ class NumExprClobberingError(NameError): >>> pd.eval("sin + a", engine='numexpr') # doctest: +SKIP ... # NumExprClobberingError: Variables in expression "(sin) + (a)" overlap... """ + + +class UndefinedVariableError(NameError): + """ + Exception is raised when trying to use an undefined variable name in a method + like query or eval. It will also specific whether the undefined variable is + local or not. + + Examples + -------- + >>> df = pd.DataFrame({'A': [1, 1, 1]}) + >>> df.query("A > x") # doctest: +SKIP + ... # UndefinedVariableError: name 'x' is not defined + >>> df.query("A > @y") # doctest: +SKIP + ... # UndefinedVariableError: local variable 'y' is not defined + >>> pd.eval('x + 1') # doctest: +SKIP + ... # UndefinedVariableError: name 'x' is not defined + """ + + def __init__(self, name: str, is_local: bool | None = None) -> None: + base_msg = f"{repr(name)} is not defined" + if is_local: + msg = f"local variable {base_msg}" + else: + msg = f"name {base_msg}" + super().__init__(msg) + + +class IndexingError(Exception): + """ + Exception is raised when trying to index and there is a mismatch in dimensions. + + Examples + -------- + >>> df = pd.DataFrame({'A': [1, 1, 1]}) + >>> df.loc[..., ..., 'A'] # doctest: +SKIP + ... # IndexingError: indexer may only contain one '...' entry + >>> df = pd.DataFrame({'A': [1, 1, 1]}) + >>> df.loc[1, ..., ...] # doctest: +SKIP + ... # IndexingError: Too many indexers + >>> df[pd.Series([True], dtype=bool)] # doctest: +SKIP + ... # IndexingError: Unalignable boolean Series provided as indexer... + >>> s = pd.Series(range(2), + ... index = pd.MultiIndex.from_product([["a", "b"], ["c"]])) + >>> s.loc["a", "c", "d"] # doctest: +SKIP + ... # IndexingError: Too many indexers + """ + + +class PyperclipException(RuntimeError): + """ + Exception is raised when trying to use methods like to_clipboard() and + read_clipboard() on an unsupported OS/platform. + """ + + +class PyperclipWindowsException(PyperclipException): + """ + Exception is raised when pandas is unable to get access to the clipboard handle + due to some other window process is accessing it. + """ + + def __init__(self, message: str) -> None: + # attr only exists on Windows, so typing fails on other platforms + message += f" ({ctypes.WinError()})" # type: ignore[attr-defined] + super().__init__(message) + + +class CSSWarning(UserWarning): + """ + Warning is raised when converting css styling fails. + This can be due to the styling not having an equivalent value or because the + styling isn't properly formatted. + + Examples + -------- + >>> df = pd.DataFrame({'A': [1, 1, 1]}) + >>> df.style.applymap(lambda x: 'background-color: blueGreenRed;') + ... .to_excel('styled.xlsx') # doctest: +SKIP + ... # CSSWarning: Unhandled color format: 'blueGreenRed' + >>> df.style.applymap(lambda x: 'border: 1px solid red red;') + ... .to_excel('styled.xlsx') # doctest: +SKIP + ... # CSSWarning: Too many tokens provided to "border" (expected 1-3) + """ + + +class PossibleDataLossError(Exception): + """ + Exception is raised when trying to open a HDFStore file when the file is already + opened. + + Examples + -------- + >>> store = pd.HDFStore('my-store', 'a') # doctest: +SKIP + >>> store.open("w") # doctest: +SKIP + ... # PossibleDataLossError: Re-opening the file [my-store] with mode [a]... + """ + + +class ClosedFileError(Exception): + """ + Exception is raised when trying to perform an operation on a closed HDFStore file. + + Examples + -------- + >>> store = pd.HDFStore('my-store', 'a') # doctest: +SKIP + >>> store.close() # doctest: +SKIP + >>> store.keys() # doctest: +SKIP + ... # ClosedFileError: my-store file is not open! + """ + + +class IncompatibilityWarning(Warning): + """ + Warning is raised when trying to use where criteria on an incompatible + HDF5 file. + """ + + +class AttributeConflictWarning(Warning): + """ + Warning is raised when attempting to append an index with a different + name than the existing index on an HDFStore or attempting to append an index with a + different frequency than the existing index on an HDFStore. + """ + + +class DatabaseError(OSError): + """ + Error is raised when executing sql with bad syntax or sql that throws an error. + + Examples + -------- + >>> from sqlite3 import connect + >>> conn = connect(':memory:') + >>> pd.read_sql('select * test', conn) # doctest: +SKIP + ... # DatabaseError: Execution failed on sql 'test': near "test": syntax error + """ + + +__all__ = [ + "AbstractMethodError", + "AccessorRegistrationWarning", + "AttributeConflictWarning", + "ClosedFileError", + "CSSWarning", + "DatabaseError", + "DataError", + "DtypeWarning", + "DuplicateLabelError", + "EmptyDataError", + "IncompatibilityWarning", + "IntCastingNaNError", + "InvalidIndexError", + "IndexingError", + "MergeError", + "NullFrequencyError", + "NumbaUtilError", + "NumExprClobberingError", + "OptionError", + "OutOfBoundsDatetime", + "OutOfBoundsTimedelta", + "ParserError", + "ParserWarning", + "PerformanceWarning", + "PossibleDataLossError", + "PyperclipException", + "PyperclipWindowsException", + "SettingWithCopyError", + "SettingWithCopyWarning", + "SpecificationError", + "UndefinedVariableError", + "UnsortedIndexError", + "UnsupportedFunctionCall", +] diff --git a/pandas/io/api.py b/pandas/io/api.py index 5926f2166ee9d..4e8b34a61dfc6 100644 --- a/pandas/io/api.py +++ b/pandas/io/api.py @@ -2,8 +2,6 @@ Data IO api """ -# flake8: noqa - from pandas.io.clipboards import read_clipboard from pandas.io.excel import ( ExcelFile, @@ -38,3 +36,30 @@ ) from pandas.io.stata import read_stata from pandas.io.xml import read_xml + +__all__ = [ + "ExcelFile", + "ExcelWriter", + "HDFStore", + "read_clipboard", + "read_csv", + "read_excel", + "read_feather", + "read_fwf", + "read_gbq", + "read_hdf", + "read_html", + "read_json", + "read_orc", + "read_parquet", + "read_pickle", + "read_sas", + "read_spss", + "read_sql", + "read_sql_query", + "read_sql_table", + "read_stata", + "read_table", + "read_xml", + "to_pickle", +] diff --git a/pandas/io/clipboard/__init__.py b/pandas/io/clipboard/__init__.py index 6a39b20869497..27fb06dfb6023 100644 --- a/pandas/io/clipboard/__init__.py +++ b/pandas/io/clipboard/__init__.py @@ -58,6 +58,11 @@ import time import warnings +from pandas.errors import ( + PyperclipException, + PyperclipWindowsException, +) + # `import PyQt4` sys.exit()s if DISPLAY is not in the environment. # Thus, we need to detect the presence of $DISPLAY manually # and not load PyQt4 if it is absent. @@ -87,18 +92,6 @@ def _executable_exists(name): ) -# Exceptions -class PyperclipException(RuntimeError): - pass - - -class PyperclipWindowsException(PyperclipException): - def __init__(self, message) -> None: - # attr only exists on Windows, so typing fails on other platforms - message += f" ({ctypes.WinError()})" # type: ignore[attr-defined] - super().__init__(message) - - def _stringifyText(text) -> str: acceptedTypes = (str, int, float, bool) if not isinstance(text, acceptedTypes): diff --git a/pandas/io/common.py b/pandas/io/common.py index 5aecc55bb363a..d911499aa848e 100644 --- a/pandas/io/common.py +++ b/pandas/io/common.py @@ -626,6 +626,21 @@ def get_handle( ... +@overload +def get_handle( + path_or_buf: FilePath | BaseBuffer, + mode: str, + *, + encoding: str | None = ..., + compression: CompressionOptions = ..., + memory_map: bool = ..., + is_text: bool = ..., + errors: str | None = ..., + storage_options: StorageOptions = ..., +) -> IOHandles[str] | IOHandles[bytes]: + ... + + @doc(compression_options=_shared_docs["compression_options"] % "path_or_buf") def get_handle( path_or_buf: FilePath | BaseBuffer, @@ -806,7 +821,10 @@ def get_handle( # XZ Compression elif compression == "xz": - handle = get_lzma_file()(handle, ioargs.mode) + # error: Argument 1 to "LZMAFile" has incompatible type "Union[str, + # BaseBuffer]"; expected "Optional[Union[Union[str, bytes, PathLike[str], + # PathLike[bytes]], IO[bytes]]]" + handle = get_lzma_file()(handle, ioargs.mode) # type: ignore[arg-type] # Zstd Compression elif compression == "zstd": @@ -910,7 +928,7 @@ class _BufferedWriter(BytesIO, ABC): # type: ignore[misc] """ @abstractmethod - def write_to_buffer(self): + def write_to_buffer(self) -> None: ... def close(self) -> None: diff --git a/pandas/io/date_converters.py b/pandas/io/date_converters.py index 077524fbee465..85e92da8c2a54 100644 --- a/pandas/io/date_converters.py +++ b/pandas/io/date_converters.py @@ -1,13 +1,16 @@ """This module is designed for community supported date conversion functions""" +from __future__ import annotations + import warnings import numpy as np from pandas._libs.tslibs import parsing +from pandas._typing import npt from pandas.util._exceptions import find_stack_level -def parse_date_time(date_col, time_col): +def parse_date_time(date_col, time_col) -> npt.NDArray[np.object_]: """ Parse columns with dates and times into a single datetime column. @@ -26,7 +29,7 @@ def parse_date_time(date_col, time_col): return parsing.try_parse_date_and_time(date_col, time_col) -def parse_date_fields(year_col, month_col, day_col): +def parse_date_fields(year_col, month_col, day_col) -> npt.NDArray[np.object_]: """ Parse columns with years, months and days into a single date column. @@ -48,7 +51,9 @@ def parse_date_fields(year_col, month_col, day_col): return parsing.try_parse_year_month_day(year_col, month_col, day_col) -def parse_all_fields(year_col, month_col, day_col, hour_col, minute_col, second_col): +def parse_all_fields( + year_col, month_col, day_col, hour_col, minute_col, second_col +) -> npt.NDArray[np.object_]: """ Parse columns with datetime information into a single datetime column. @@ -78,7 +83,7 @@ def parse_all_fields(year_col, month_col, day_col, hour_col, minute_col, second_ ) -def generic_parser(parse_func, *cols): +def generic_parser(parse_func, *cols) -> np.ndarray: """ Use dateparser to parse columns with data information into a single datetime column. diff --git a/pandas/io/excel/_base.py b/pandas/io/excel/_base.py index d20f347e54d6b..a0abddc82e6c8 100644 --- a/pandas/io/excel/_base.py +++ b/pandas/io/excel/_base.py @@ -774,6 +774,12 @@ def parse( assert isinstance(skiprows, int) row += skiprows + if row > len(data) - 1: + raise ValueError( + f"header index {row} exceeds maximum index " + f"{len(data) - 1} of data.", + ) + data[row], control_row = fill_mi_header(data[row], control_row) if index_col is not None: @@ -782,9 +788,27 @@ def parse( # If there is a MultiIndex header and an index then there is also # a row containing just the index name(s) - has_index_names = ( - is_list_header and not is_len_one_list_header and index_col is not None - ) + has_index_names = False + if is_list_header and not is_len_one_list_header and index_col is not None: + + index_col_list: Sequence[int] + if isinstance(index_col, int): + index_col_list = [index_col] + else: + assert isinstance(index_col, Sequence) + index_col_list = index_col + + # We have to handle mi without names. If any of the entries in the data + # columns are not empty, this is a regular row + assert isinstance(header, Sequence) + if len(header) < len(data): + potential_index_names = data[len(header)] + potential_data = [ + x + for i, x in enumerate(potential_index_names) + if not control_row[i] and i not in index_col_list + ] + has_index_names = all(x == "" or x is None for x in potential_data) if is_list_like(index_col): # Forward fill values for MultiIndex index. @@ -865,11 +889,14 @@ class ExcelWriter(metaclass=abc.ABCMeta): """ Class for writing DataFrame objects into excel sheets. - Default is to use : - * xlwt for xls - * xlsxwriter for xlsx if xlsxwriter is installed otherwise openpyxl - * odf for ods. - See DataFrame.to_excel for typical usage. + Default is to use: + + * `xlwt `__ for xls files + * `xlsxwriter `__ for xlsx files if xlsxwriter + is installed otherwise `openpyxl `__ + * `odswriter `__ for ods files + + See ``DataFrame.to_excel`` for typical usage. The writer should be used as a context manager. Otherwise, call `close()` to save and close any opened file handles. @@ -1063,7 +1090,7 @@ class ExcelWriter(metaclass=abc.ABCMeta): _supported_extensions: tuple[str, ...] def __new__( - cls, + cls: type[ExcelWriter], path: FilePath | WriteExcelBuffer | ExcelWriter, engine: str | None = None, date_format: str | None = None, @@ -1073,7 +1100,7 @@ def __new__( if_sheet_exists: Literal["error", "new", "replace", "overlay"] | None = None, engine_kwargs: dict | None = None, **kwargs, - ): + ) -> ExcelWriter: if kwargs: if engine_kwargs is not None: raise ValueError("Cannot use both engine_kwargs and **kwargs") @@ -1319,7 +1346,7 @@ def cur_sheet(self): return self._cur_sheet @property - def handles(self): + def handles(self) -> IOHandles[bytes]: """ Handles to Excel sheets. @@ -1338,7 +1365,7 @@ def path(self): self._deprecate("path") return self._path - def __fspath__(self): + def __fspath__(self) -> str: return getattr(self._handles.handle, "name", "") def _get_sheet_name(self, sheet_name: str | None) -> str: @@ -1396,10 +1423,10 @@ def check_extension(cls, ext: str) -> Literal[True]: return True # Allow use as a contextmanager - def __enter__(self): + def __enter__(self) -> ExcelWriter: return self - def __exit__(self, exc_type, exc_value, traceback): + def __exit__(self, exc_type, exc_value, traceback) -> None: self.close() def close(self) -> None: @@ -1693,13 +1720,13 @@ def close(self) -> None: """close io if necessary""" self._reader.close() - def __enter__(self): + def __enter__(self) -> ExcelFile: return self - def __exit__(self, exc_type, exc_value, traceback): + def __exit__(self, exc_type, exc_value, traceback) -> None: self.close() - def __del__(self): + def __del__(self) -> None: # Ensure we don't leak file descriptors, but put in try/except in case # attributes are already deleted try: diff --git a/pandas/io/excel/_odswriter.py b/pandas/io/excel/_odswriter.py index f5367df6f228d..a6e125f4b9f33 100644 --- a/pandas/io/excel/_odswriter.py +++ b/pandas/io/excel/_odswriter.py @@ -189,14 +189,18 @@ def _make_table_cell(self, cell) -> tuple[object, Any]: value = str(val).lower() pvalue = str(val).upper() if isinstance(val, datetime.datetime): + # Fast formatting value = val.isoformat() + # Slow but locale-dependent pvalue = val.strftime("%c") return ( pvalue, TableCell(valuetype="date", datevalue=value, attributes=attributes), ) elif isinstance(val, datetime.date): - value = val.strftime("%Y-%m-%d") + # Fast formatting + value = f"{val.year}-{val.month:02d}-{val.day:02d}" + # Slow but locale-dependent pvalue = val.strftime("%x") return ( pvalue, diff --git a/pandas/io/excel/_openpyxl.py b/pandas/io/excel/_openpyxl.py index 87cc07d3fd21d..c3cd3fbe9e853 100644 --- a/pandas/io/excel/_openpyxl.py +++ b/pandas/io/excel/_openpyxl.py @@ -33,6 +33,7 @@ if TYPE_CHECKING: from openpyxl.descriptors.serialisable import Serialisable + from openpyxl.workbook import Workbook class OpenpyxlWriter(ExcelWriter): @@ -79,7 +80,7 @@ def __init__( self.book.remove(self.book.worksheets[0]) @property - def book(self): + def book(self) -> Workbook: """ Book instance of class openpyxl.workbook.Workbook. diff --git a/pandas/io/excel/_xlsxwriter.py b/pandas/io/excel/_xlsxwriter.py index 302d0281019f5..a3edccd3a5779 100644 --- a/pandas/io/excel/_xlsxwriter.py +++ b/pandas/io/excel/_xlsxwriter.py @@ -165,6 +165,10 @@ def convert(cls, style_dict, num_format_str=None): "doubleAccounting": 34, }[props["underline"]] + # GH 30107 - xlsxwriter uses different name + if props.get("valign") == "center": + props["valign"] = "vcenter" + return props diff --git a/pandas/io/feather_format.py b/pandas/io/feather_format.py index 9813b91419060..4ecd5b7604088 100644 --- a/pandas/io/feather_format.py +++ b/pandas/io/feather_format.py @@ -31,7 +31,7 @@ def to_feather( path: FilePath | WriteBuffer[bytes], storage_options: StorageOptions = None, **kwargs, -): +) -> None: """ Write a DataFrame to the binary Feather format. diff --git a/pandas/io/formats/_color_data.py b/pandas/io/formats/_color_data.py index e5b72b2befa4f..2e7cb7f29646e 100644 --- a/pandas/io/formats/_color_data.py +++ b/pandas/io/formats/_color_data.py @@ -3,6 +3,8 @@ # This data has been copied here, instead of being imported from matplotlib, # not to have ``to_excel`` methods require matplotlib. # source: matplotlib._color_data (3.3.3) +from __future__ import annotations + CSS4_COLORS = { "aliceblue": "F0F8FF", "antiquewhite": "FAEBD7", diff --git a/pandas/io/formats/console.py b/pandas/io/formats/console.py index bdd2b3d6e4c6a..2a6cbe0762903 100644 --- a/pandas/io/formats/console.py +++ b/pandas/io/formats/console.py @@ -1,11 +1,12 @@ """ Internal module for console introspection """ +from __future__ import annotations from shutil import get_terminal_size -def get_console_size(): +def get_console_size() -> tuple[int | None, int | None]: """ Return console size as tuple = (width, height). @@ -43,14 +44,14 @@ def get_console_size(): # Note if the User sets width/Height to None (auto-detection) # and we're in a script (non-inter), this will return (None,None) # caller needs to deal. - return (display_width or terminal_width, display_height or terminal_height) + return display_width or terminal_width, display_height or terminal_height # ---------------------------------------------------------------------- # Detect our environment -def in_interactive_session(): +def in_interactive_session() -> bool: """ Check if we're running in an interactive shell. @@ -75,7 +76,7 @@ def check_main(): return check_main() -def in_ipython_frontend(): +def in_ipython_frontend() -> bool: """ Check if we're inside an IPython zmq frontend. diff --git a/pandas/io/formats/css.py b/pandas/io/formats/css.py index 5335887785881..778df087d28d8 100644 --- a/pandas/io/formats/css.py +++ b/pandas/io/formats/css.py @@ -7,14 +7,12 @@ from typing import ( Callable, Generator, + Iterable, + Iterator, ) import warnings - -class CSSWarning(UserWarning): - """ - This CSS syntax cannot currently be parsed. - """ +from pandas.errors import CSSWarning def _side_expander(prop_fmt: str) -> Callable: @@ -187,9 +185,24 @@ class CSSResolver: SIDES = ("top", "right", "bottom", "left") + CSS_EXPANSIONS = { + **{ + "-".join(["border", prop] if prop else ["border"]): _border_expander(prop) + for prop in ["", "top", "right", "bottom", "left"] + }, + **{ + "-".join(["border", prop]): _side_expander("border-{:s}-" + prop) + for prop in ["color", "style", "width"] + }, + **{ + "margin": _side_expander("margin-{:s}"), + "padding": _side_expander("padding-{:s}"), + }, + } + def __call__( self, - declarations_str: str, + declarations: str | Iterable[tuple[str, str]], inherited: dict[str, str] | None = None, ) -> dict[str, str]: """ @@ -197,8 +210,10 @@ def __call__( Parameters ---------- - declarations_str : str - A list of CSS declarations + declarations_str : str | Iterable[tuple[str, str]] + A CSS string or set of CSS declaration tuples + e.g. "font-weight: bold; background: blue" or + {("font-weight", "bold"), ("background", "blue")} inherited : dict, optional Atomic properties indicating the inherited style context in which declarations_str is to be resolved. ``inherited`` should already @@ -229,7 +244,9 @@ def __call__( ('font-size', '24pt'), ('font-weight', 'bold')] """ - props = dict(self.atomize(self.parse(declarations_str))) + if isinstance(declarations, str): + declarations = self.parse(declarations) + props = dict(self.atomize(declarations)) if inherited is None: inherited = {} @@ -346,30 +363,17 @@ def _error(): size_fmt = f"{val:f}pt" return size_fmt - def atomize(self, declarations) -> Generator[tuple[str, str], None, None]: + def atomize(self, declarations: Iterable) -> Generator[tuple[str, str], None, None]: for prop, value in declarations: - attr = "expand_" + prop.replace("-", "_") - try: - expand = getattr(self, attr) - except AttributeError: - yield prop, value + prop = prop.lower() + value = value.lower() + if prop in self.CSS_EXPANSIONS: + expand = self.CSS_EXPANSIONS[prop] + yield from expand(self, prop, value) else: - for prop, value in expand(prop, value): - yield prop, value - - expand_border = _border_expander() - expand_border_top = _border_expander("top") - expand_border_right = _border_expander("right") - expand_border_bottom = _border_expander("bottom") - expand_border_left = _border_expander("left") - - expand_border_color = _side_expander("border-{:s}-color") - expand_border_style = _side_expander("border-{:s}-style") - expand_border_width = _side_expander("border-{:s}-width") - expand_margin = _side_expander("margin-{:s}") - expand_padding = _side_expander("padding-{:s}") - - def parse(self, declarations_str: str): + yield prop, value + + def parse(self, declarations_str: str) -> Iterator[tuple[str, str]]: """ Generates (prop, value) pairs from declarations. diff --git a/pandas/io/formats/csvs.py b/pandas/io/formats/csvs.py index c577acfaeba8e..6ab57b0cce2a4 100644 --- a/pandas/io/formats/csvs.py +++ b/pandas/io/formats/csvs.py @@ -118,16 +118,16 @@ def _initialize_index_label(self, index_label: IndexLabel | None) -> IndexLabel: return [index_label] return index_label - def _get_index_label_from_obj(self) -> list[str]: + def _get_index_label_from_obj(self) -> Sequence[Hashable]: if isinstance(self.obj.index, ABCMultiIndex): return self._get_index_label_multiindex() else: return self._get_index_label_flat() - def _get_index_label_multiindex(self) -> list[str]: + def _get_index_label_multiindex(self) -> Sequence[Hashable]: return [name or "" for name in self.obj.index.names] - def _get_index_label_flat(self) -> list[str]: + def _get_index_label_flat(self) -> Sequence[Hashable]: index_label = self.obj.index.name return [""] if index_label is None else [index_label] diff --git a/pandas/io/formats/excel.py b/pandas/io/formats/excel.py index d0fea32cafe26..811b079c3c693 100644 --- a/pandas/io/formats/excel.py +++ b/pandas/io/formats/excel.py @@ -3,7 +3,10 @@ """ from __future__ import annotations -from functools import reduce +from functools import ( + lru_cache, + reduce, +) import itertools import re from typing import ( @@ -85,10 +88,13 @@ def __init__( **kwargs, ) -> None: if css_styles and css_converter: - css = ";".join( - [a + ":" + str(v) for (a, v) in css_styles[css_row, css_col]] - ) - style = css_converter(css) + # Use dict to get only one (case-insensitive) declaration per property + declaration_dict = { + prop.lower(): val for prop, val in css_styles[css_row, css_col] + } + # Convert to frozenset for order-invariant caching + unique_declarations = frozenset(declaration_dict.items()) + style = css_converter(unique_declarations) return super().__init__(row=row, col=col, val=val, style=style, **kwargs) @@ -166,15 +172,19 @@ def __init__(self, inherited: str | None = None) -> None: compute_css = CSSResolver() - def __call__(self, declarations_str: str) -> dict[str, dict[str, str]]: + @lru_cache(maxsize=None) + def __call__( + self, declarations: str | frozenset[tuple[str, str]] + ) -> dict[str, dict[str, str]]: """ Convert CSS declarations to ExcelWriter style. Parameters ---------- - declarations_str : str - List of CSS declarations. - e.g. "font-weight: bold; background: blue" + declarations : str | frozenset[tuple[str, str]] + CSS string or set of CSS declaration tuples. + e.g. "font-weight: bold; background: blue" or + {("font-weight", "bold"), ("background", "blue")} Returns ------- @@ -182,8 +192,7 @@ def __call__(self, declarations_str: str) -> dict[str, dict[str, str]]: A style as interpreted by ExcelWriter when found in ExcelCell.style. """ - # TODO: memoize? - properties = self.compute_css(declarations_str, self.inherited) + properties = self.compute_css(declarations, self.inherited) return self.build_xlstyle(properties) def build_xlstyle(self, props: Mapping[str, str]) -> dict[str, dict[str, str]]: @@ -197,7 +206,7 @@ def build_xlstyle(self, props: Mapping[str, str]) -> dict[str, dict[str, str]]: # TODO: handle cell width and height: needs support in pandas.io.excel - def remove_none(d: dict[str, str]) -> None: + def remove_none(d: dict[str, str | None]) -> None: """Remove key where value is None, through nested dicts""" for k, v in list(d.items()): if v is None: @@ -528,7 +537,7 @@ def __init__( self.inf_rep = inf_rep @property - def header_style(self): + def header_style(self) -> dict[str, dict[str, str | bool]]: return { "font": {"bold": True}, "borders": { @@ -634,7 +643,7 @@ def _format_header_regular(self) -> Iterable[ExcelCell]: if self.index: coloffset = 1 if isinstance(self.df.index, MultiIndex): - coloffset = len(self.df.index[0]) + coloffset = len(self.df.index.names) colnames = self.columns if self._has_aliases: @@ -850,7 +859,7 @@ def write( freeze_panes=None, engine=None, storage_options: StorageOptions = None, - ): + ) -> None: """ writer : path-like, file-like, or ExcelWriter object File path or existing ExcelWriter diff --git a/pandas/io/formats/format.py b/pandas/io/formats/format.py index 045e74c1b6083..6554b4c1f1afd 100644 --- a/pandas/io/formats/format.py +++ b/pandas/io/formats/format.py @@ -22,6 +22,7 @@ Callable, Hashable, Iterable, + Iterator, List, Mapping, Sequence, @@ -42,7 +43,9 @@ NaT, Timedelta, Timestamp, + get_unit_from_dtype, iNaT, + periods_per_day, ) from pandas._libs.tslibs.nattype import NaTType from pandas._typing import ( @@ -561,7 +564,7 @@ class DataFrameFormatter: def __init__( self, frame: DataFrame, - columns: Sequence[str] | None = None, + columns: Sequence[Hashable] | None = None, col_space: ColspaceArgType | None = None, header: bool | Sequence[str] = True, index: bool = True, @@ -683,7 +686,7 @@ def _initialize_justify(self, justify: str | None) -> str: else: return justify - def _initialize_columns(self, columns: Sequence[str] | None) -> Index: + def _initialize_columns(self, columns: Sequence[Hashable] | None) -> Index: if columns is not None: cols = ensure_index(columns) self.frame = self.frame[cols] @@ -990,8 +993,8 @@ def _get_formatted_index(self, frame: DataFrame) -> list[str]: else: return adjoined - def _get_column_name_list(self) -> list[str]: - names: list[str] = [] + def _get_column_name_list(self) -> list[Hashable]: + names: list[Hashable] = [] columns = self.frame.columns if isinstance(columns, MultiIndex): names.extend("" if name is None else name for name in columns.names) @@ -1201,12 +1204,15 @@ def save_to_buffer( with get_buffer(buf, encoding=encoding) as f: f.write(string) if buf is None: - return f.getvalue() + # error: "WriteBuffer[str]" has no attribute "getvalue" + return f.getvalue() # type: ignore[attr-defined] return None @contextmanager -def get_buffer(buf: FilePath | WriteBuffer[str] | None, encoding: str | None = None): +def get_buffer( + buf: FilePath | WriteBuffer[str] | None, encoding: str | None = None +) -> Iterator[WriteBuffer[str]] | Iterator[StringIO]: """ Context manager to open, yield and close buffer for filenames or Path-like objects, otherwise yield buf unchanged. @@ -1738,16 +1744,21 @@ def is_dates_only(values: np.ndarray | DatetimeArray | Index | DatetimeIndex) -> if not isinstance(values, Index): values = values.ravel() - values = DatetimeIndex(values) + if not isinstance(values, (DatetimeArray, DatetimeIndex)): + values = DatetimeIndex(values) + if values.tz is not None: return False values_int = values.asi8 consider_values = values_int != iNaT - one_day_nanos = 86400 * 10**9 - even_days = ( - np.logical_and(consider_values, values_int % int(one_day_nanos) != 0).sum() == 0 - ) + # error: Argument 1 to "py_get_unit_from_dtype" has incompatible type + # "Union[dtype[Any], ExtensionDtype]"; expected "dtype[Any]" + reso = get_unit_from_dtype(values.dtype) # type: ignore[arg-type] + ppd = periods_per_day(reso) + + # TODO: can we reuse is_date_array_normalized? would need a skipna kwd + even_days = np.logical_and(consider_values, values_int % ppd != 0).sum() == 0 if even_days: return True return False @@ -1757,6 +1768,8 @@ def _format_datetime64(x: NaTType | Timestamp, nat_rep: str = "NaT") -> str: if x is NaT: return nat_rep + # Timestamp.__str__ falls back to datetime.datetime.__str__ = isoformat(sep=' ') + # so it already uses string formatting rather than strftime (faster). return str(x) @@ -1771,12 +1784,15 @@ def _format_datetime64_dateonly( if date_format: return x.strftime(date_format) else: + # Timestamp._date_repr relies on string formatting (faster than strftime) return x._date_repr def get_format_datetime64( is_dates_only: bool, nat_rep: str = "NaT", date_format: str | None = None ) -> Callable: + """Return a formatter callable taking a datetime64 as input and providing + a string as output""" if is_dates_only: return lambda x: _format_datetime64_dateonly( @@ -1797,6 +1813,7 @@ def get_format_datetime64_from_values( ido = is_dates_only(values) if ido: + # Only dates and no timezone: provide a default format return date_format or "%Y-%m-%d" return date_format @@ -1870,6 +1887,8 @@ def _formatter(x): if not isinstance(x, Timedelta): x = Timedelta(x) + + # Timedelta._repr_base uses string formatting (faster than strftime) result = x._repr_base(format=format) if box: result = f"'{result}'" diff --git a/pandas/io/formats/html.py b/pandas/io/formats/html.py index dfd95b96c68e8..163e7dc7bde5e 100644 --- a/pandas/io/formats/html.py +++ b/pandas/io/formats/html.py @@ -6,6 +6,7 @@ from textwrap import dedent from typing import ( Any, + Hashable, Iterable, Mapping, cast, @@ -89,7 +90,7 @@ def render(self) -> list[str]: return self.elements @property - def should_show_dimensions(self): + def should_show_dimensions(self) -> bool: return self.fmt.should_show_dimensions @property @@ -258,6 +259,7 @@ def _write_table(self, indent: int = 0) -> None: self.write("
", indent) def _write_col_header(self, indent: int) -> None: + row: list[Hashable] is_truncated_horizontally = self.fmt.is_truncated_horizontally if isinstance(self.columns, MultiIndex): template = 'colspan="{span:d}" halign="left"' diff --git a/pandas/io/formats/info.py b/pandas/io/formats/info.py index c0bdf37e5273a..07ec50a2cd6a8 100644 --- a/pandas/io/formats/info.py +++ b/pandas/io/formats/info.py @@ -566,7 +566,7 @@ def dtypes(self) -> Iterable[Dtype]: return [self.data.dtypes] @property - def dtype_counts(self): + def dtype_counts(self) -> Mapping[str, int]: from pandas.core.frame import DataFrame return _get_dataframe_dtype_counts(DataFrame(self.data)) @@ -1087,7 +1087,7 @@ def _fill_non_empty_info(self) -> None: if self.display_memory_usage: self.add_memory_usage_line() - def add_series_name_line(self): + def add_series_name_line(self) -> None: self._lines.append(f"Series name: {self.data.name}") @property diff --git a/pandas/io/formats/style.py b/pandas/io/formats/style.py index 24646da9162b0..fbee64771cd9a 100644 --- a/pandas/io/formats/style.py +++ b/pandas/io/formats/style.py @@ -12,6 +12,7 @@ Callable, Hashable, Sequence, + overload, ) import warnings @@ -24,8 +25,11 @@ Axis, FilePath, IndexLabel, + IntervalInclusiveType, Level, + QuantileInterpolation, Scalar, + StorageOptions, WriteBuffer, ) from pandas.compat._optional import import_optional_dependency @@ -547,6 +551,7 @@ def set_tooltips( NDFrame.to_excel, klass="Styler", storage_options=_shared_docs["storage_options"], + storage_options_versionadded="1.5.0", ) def to_excel( self, @@ -566,6 +571,7 @@ def to_excel( inf_rep: str = "inf", verbose: bool = True, freeze_panes: tuple[int, int] | None = None, + storage_options: StorageOptions = None, ) -> None: from pandas.io.formats.excel import ExcelFormatter @@ -588,8 +594,55 @@ def to_excel( startcol=startcol, freeze_panes=freeze_panes, engine=engine, + storage_options=storage_options, ) + @overload + def to_latex( + self, + buf: FilePath | WriteBuffer[str], + *, + column_format: str | None = ..., + position: str | None = ..., + position_float: str | None = ..., + hrules: bool | None = ..., + clines: str | None = ..., + label: str | None = ..., + caption: str | tuple | None = ..., + sparse_index: bool | None = ..., + sparse_columns: bool | None = ..., + multirow_align: str | None = ..., + multicol_align: str | None = ..., + siunitx: bool = ..., + environment: str | None = ..., + encoding: str | None = ..., + convert_css: bool = ..., + ) -> None: + ... + + @overload + def to_latex( + self, + buf: None = ..., + *, + column_format: str | None = ..., + position: str | None = ..., + position_float: str | None = ..., + hrules: bool | None = ..., + clines: str | None = ..., + label: str | None = ..., + caption: str | tuple | None = ..., + sparse_index: bool | None = ..., + sparse_columns: bool | None = ..., + multirow_align: str | None = ..., + multicol_align: str | None = ..., + siunitx: bool = ..., + environment: str | None = ..., + encoding: str | None = ..., + convert_css: bool = ..., + ) -> str: + ... + def to_latex( self, buf: FilePath | WriteBuffer[str] | None = None, @@ -609,7 +662,7 @@ def to_latex( environment: str | None = None, encoding: str | None = None, convert_css: bool = False, - ): + ) -> str | None: r""" Write Styler to a file, buffer or string in LaTeX format. @@ -1160,6 +1213,46 @@ def to_latex( ) return save_to_buffer(latex, buf=buf, encoding=encoding) + @overload + def to_html( + self, + buf: FilePath | WriteBuffer[str], + *, + table_uuid: str | None = ..., + table_attributes: str | None = ..., + sparse_index: bool | None = ..., + sparse_columns: bool | None = ..., + bold_headers: bool = ..., + caption: str | None = ..., + max_rows: int | None = ..., + max_columns: int | None = ..., + encoding: str | None = ..., + doctype_html: bool = ..., + exclude_styles: bool = ..., + **kwargs, + ) -> None: + ... + + @overload + def to_html( + self, + buf: None = ..., + *, + table_uuid: str | None = ..., + table_attributes: str | None = ..., + sparse_index: bool | None = ..., + sparse_columns: bool | None = ..., + bold_headers: bool = ..., + caption: str | None = ..., + max_rows: int | None = ..., + max_columns: int | None = ..., + encoding: str | None = ..., + doctype_html: bool = ..., + exclude_styles: bool = ..., + **kwargs, + ) -> str: + ... + @Substitution(buf=buf, encoding=encoding) def to_html( self, @@ -1177,7 +1270,7 @@ def to_html( doctype_html: bool = False, exclude_styles: bool = False, **kwargs, - ): + ) -> str | None: """ Write Styler to a file, buffer or string in HTML-CSS format. @@ -1291,10 +1384,38 @@ def to_html( html, buf=buf, encoding=(encoding if buf is not None else None) ) + @overload + def to_string( + self, + buf: FilePath | WriteBuffer[str], + *, + encoding=..., + sparse_index: bool | None = ..., + sparse_columns: bool | None = ..., + max_rows: int | None = ..., + max_columns: int | None = ..., + delimiter: str = ..., + ) -> None: + ... + + @overload + def to_string( + self, + buf: None = ..., + *, + encoding=..., + sparse_index: bool | None = ..., + sparse_columns: bool | None = ..., + max_rows: int | None = ..., + max_columns: int | None = ..., + delimiter: str = ..., + ) -> str: + ... + @Substitution(buf=buf, encoding=encoding) def to_string( self, - buf=None, + buf: FilePath | WriteBuffer[str] | None = None, *, encoding=None, sparse_index: bool | None = None, @@ -1302,7 +1423,7 @@ def to_string( max_rows: int | None = None, max_columns: int | None = None, delimiter: str = " ", - ): + ) -> str | None: """ Write Styler to a file, buffer or string in text format. @@ -3363,7 +3484,7 @@ def highlight_between( axis: Axis | None = 0, left: Scalar | Sequence | None = None, right: Scalar | Sequence | None = None, - inclusive: str = "both", + inclusive: IntervalInclusiveType = "both", props: str | None = None, ) -> Styler: """ @@ -3467,8 +3588,8 @@ def highlight_quantile( axis: Axis | None = 0, q_left: float = 0.0, q_right: float = 1.0, - interpolation: str = "linear", - inclusive: str = "both", + interpolation: QuantileInterpolation = "linear", + inclusive: IntervalInclusiveType = "both", props: str | None = None, ) -> Styler: """ @@ -3539,13 +3660,17 @@ def highlight_quantile( # after quantile is found along axis, e.g. along rows, # applying the calculated quantile to alternate axis, e.g. to each column - kwargs = {"q": [q_left, q_right], "interpolation": interpolation} + quantiles = [q_left, q_right] if axis is None: - q = Series(data.to_numpy().ravel()).quantile(**kwargs) + q = Series(data.to_numpy().ravel()).quantile( + q=quantiles, interpolation=interpolation + ) axis_apply: int | None = None else: axis = self.data._get_axis_number(axis) - q = data.quantile(axis=axis, numeric_only=False, **kwargs) + q = data.quantile( + axis=axis, numeric_only=False, q=quantiles, interpolation=interpolation + ) axis_apply = 1 - axis if props is None: @@ -3849,7 +3974,7 @@ def _highlight_between( props: str, left: Scalar | Sequence | np.ndarray | NDFrame | None = None, right: Scalar | Sequence | np.ndarray | NDFrame | None = None, - inclusive: bool | str = True, + inclusive: bool | IntervalInclusiveType = True, ) -> np.ndarray: """ Return an array of css props based on condition of data values within given range. diff --git a/pandas/io/json/_json.py b/pandas/io/json/_json.py index fbea7a71202eb..c617828c91bd4 100644 --- a/pandas/io/json/_json.py +++ b/pandas/io/json/_json.py @@ -11,7 +11,11 @@ from typing import ( Any, Callable, + Generic, + Literal, Mapping, + TypeVar, + overload, ) import numpy as np @@ -21,9 +25,12 @@ from pandas._typing import ( CompressionOptions, DtypeArg, + FilePath, IndexLabel, JSONSerializable, + ReadBuffer, StorageOptions, + WriteBuffer, ) from pandas.errors import AbstractMethodError from pandas.util._decorators import ( @@ -66,13 +73,53 @@ ) from pandas.io.parsers.readers import validate_integer +FrameSeriesStrT = TypeVar("FrameSeriesStrT", bound=Literal["frame", "series"]) + loads = json.loads dumps = json.dumps # interface to/from +@overload +def to_json( + path_or_buf: FilePath | WriteBuffer[str] | WriteBuffer[bytes], + obj: NDFrame, + orient: str | None = ..., + date_format: str = ..., + double_precision: int = ..., + force_ascii: bool = ..., + date_unit: str = ..., + default_handler: Callable[[Any], JSONSerializable] | None = ..., + lines: bool = ..., + compression: CompressionOptions = ..., + index: bool = ..., + indent: int = ..., + storage_options: StorageOptions = ..., +) -> None: + ... + + +@overload def to_json( - path_or_buf, + path_or_buf: None, + obj: NDFrame, + orient: str | None = ..., + date_format: str = ..., + double_precision: int = ..., + force_ascii: bool = ..., + date_unit: str = ..., + default_handler: Callable[[Any], JSONSerializable] | None = ..., + lines: bool = ..., + compression: CompressionOptions = ..., + index: bool = ..., + indent: int = ..., + storage_options: StorageOptions = ..., +) -> str: + ... + + +def to_json( + path_or_buf: FilePath | WriteBuffer[str] | WriteBuffer[bytes] | None, obj: NDFrame, orient: str | None = None, date_format: str = "epoch", @@ -85,7 +132,7 @@ def to_json( index: bool = True, indent: int = 0, storage_options: StorageOptions = None, -): +) -> str | None: if not index and orient not in ["split", "table"]: raise ValueError( @@ -131,6 +178,7 @@ def to_json( handles.handle.write(s) else: return s + return None class Writer(ABC): @@ -168,7 +216,7 @@ def __init__( def _format_axes(self): raise AbstractMethodError(self) - def write(self): + def write(self) -> str: iso_dates = self.date_format == "iso" return dumps( self.obj_to_write, @@ -313,18 +361,111 @@ def obj_to_write(self) -> NDFrame | Mapping[IndexLabel, Any]: return {"schema": self.schema, "data": self.obj} +@overload +def read_json( + path_or_buf: FilePath | ReadBuffer[str] | ReadBuffer[bytes], + *, + orient=..., + typ: Literal["frame"] = ..., + dtype: DtypeArg | None = ..., + convert_axes=..., + convert_dates=..., + keep_default_dates: bool = ..., + numpy: bool = ..., + precise_float: bool = ..., + date_unit=..., + encoding=..., + encoding_errors: str | None = ..., + lines: bool = ..., + chunksize: int, + compression: CompressionOptions = ..., + nrows: int | None = ..., + storage_options: StorageOptions = ..., +) -> JsonReader[Literal["frame"]]: + ... + + +@overload +def read_json( + path_or_buf: FilePath | ReadBuffer[str] | ReadBuffer[bytes], + *, + orient=..., + typ: Literal["series"], + dtype: DtypeArg | None = ..., + convert_axes=..., + convert_dates=..., + keep_default_dates: bool = ..., + numpy: bool = ..., + precise_float: bool = ..., + date_unit=..., + encoding=..., + encoding_errors: str | None = ..., + lines: bool = ..., + chunksize: int, + compression: CompressionOptions = ..., + nrows: int | None = ..., + storage_options: StorageOptions = ..., +) -> JsonReader[Literal["series"]]: + ... + + +@overload +def read_json( + path_or_buf: FilePath | ReadBuffer[str] | ReadBuffer[bytes], + *, + orient=..., + typ: Literal["series"], + dtype: DtypeArg | None = ..., + convert_axes=..., + convert_dates=..., + keep_default_dates: bool = ..., + numpy: bool = ..., + precise_float: bool = ..., + date_unit=..., + encoding=..., + encoding_errors: str | None = ..., + lines: bool = ..., + chunksize: None = ..., + compression: CompressionOptions = ..., + nrows: int | None = ..., + storage_options: StorageOptions = ..., +) -> Series: + ... + + +@overload +def read_json( + path_or_buf: FilePath | ReadBuffer[str] | ReadBuffer[bytes], + orient=..., + typ: Literal["frame"] = ..., + dtype: DtypeArg | None = ..., + convert_axes=..., + convert_dates=..., + keep_default_dates: bool = ..., + numpy: bool = ..., + precise_float: bool = ..., + date_unit=..., + encoding=..., + encoding_errors: str | None = ..., + lines: bool = ..., + chunksize: None = ..., + compression: CompressionOptions = ..., + nrows: int | None = ..., + storage_options: StorageOptions = ..., +) -> DataFrame: + ... + + @doc( storage_options=_shared_docs["storage_options"], decompression_options=_shared_docs["decompression_options"] % "path_or_buf", ) @deprecate_kwarg(old_arg_name="numpy", new_arg_name=None) -@deprecate_nonkeyword_arguments( - version="2.0", allowed_args=["path_or_buf"], stacklevel=3 -) +@deprecate_nonkeyword_arguments(version="2.0", allowed_args=["path_or_buf"]) def read_json( - path_or_buf=None, + path_or_buf: FilePath | ReadBuffer[str] | ReadBuffer[bytes], orient=None, - typ="frame", + typ: Literal["frame", "series"] = "frame", dtype: DtypeArg | None = None, convert_axes=None, convert_dates=True, @@ -339,7 +480,7 @@ def read_json( compression: CompressionOptions = "infer", nrows: int | None = None, storage_options: StorageOptions = None, -): +) -> DataFrame | Series | JsonReader: """ Convert a JSON string to pandas object. @@ -613,7 +754,7 @@ def read_json( return json_reader.read() -class JsonReader(abc.Iterator): +class JsonReader(abc.Iterator, Generic[FrameSeriesStrT]): """ JsonReader provides an interface for reading in a JSON file. @@ -626,7 +767,7 @@ def __init__( self, filepath_or_buffer, orient, - typ, + typ: FrameSeriesStrT, dtype, convert_axes, convert_dates, @@ -739,10 +880,23 @@ def _combine_lines(self, lines) -> str: f'[{",".join([line for line in (line.strip() for line in lines) if line])}]' ) - def read(self): + @overload + def read(self: JsonReader[Literal["frame"]]) -> DataFrame: + ... + + @overload + def read(self: JsonReader[Literal["series"]]) -> Series: + ... + + @overload + def read(self: JsonReader[Literal["frame", "series"]]) -> DataFrame | Series: + ... + + def read(self) -> DataFrame | Series: """ Read the whole JSON input into a pandas object. """ + obj: DataFrame | Series if self.lines: if self.chunksize: obj = concat(self) @@ -759,7 +913,7 @@ def read(self): self.close() return obj - def _get_object_parser(self, json): + def _get_object_parser(self, json) -> DataFrame | Series: """ Parses a json document into a pandas object. """ @@ -786,7 +940,7 @@ def _get_object_parser(self, json): return obj - def close(self): + def close(self) -> None: """ If we opened a stream earlier, in _get_data_from_filepath, we should close it. @@ -796,7 +950,22 @@ def close(self): if self.handles is not None: self.handles.close() - def __next__(self): + def __iter__(self: JsonReader[FrameSeriesStrT]) -> JsonReader[FrameSeriesStrT]: + return self + + @overload + def __next__(self: JsonReader[Literal["frame"]]) -> DataFrame: + ... + + @overload + def __next__(self: JsonReader[Literal["series"]]) -> Series: + ... + + @overload + def __next__(self: JsonReader[Literal["frame", "series"]]) -> DataFrame | Series: + ... + + def __next__(self) -> DataFrame | Series: if self.nrows: if self.nrows_seen >= self.nrows: self.close() @@ -816,10 +985,10 @@ def __next__(self): self.close() raise StopIteration - def __enter__(self): + def __enter__(self) -> JsonReader[FrameSeriesStrT]: return self - def __exit__(self, exc_type, exc_value, traceback): + def __exit__(self, exc_type, exc_value, traceback) -> None: self.close() @@ -875,7 +1044,7 @@ def __init__( self.keep_default_dates = keep_default_dates self.obj: DataFrame | Series | None = None - def check_keys_split(self, decoded): + def check_keys_split(self, decoded) -> None: """ Checks that dict has only the appropriate keys for orient='split'. """ diff --git a/pandas/io/json/_table_schema.py b/pandas/io/json/_table_schema.py index c630f0d7613e0..b7a8b5cc82f7a 100644 --- a/pandas/io/json/_table_schema.py +++ b/pandas/io/json/_table_schema.py @@ -115,8 +115,9 @@ def set_default_names(data): return data -def convert_pandas_type_to_json_field(arr): +def convert_pandas_type_to_json_field(arr) -> dict[str, JSONSerializable]: dtype = arr.dtype + name: JSONSerializable if arr.name is None: name = "values" else: @@ -141,7 +142,7 @@ def convert_pandas_type_to_json_field(arr): return field -def convert_json_field_to_pandas_type(field): +def convert_json_field_to_pandas_type(field) -> str | CategoricalDtype: """ Converts a JSON field descriptor into its corresponding NumPy / pandas type @@ -196,6 +197,9 @@ def convert_json_field_to_pandas_type(field): elif typ == "datetime": if field.get("tz"): return f"datetime64[ns, {field['tz']}]" + elif field.get("freq"): + # GH#47747 using datetime over period to minimize the change surface + return f"period[{field['freq']}]" else: return "datetime64[ns]" elif typ == "any": diff --git a/pandas/io/orc.py b/pandas/io/orc.py index b02660c089382..40754a56bbe8b 100644 --- a/pandas/io/orc.py +++ b/pandas/io/orc.py @@ -1,14 +1,28 @@ """ orc compat """ from __future__ import annotations -from typing import TYPE_CHECKING +import io +from types import ModuleType +from typing import ( + TYPE_CHECKING, + Any, + Literal, +) from pandas._typing import ( FilePath, ReadBuffer, + WriteBuffer, ) from pandas.compat._optional import import_optional_dependency +from pandas.core.dtypes.common import ( + is_categorical_dtype, + is_interval_dtype, + is_period_dtype, + is_unsigned_integer_dtype, +) + from pandas.io.common import get_handle if TYPE_CHECKING: @@ -52,3 +66,111 @@ def read_orc( with get_handle(path, "rb", is_text=False) as handles: orc_file = orc.ORCFile(handles.handle) return orc_file.read(columns=columns, **kwargs).to_pandas() + + +def to_orc( + df: DataFrame, + path: FilePath | WriteBuffer[bytes] | None = None, + *, + engine: Literal["pyarrow"] = "pyarrow", + index: bool | None = None, + engine_kwargs: dict[str, Any] | None = None, +) -> bytes | None: + """ + Write a DataFrame to the ORC format. + + .. versionadded:: 1.5.0 + + Parameters + ---------- + df : DataFrame + The dataframe to be written to ORC. Raises NotImplementedError + if dtype of one or more columns is category, unsigned integers, + intervals, periods or sparse. + path : str, file-like object or None, default None + If a string, it will be used as Root Directory path + when writing a partitioned dataset. By file-like object, + we refer to objects with a write() method, such as a file handle + (e.g. via builtin open function). If path is None, + a bytes object is returned. + engine : str, default 'pyarrow' + ORC library to use. Pyarrow must be >= 7.0.0. + index : bool, optional + If ``True``, include the dataframe's index(es) in the file output. If + ``False``, they will not be written to the file. + If ``None``, similar to ``infer`` the dataframe's index(es) + will be saved. However, instead of being saved as values, + the RangeIndex will be stored as a range in the metadata so it + doesn't require much space and is faster. Other indexes will + be included as columns in the file output. + engine_kwargs : dict[str, Any] or None, default None + Additional keyword arguments passed to :func:`pyarrow.orc.write_table`. + + Returns + ------- + bytes if no path argument is provided else None + + Raises + ------ + NotImplementedError + Dtype of one or more columns is category, unsigned integers, interval, + period or sparse. + ValueError + engine is not pyarrow. + + Notes + ----- + * Before using this function you should read the + :ref:`user guide about ORC ` and + :ref:`install optional dependencies `. + * This function requires `pyarrow `_ + library. + * For supported dtypes please refer to `supported ORC features in Arrow + `__. + * Currently timezones in datetime columns are not preserved when a + dataframe is converted into ORC files. + """ + if index is None: + index = df.index.names[0] is not None + if engine_kwargs is None: + engine_kwargs = {} + + # If unsupported dtypes are found raise NotImplementedError + # In Pyarrow 9.0.0 this check will no longer be needed + for dtype in df.dtypes: + if ( + is_categorical_dtype(dtype) + or is_interval_dtype(dtype) + or is_period_dtype(dtype) + or is_unsigned_integer_dtype(dtype) + ): + raise NotImplementedError( + "The dtype of one or more columns is not supported yet." + ) + + if engine != "pyarrow": + raise ValueError("engine must be 'pyarrow'") + engine = import_optional_dependency(engine, min_version="7.0.0") + orc = import_optional_dependency("pyarrow.orc") + + was_none = path is None + if was_none: + path = io.BytesIO() + assert path is not None # For mypy + with get_handle(path, "wb", is_text=False) as handles: + assert isinstance(engine, ModuleType) # For mypy + try: + orc.write_table( + engine.Table.from_pandas(df, preserve_index=index), + handles.handle, + **engine_kwargs, + ) + except TypeError as e: + raise NotImplementedError( + "The dtype of one or more columns is not supported yet." + ) from e + + if was_none: + assert isinstance(path, io.BytesIO) # For mypy + return path.getvalue() + return None diff --git a/pandas/io/parquet.py b/pandas/io/parquet.py index cbf3bcc9278d5..ed0e0a99ec43b 100644 --- a/pandas/io/parquet.py +++ b/pandas/io/parquet.py @@ -151,7 +151,7 @@ def __init__(self) -> None: import pyarrow.parquet # import utils to register the pyarrow extension types - import pandas.core.arrays.arrow._arrow_utils # noqa:F401 + import pandas.core.arrays.arrow._arrow_utils # pyright: ignore # noqa:F401 self.api = pyarrow @@ -231,6 +231,8 @@ def read( self.api.uint64(): pd.UInt64Dtype(), self.api.bool_(): pd.BooleanDtype(), self.api.string(): pd.StringDtype(), + self.api.float32(): pd.Float32Dtype(), + self.api.float64(): pd.Float64Dtype(), } to_pandas_kwargs["types_mapper"] = mapping.get manager = get_option("mode.data_manager") diff --git a/pandas/io/parsers/base_parser.py b/pandas/io/parsers/base_parser.py index 185ef8b59b587..531fa5400f466 100644 --- a/pandas/io/parsers/base_parser.py +++ b/pandas/io/parsers/base_parser.py @@ -32,6 +32,7 @@ from pandas._typing import ( ArrayLike, DtypeArg, + Scalar, ) from pandas.errors import ( ParserError, @@ -89,7 +90,7 @@ def __init__(self, kwds) -> None: self.index_col = kwds.get("index_col", None) self.unnamed_cols: set = set() - self.index_names: list | None = None + self.index_names: Sequence[Hashable] | None = None self.col_names = None self.parse_dates = _validate_parse_dates_arg(kwds.pop("parse_dates", False)) @@ -219,7 +220,7 @@ def _validate_parse_dates_presence(self, columns: Sequence[Hashable]) -> Iterabl for col in cols_needed ] - def close(self): + def close(self) -> None: pass @final @@ -365,7 +366,7 @@ def _maybe_make_multi_index_columns( @final def _make_index( - self, data, alldata, columns, indexnamerow=False + self, data, alldata, columns, indexnamerow: list[Scalar] | None = None ) -> tuple[Index | None, Sequence[Hashable] | MultiIndex]: index: Index | None if not is_index_col(self.index_col) or not self.index_col: diff --git a/pandas/io/parsers/c_parser_wrapper.py b/pandas/io/parsers/c_parser_wrapper.py index 91c37a0e43505..711d0857a5a1c 100644 --- a/pandas/io/parsers/c_parser_wrapper.py +++ b/pandas/io/parsers/c_parser_wrapper.py @@ -367,7 +367,7 @@ def _concatenate_chunks(chunks: list[dict[int, ArrayLike]]) -> dict: names = list(chunks[0].keys()) warning_columns = [] - result = {} + result: dict = {} for name in names: arrs = [chunk.pop(name) for chunk in chunks] # Check each arr for consistent types. @@ -383,7 +383,7 @@ def _concatenate_chunks(chunks: list[dict[int, ArrayLike]]) -> dict: numpy_dtypes, # type: ignore[arg-type] [], ) - if common_type == object: + if common_type == np.dtype(object): warning_columns.append(str(name)) dtype = dtypes.pop() @@ -400,13 +400,14 @@ def _concatenate_chunks(chunks: list[dict[int, ArrayLike]]) -> dict: arrs # type: ignore[arg-type] ) else: - # Argument 1 to "concatenate" has incompatible type - # "List[Union[ExtensionArray, ndarray[Any, Any]]]"; expected - # "Union[_SupportsArray[dtype[Any]], + # error: Argument 1 to "concatenate" has incompatible + # type "List[Union[ExtensionArray, ndarray[Any, Any]]]" + # ; expected "Union[_SupportsArray[dtype[Any]], # Sequence[_SupportsArray[dtype[Any]]], # Sequence[Sequence[_SupportsArray[dtype[Any]]]], - # Sequence[Sequence[Sequence[_SupportsArray[dtype[Any]]]]], - # Sequence[Sequence[Sequence[Sequence[_SupportsArray[dtype[Any]]]]]]]" + # Sequence[Sequence[Sequence[_SupportsArray[dtype[Any]]]]] + # , Sequence[Sequence[Sequence[Sequence[ + # _SupportsArray[dtype[Any]]]]]]]" result[name] = np.concatenate(arrs) # type: ignore[arg-type] if warning_columns: diff --git a/pandas/io/parsers/python_parser.py b/pandas/io/parsers/python_parser.py index 37b2ce4c4148b..3e897f9b1334e 100644 --- a/pandas/io/parsers/python_parser.py +++ b/pandas/io/parsers/python_parser.py @@ -308,7 +308,11 @@ def _exclude_implicit_index( }, names # legacy - def get_chunk(self, size=None): + def get_chunk( + self, size: int | None = None + ) -> tuple[ + Index | None, Sequence[Hashable] | MultiIndex, Mapping[Hashable, ArrayLike] + ]: if size is None: # error: "PythonParser" has no attribute "chunksize" size = self.chunksize # type: ignore[attr-defined] @@ -379,10 +383,16 @@ def _infer_columns( line = self._next_line() except StopIteration as err: - if self.line_pos < hr: + if 0 < self.line_pos <= hr and ( + not have_mi_columns or hr != header[-1] + ): + # If no rows we want to raise a different message and if + # we have mi columns, the last line is not part of the header + joi = list(map(str, header[:-1] if have_mi_columns else header)) + msg = f"[{','.join(joi)}], len of {len(joi)}, " raise ValueError( - f"Passed header={hr} but only {self.line_pos + 1} lines in " - "file" + f"Passed header={msg}" + f"but only {self.line_pos} lines in file" ) from err # We have an empty file, so check @@ -933,7 +943,11 @@ def _get_index_name( implicit_first_cols = len(line) - self.num_original_columns # Case 0 - if next_line is not None and self.header is not None: + if ( + next_line is not None + and self.header is not None + and index_col is not False + ): if len(next_line) == len(line) + self.num_original_columns: # column and index names on diff rows self.index_col = list(range(len(line))) diff --git a/pandas/io/parsers/readers.py b/pandas/io/parsers/readers.py index 56df5493027c5..4858d56d71c42 100644 --- a/pandas/io/parsers/readers.py +++ b/pandas/io/parsers/readers.py @@ -381,6 +381,8 @@ .. versionadded:: 1.3.0 + .. versionadded:: 1.4.0 + - callable, function with signature ``(bad_line: list[str]) -> list[str] | None`` that will process a single bad line. ``bad_line`` is a list of strings split by the ``sep``. @@ -389,8 +391,6 @@ expected, a ``ParserWarning`` will be emitted while dropping extra elements. Only supported when ``engine="python"`` - .. versionadded:: 1.4.0 - delim_whitespace : bool, default False Specifies whether or not whitespace (e.g. ``' '`` or ``'\t'``) will be used as the sep. Equivalent to setting ``sep='\\s+'``. If this option @@ -493,7 +493,22 @@ class _DeprecationConfig(NamedTuple): } -def validate_integer(name, val, min_val=0): +@overload +def validate_integer(name, val: None, min_val=...) -> None: + ... + + +@overload +def validate_integer(name, val: int | float, min_val=...) -> int: + ... + + +@overload +def validate_integer(name, val: int | None, min_val=...) -> int | None: + ... + + +def validate_integer(name, val: int | float | None, min_val=0) -> int | None: """ Checks whether the 'name' parameter for parsing is either an integer OR float that can SAFELY be cast to an integer @@ -509,17 +524,18 @@ def validate_integer(name, val, min_val=0): min_val : int Minimum allowed value (val < min_val will result in a ValueError) """ - msg = f"'{name:s}' must be an integer >={min_val:d}" + if val is None: + return val - if val is not None: - if is_float(val): - if int(val) != val: - raise ValueError(msg) - val = int(val) - elif not (is_integer(val) and val >= min_val): + msg = f"'{name:s}' must be an integer >={min_val:d}" + if is_float(val): + if int(val) != val: raise ValueError(msg) + val = int(val) + elif not (is_integer(val) and val >= min_val): + raise ValueError(msg) - return val + return int(val) def _validate_names(names: Sequence[Hashable] | None) -> None: @@ -636,7 +652,7 @@ def read_csv( comment: str | None = ..., encoding: str | None = ..., encoding_errors: str | None = ..., - dialect=..., + dialect: str | csv.Dialect | None = ..., error_bad_lines: bool | None = ..., warn_bad_lines: bool | None = ..., on_bad_lines=..., @@ -696,7 +712,7 @@ def read_csv( comment: str | None = ..., encoding: str | None = ..., encoding_errors: str | None = ..., - dialect=..., + dialect: str | csv.Dialect | None = ..., error_bad_lines: bool | None = ..., warn_bad_lines: bool | None = ..., on_bad_lines=..., @@ -756,7 +772,7 @@ def read_csv( comment: str | None = ..., encoding: str | None = ..., encoding_errors: str | None = ..., - dialect=..., + dialect: str | csv.Dialect | None = ..., error_bad_lines: bool | None = ..., warn_bad_lines: bool | None = ..., on_bad_lines=..., @@ -816,7 +832,7 @@ def read_csv( comment: str | None = ..., encoding: str | None = ..., encoding_errors: str | None = ..., - dialect=..., + dialect: str | csv.Dialect | None = ..., error_bad_lines: bool | None = ..., warn_bad_lines: bool | None = ..., on_bad_lines=..., @@ -829,16 +845,15 @@ def read_csv( ... -@deprecate_nonkeyword_arguments( - version=None, allowed_args=["filepath_or_buffer"], stacklevel=3 -) +@deprecate_nonkeyword_arguments(version=None, allowed_args=["filepath_or_buffer"]) @Appender( _doc_read_csv_and_table.format( func_name="read_csv", summary="Read a comma-separated values (csv) file into DataFrame.", _default_sep="','", storage_options=_shared_docs["storage_options"], - decompression_options=_shared_docs["decompression_options"], + decompression_options=_shared_docs["decompression_options"] + % "filepath_or_buffer", ) ) def read_csv( @@ -891,7 +906,7 @@ def read_csv( comment: str | None = None, encoding: str | None = None, encoding_errors: str | None = "strict", - dialect=None, + dialect: str | csv.Dialect | None = None, # Error Handling error_bad_lines: bool | None = None, warn_bad_lines: bool | None = None, @@ -975,7 +990,7 @@ def read_table( comment: str | None = ..., encoding: str | None = ..., encoding_errors: str | None = ..., - dialect=..., + dialect: str | csv.Dialect | None = ..., error_bad_lines: bool | None = ..., warn_bad_lines: bool | None = ..., on_bad_lines=..., @@ -1035,7 +1050,7 @@ def read_table( comment: str | None = ..., encoding: str | None = ..., encoding_errors: str | None = ..., - dialect=..., + dialect: str | csv.Dialect | None = ..., error_bad_lines: bool | None = ..., warn_bad_lines: bool | None = ..., on_bad_lines=..., @@ -1095,7 +1110,7 @@ def read_table( comment: str | None = ..., encoding: str | None = ..., encoding_errors: str | None = ..., - dialect=..., + dialect: str | csv.Dialect | None = ..., error_bad_lines: bool | None = ..., warn_bad_lines: bool | None = ..., on_bad_lines=..., @@ -1155,7 +1170,7 @@ def read_table( comment: str | None = ..., encoding: str | None = ..., encoding_errors: str | None = ..., - dialect=..., + dialect: str | csv.Dialect | None = ..., error_bad_lines: bool | None = ..., warn_bad_lines: bool | None = ..., on_bad_lines=..., @@ -1168,16 +1183,15 @@ def read_table( ... -@deprecate_nonkeyword_arguments( - version=None, allowed_args=["filepath_or_buffer"], stacklevel=3 -) +@deprecate_nonkeyword_arguments(version=None, allowed_args=["filepath_or_buffer"]) @Appender( _doc_read_csv_and_table.format( func_name="read_table", summary="Read general delimited file into DataFrame.", _default_sep=r"'\\t' (tab-stop)", storage_options=_shared_docs["storage_options"], - decompression_options=_shared_docs["decompression_options"], + decompression_options=_shared_docs["decompression_options"] + % "filepath_or_buffer", ) ) def read_table( @@ -1230,7 +1244,7 @@ def read_table( comment: str | None = None, encoding: str | None = None, encoding_errors: str | None = "strict", - dialect=None, + dialect: str | csv.Dialect | None = None, # Error Handling error_bad_lines: bool | None = None, warn_bad_lines: bool | None = None, @@ -1267,9 +1281,7 @@ def read_table( return _read(filepath_or_buffer, kwds) -@deprecate_nonkeyword_arguments( - version=None, allowed_args=["filepath_or_buffer"], stacklevel=2 -) +@deprecate_nonkeyword_arguments(version=None, allowed_args=["filepath_or_buffer"]) def read_fwf( filepath_or_buffer: FilePath | ReadCsvBuffer[bytes] | ReadCsvBuffer[str], colspecs: Sequence[tuple[int, int]] | str | None = "infer", @@ -1702,10 +1714,7 @@ def _make_engine( if engine == "pyarrow": is_text = False mode = "rb" - # error: No overload variant of "get_handle" matches argument types - # "Union[str, PathLike[str], ReadCsvBuffer[bytes], ReadCsvBuffer[str]]" - # , "str", "bool", "Any", "Any", "Any", "Any", "Any" - self.handles = get_handle( # type: ignore[call-overload] + self.handles = get_handle( f, mode, encoding=self.options.get("encoding", None), @@ -1789,7 +1798,7 @@ def __exit__(self, exc_type, exc_value, traceback) -> None: self.close() -def TextParser(*args, **kwds): +def TextParser(*args, **kwds) -> TextFileReader: """ Converts lists of lists/tuples into DataFrames with proper type inference and optional (e.g. string to datetime) conversion. Also enables iterating @@ -1925,7 +1934,7 @@ def _stringify_na_values(na_values): def _refine_defaults_read( - dialect: str | csv.Dialect, + dialect: str | csv.Dialect | None, delimiter: str | None | lib.NoDefault, delim_whitespace: bool, engine: CSVEngine | None, diff --git a/pandas/io/pytables.py b/pandas/io/pytables.py index c20ce0c847b61..52a2883e70f93 100644 --- a/pandas/io/pytables.py +++ b/pandas/io/pytables.py @@ -19,9 +19,11 @@ Any, Callable, Hashable, + Iterator, Literal, Sequence, cast, + overload, ) import warnings @@ -46,7 +48,13 @@ ) from pandas.compat._optional import import_optional_dependency from pandas.compat.pickle_compat import patch_pickle -from pandas.errors import PerformanceWarning +from pandas.errors import ( + AttributeConflictWarning, + ClosedFileError, + IncompatibilityWarning, + PerformanceWarning, + PossibleDataLossError, +) from pandas.util._decorators import cache_readonly from pandas.util._exceptions import find_stack_level @@ -167,43 +175,17 @@ def _ensure_term(where, scope_level: int): return where if where is None or len(where) else None -class PossibleDataLossError(Exception): - pass - - -class ClosedFileError(Exception): - pass - - -class IncompatibilityWarning(Warning): - pass - - incompatibility_doc = """ where criteria is being ignored as this version [%s] is too old (or not-defined), read the file in and write it out to a new file to upgrade (with the copy_to method) """ - -class AttributeConflictWarning(Warning): - pass - - attribute_conflict_doc = """ the [%s] attribute of the existing index is [%s] which conflicts with the new [%s], resetting the attribute to None """ - -class DuplicateWarning(Warning): - pass - - -duplicate_doc = """ -duplicate entries in table, taking most recently appended -""" - performance_doc = """ your performance may suffer as PyTables will pickle object types that it cannot map directly to c-types [inferred_type->%s,key->%s] [items->%s] @@ -592,7 +574,7 @@ def __init__( self._filters = None self.open(mode=mode, **kwargs) - def __fspath__(self): + def __fspath__(self) -> str: return self._path @property @@ -603,16 +585,16 @@ def root(self): return self._handle.root @property - def filename(self): + def filename(self) -> str: return self._path def __getitem__(self, key: str): return self.get(key) - def __setitem__(self, key: str, value): + def __setitem__(self, key: str, value) -> None: self.put(key, value) - def __delitem__(self, key: str): + def __delitem__(self, key: str) -> None: return self.remove(key) def __getattr__(self, name: str): @@ -644,10 +626,10 @@ def __repr__(self) -> str: pstr = pprint_thing(self._path) return f"{type(self)}\nFile path: {pstr}\n" - def __enter__(self): + def __enter__(self) -> HDFStore: return self - def __exit__(self, exc_type, exc_value, traceback): + def __exit__(self, exc_type, exc_value, traceback) -> None: self.close() def keys(self, include: str = "pandas") -> list[str]: @@ -684,7 +666,7 @@ def keys(self, include: str = "pandas") -> list[str]: f"`include` should be either 'pandas' or 'native' but is '{include}'" ) - def __iter__(self): + def __iter__(self) -> Iterator[str]: return iter(self.keys()) def items(self): @@ -706,7 +688,7 @@ def iteritems(self): ) yield from self.items() - def open(self, mode: str = "a", **kwargs): + def open(self, mode: str = "a", **kwargs) -> None: """ Open the file in the specified mode @@ -751,7 +733,7 @@ def open(self, mode: str = "a", **kwargs): self._handle = tables.open_file(self._path, self._mode, **kwargs) - def close(self): + def close(self) -> None: """ Close the PyTables file handle """ @@ -768,7 +750,7 @@ def is_open(self) -> bool: return False return bool(self._handle.isopen) - def flush(self, fsync: bool = False): + def flush(self, fsync: bool = False) -> None: """ Force all buffered modifications to be written to disk. @@ -1096,7 +1078,7 @@ def put( errors: str = "strict", track_times: bool = True, dropna: bool = False, - ): + ) -> None: """ Store object in HDFStore. @@ -1152,7 +1134,7 @@ def put( dropna=dropna, ) - def remove(self, key: str, where=None, start=None, stop=None): + def remove(self, key: str, where=None, start=None, stop=None) -> None: """ Remove pandas object partially by specifying the where condition @@ -1228,7 +1210,7 @@ def append( data_columns: Literal[True] | list[str] | None = None, encoding=None, errors: str = "strict", - ): + ) -> None: """ Append to Table in file. Node must already exist and be Table format. @@ -1305,7 +1287,7 @@ def append_to_multiple( axes=None, dropna=False, **kwargs, - ): + ) -> None: """ Append to multiple tables @@ -1399,7 +1381,7 @@ def create_table_index( columns=None, optlevel: int | None = None, kind: str | None = None, - ): + ) -> None: """ Create a pytables index on the table. @@ -1545,7 +1527,7 @@ def copy( complevel: int | None = None, fletcher32: bool = False, overwrite=True, - ): + ) -> HDFStore: """ Copy the existing store to a new file, updating in place. @@ -1933,7 +1915,7 @@ def __iter__(self): self.close() - def close(self): + def close(self) -> None: if self.auto_close: self.store.close() @@ -2037,7 +2019,7 @@ def itemsize(self) -> int: def kind_attr(self) -> str: return f"{self.name}_kind" - def set_pos(self, pos: int): + def set_pos(self, pos: int) -> None: """set the position of this column in the Table""" self.pos = pos if pos is not None and self.typ is not None: @@ -2072,7 +2054,9 @@ def is_indexed(self) -> bool: return False return getattr(self.table.cols, self.cname).is_indexed - def convert(self, values: np.ndarray, nan_rep, encoding: str, errors: str): + def convert( + self, values: np.ndarray, nan_rep, encoding: str, errors: str + ) -> tuple[np.ndarray, np.ndarray] | tuple[DatetimeIndex, DatetimeIndex]: """ Convert the data from this selection to the appropriate pandas type. """ @@ -2140,7 +2124,7 @@ def cvalues(self): def __iter__(self): return iter(self.values) - def maybe_set_size(self, min_itemsize=None): + def maybe_set_size(self, min_itemsize=None) -> None: """ maybe set a string col itemsize: min_itemsize can be an integer or a dict with this columns name @@ -2153,10 +2137,10 @@ def maybe_set_size(self, min_itemsize=None): if min_itemsize is not None and self.typ.itemsize < min_itemsize: self.typ = _tables().StringCol(itemsize=min_itemsize, pos=self.pos) - def validate_names(self): + def validate_names(self) -> None: pass - def validate_and_set(self, handler: AppendableTable, append: bool): + def validate_and_set(self, handler: AppendableTable, append: bool) -> None: self.table = handler.table self.validate_col() self.validate_attr(append) @@ -2183,7 +2167,7 @@ def validate_col(self, itemsize=None): return None - def validate_attr(self, append: bool): + def validate_attr(self, append: bool) -> None: # check for backwards incompatibility if append: existing_kind = getattr(self.attrs, self.kind_attr, None) @@ -2192,7 +2176,7 @@ def validate_attr(self, append: bool): f"incompatible kind in col [{existing_kind} - {self.kind}]" ) - def update_info(self, info): + def update_info(self, info) -> None: """ set/update the info for this indexable with the key/value if there is a conflict raise/warn as needed @@ -2225,17 +2209,17 @@ def update_info(self, info): if value is not None or existing_value is not None: idx[key] = value - def set_info(self, info): + def set_info(self, info) -> None: """set my state from the passed info""" idx = info.get(self.name) if idx is not None: self.__dict__.update(idx) - def set_attr(self): + def set_attr(self) -> None: """set the kind for this column""" setattr(self.attrs, self.kind_attr, self.kind) - def validate_metadata(self, handler: AppendableTable): + def validate_metadata(self, handler: AppendableTable) -> None: """validate that kind=category does not change the categories""" if self.meta == "category": new_metadata = self.metadata @@ -2250,7 +2234,7 @@ def validate_metadata(self, handler: AppendableTable): "different categories to the existing" ) - def write_metadata(self, handler: AppendableTable): + def write_metadata(self, handler: AppendableTable) -> None: """set the meta data""" if self.metadata is not None: handler.write_metadata(self.cname, self.metadata) @@ -2263,7 +2247,13 @@ class GenericIndexCol(IndexCol): def is_indexed(self) -> bool: return False - def convert(self, values: np.ndarray, nan_rep, encoding: str, errors: str): + # error: Return type "Tuple[Int64Index, Int64Index]" of "convert" + # incompatible with return type "Union[Tuple[ndarray[Any, Any], + # ndarray[Any, Any]], Tuple[DatetimeIndex, DatetimeIndex]]" in + # supertype "IndexCol" + def convert( # type: ignore[override] + self, values: np.ndarray, nan_rep, encoding: str, errors: str + ) -> tuple[Int64Index, Int64Index]: """ Convert the data from this selection to the appropriate pandas type. @@ -2276,12 +2266,10 @@ def convert(self, values: np.ndarray, nan_rep, encoding: str, errors: str): """ assert isinstance(values, np.ndarray), type(values) - # error: Incompatible types in assignment (expression has type - # "Int64Index", variable has type "ndarray") - values = Int64Index(np.arange(len(values))) # type: ignore[assignment] - return values, values + index = Int64Index(np.arange(len(values))) + return index, index - def set_attr(self): + def set_attr(self) -> None: pass @@ -2362,7 +2350,7 @@ def __eq__(self, other: Any) -> bool: for a in ["name", "cname", "dtype", "pos"] ) - def set_data(self, data: ArrayLike): + def set_data(self, data: ArrayLike) -> None: assert data is not None assert self.dtype is None @@ -2448,7 +2436,7 @@ def cvalues(self): """return my cython values""" return self.data - def validate_attr(self, append): + def validate_attr(self, append) -> None: """validate that we have the same order as the existing & same dtype""" if append: existing_fields = getattr(self.attrs, self.kind_attr, None) @@ -2562,7 +2550,7 @@ def convert(self, values: np.ndarray, nan_rep, encoding: str, errors: str): return self.values, converted - def set_attr(self): + def set_attr(self) -> None: """set the data for this column""" setattr(self.attrs, self.kind_attr, self.values) setattr(self.attrs, self.meta_attr, self.meta) @@ -2575,7 +2563,7 @@ class DataIndexableCol(DataCol): is_data_indexable = True - def validate_names(self): + def validate_names(self) -> None: if not Index(self.values).is_object(): # TODO: should the message here be more specifically non-str? raise ValueError("cannot have non-object label DataIndexableCol") @@ -2672,12 +2660,12 @@ def __repr__(self) -> str: return f"{self.pandas_type:12.12} (shape->{s})" return self.pandas_type - def set_object_info(self): + def set_object_info(self) -> None: """set my pandas type & version""" self.attrs.pandas_type = str(self.pandas_kind) self.attrs.pandas_version = str(_version) - def copy(self): + def copy(self) -> Fixed: new_self = copy.copy(self) return new_self @@ -2709,11 +2697,11 @@ def _fletcher32(self) -> bool: def attrs(self): return self.group._v_attrs - def set_attrs(self): + def set_attrs(self) -> None: """set our object attributes""" pass - def get_attrs(self): + def get_attrs(self) -> None: """get our object attributes""" pass @@ -2730,17 +2718,17 @@ def is_exists(self) -> bool: def nrows(self): return getattr(self.storable, "nrows", None) - def validate(self, other): + def validate(self, other) -> Literal[True] | None: """validate against an existing storable""" if other is None: - return + return None return True - def validate_version(self, where=None): + def validate_version(self, where=None) -> None: """are we trying to operate on an old version?""" - return True + pass - def infer_axes(self): + def infer_axes(self) -> bool: """ infer the axes of my storer return a boolean indicating if we have a valid storer or not @@ -2767,7 +2755,9 @@ def write(self, **kwargs): "cannot write on an abstract storer: subclasses should implement" ) - def delete(self, where=None, start: int | None = None, stop: int | None = None): + def delete( + self, where=None, start: int | None = None, stop: int | None = None + ) -> None: """ support fully deleting the node in its entirety (only) - where specification must be None @@ -2842,7 +2832,7 @@ def f(values, freq=None, tz=None): return factory, kwargs - def validate_read(self, columns, where): + def validate_read(self, columns, where) -> None: """ raise if any keywords are passed which are not-None """ @@ -2861,12 +2851,12 @@ def validate_read(self, columns, where): def is_exists(self) -> bool: return True - def set_attrs(self): + def set_attrs(self) -> None: """set our object attributes""" self.attrs.encoding = self.encoding self.attrs.errors = self.errors - def get_attrs(self): + def get_attrs(self) -> None: """retrieve our attributes""" self.encoding = _ensure_encoding(getattr(self.attrs, "encoding", None)) self.errors = _ensure_decoded(getattr(self.attrs, "errors", "strict")) @@ -2924,7 +2914,7 @@ def read_index( else: # pragma: no cover raise TypeError(f"unrecognized index variety: {variety}") - def write_index(self, key: str, index: Index): + def write_index(self, key: str, index: Index) -> None: if isinstance(index, MultiIndex): setattr(self.attrs, f"{key}_variety", "multi") self.write_multi_index(key, index) @@ -2947,7 +2937,7 @@ def write_index(self, key: str, index: Index): if isinstance(index, DatetimeIndex) and index.tz is not None: node._v_attrs.tz = _get_tz(index.tz) - def write_multi_index(self, key: str, index: MultiIndex): + def write_multi_index(self, key: str, index: MultiIndex) -> None: setattr(self.attrs, f"{key}_nlevels", index.nlevels) for i, (lev, level_codes, name) in enumerate( @@ -3033,7 +3023,7 @@ def read_index_node( return index - def write_array_empty(self, key: str, value: ArrayLike): + def write_array_empty(self, key: str, value: ArrayLike) -> None: """write a 0-len array""" # ugly hack for length 0 axes arr = np.empty((1,) * value.ndim) @@ -3152,7 +3142,7 @@ def read( columns=None, start: int | None = None, stop: int | None = None, - ): + ) -> Series: self.validate_read(columns, where) index = self.read_index("index", start=start, stop=stop) values = self.read_array("values", start=start, stop=stop) @@ -3203,7 +3193,7 @@ def read( columns=None, start: int | None = None, stop: int | None = None, - ): + ) -> DataFrame: # start, stop applied to rows, so 0th axis only self.validate_read(columns, where) select_axis = self.obj_type()._get_block_manager_axis(0) @@ -3352,7 +3342,7 @@ def __getitem__(self, c: str): return a return None - def validate(self, other): + def validate(self, other) -> None: """validate against an existing table""" if other is None: return @@ -3449,7 +3439,7 @@ def is_transposed(self) -> bool: return False @property - def data_orientation(self): + def data_orientation(self) -> tuple[int, ...]: """return a tuple of my permutated axes, non_indexable at the front""" return tuple( itertools.chain( @@ -3488,7 +3478,7 @@ def _get_metadata_path(self, key: str) -> str: group = self.group._v_pathname return f"{group}/meta/{key}/meta" - def write_metadata(self, key: str, values: np.ndarray): + def write_metadata(self, key: str, values: np.ndarray) -> None: """ Write out a metadata array to the key as a fixed-format Series. @@ -3512,7 +3502,7 @@ def read_metadata(self, key: str): return self.parent.select(self._get_metadata_path(key)) return None - def set_attrs(self): + def set_attrs(self) -> None: """set our table type & indexables""" self.attrs.table_type = str(self.table_type) self.attrs.index_cols = self.index_cols() @@ -3525,7 +3515,7 @@ def set_attrs(self): self.attrs.levels = self.levels self.attrs.info = self.info - def get_attrs(self): + def get_attrs(self) -> None: """retrieve our attributes""" self.non_index_axes = getattr(self.attrs, "non_index_axes", None) or [] self.data_columns = getattr(self.attrs, "data_columns", None) or [] @@ -3537,14 +3527,14 @@ def get_attrs(self): self.index_axes = [a for a in self.indexables if a.is_an_indexable] self.values_axes = [a for a in self.indexables if not a.is_an_indexable] - def validate_version(self, where=None): + def validate_version(self, where=None) -> None: """are we trying to operate on an old version?""" if where is not None: - if self.version[0] <= 0 and self.version[1] <= 10 and self.version[2] < 1: + if self.is_old_version: ws = incompatibility_doc % ".".join([str(x) for x in self.version]) warnings.warn(ws, IncompatibilityWarning) - def validate_min_itemsize(self, min_itemsize): + def validate_min_itemsize(self, min_itemsize) -> None: """ validate the min_itemsize doesn't contain items that are not in the axes this needs data_columns to be defined @@ -3642,7 +3632,9 @@ def f(i, c): return _indexables - def create_index(self, columns=None, optlevel=None, kind: str | None = None): + def create_index( + self, columns=None, optlevel=None, kind: str | None = None + ) -> None: """ Create a pytables index on the specified columns. @@ -4100,7 +4092,7 @@ def get_blk_items(mgr): return blocks, blk_items - def process_axes(self, obj, selection: Selection, columns=None): + def process_axes(self, obj, selection: Selection, columns=None) -> DataFrame: """process axes filters""" # make a copy to avoid side effects if columns is not None: @@ -4354,7 +4346,7 @@ def write( # add the rows table.write_data(chunksize, dropna=dropna) - def write_data(self, chunksize: int | None, dropna: bool = False): + def write_data(self, chunksize: int | None, dropna: bool = False) -> None: """ we form the data into a 2-d including indexes,values,mask write chunk-by-chunk """ @@ -4419,7 +4411,7 @@ def write_data_chunk( indexes: list[np.ndarray], mask: npt.NDArray[np.bool_] | None, values: list[np.ndarray], - ): + ) -> None: """ Parameters ---------- @@ -4701,7 +4693,7 @@ def pandas_type(self) -> str: def storable(self): return getattr(self.group, "table", None) or self.group - def get_attrs(self): + def get_attrs(self) -> None: """retrieve our attributes""" self.non_index_axes = [] self.nan_rep = None @@ -4823,10 +4815,20 @@ def _get_tz(tz: tzinfo) -> str | tzinfo: return zone +@overload +def _set_tz( + values: np.ndarray | Index, tz: str | tzinfo, coerce: bool = False +) -> DatetimeIndex: + ... + + +@overload +def _set_tz(values: np.ndarray | Index, tz: None, coerce: bool = False) -> np.ndarray: + ... + + def _set_tz( - values: np.ndarray | Index, - tz: str | tzinfo | None, - coerce: bool = False, + values: np.ndarray | Index, tz: str | tzinfo | None, coerce: bool = False ) -> np.ndarray | DatetimeIndex: """ coerce the values to a DatetimeIndex if tz is set diff --git a/pandas/io/sas/__init__.py b/pandas/io/sas/__init__.py index 71027fd064f3d..317730745b6e3 100644 --- a/pandas/io/sas/__init__.py +++ b/pandas/io/sas/__init__.py @@ -1 +1,3 @@ -from pandas.io.sas.sasreader import read_sas # noqa:F401 +from pandas.io.sas.sasreader import read_sas + +__all__ = ["read_sas"] diff --git a/pandas/io/sas/sas.pyx b/pandas/io/sas/sas.pyx index 2df3e1f7243da..29e014abf9b71 100644 --- a/pandas/io/sas/sas.pyx +++ b/pandas/io/sas/sas.pyx @@ -1,133 +1,688 @@ +# cython: language_level=3 # cython: profile=False # cython: boundscheck=False, initializedcheck=False -from cython cimport Py_ssize_t +from libc.stdint cimport ( + uint8_t, + uint16_t, + uint32_t, + uint64_t, +) +from libc.stdlib cimport ( + calloc, + free, + malloc, +) +from libc.string cimport ( + memcmp, + memcpy, + memset, +) + import numpy as np import pandas.io.sas.sas_constants as const -ctypedef signed long long int64_t -ctypedef unsigned char uint8_t -ctypedef unsigned short uint16_t -# rle_decompress decompresses data using a Run Length Encoding +cdef object _np_nan = np.nan +# Buffer for decompressing short rows. +cdef uint8_t *_process_byte_array_with_data_buf = malloc(1024 * sizeof(uint8_t)) + +# Typed const aliases for quick access. +assert len(const.page_meta_types) == 2 +cdef: + int page_meta_types_0 = const.page_meta_types[0] + int page_meta_types_1 = const.page_meta_types[1] + int page_mix_type = const.page_mix_type + + int subheader_pointers_offset = const.subheader_pointers_offset + int truncated_subheader_id = const.truncated_subheader_id + int compressed_subheader_id = const.compressed_subheader_id + int compressed_subheader_type = const.compressed_subheader_type + + int data_subheader_index = const.SASIndex.data_subheader_index + int row_size_index = const.SASIndex.row_size_index + int column_size_index = const.SASIndex.column_size_index + int column_text_index = const.SASIndex.column_text_index + int column_name_index = const.SASIndex.column_name_index + int column_attributes_index = const.SASIndex.column_attributes_index + int format_and_label_index = const.SASIndex.format_and_label_index + int column_list_index = const.SASIndex.column_list_index + int subheader_counts_index = const.SASIndex.subheader_counts_index + +# Typed const aliases: subheader_signature_to_index. +# Flatten the const.subheader_signature_to_index dictionary to lists of raw keys and values. +# Since the dictionary is small it is much faster to have an O(n) loop through the raw keys +# rather than use a Python dictionary lookup. +assert all(len(k) in (4, 8) for k in const.subheader_signature_to_index) +_sigs32 = {k: v for k, v in const.subheader_signature_to_index.items() if len(k) == 4} +_sigs64 = {k: v for k, v in const.subheader_signature_to_index.items() if len(k) == 8} +cdef: + _subheader_signature_to_index_keys32 = b"".join(_sigs32.keys()) + const uint32_t *subheader_signature_to_index_keys32 = ( + _subheader_signature_to_index_keys32 + ) + Py_ssize_t[:] subheader_signature_to_index_values32 = ( + np.asarray(list(_sigs32.values())) + ) + + _subheader_signature_to_index_keys64 = b"".join(_sigs64.keys()) + const uint64_t *subheader_signature_to_index_keys64 = ( + _subheader_signature_to_index_keys64 + ) + Py_ssize_t[:] subheader_signature_to_index_values64 = ( + np.asarray(list(_sigs64.values())) + ) + + +cdef class _SubheaderPointer: + cdef: + Py_ssize_t offset + Py_ssize_t length + + def __init__(self, Py_ssize_t offset, Py_ssize_t length): + self.offset = offset + self.length = length + + +cdef class BasePage: + """A page (= bunch of bytes) with with unknown endianness. + + Supports reading raw bytes.""" + cdef: + object sas7bdatreader + readonly bytes data + const uint8_t *data_raw + Py_ssize_t data_len + + def __init__(self, sas7bdatreader, data): + self.sas7bdatreader = sas7bdatreader + self.data = data + self.data_raw = self.data + self.data_len = len(data) + + def __len__(self): + return self.data_len + + def read_bytes(self, Py_ssize_t offset, Py_ssize_t width): + self.check_read(offset, width) + return self.data_raw[offset:offset+width] + + cpdef bint check_read(self, Py_ssize_t offset, Py_ssize_t width) except -1: + if offset + width > self.data_len: + self.sas7bdatreader.close() + raise ValueError("The cached page is too small.") + return True + + +cdef class Page(BasePage): + """A page with known endianness. + + Supports reading raw bytes, integers and floats.""" + cdef bint file_is_little_endian, need_byteswap + + def __init__(self, sas7bdatreader, data, file_is_little_endian): + super().__init__(sas7bdatreader, data) + self.file_is_little_endian = file_is_little_endian + self.need_byteswap = file_is_little_endian != _machine_is_little_endian() + + def process_page_metadata(self): + cdef: + Py_ssize_t int_length = self.sas7bdatreader._int_length + Py_ssize_t total_offset + Py_ssize_t subheader_offset + Py_ssize_t subheader_length + Py_ssize_t subheader_compression + Py_ssize_t subheader_type + Py_ssize_t page_bit_offset = self.sas7bdatreader._page_bit_offset + Py_ssize_t current_page_subheaders_count = ( + self.sas7bdatreader._current_page_subheaders_count + ) + Py_ssize_t subheader_pointer_length = ( + self.sas7bdatreader._subheader_pointer_length + ) + list current_page_data_subheader_pointers = ( + self.sas7bdatreader._current_page_data_subheader_pointers + ) + Py_ssize_t i + + for i in range(current_page_subheaders_count): + total_offset = subheader_pointers_offset + page_bit_offset + subheader_pointer_length * i + + subheader_offset = self.read_int(total_offset, int_length) + total_offset += int_length + + subheader_length = self.read_int(total_offset, int_length) + total_offset += int_length + + subheader_compression = self.read_int(total_offset, 1) + total_offset += 1 + + subheader_type = self.read_int(total_offset, 1) + + if subheader_length == 0 or subheader_compression == truncated_subheader_id: + continue + + subheader_index = self._get_subheader_index( + subheader_offset, + int_length, + subheader_compression, + subheader_type, + ) + processor = self._get_subheader_processor(subheader_index) + if processor is None: + current_page_data_subheader_pointers.append( + _SubheaderPointer(subheader_offset, subheader_length) + ) + else: + processor(subheader_offset, subheader_length) + + cdef int _get_subheader_index( + self, + Py_ssize_t signature_offset, + Py_ssize_t signature_length, + Py_ssize_t compression, + Py_ssize_t ptype, + ) except -1: + cdef Py_ssize_t i + + self.check_read(signature_offset, signature_length) + + if signature_length == 4: + for i in range(len(subheader_signature_to_index_values32)): + if not memcmp( + &subheader_signature_to_index_keys32[i], + &self.data_raw[signature_offset], + 4, + ): + return subheader_signature_to_index_values32[i] + else: + for i in range(len(subheader_signature_to_index_values64)): + if not memcmp( + &subheader_signature_to_index_keys64[i], + &self.data_raw[signature_offset], + 8, + ): + return subheader_signature_to_index_values64[i] + + if ( + self.sas7bdatreader.compression + and (compression in (compressed_subheader_id, 0)) + and ptype == compressed_subheader_type + ): + return data_subheader_index + else: + self.sas7bdatreader.close() + raise ValueError( + f"Unknown subheader signature {self.data_raw[signature_offset:signature_offset+signature_length]}" + ) + + cdef _get_subheader_processor(self, Py_ssize_t index): + if index == data_subheader_index: + return None + elif index == row_size_index: + return self.sas7bdatreader._process_rowsize_subheader + elif index == column_size_index: + return self.sas7bdatreader._process_columnsize_subheader + elif index == column_text_index: + return self.sas7bdatreader._process_columntext_subheader + elif index == column_name_index: + return self.sas7bdatreader._process_columnname_subheader + elif index == column_attributes_index: + return self.sas7bdatreader._process_columnattributes_subheader + elif index == format_and_label_index: + return self.sas7bdatreader._process_format_subheader + elif index == column_list_index: + return self.sas7bdatreader._process_columnlist_subheader + elif index == subheader_counts_index: + return self.sas7bdatreader._process_subheader_counts + else: + raise ValueError(f"unknown subheader index {index}") + + cpdef double read_float(self, Py_ssize_t offset, Py_ssize_t width) except? 1337: + self.check_read(offset, width) + cdef const uint8_t *d = &self.data_raw[offset] + if width == 4: + return _read_float_with_byteswap(d, self.need_byteswap) + else: + return _read_double_with_byteswap(d, self.need_byteswap) + + cpdef uint64_t read_int(self, Py_ssize_t offset, Py_ssize_t width) except? 1337: + self.check_read(offset, width) + cdef const uint8_t *d = &self.data_raw[offset] + if width == 1: + return d[0] + elif width == 2: + return _read_uint16_with_byteswap(d, self.need_byteswap) + elif width == 4: + return _read_uint32_with_byteswap(d, self.need_byteswap) + else: + return _read_uint64_with_byteswap(d, self.need_byteswap) + + +cdef class SAS7BDATCythonReader: + """Fast extensions to SAS7BDATCythonReader.""" + cdef: + # Static + object sas7bdatreader + uint8_t[:, :] byte_chunk + object[:, :] string_chunk + int row_length + int page_bit_offset + int subheader_pointer_length + int row_count + int mix_page_row_count + bint blank_missing + bytes encoding + # Synced Cython <-> Python, see _update_{c,p}ython_row_indices() + public int current_row_in_chunk_index + public int current_row_in_file_index + # Synced Python -> Cython, see _update_cython_page_info() + public int current_row_on_page_index + public int current_page_type + public int current_page_block_count + public list current_page_data_subheader_pointers + public int current_page_subheaders_count + public Page cached_page + + Py_ssize_t (*decompress)( + const uint8_t *inbuff, + Py_ssize_t input_length, + uint8_t *outbuff, + Py_ssize_t row_length, + ) except -1 + + Py_ssize_t[:] column_data_offsets, column_data_lengths + char[:] column_types + + def __init__( + self, + sas7bdatreader, + byte_chunk, + string_chunk, + row_length, + page_bit_offset, + subheader_pointer_length, + row_count, + mix_page_row_count, + blank_missing, + encoding, + column_data_offsets, + column_data_lengths, + column_types, + compression, + ): + self.sas7bdatreader = sas7bdatreader + self.byte_chunk = byte_chunk + self.string_chunk = string_chunk + self.row_length = row_length + self.page_bit_offset = page_bit_offset + self.subheader_pointer_length = subheader_pointer_length + self.row_count = row_count + self.mix_page_row_count = mix_page_row_count + self.blank_missing = blank_missing + self.encoding = None if encoding is None else encoding.encode("ascii") + self.column_data_offsets = column_data_offsets + self.column_data_lengths = column_data_lengths + self.column_types = column_types + + # Compression + if compression == const.rle_compression: + self.decompress = _rle_decompress + elif compression == const.rdc_compression: + self.decompress = _rdc_decompress + else: + self.decompress = NULL + + def read(self, Py_ssize_t nrows): + cdef bint done + + for _ in range(nrows): + done = self._readline() + if done: + break + + cdef bint _readline(self) except -1: + # Loop until a data row is read + while self.current_page_type in ( + page_meta_types_0, + page_meta_types_1, + ) and self.current_row_on_page_index >= len( + self.current_page_data_subheader_pointers + ): + if self.sas7bdatreader._read_next_page(): + return True + + if self.current_page_type in (page_meta_types_0, page_meta_types_1): + return self._readline_meta_page() + elif self.current_page_type == page_mix_type: + return self._readline_mix_page() + else: + return self._readline_data_page() + + cdef bint _readline_meta_page(self) except -1: + cdef _SubheaderPointer current_subheader_pointer = ( + self.current_page_data_subheader_pointers[self.current_row_on_page_index] + ) + self.process_byte_array_with_data( + current_subheader_pointer.offset, current_subheader_pointer.length + ) + return False + + cdef bint _readline_mix_page(self) except -1: + cdef Py_ssize_t align_correction, offset + align_correction = ( + self.page_bit_offset + + subheader_pointers_offset + + self.current_page_subheaders_count * self.subheader_pointer_length + ) + align_correction = align_correction % 8 + offset = self.page_bit_offset + align_correction + offset += subheader_pointers_offset + offset += self.current_page_subheaders_count * self.subheader_pointer_length + offset += self.current_row_on_page_index * self.row_length + self.process_byte_array_with_data(offset, self.row_length) + if self.current_row_on_page_index == min(self.row_count, self.mix_page_row_count): + return self.sas7bdatreader._read_next_page() + else: + return False + + cdef bint _readline_data_page(self) except -1: + self.process_byte_array_with_data( + self.page_bit_offset + + subheader_pointers_offset + + self.current_row_on_page_index * self.row_length, + self.row_length, + ) + if self.current_row_on_page_index == self.current_page_block_count: + return self.sas7bdatreader._read_next_page() + else: + return False + + cpdef bint process_byte_array_with_data(self, int offset, int length) except -1: + cdef: + char column_type + Py_ssize_t data_length, data_offset + const uint8_t *source + Py_ssize_t j, rpos, m, jb = 0, js = 0 + uint8_t *decompress_dynamic_buf = NULL + object tmp + + assert offset + length <= self.cached_page.data_len + source = &self.cached_page.data_raw[offset] + if self.decompress != NULL and length < self.row_length: + if self.row_length <= 1024: + memset(_process_byte_array_with_data_buf, 0, self.row_length) + rpos = self.decompress( + source, length, _process_byte_array_with_data_buf, self.row_length + ) + source = _process_byte_array_with_data_buf + else: + decompress_dynamic_buf = calloc(self.row_length, sizeof(uint8_t)) + if decompress_dynamic_buf == NULL: + nbytes = self.row_length * sizeof(uint8_t) + raise MemoryError(f"Failed to allocate {nbytes} bytes") + rpos = self.decompress(source, length, decompress_dynamic_buf, self.row_length) + source = decompress_dynamic_buf + if rpos != self.row_length: + raise ValueError( + f"Expected decompressed line of length {self.row_length} bytes but decompressed {rpos} bytes" + ) + + for j in range(len(self.column_data_offsets)): + column_type = self.column_types[j] + data_length = self.column_data_lengths[j] + data_offset = self.column_data_offsets[j] + if data_length == 0: + break + assert data_offset + data_length <= self.row_length + if column_type == b"d": + # decimal + # TODO optimize this, can only be 8 or 4 + m = 8 * self.current_row_in_chunk_index + if self.cached_page.file_is_little_endian: + m += 8 - data_length + memcpy(&self.byte_chunk[jb, m], &source[data_offset], data_length) + jb += 1 + elif column_type == b"s": + # string + # Quickly skip 8-byte blocks of trailing whitespace. + while ( + data_length > 8 + and source[data_offset+data_length-8] in b"\x00 " + and source[data_offset+data_length-7] in b"\x00 " + and source[data_offset+data_length-6] in b"\x00 " + and source[data_offset+data_length-5] in b"\x00 " + and source[data_offset+data_length-4] in b"\x00 " + and source[data_offset+data_length-3] in b"\x00 " + and source[data_offset+data_length-2] in b"\x00 " + and source[data_offset+data_length-1] in b"\x00 " + ): + data_length -= 8 + # Skip the rest of the trailing whitespace. + while data_length > 0 and source[data_offset+data_length-1] in b"\x00 ": + data_length -= 1 + if self.blank_missing and not data_length: + self.string_chunk[js, self.current_row_in_chunk_index] = _np_nan + else: + self.string_chunk[js, self.current_row_in_chunk_index] = ( + source[data_offset:data_offset+data_length] + if self.encoding is None else + source[data_offset:data_offset+data_length].decode(self.encoding) + ) + js += 1 + else: + raise ValueError(f"unknown column type {column_type!r}") + + self.current_row_in_chunk_index += 1 + self.current_row_in_file_index += 1 + self.current_row_on_page_index += 1 + + if decompress_dynamic_buf != NULL: + free(decompress_dynamic_buf) + + return True + + +cdef inline float _read_float_with_byteswap(const uint8_t *data, bint byteswap): + cdef float res = (data)[0] + if byteswap: + res = _byteswap_float(res) + return res + + +cdef inline double _read_double_with_byteswap(const uint8_t *data, bint byteswap): + cdef double res = (data)[0] + if byteswap: + res = _byteswap_double(res) + return res + + +cdef inline uint16_t _read_uint16_with_byteswap(const uint8_t *data, bint byteswap): + cdef uint16_t res = (data)[0] + if byteswap: + res = _byteswap2(res) + return res + + +cdef inline uint32_t _read_uint32_with_byteswap(const uint8_t *data, bint byteswap): + cdef uint32_t res = (data)[0] + if byteswap: + res = _byteswap4(res) + return res + + +cdef inline uint64_t _read_uint64_with_byteswap(const uint8_t *data, bint byteswap): + cdef uint64_t res = (data)[0] + if byteswap: + res = _byteswap8(res) + return res + + +# Byteswapping +# From https://github.com/WizardMac/ReadStat/blob/master/src/readstat_bits. +# Copyright (c) 2013-2016 Evan Miller, Apache 2 License + +cdef inline bint _machine_is_little_endian(): + cdef int test_byte_order = 1; + return (&test_byte_order)[0] + + +cdef inline uint16_t _byteswap2(uint16_t num): + return ((num & 0xFF00) >> 8) | ((num & 0x00FF) << 8) + + +cdef inline uint32_t _byteswap4(uint32_t num): + num = ((num & 0xFFFF0000) >> 16) | ((num & 0x0000FFFF) << 16) + return ((num & 0xFF00FF00) >> 8) | ((num & 0x00FF00FF) << 8) + + +cdef inline uint64_t _byteswap8(uint64_t num): + num = ((num & 0xFFFFFFFF00000000) >> 32) | ((num & 0x00000000FFFFFFFF) << 32) + num = ((num & 0xFFFF0000FFFF0000) >> 16) | ((num & 0x0000FFFF0000FFFF) << 16) + return ((num & 0xFF00FF00FF00FF00) >> 8) | ((num & 0x00FF00FF00FF00FF) << 8) + + +cdef inline float _byteswap_float(float num): + cdef uint32_t answer = 0 + memcpy(&answer, &num, 4) + answer = _byteswap4(answer) + memcpy(&num, &answer, 4) + return num + + +cdef inline double _byteswap_double(double num): + cdef uint64_t answer = 0 + memcpy(&answer, &num, 8) + answer = _byteswap8(answer) + memcpy(&num, &answer, 8) + return num + + +# Decompression + +# _rle_decompress decompresses data using a Run Length Encoding # algorithm. It is partially documented here: # # https://cran.r-project.org/package=sas7bdat/vignettes/sas7bdat.pdf -cdef const uint8_t[:] rle_decompress(int result_length, const uint8_t[:] inbuff) except *: - +cdef Py_ssize_t _rle_decompress( + const uint8_t *inbuff, + Py_ssize_t input_length, + uint8_t *outbuff, + Py_ssize_t row_length, +) except -1: cdef: - uint8_t control_byte, x - uint8_t[:] result = np.zeros(result_length, np.uint8) - int rpos = 0 - int i, nbytes, end_of_first_byte - Py_ssize_t ipos = 0, length = len(inbuff) + Py_ssize_t rpos = 0, ipos = 0, nbytes, control_byte, end_of_first_byte + + while ipos < input_length: + if rpos >= row_length: + raise ValueError(f"Invalid RLE out of bounds write at position {rpos} of {row_length}") - while ipos < length: control_byte = inbuff[ipos] & 0xF0 - end_of_first_byte = (inbuff[ipos] & 0x0F) + end_of_first_byte = inbuff[ipos] & 0x0F ipos += 1 + if control_byte not in (0xD0, 0xE0, 0xF0): + if ipos >= input_length: + raise ValueError(f"Invalid RLE out of bounds read at position {ipos} of {input_length}") if control_byte == 0x00: - if end_of_first_byte != 0: - raise ValueError("Unexpected non-zero end_of_first_byte") - nbytes = (inbuff[ipos]) + 64 + nbytes = inbuff[ipos] + 64 + end_of_first_byte * 256 ipos += 1 - for _ in range(nbytes): - result[rpos] = inbuff[ipos] - rpos += 1 - ipos += 1 + assert rpos + nbytes <= row_length + assert ipos + nbytes <= input_length + memcpy(&outbuff[rpos], &inbuff[ipos], nbytes) + ipos += nbytes + rpos += nbytes elif control_byte == 0x40: # not documented - nbytes = end_of_first_byte * 16 - nbytes += (inbuff[ipos]) + nbytes = inbuff[ipos] + 18 + end_of_first_byte * 256 ipos += 1 - for _ in range(nbytes): - result[rpos] = inbuff[ipos] - rpos += 1 + assert rpos + nbytes <= row_length + memset(&outbuff[rpos], inbuff[ipos], nbytes) + rpos += nbytes ipos += 1 elif control_byte == 0x60: - nbytes = end_of_first_byte * 256 + (inbuff[ipos]) + 17 + nbytes = inbuff[ipos] + 17 + end_of_first_byte * 256 ipos += 1 - for _ in range(nbytes): - result[rpos] = 0x20 - rpos += 1 + assert rpos + nbytes <= row_length + memset(&outbuff[rpos], 0x20, nbytes) + rpos += nbytes elif control_byte == 0x70: - nbytes = end_of_first_byte * 256 + (inbuff[ipos]) + 17 + nbytes = inbuff[ipos] + 17 + end_of_first_byte * 256 ipos += 1 - for _ in range(nbytes): - result[rpos] = 0x00 - rpos += 1 + assert rpos + nbytes <= row_length + memset(&outbuff[rpos], 0x00, nbytes) + rpos += nbytes elif control_byte == 0x80: nbytes = end_of_first_byte + 1 - for i in range(nbytes): - result[rpos] = inbuff[ipos + i] - rpos += 1 + assert rpos + nbytes <= row_length + assert ipos + nbytes <= input_length + memcpy(&outbuff[rpos], &inbuff[ipos], nbytes) + rpos += nbytes ipos += nbytes elif control_byte == 0x90: nbytes = end_of_first_byte + 17 - for i in range(nbytes): - result[rpos] = inbuff[ipos + i] - rpos += 1 + assert rpos + nbytes <= row_length + assert ipos + nbytes <= input_length + memcpy(&outbuff[rpos], &inbuff[ipos], nbytes) + rpos += nbytes ipos += nbytes elif control_byte == 0xA0: nbytes = end_of_first_byte + 33 - for i in range(nbytes): - result[rpos] = inbuff[ipos + i] - rpos += 1 + assert rpos + nbytes <= row_length + assert ipos + nbytes <= input_length + memcpy(&outbuff[rpos], &inbuff[ipos], nbytes) + rpos += nbytes ipos += nbytes elif control_byte == 0xB0: nbytes = end_of_first_byte + 49 - for i in range(nbytes): - result[rpos] = inbuff[ipos + i] - rpos += 1 + assert rpos + nbytes <= row_length + assert ipos + nbytes <= input_length + memcpy(&outbuff[rpos], &inbuff[ipos], nbytes) + rpos += nbytes ipos += nbytes elif control_byte == 0xC0: nbytes = end_of_first_byte + 3 - x = inbuff[ipos] + assert rpos + nbytes <= row_length + memset(&outbuff[rpos], inbuff[ipos], nbytes) ipos += 1 - for _ in range(nbytes): - result[rpos] = x - rpos += 1 + rpos += nbytes elif control_byte == 0xD0: nbytes = end_of_first_byte + 2 - for _ in range(nbytes): - result[rpos] = 0x40 - rpos += 1 + assert rpos + nbytes <= row_length + memset(&outbuff[rpos], 0x40, nbytes) + rpos += nbytes elif control_byte == 0xE0: nbytes = end_of_first_byte + 2 - for _ in range(nbytes): - result[rpos] = 0x20 - rpos += 1 + assert rpos + nbytes <= row_length + memset(&outbuff[rpos], 0x20, nbytes) + rpos += nbytes elif control_byte == 0xF0: nbytes = end_of_first_byte + 2 - for _ in range(nbytes): - result[rpos] = 0x00 - rpos += 1 + assert rpos + nbytes <= row_length + memset(&outbuff[rpos], 0x00, nbytes) + rpos += nbytes else: raise ValueError(f"unknown control byte: {control_byte}") - # In py37 cython/clang sees `len(outbuff)` as size_t and not Py_ssize_t - if len(result) != result_length: - raise ValueError(f"RLE: {len(result)} != {result_length}") - - return np.asarray(result) + return rpos - -# rdc_decompress decompresses data using the Ross Data Compression algorithm: +# _rdc_decompress decompresses data using the Ross Data Compression algorithm: # # http://collaboration.cmc.ec.gc.ca/science/rpn/biblio/ddj/Website/articles/CUJ/1992/9210/ross/ross.htm -cdef const uint8_t[:] rdc_decompress(int result_length, const uint8_t[:] inbuff) except *: - +cdef Py_ssize_t _rdc_decompress( + const uint8_t *inbuff, + Py_ssize_t input_length, + uint8_t *outbuff, + Py_ssize_t row_length, +) except -1: cdef: uint8_t cmd uint16_t ctrl_bits = 0, ctrl_mask = 0, ofs, cnt - int rpos = 0, k - uint8_t[:] outbuff = np.zeros(result_length, dtype=np.uint8) - Py_ssize_t ipos = 0, length = len(inbuff) - - ii = -1 + Py_ssize_t rpos = 0, ipos = 0, ii = -1 - while ipos < length: + while ipos < input_length: + if rpos >= row_length: + raise ValueError(f"Invalid RDC out of bounds write at position {rpos} of {row_length}") ii += 1 ctrl_mask = ctrl_mask >> 1 if ctrl_mask == 0: @@ -149,8 +704,7 @@ cdef const uint8_t[:] rdc_decompress(int result_length, const uint8_t[:] inbuff) # short RLE if cmd == 0: cnt += 3 - for k in range(cnt): - outbuff[rpos + k] = inbuff[ipos] + memset(&outbuff[rpos], inbuff[ipos], cnt) rpos += cnt ipos += 1 @@ -159,8 +713,7 @@ cdef const uint8_t[:] rdc_decompress(int result_length, const uint8_t[:] inbuff) cnt += inbuff[ipos] << 4 cnt += 19 ipos += 1 - for k in range(cnt): - outbuff[rpos + k] = inbuff[ipos] + memset(&outbuff[rpos], inbuff[ipos], cnt) rpos += cnt ipos += 1 @@ -172,8 +725,7 @@ cdef const uint8_t[:] rdc_decompress(int result_length, const uint8_t[:] inbuff) cnt = inbuff[ipos] ipos += 1 cnt += 16 - for k in range(cnt): - outbuff[rpos + k] = outbuff[rpos - ofs + k] + memcpy(&outbuff[rpos], &outbuff[rpos - ofs], cnt) rpos += cnt # short pattern @@ -181,256 +733,7 @@ cdef const uint8_t[:] rdc_decompress(int result_length, const uint8_t[:] inbuff) ofs = cnt + 3 ofs += inbuff[ipos] << 4 ipos += 1 - for k in range(cmd): - outbuff[rpos + k] = outbuff[rpos - ofs + k] + memcpy(&outbuff[rpos], &outbuff[rpos - ofs], cmd) rpos += cmd - # In py37 cython/clang sees `len(outbuff)` as size_t and not Py_ssize_t - if len(outbuff) != result_length: - raise ValueError(f"RDC: {len(outbuff)} != {result_length}\n") - - return np.asarray(outbuff) - - -cdef enum ColumnTypes: - column_type_decimal = 1 - column_type_string = 2 - - -# type the page_data types -assert len(const.page_meta_types) == 2 -cdef: - int page_meta_types_0 = const.page_meta_types[0] - int page_meta_types_1 = const.page_meta_types[1] - int page_mix_type = const.page_mix_type - int page_data_type = const.page_data_type - int subheader_pointers_offset = const.subheader_pointers_offset - - -cdef class Parser: - - cdef: - int column_count - int64_t[:] lengths - int64_t[:] offsets - int64_t[:] column_types - uint8_t[:, :] byte_chunk - object[:, :] string_chunk - char *cached_page - int current_row_on_page_index - int current_page_block_count - int current_page_data_subheader_pointers_len - int current_page_subheaders_count - int current_row_in_chunk_index - int current_row_in_file_index - int header_length - int row_length - int bit_offset - int subheader_pointer_length - int current_page_type - bint is_little_endian - const uint8_t[:] (*decompress)(int result_length, const uint8_t[:] inbuff) except * - object parser - - def __init__(self, object parser): - cdef: - int j - char[:] column_types - - self.parser = parser - self.header_length = self.parser.header_length - self.column_count = parser.column_count - self.lengths = parser.column_data_lengths() - self.offsets = parser.column_data_offsets() - self.byte_chunk = parser._byte_chunk - self.string_chunk = parser._string_chunk - self.row_length = parser.row_length - self.bit_offset = self.parser._page_bit_offset - self.subheader_pointer_length = self.parser._subheader_pointer_length - self.is_little_endian = parser.byte_order == "<" - self.column_types = np.empty(self.column_count, dtype='int64') - - # page indicators - self.update_next_page() - - column_types = parser.column_types() - - # map column types - for j in range(self.column_count): - if column_types[j] == b'd': - self.column_types[j] = column_type_decimal - elif column_types[j] == b's': - self.column_types[j] = column_type_string - else: - raise ValueError(f"unknown column type: {self.parser.columns[j].ctype}") - - # compression - if parser.compression == const.rle_compression: - self.decompress = rle_decompress - elif parser.compression == const.rdc_compression: - self.decompress = rdc_decompress - else: - self.decompress = NULL - - # update to current state of the parser - self.current_row_in_chunk_index = parser._current_row_in_chunk_index - self.current_row_in_file_index = parser._current_row_in_file_index - self.current_row_on_page_index = parser._current_row_on_page_index - - def read(self, int nrows): - cdef: - bint done - int i - - for _ in range(nrows): - done = self.readline() - if done: - break - - # update the parser - self.parser._current_row_on_page_index = self.current_row_on_page_index - self.parser._current_row_in_chunk_index = self.current_row_in_chunk_index - self.parser._current_row_in_file_index = self.current_row_in_file_index - - cdef bint read_next_page(self) except? True: - cdef bint done - - done = self.parser._read_next_page() - if done: - self.cached_page = NULL - else: - self.update_next_page() - return done - - cdef update_next_page(self): - # update data for the current page - - self.cached_page = self.parser._cached_page - self.current_row_on_page_index = 0 - self.current_page_type = self.parser._current_page_type - self.current_page_block_count = self.parser._current_page_block_count - self.current_page_data_subheader_pointers_len = len( - self.parser._current_page_data_subheader_pointers - ) - self.current_page_subheaders_count = self.parser._current_page_subheaders_count - - cdef bint readline(self) except? True: - - cdef: - int offset, bit_offset, align_correction - int subheader_pointer_length, mn - bint done, flag - - bit_offset = self.bit_offset - subheader_pointer_length = self.subheader_pointer_length - - # If there is no page, go to the end of the header and read a page. - if self.cached_page == NULL: - self.parser._path_or_buf.seek(self.header_length) - done = self.read_next_page() - if done: - return True - - # Loop until a data row is read - while True: - if self.current_page_type in (page_meta_types_0, page_meta_types_1): - flag = self.current_row_on_page_index >=\ - self.current_page_data_subheader_pointers_len - if flag: - done = self.read_next_page() - if done: - return True - continue - current_subheader_pointer = ( - self.parser._current_page_data_subheader_pointers[ - self.current_row_on_page_index]) - self.process_byte_array_with_data( - current_subheader_pointer.offset, - current_subheader_pointer.length) - return False - elif self.current_page_type == page_mix_type: - align_correction = ( - bit_offset - + subheader_pointers_offset - + self.current_page_subheaders_count * subheader_pointer_length - ) - align_correction = align_correction % 8 - offset = bit_offset + align_correction - offset += subheader_pointers_offset - offset += self.current_page_subheaders_count * subheader_pointer_length - offset += self.current_row_on_page_index * self.row_length - self.process_byte_array_with_data(offset, self.row_length) - mn = min(self.parser.row_count, self.parser._mix_page_row_count) - if self.current_row_on_page_index == mn: - done = self.read_next_page() - if done: - return True - return False - elif self.current_page_type == page_data_type: - self.process_byte_array_with_data( - bit_offset - + subheader_pointers_offset - + self.current_row_on_page_index * self.row_length, - self.row_length, - ) - flag = self.current_row_on_page_index == self.current_page_block_count - if flag: - done = self.read_next_page() - if done: - return True - return False - else: - raise ValueError(f"unknown page type: {self.current_page_type}") - - cdef void process_byte_array_with_data(self, int offset, int length) except *: - - cdef: - Py_ssize_t j - int s, k, m, jb, js, current_row - int64_t lngt, start, ct - const uint8_t[:] source - int64_t[:] column_types - int64_t[:] lengths - int64_t[:] offsets - uint8_t[:, :] byte_chunk - object[:, :] string_chunk - - source = np.frombuffer( - self.cached_page[offset:offset + length], dtype=np.uint8) - - if self.decompress != NULL and (length < self.row_length): - source = self.decompress(self.row_length, source) - - current_row = self.current_row_in_chunk_index - column_types = self.column_types - lengths = self.lengths - offsets = self.offsets - byte_chunk = self.byte_chunk - string_chunk = self.string_chunk - s = 8 * self.current_row_in_chunk_index - js = 0 - jb = 0 - for j in range(self.column_count): - lngt = lengths[j] - if lngt == 0: - break - start = offsets[j] - ct = column_types[j] - if ct == column_type_decimal: - # decimal - if self.is_little_endian: - m = s + 8 - lngt - else: - m = s - for k in range(lngt): - byte_chunk[jb, m + k] = source[start + k] - jb += 1 - elif column_types[j] == column_type_string: - # string - string_chunk[js, current_row] = np.array(source[start:( - start + lngt)]).tobytes().rstrip(b"\x00 ") - js += 1 - - self.current_row_on_page_index += 1 - self.current_row_in_chunk_index += 1 - self.current_row_in_file_index += 1 + return rpos diff --git a/pandas/io/sas/sas7bdat.py b/pandas/io/sas/sas7bdat.py index 0ed853d619d4e..f0ae94fe23a2c 100644 --- a/pandas/io/sas/sas7bdat.py +++ b/pandas/io/sas/sas7bdat.py @@ -20,12 +20,12 @@ datetime, timedelta, ) -import struct from typing import cast import numpy as np from pandas._typing import ( + CompressionOptions, FilePath, ReadBuffer, ) @@ -41,7 +41,12 @@ ) from pandas.io.common import get_handle -from pandas.io.sas._sas import Parser +from pandas.io.sas._sas import ( + BasePage, + Page, + SAS7BDATCythonReader, + _SubheaderPointer, +) import pandas.io.sas.sas_constants as const from pandas.io.sas.sasreader import ReaderBase @@ -86,19 +91,6 @@ def _convert_datetimes(sas_datetimes: pd.Series, unit: str) -> pd.Series: return s_series -class _SubheaderPointer: - offset: int - length: int - compression: int - ptype: int - - def __init__(self, offset: int, length: int, compression: int, ptype: int) -> None: - self.offset = offset - self.length = length - self.compression = compression - self.ptype = ptype - - class _Column: col_id: int name: str | bytes @@ -156,18 +148,20 @@ class SAS7BDATReader(ReaderBase, abc.Iterator): """ _int_length: int - _cached_page: bytes | None + _cached_page: BasePage | None + _cython_reader: SAS7BDATCythonReader | None def __init__( self, path_or_buf: FilePath | ReadBuffer[bytes], index=None, - convert_dates=True, - blank_missing=True, - chunksize=None, - encoding=None, - convert_text=True, - convert_header_text=True, + convert_dates: bool = True, + blank_missing: bool = True, + chunksize: int | None = None, + encoding: str | None = None, + convert_text: bool = True, + convert_header_text: bool = True, + compression: CompressionOptions = "infer", ) -> None: self.index = index @@ -180,22 +174,25 @@ def __init__( self.default_encoding = "latin-1" self.compression = b"" - self.column_names_strings: list[str] = [] - self.column_names: list[str] = [] - self.column_formats: list[str] = [] + self.column_names_raw: list[bytes] = [] + self.column_names: list[str | bytes] = [] + self.column_formats: list[str | bytes] = [] self.columns: list[_Column] = [] self._current_page_data_subheader_pointers: list[_SubheaderPointer] = [] self._cached_page = None + self._cython_reader = None self._column_data_lengths: list[int] = [] self._column_data_offsets: list[int] = [] self._column_types: list[bytes] = [] + self._current_row_in_chunk_index = 0 self._current_row_in_file_index = 0 self._current_row_on_page_index = 0 - self._current_row_in_file_index = 0 - self.handles = get_handle(path_or_buf, "rb", is_text=False) + self.handles = get_handle( + path_or_buf, "rb", is_text=False, compression=compression + ) self._path_or_buf = self.handles.handle @@ -221,6 +218,12 @@ def column_types(self) -> np.ndarray: """ return np.asarray(self._column_types, dtype=np.dtype("S1")) + def file_is_little_endian(self): + byte_order = getattr(self, "byte_order", None) + if byte_order not in "<>": + raise ValueError(f"byte_order invalid or not set: {byte_order}") + return byte_order == "<" + def close(self) -> None: self.handles.close() @@ -228,13 +231,13 @@ def _get_properties(self) -> None: # Check magic number self._path_or_buf.seek(0) - self._cached_page = self._path_or_buf.read(288) - if self._cached_page[0 : len(const.magic)] != const.magic: + self._cached_page = BasePage(self, self._path_or_buf.read(288)) + if self._cached_page.read_bytes(0, len(const.magic)) != const.magic: raise ValueError("magic number mismatch (not a SAS file?)") # Get alignment information align1, align2 = 0, 0 - buf = self._read_bytes(const.align_1_offset, const.align_1_length) + buf = self._cached_page.read_bytes(const.align_1_offset, const.align_1_length) if buf == const.u64_byte_checker_value: align2 = const.align_2_value self.U64 = True @@ -246,27 +249,35 @@ def _get_properties(self) -> None: self._page_bit_offset = const.page_bit_offset_x86 self._subheader_pointer_length = const.subheader_pointer_length_x86 self._int_length = 4 - buf = self._read_bytes(const.align_2_offset, const.align_2_length) + buf = self._cached_page.read_bytes(const.align_2_offset, const.align_2_length) if buf == const.align_1_checker_value: align1 = const.align_2_value total_align = align1 + align2 # Get endianness information - buf = self._read_bytes(const.endianness_offset, const.endianness_length) + buf = self._cached_page.read_bytes( + const.endianness_offset, const.endianness_length + ) if buf == b"\x01": self.byte_order = "<" else: self.byte_order = ">" + self._cached_page = Page( + self, self._cached_page.data, self.file_is_little_endian() + ) + # Get encoding information - buf = self._read_bytes(const.encoding_offset, const.encoding_length)[0] + buf = self._cached_page.read_bytes( + const.encoding_offset, const.encoding_length + )[0] if buf in const.encoding_names: self.file_encoding = const.encoding_names[buf] else: self.file_encoding = f"unknown (code={buf})" # Get platform information - buf = self._read_bytes(const.platform_offset, const.platform_length) + buf = self._cached_page.read_bytes(const.platform_offset, const.platform_length) if buf == b"1": self.platform = "unix" elif buf == b"2": @@ -274,134 +285,81 @@ def _get_properties(self) -> None: else: self.platform = "unknown" - buf = self._read_bytes(const.dataset_offset, const.dataset_length) - self.name = buf.rstrip(b"\x00 ") - if self.convert_header_text: - self.name = self.name.decode(self.encoding or self.default_encoding) + self.name = self._read_and_convert_header_text( + const.dataset_offset, const.dataset_length + ) - buf = self._read_bytes(const.file_type_offset, const.file_type_length) - self.file_type = buf.rstrip(b"\x00 ") - if self.convert_header_text: - self.file_type = self.file_type.decode( - self.encoding or self.default_encoding - ) + self.file_type = self._read_and_convert_header_text( + const.file_type_offset, const.file_type_length + ) # Timestamp is epoch 01/01/1960 epoch = datetime(1960, 1, 1) - x = self._read_float( + x = self._cached_page.read_float( const.date_created_offset + align1, const.date_created_length ) self.date_created = epoch + pd.to_timedelta(x, unit="s") - x = self._read_float( + x = self._cached_page.read_float( const.date_modified_offset + align1, const.date_modified_length ) self.date_modified = epoch + pd.to_timedelta(x, unit="s") - self.header_length = self._read_int( + self.header_length = self._cached_page.read_int( const.header_size_offset + align1, const.header_size_length ) # Read the rest of the header into cached_page. buf = self._path_or_buf.read(self.header_length - 288) - self._cached_page += buf + self._cached_page = Page( + self, self._cached_page.data + buf, self.file_is_little_endian() + ) # error: Argument 1 to "len" has incompatible type "Optional[bytes]"; # expected "Sized" if len(self._cached_page) != self.header_length: # type: ignore[arg-type] raise ValueError("The SAS7BDAT file appears to be truncated.") - self._page_length = self._read_int( + self._page_length = self._cached_page.read_int( const.page_size_offset + align1, const.page_size_length ) - self._page_count = self._read_int( + self._page_count = self._cached_page.read_int( const.page_count_offset + align1, const.page_count_length ) - buf = self._read_bytes( + self.sas_release_offset = self._read_and_convert_header_text( const.sas_release_offset + total_align, const.sas_release_length ) - self.sas_release = buf.rstrip(b"\x00 ") - if self.convert_header_text: - self.sas_release = self.sas_release.decode( - self.encoding or self.default_encoding - ) - buf = self._read_bytes( + self.server_type = self._read_and_convert_header_text( const.sas_server_type_offset + total_align, const.sas_server_type_length ) - self.server_type = buf.rstrip(b"\x00 ") - if self.convert_header_text: - self.server_type = self.server_type.decode( - self.encoding or self.default_encoding - ) - buf = self._read_bytes( + self.os_version = self._read_and_convert_header_text( const.os_version_number_offset + total_align, const.os_version_number_length ) - self.os_version = buf.rstrip(b"\x00 ") - if self.convert_header_text: - self.os_version = self.os_version.decode( - self.encoding or self.default_encoding - ) - buf = self._read_bytes(const.os_name_offset + total_align, const.os_name_length) - buf = buf.rstrip(b"\x00 ") - if len(buf) > 0: - self.os_name = buf.decode(self.encoding or self.default_encoding) - else: - buf = self._read_bytes( + self.os_name = self._read_and_convert_header_text( + const.os_name_offset + total_align, const.os_name_length + ) + if not self.os_name: + self.os_name = self._read_and_convert_header_text( const.os_maker_offset + total_align, const.os_maker_length ) - self.os_name = buf.rstrip(b"\x00 ") - if self.convert_header_text: - self.os_name = self.os_name.decode( - self.encoding or self.default_encoding - ) - def __next__(self): + def __next__(self) -> DataFrame: da = self.read(nrows=self.chunksize or 1) - if da is None: + if da.empty: self.close() raise StopIteration return da - # Read a single float of the given width (4 or 8). - def _read_float(self, offset: int, width: int): - if width not in (4, 8): - self.close() - raise ValueError("invalid float width") - buf = self._read_bytes(offset, width) - fd = "f" if width == 4 else "d" - return struct.unpack(self.byte_order + fd, buf)[0] - - # Read a single signed integer of the given width (1, 2, 4 or 8). - def _read_int(self, offset: int, width: int) -> int: - if width not in (1, 2, 4, 8): - self.close() - raise ValueError("invalid int width") - buf = self._read_bytes(offset, width) - it = {1: "b", 2: "h", 4: "l", 8: "q"}[width] - iv = struct.unpack(self.byte_order + it, buf)[0] - return iv - - def _read_bytes(self, offset: int, length: int): - if self._cached_page is None: - self._path_or_buf.seek(offset) - buf = self._path_or_buf.read(length) - if len(buf) < length: - self.close() - msg = f"Unable to read {length:d} bytes from file position {offset:d}." - raise ValueError(msg) - return buf - else: - if offset + length > len(self._cached_page): - self.close() - raise ValueError("The cached page is too small.") - return self._cached_page[offset : offset + length] - def _parse_metadata(self) -> None: done = False while not done: - self._cached_page = self._path_or_buf.read(self._page_length) + self._cached_page = Page( + self, + self._path_or_buf.read(self._page_length), + self.file_is_little_endian(), + ) if len(self._cached_page) <= 0: break if len(self._cached_page) != self._page_length: @@ -412,7 +370,7 @@ def _process_page_meta(self) -> bool: self._read_page_header() pt = const.page_meta_types + [const.page_amd_type, const.page_mix_type] if self._current_page_type in pt: - self._process_page_metadata() + self._cached_page.process_page_metadata() is_data_page = self._current_page_type == const.page_data_type is_mix_page = self._current_page_type == const.page_mix_type return bool( @@ -425,103 +383,18 @@ def _read_page_header(self): bit_offset = self._page_bit_offset tx = const.page_type_offset + bit_offset self._current_page_type = ( - self._read_int(tx, const.page_type_length) & const.page_type_mask2 + self._cached_page.read_int(tx, const.page_type_length) + & const.page_type_mask2 ) tx = const.block_count_offset + bit_offset - self._current_page_block_count = self._read_int(tx, const.block_count_length) + self._current_page_block_count = self._cached_page.read_int( + tx, const.block_count_length + ) tx = const.subheader_count_offset + bit_offset - self._current_page_subheaders_count = self._read_int( + self._current_page_subheaders_count = self._cached_page.read_int( tx, const.subheader_count_length ) - def _process_page_metadata(self) -> None: - bit_offset = self._page_bit_offset - - for i in range(self._current_page_subheaders_count): - pointer = self._process_subheader_pointers( - const.subheader_pointers_offset + bit_offset, i - ) - if pointer.length == 0: - continue - if pointer.compression == const.truncated_subheader_id: - continue - subheader_signature = self._read_subheader_signature(pointer.offset) - subheader_index = self._get_subheader_index( - subheader_signature, pointer.compression, pointer.ptype - ) - self._process_subheader(subheader_index, pointer) - - def _get_subheader_index(self, signature: bytes, compression, ptype) -> int: - # TODO: return here could be made an enum - index = const.subheader_signature_to_index.get(signature) - if index is None: - f1 = (compression == const.compressed_subheader_id) or (compression == 0) - f2 = ptype == const.compressed_subheader_type - if (self.compression != b"") and f1 and f2: - index = const.SASIndex.data_subheader_index - else: - self.close() - raise ValueError("Unknown subheader signature") - return index - - def _process_subheader_pointers( - self, offset: int, subheader_pointer_index: int - ) -> _SubheaderPointer: - - subheader_pointer_length = self._subheader_pointer_length - total_offset = offset + subheader_pointer_length * subheader_pointer_index - - subheader_offset = self._read_int(total_offset, self._int_length) - total_offset += self._int_length - - subheader_length = self._read_int(total_offset, self._int_length) - total_offset += self._int_length - - subheader_compression = self._read_int(total_offset, 1) - total_offset += 1 - - subheader_type = self._read_int(total_offset, 1) - - x = _SubheaderPointer( - subheader_offset, subheader_length, subheader_compression, subheader_type - ) - - return x - - def _read_subheader_signature(self, offset: int) -> bytes: - subheader_signature = self._read_bytes(offset, self._int_length) - return subheader_signature - - def _process_subheader( - self, subheader_index: int, pointer: _SubheaderPointer - ) -> None: - offset = pointer.offset - length = pointer.length - - if subheader_index == const.SASIndex.row_size_index: - processor = self._process_rowsize_subheader - elif subheader_index == const.SASIndex.column_size_index: - processor = self._process_columnsize_subheader - elif subheader_index == const.SASIndex.column_text_index: - processor = self._process_columntext_subheader - elif subheader_index == const.SASIndex.column_name_index: - processor = self._process_columnname_subheader - elif subheader_index == const.SASIndex.column_attributes_index: - processor = self._process_columnattributes_subheader - elif subheader_index == const.SASIndex.format_and_label_index: - processor = self._process_format_subheader - elif subheader_index == const.SASIndex.column_list_index: - processor = self._process_columnlist_subheader - elif subheader_index == const.SASIndex.subheader_counts_index: - processor = self._process_subheader_counts - elif subheader_index == const.SASIndex.data_subheader_index: - self._current_page_data_subheader_pointers.append(pointer) - return - else: - raise ValueError("unknown subheader index") - - processor(offset, length) - def _process_rowsize_subheader(self, offset: int, length: int) -> None: int_len = self._int_length @@ -534,27 +407,27 @@ def _process_rowsize_subheader(self, offset: int, length: int) -> None: lcs_offset += 354 lcp_offset += 378 - self.row_length = self._read_int( + self.row_length = self._cached_page.read_int( offset + const.row_length_offset_multiplier * int_len, int_len ) - self.row_count = self._read_int( + self.row_count = self._cached_page.read_int( offset + const.row_count_offset_multiplier * int_len, int_len ) - self.col_count_p1 = self._read_int( + self.col_count_p1 = self._cached_page.read_int( offset + const.col_count_p1_multiplier * int_len, int_len ) - self.col_count_p2 = self._read_int( + self.col_count_p2 = self._cached_page.read_int( offset + const.col_count_p2_multiplier * int_len, int_len ) mx = const.row_count_on_mix_page_offset_multiplier * int_len - self._mix_page_row_count = self._read_int(offset + mx, int_len) - self._lcs = self._read_int(lcs_offset, 2) - self._lcp = self._read_int(lcp_offset, 2) + self._mix_page_row_count = self._cached_page.read_int(offset + mx, int_len) + self._lcs = self._cached_page.read_int(lcs_offset, 2) + self._lcp = self._cached_page.read_int(lcp_offset, 2) def _process_columnsize_subheader(self, offset: int, length: int) -> None: int_len = self._int_length offset += int_len - self.column_count = self._read_int(offset, int_len) + self.column_count = self._cached_page.read_int(offset, int_len) if self.col_count_p1 + self.col_count_p2 != self.column_count: print( f"Warning: column count mismatch ({self.col_count_p1} + " @@ -568,16 +441,15 @@ def _process_subheader_counts(self, offset: int, length: int) -> None: def _process_columntext_subheader(self, offset: int, length: int) -> None: offset += self._int_length - text_block_size = self._read_int(offset, const.text_block_size_length) + text_block_size = self._cached_page.read_int( + offset, const.text_block_size_length + ) - buf = self._read_bytes(offset, text_block_size) + buf = self._cached_page.read_bytes(offset, text_block_size) cname_raw = buf[0:text_block_size].rstrip(b"\x00 ") - cname = cname_raw - if self.convert_header_text: - cname = cname.decode(self.encoding or self.default_encoding) - self.column_names_strings.append(cname) + self.column_names_raw.append(cname_raw) - if len(self.column_names_strings) == 1: + if len(self.column_names_raw) == 1: compression_literal = b"" for cl in const.compression_literals: if cl in cname_raw: @@ -589,33 +461,30 @@ def _process_columntext_subheader(self, offset: int, length: int) -> None: if self.U64: offset1 += 4 - buf = self._read_bytes(offset1, self._lcp) + buf = self._cached_page.read_bytes(offset1, self._lcp) compression_literal = buf.rstrip(b"\x00") if compression_literal == b"": self._lcs = 0 offset1 = offset + 32 if self.U64: offset1 += 4 - buf = self._read_bytes(offset1, self._lcp) + buf = self._cached_page.read_bytes(offset1, self._lcp) self.creator_proc = buf[0 : self._lcp] elif compression_literal == const.rle_compression: offset1 = offset + 40 if self.U64: offset1 += 4 - buf = self._read_bytes(offset1, self._lcp) + buf = self._cached_page.read_bytes(offset1, self._lcp) self.creator_proc = buf[0 : self._lcp] elif self._lcs > 0: self._lcp = 0 offset1 = offset + 16 if self.U64: offset1 += 4 - buf = self._read_bytes(offset1, self._lcs) + buf = self._cached_page.read_bytes(offset1, self._lcs) self.creator_proc = buf[0 : self._lcp] - if self.convert_header_text: - if hasattr(self, "creator_proc"): - self.creator_proc = self.creator_proc.decode( - self.encoding or self.default_encoding - ) + if hasattr(self, "creator_proc"): + self.creator_proc = self._convert_header_text(self.creator_proc) def _process_columnname_subheader(self, offset: int, length: int) -> None: int_len = self._int_length @@ -638,16 +507,19 @@ def _process_columnname_subheader(self, offset: int, length: int) -> None: + const.column_name_length_offset ) - idx = self._read_int( + idx = self._cached_page.read_int( text_subheader, const.column_name_text_subheader_length ) - col_offset = self._read_int( + col_offset = self._cached_page.read_int( col_name_offset, const.column_name_offset_length ) - col_len = self._read_int(col_name_length, const.column_name_length_length) + col_len = self._cached_page.read_int( + col_name_length, const.column_name_length_length + ) - name_str = self.column_names_strings[idx] - self.column_names.append(name_str[col_offset : col_offset + col_len]) + name_raw = self.column_names_raw[idx] + cname = name_raw[col_offset : col_offset + col_len] + self.column_names.append(self._convert_header_text(cname)) def _process_columnattributes_subheader(self, offset: int, length: int) -> None: int_len = self._int_length @@ -666,13 +538,15 @@ def _process_columnattributes_subheader(self, offset: int, length: int) -> None: offset + 2 * int_len + const.column_type_offset + i * (int_len + 8) ) - x = self._read_int(col_data_offset, int_len) + x = self._cached_page.read_int(col_data_offset, int_len) self._column_data_offsets.append(x) - x = self._read_int(col_data_len, const.column_data_length_length) + x = self._cached_page.read_int( + col_data_len, const.column_data_length_length + ) self._column_data_lengths.append(x) - x = self._read_int(col_types, const.column_type_length) + x = self._cached_page.read_int(col_types, const.column_type_length) self._column_types.append(b"d" if x == 1 else b"s") def _process_columnlist_subheader(self, offset: int, length: int) -> None: @@ -692,28 +566,38 @@ def _process_format_subheader(self, offset: int, length: int) -> None: col_label_offset = offset + const.column_label_offset_offset + 3 * int_len col_label_len = offset + const.column_label_length_offset + 3 * int_len - x = self._read_int( + x = self._cached_page.read_int( text_subheader_format, const.column_format_text_subheader_index_length ) - format_idx = min(x, len(self.column_names_strings) - 1) + format_idx = min(x, len(self.column_names_raw) - 1) - format_start = self._read_int( + format_start = self._cached_page.read_int( col_format_offset, const.column_format_offset_length ) - format_len = self._read_int(col_format_len, const.column_format_length_length) + format_len = self._cached_page.read_int( + col_format_len, const.column_format_length_length + ) - label_idx = self._read_int( + label_idx = self._cached_page.read_int( text_subheader_label, const.column_label_text_subheader_index_length ) - label_idx = min(label_idx, len(self.column_names_strings) - 1) + label_idx = min(label_idx, len(self.column_names_raw) - 1) - label_start = self._read_int(col_label_offset, const.column_label_offset_length) - label_len = self._read_int(col_label_len, const.column_label_length_length) + label_start = self._cached_page.read_int( + col_label_offset, const.column_label_offset_length + ) + label_len = self._cached_page.read_int( + col_label_len, const.column_label_length_length + ) - label_names = self.column_names_strings[label_idx] - column_label = label_names[label_start : label_start + label_len] - format_names = self.column_names_strings[format_idx] - column_format = format_names[format_start : format_start + format_len] + label_names = self.column_names_raw[label_idx] + column_label = self._convert_header_text( + label_names[label_start : label_start + label_len] + ) + format_names = self.column_names_raw[format_idx] + column_format = self._convert_header_text( + format_names[format_start : format_start + format_len] + ) current_column_number = len(self.columns) col = _Column( @@ -728,7 +612,7 @@ def _process_format_subheader(self, offset: int, length: int) -> None: self.column_formats.append(column_format) self.columns.append(col) - def read(self, nrows: int | None = None) -> DataFrame | None: + def read(self, nrows: int | None = None) -> DataFrame: if (nrows is None) and (self.chunksize is not None): nrows = self.chunksize @@ -740,7 +624,7 @@ def read(self, nrows: int | None = None) -> DataFrame | None: raise EmptyDataError("No columns to parse from file") if nrows > 0 and self._current_row_in_file_index >= self.row_count: - return None + return DataFrame() m = self.row_count - self._current_row_in_file_index if nrows > m: @@ -753,8 +637,30 @@ def read(self, nrows: int | None = None) -> DataFrame | None: self._byte_chunk = np.zeros((nd, 8 * nrows), dtype=np.uint8) self._current_row_in_chunk_index = 0 - p = Parser(self) - p.read(nrows) + + self._cython_reader = SAS7BDATCythonReader( + self, + self._byte_chunk, + self._string_chunk, + self.row_length, + self._page_bit_offset, + self._subheader_pointer_length, + self.row_count, + self._mix_page_row_count, + self.blank_missing, + (self.encoding or self.default_encoding) + if self.convert_text and self.encoding is not None + else None, + self.column_data_offsets(), + self.column_data_lengths(), + self.column_types(), + self.compression, + ) + self._update_cython_row_indices() + self._update_cython_page_info() + self._cython_reader.read(nrows) + self._update_python_row_indices() + self._cython_reader = None rslt = self._chunk_to_dataframe() if self.index is not None: @@ -762,9 +668,40 @@ def read(self, nrows: int | None = None) -> DataFrame | None: return rslt - def _read_next_page(self): + def _update_python_row_indices(self) -> None: + self._current_row_in_file_index = self._cython_reader.current_row_in_file_index + self._current_row_in_chunk_index = ( + self._cython_reader.current_row_in_chunk_index + ) + self._current_row_on_page_index = self._cython_reader.current_row_on_page_index + + def _update_cython_row_indices(self) -> None: + self._cython_reader.current_row_in_file_index = self._current_row_in_file_index + self._cython_reader.current_row_in_chunk_index = ( + self._current_row_in_chunk_index + ) + self._cython_reader.current_row_on_page_index = self._current_row_on_page_index + + def _update_cython_page_info(self) -> None: + self._cython_reader.current_row_on_page_index = self._current_row_on_page_index + self._cython_reader.current_page_type = self._current_page_type + self._cython_reader.current_page_block_count = self._current_page_block_count + self._cython_reader.current_page_data_subheader_pointers = ( + self._current_page_data_subheader_pointers + ) + self._cython_reader.current_page_subheaders_count = ( + self._current_page_subheaders_count + ) + self._cython_reader.cached_page = self._cached_page + + def _read_next_page(self) -> bool: self._current_page_data_subheader_pointers = [] - self._cached_page = self._path_or_buf.read(self._page_length) + self._current_row_on_page_index = 0 + self._cached_page = Page( + self, + self._path_or_buf.read(self._page_length), + self.file_is_little_endian(), + ) if len(self._cached_page) <= 0: return True elif len(self._cached_page) != self._page_length: @@ -777,7 +714,7 @@ def _read_next_page(self): self._read_page_header() if self._current_page_type in const.page_meta_types: - self._process_page_metadata() + self._cached_page.process_page_metadata() if self._current_page_type not in const.page_meta_types + [ const.page_data_type, @@ -785,13 +722,14 @@ def _read_next_page(self): ]: return self._read_next_page() + self._update_cython_page_info() return False def _chunk_to_dataframe(self) -> DataFrame: n = self._current_row_in_chunk_index m = self._current_row_in_file_index - ix = range(m - n, m) + ix = pd.RangeIndex(m - n, m) rslt = {} js, jb = 0, 0 @@ -810,13 +748,6 @@ def _chunk_to_dataframe(self) -> DataFrame: jb += 1 elif self._column_types[j] == b"s": rslt[name] = pd.Series(self._string_chunk[js, :], index=ix) - if self.convert_text and (self.encoding is not None): - rslt[name] = rslt[name].str.decode( - self.encoding or self.default_encoding - ) - if self.blank_missing: - ii = rslt[name].str.len() == 0 - rslt[name][ii] = np.nan js += 1 else: self.close() @@ -824,3 +755,17 @@ def _chunk_to_dataframe(self) -> DataFrame: df = DataFrame(rslt, columns=self.column_names, index=ix, copy=False) return df + + def _decode_string(self, b): + return b.decode(self.encoding or self.default_encoding) + + def _read_and_convert_header_text(self, offset: int, length: int) -> str | bytes: + return self._convert_header_text( + self._cached_page.read_bytes(offset, length).rstrip(b"\x00 ") + ) + + def _convert_header_text(self, b: bytes) -> str | bytes: + if self.convert_header_text: + return self._decode_string(b) + else: + return b diff --git a/pandas/io/sas/sas_constants.py b/pandas/io/sas/sas_constants.py index 979b2cacbf706..366e6924a1e16 100644 --- a/pandas/io/sas/sas_constants.py +++ b/pandas/io/sas/sas_constants.py @@ -1,3 +1,5 @@ +from __future__ import annotations + magic = ( b"\x00\x00\x00\x00\x00\x00\x00\x00" + b"\x00\x00\x00\x00\xc2\xea\x81\x60" diff --git a/pandas/io/sas/sas_xport.py b/pandas/io/sas/sas_xport.py index a64ade2b3c77c..a2e217767d1d4 100644 --- a/pandas/io/sas/sas_xport.py +++ b/pandas/io/sas/sas_xport.py @@ -17,6 +17,7 @@ import numpy as np from pandas._typing import ( + CompressionOptions, DatetimeNaTType, FilePath, ReadBuffer, @@ -256,6 +257,7 @@ def __init__( index=None, encoding: str | None = "ISO-8859-1", chunksize=None, + compression: CompressionOptions = "infer", ) -> None: self._encoding = encoding @@ -264,7 +266,11 @@ def __init__( self._chunksize = chunksize self.handles = get_handle( - filepath_or_buffer, "rb", encoding=encoding, is_text=False + filepath_or_buffer, + "rb", + encoding=encoding, + is_text=False, + compression=compression, ) self.filepath_or_buffer = self.handles.handle @@ -274,7 +280,7 @@ def __init__( self.close() raise - def close(self): + def close(self) -> None: self.handles.close() def _get_row(self): @@ -390,7 +396,7 @@ def _read_header(self): dtype = np.dtype(dtypel) self._dtype = dtype - def __next__(self): + def __next__(self) -> pd.DataFrame: return self.read(nrows=self._chunksize or 1) def _record_count(self) -> int: @@ -428,7 +434,7 @@ def _record_count(self) -> int: return (total_records_length - tail_pad) // self.record_length - def get_chunk(self, size=None): + def get_chunk(self, size=None) -> pd.DataFrame: """ Reads lines from Xport file and returns as dataframe @@ -457,7 +463,7 @@ def _missing_double(self, vec): return miss @Appender(_read_method_doc) - def read(self, nrows=None): + def read(self, nrows: int | None = None) -> pd.DataFrame: if nrows is None: nrows = self.nobs diff --git a/pandas/io/sas/sasreader.py b/pandas/io/sas/sasreader.py index f50fc777f55e9..359174166f980 100644 --- a/pandas/io/sas/sasreader.py +++ b/pandas/io/sas/sasreader.py @@ -14,9 +14,16 @@ ) from pandas._typing import ( + CompressionOptions, FilePath, ReadBuffer, ) +from pandas.util._decorators import ( + deprecate_nonkeyword_arguments, + doc, +) + +from pandas.core.shared_docs import _shared_docs from pandas.io.common import stringify_path @@ -31,17 +38,17 @@ class ReaderBase(metaclass=ABCMeta): """ @abstractmethod - def read(self, nrows=None): + def read(self, nrows: int | None = None) -> DataFrame: pass @abstractmethod - def close(self): + def close(self) -> None: pass - def __enter__(self): + def __enter__(self) -> ReaderBase: return self - def __exit__(self, exc_type, exc_value, traceback): + def __exit__(self, exc_type, exc_value, traceback) -> None: self.close() @@ -53,6 +60,7 @@ def read_sas( encoding: str | None = ..., chunksize: int = ..., iterator: bool = ..., + compression: CompressionOptions = ..., ) -> ReaderBase: ... @@ -65,10 +73,13 @@ def read_sas( encoding: str | None = ..., chunksize: None = ..., iterator: bool = ..., + compression: CompressionOptions = ..., ) -> DataFrame | ReaderBase: ... +@deprecate_nonkeyword_arguments(version=None, allowed_args=["filepath_or_buffer"]) +@doc(decompression_options=_shared_docs["decompression_options"] % "filepath_or_buffer") def read_sas( filepath_or_buffer: FilePath | ReadBuffer[bytes], format: str | None = None, @@ -76,6 +87,7 @@ def read_sas( encoding: str | None = None, chunksize: int | None = None, iterator: bool = False, + compression: CompressionOptions = "infer", ) -> DataFrame | ReaderBase: """ Read SAS files stored as either XPORT or SAS7BDAT format files. @@ -88,7 +100,7 @@ def read_sas( Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. A local file could be: ``file://localhost/path/to/table.sas``. - format : str {'xport', 'sas7bdat'} or None + format : str {{'xport', 'sas7bdat'}} or None If None, file format is inferred from file extension. If 'xport' or 'sas7bdat', uses the corresponding format. index : identifier of index column, defaults to None @@ -107,6 +119,7 @@ def read_sas( .. versionchanged:: 1.2 ``TextFileReader`` is a context manager. + {decompression_options} Returns ------- @@ -122,12 +135,14 @@ def read_sas( if not isinstance(filepath_or_buffer, str): raise ValueError(buffer_error_msg) fname = filepath_or_buffer.lower() - if fname.endswith(".xpt"): + if ".xpt" in fname: format = "xport" - elif fname.endswith(".sas7bdat"): + elif ".sas7bdat" in fname: format = "sas7bdat" else: - raise ValueError("unable to infer format of SAS file") + raise ValueError( + f"unable to infer format of SAS file from filename: {repr(fname)}" + ) reader: ReaderBase if format.lower() == "xport": @@ -138,6 +153,7 @@ def read_sas( index=index, encoding=encoding, chunksize=chunksize, + compression=compression, ) elif format.lower() == "sas7bdat": from pandas.io.sas.sas7bdat import SAS7BDATReader @@ -147,6 +163,7 @@ def read_sas( index=index, encoding=encoding, chunksize=chunksize, + compression=compression, ) else: raise ValueError("unknown SAS format") diff --git a/pandas/io/sql.py b/pandas/io/sql.py index 701642ad2cfe2..f591e7b8676f6 100644 --- a/pandas/io/sql.py +++ b/pandas/io/sql.py @@ -14,6 +14,7 @@ from functools import partial import re from typing import ( + TYPE_CHECKING, Any, Iterator, Sequence, @@ -30,12 +31,16 @@ DtypeArg, ) from pandas.compat._optional import import_optional_dependency -from pandas.errors import AbstractMethodError +from pandas.errors import ( + AbstractMethodError, + DatabaseError, +) from pandas.util._exceptions import find_stack_level from pandas.core.dtypes.common import ( is_datetime64tz_dtype, is_dict_like, + is_integer, is_list_like, ) from pandas.core.dtypes.dtypes import DatetimeTZDtype @@ -50,9 +55,8 @@ import pandas.core.common as com from pandas.core.tools.datetimes import to_datetime - -class DatabaseError(OSError): - pass +if TYPE_CHECKING: + from sqlalchemy import Table # ----------------------------------------------------------------------------- @@ -277,7 +281,9 @@ def read_sql_table( if not pandas_sql.has_table(table_name): raise ValueError(f"Table {table_name} not found") - table = pandas_sql.read_table( + # error: Item "SQLiteDatabase" of "Union[SQLDatabase, SQLiteDatabase]" + # has no attribute "read_table" + table = pandas_sql.read_table( # type: ignore[union-attr] table_name, index_col=index_col, coerce_float=coerce_float, @@ -662,7 +668,7 @@ def to_sql( ------- None or int Number of rows affected by to_sql. None is returned if the callable - passed into ``method`` does not return the number of rows. + passed into ``method`` does not return an integer number of rows. .. versionadded:: 1.4.0 @@ -701,7 +707,7 @@ def to_sql( ) -def has_table(table_name: str, con, schema: str | None = None): +def has_table(table_name: str, con, schema: str | None = None) -> bool: """ Check if DataBase has named table. @@ -728,7 +734,7 @@ def has_table(table_name: str, con, schema: str | None = None): table_exists = has_table -def pandasSQL_builder(con, schema: str | None = None): +def pandasSQL_builder(con, schema: str | None = None) -> SQLDatabase | SQLiteDatabase: """ Convenience function to return the correct PandasSQL subclass based on the provided parameters. @@ -806,7 +812,7 @@ def __init__( def exists(self): return self.pd_sql.has_table(self.name, self.schema) - def sql_schema(self): + def sql_schema(self) -> str: from sqlalchemy.schema import CreateTable return str(CreateTable(self.table).compile(self.pd_sql.connectable)) @@ -816,7 +822,7 @@ def _execute_create(self): self.table = self.table.to_metadata(self.pd_sql.meta) self.table.create(bind=self.pd_sql.connectable) - def create(self): + def create(self) -> None: if self.exists(): if self.if_exists == "fail": raise ValueError(f"Table '{self.name}' already exists.") @@ -862,7 +868,7 @@ def _execute_insert_multi(self, conn, keys: list[str], data_iter) -> int: result = conn.execute(stmt) return result.rowcount - def insert_data(self): + def insert_data(self) -> tuple[list[str], list[np.ndarray]]: if self.index is not None: temp = self.frame.copy() temp.index.names = self.index @@ -875,7 +881,9 @@ def insert_data(self): column_names = list(map(str, temp.columns)) ncols = len(column_names) - data_list = [None] * ncols + # this just pre-allocates the list: None's will be replaced with ndarrays + # error: List item 0 has incompatible type "None"; expected "ndarray" + data_list: list[np.ndarray] = [None] * ncols # type: ignore[list-item] for i, (_, ser) in enumerate(temp.items()): vals = ser._values @@ -894,9 +902,7 @@ def insert_data(self): mask = isna(d) d[mask] = None - # error: No overload variant of "__setitem__" of "list" matches - # argument types "int", "ndarray" - data_list[i] = d # type: ignore[call-overload] + data_list[i] = d return column_names, data_list @@ -927,7 +933,7 @@ def insert( raise ValueError("chunksize argument should be non-zero") chunks = (nrows // chunksize) + 1 - total_inserted = 0 + total_inserted = None with self.pd_sql.run_transaction() as conn: for i in range(chunks): start_i = i * chunksize @@ -937,10 +943,12 @@ def insert( chunk_iter = zip(*(arr[start_i:end_i] for arr in data_list)) num_inserted = exec_insert(conn, keys, chunk_iter) - if num_inserted is None: - total_inserted = None - else: - total_inserted += num_inserted + # GH 46891 + if is_integer(num_inserted): + if total_inserted is None: + total_inserted = num_inserted + else: + total_inserted += num_inserted return total_inserted def _query_iterator( @@ -974,7 +982,13 @@ def _query_iterator( yield self.frame - def read(self, coerce_float=True, parse_dates=None, columns=None, chunksize=None): + def read( + self, + coerce_float=True, + parse_dates=None, + columns=None, + chunksize=None, + ) -> DataFrame | Iterator[DataFrame]: from sqlalchemy import select if columns is not None and len(columns) > 0: @@ -1398,7 +1412,7 @@ def read_table( columns=None, schema: str | None = None, chunksize: int | None = None, - ): + ) -> DataFrame | Iterator[DataFrame]: """ Read SQL database table into a DataFrame. @@ -1487,13 +1501,13 @@ def _query_iterator( def read_query( self, sql: str, - index_col: str | None = None, + index_col: str | Sequence[str] | None = None, coerce_float: bool = True, parse_dates=None, params=None, chunksize: int | None = None, dtype: DtypeArg | None = None, - ): + ) -> DataFrame | Iterator[DataFrame]: """ Read SQL query into a DataFrame. @@ -1620,7 +1634,7 @@ def check_case_sensitive( self, name, schema, - ): + ) -> None: """ Checks table name for issues with case-sensitivity. Method is called after data is inserted. @@ -1741,7 +1755,7 @@ def has_table(self, name: str, schema: str | None = None): insp = inspect(self.connectable) return insp.has_table(name, schema or self.meta.schema) - def get_table(self, table_name: str, schema: str | None = None): + def get_table(self, table_name: str, schema: str | None = None) -> Table: from sqlalchemy import ( Numeric, Table, @@ -1756,7 +1770,7 @@ def get_table(self, table_name: str, schema: str | None = None): column.type.asdecimal = False return tbl - def drop_table(self, table_name: str, schema: str | None = None): + def drop_table(self, table_name: str, schema: str | None = None) -> None: schema = schema or self.meta.schema if self.has_table(table_name, schema): self.meta.reflect(bind=self.connectable, only=[table_name], schema=schema) @@ -1836,10 +1850,14 @@ def __init__(self, *args, **kwargs) -> None: # this will transform time(12,34,56,789) into '12:34:56.000789' # (this is what sqlalchemy does) - sqlite3.register_adapter(time, lambda _: _.strftime("%H:%M:%S.%f")) + def _adapt_time(t): + # This is faster than strftime + return f"{t.hour:02d}:{t.minute:02d}:{t.second:02d}.{t.microsecond:06d}" + + sqlite3.register_adapter(time, _adapt_time) super().__init__(*args, **kwargs) - def sql_schema(self): + def sql_schema(self) -> str: return str(";\n".join(self.table)) def _execute_create(self): @@ -1847,7 +1865,7 @@ def _execute_create(self): for stmt in self.table: conn.execute(stmt) - def insert_statement(self, *, num_rows: int): + def insert_statement(self, *, num_rows: int) -> str: names = list(map(str, self.frame.columns)) wld = "?" # wildcard char escape = _get_valid_sqlite_name @@ -2049,7 +2067,7 @@ def read_query( parse_dates=None, chunksize: int | None = None, dtype: DtypeArg | None = None, - ): + ) -> DataFrame | Iterator[DataFrame]: args = _convert_params(sql, params) cursor = self.execute(*args) @@ -2164,17 +2182,17 @@ def to_sql( table.create() return table.insert(chunksize, method) - def has_table(self, name: str, schema: str | None = None): + def has_table(self, name: str, schema: str | None = None) -> bool: wld = "?" query = f"SELECT name FROM sqlite_master WHERE type='table' AND name={wld};" return len(self.execute(query, [name]).fetchall()) > 0 - def get_table(self, table_name: str, schema: str | None = None): + def get_table(self, table_name: str, schema: str | None = None) -> None: return None # not supported in fallback mode - def drop_table(self, name: str, schema: str | None = None): + def drop_table(self, name: str, schema: str | None = None) -> None: drop_sql = f"DROP TABLE {_get_valid_sqlite_name(name)}" self.execute(drop_sql) @@ -2205,7 +2223,7 @@ def get_schema( con=None, dtype: DtypeArg | None = None, schema: str | None = None, -): +) -> str: """ Get the SQL db table schema for the given frame. diff --git a/pandas/io/stata.py b/pandas/io/stata.py index 1a230f4ae4164..226a19e1f7599 100644 --- a/pandas/io/stata.py +++ b/pandas/io/stata.py @@ -140,7 +140,7 @@ {_statafile_processing_params2} {_chunksize_params} {_iterator_params} -{_shared_docs["decompression_options"]} +{_shared_docs["decompression_options"] % "filepath_or_buffer"} {_shared_docs["storage_options"]} Returns @@ -1930,7 +1930,9 @@ def _do_convert_categoricals( categories = list(vl.values()) try: # Try to catch duplicate categories - cat_data.categories = categories + # error: Incompatible types in assignment (expression has + # type "List[str]", variable has type "Index") + cat_data.categories = categories # type: ignore[assignment] except ValueError as err: vc = Series(categories).value_counts() repeated_cats = list(vc.index[vc > 1]) diff --git a/pandas/io/xml.py b/pandas/io/xml.py index 181b0fe115f4c..9b6eb31dafc07 100644 --- a/pandas/io/xml.py +++ b/pandas/io/xml.py @@ -7,6 +7,7 @@ import io from typing import ( Any, + Callable, Sequence, ) @@ -177,7 +178,7 @@ def parse_data(self) -> list[dict[str, str | None]]: raise AbstractMethodError(self) - def _parse_nodes(self) -> list[dict[str, str | None]]: + def _parse_nodes(self, elems: list[Any]) -> list[dict[str, str | None]]: """ Parse xml nodes. @@ -197,102 +198,6 @@ def _parse_nodes(self) -> list[dict[str, str | None]]: will have optional keys filled with None values. """ - raise AbstractMethodError(self) - - def _iterparse_nodes(self) -> list[dict[str, str | None]]: - """ - Iterparse xml nodes. - - This method will read in local disk, decompressed XML files for elements - and underlying descendants using iterparse, a method to iterate through - an XML tree without holding entire XML tree in memory. - - Raises - ------ - TypeError - * If `iterparse` is not a dict or its dict value is not list-like. - ParserError - * If `path_or_buffer` is not a physical, decompressed file on disk. - * If no data is returned from selected items in `iterparse`. - - Notes - ----- - Namespace URIs will be removed from return node values. Also, - elements with missing children or attributes in submitted list - will have optional keys filled with None values. - """ - - raise AbstractMethodError(self) - - def _validate_path(self) -> None: - """ - Validate xpath. - - This method checks for syntax, evaluation, or empty nodes return. - - Raises - ------ - SyntaxError - * If xpah is not supported or issues with namespaces. - - ValueError - * If xpah does not return any nodes. - """ - - raise AbstractMethodError(self) - - def _validate_names(self) -> None: - """ - Validate names. - - This method will check if names is a list-like and aligns - with length of parse nodes. - - Raises - ------ - ValueError - * If value is not a list and less then length of nodes. - """ - raise AbstractMethodError(self) - - def _parse_doc(self, raw_doc) -> bytes: - """ - Build tree from path_or_buffer. - - This method will parse XML object into tree - either from string/bytes or file location. - """ - raise AbstractMethodError(self) - - -class _EtreeFrameParser(_XMLFrameParser): - """ - Internal class to parse XML into DataFrames with the Python - standard library XML module: `xml.etree.ElementTree`. - """ - - def parse_data(self) -> list[dict[str, str | None]]: - from xml.etree.ElementTree import XML - - if self.stylesheet is not None: - raise ValueError( - "To use stylesheet, you need lxml installed and selected as parser." - ) - - if self.iterparse is None: - self.xml_doc = XML(self._parse_doc(self.path_or_buffer)) - self._validate_path() - - self._validate_names() - - xml_dicts: list[dict[str, str | None]] = ( - self._parse_nodes() if self.iterparse is None else self._iterparse_nodes() - ) - - return xml_dicts - - def _parse_nodes(self) -> list[dict[str, str | None]]: - elems = self.xml_doc.findall(self.xpath, namespaces=self.namespaces) dicts: list[dict[str, str | None]] if self.elems_only and self.attrs_only: @@ -375,8 +280,28 @@ def _parse_nodes(self) -> list[dict[str, str | None]]: return dicts - def _iterparse_nodes(self) -> list[dict[str, str | None]]: - from xml.etree.ElementTree import iterparse + def _iterparse_nodes(self, iterparse: Callable) -> list[dict[str, str | None]]: + """ + Iterparse xml nodes. + + This method will read in local disk, decompressed XML files for elements + and underlying descendants using iterparse, a method to iterate through + an XML tree without holding entire XML tree in memory. + + Raises + ------ + TypeError + * If `iterparse` is not a dict or its dict value is not list-like. + ParserError + * If `path_or_buffer` is not a physical, decompressed file on disk. + * If no data is returned from selected items in `iterparse`. + + Notes + ----- + Namespace URIs will be removed from return node values. Also, + elements with missing children or attributes in submitted list + will have optional keys filled with None values. + """ dicts: list[dict[str, str | None]] = [] row: dict[str, str | None] | None = None @@ -413,17 +338,33 @@ def _iterparse_nodes(self) -> list[dict[str, str | None]]: row = {} if row is not None: - for col in self.iterparse[row_node]: - if curr_elem == col: - row[col] = elem.text.strip() if elem.text else None - if col in elem.attrib: - row[col] = elem.attrib[col] + if self.names: + for col, nm in zip(self.iterparse[row_node], self.names): + if curr_elem == col: + elem_val = elem.text.strip() if elem.text else None + if row.get(nm) != elem_val and nm not in row: + row[nm] = elem_val + if col in elem.attrib: + if elem.attrib[col] not in row.values() and nm not in row: + row[nm] = elem.attrib[col] + else: + for col in self.iterparse[row_node]: + if curr_elem == col: + row[col] = elem.text.strip() if elem.text else None + if col in elem.attrib: + row[col] = elem.attrib[col] if event == "end": if curr_elem == row_node and row is not None: dicts.append(row) row = None + elem.clear() + if hasattr(elem, "getprevious"): + while ( + elem.getprevious() is not None and elem.getparent() is not None + ): + del elem.getparent()[0] if dicts == []: raise ParserError("No result from selected items in iterparse.") @@ -436,6 +377,81 @@ def _iterparse_nodes(self) -> list[dict[str, str | None]]: return dicts + def _validate_path(self) -> None: + """ + Validate xpath. + + This method checks for syntax, evaluation, or empty nodes return. + + Raises + ------ + SyntaxError + * If xpah is not supported or issues with namespaces. + + ValueError + * If xpah does not return any nodes. + """ + + raise AbstractMethodError(self) + + def _validate_names(self) -> None: + """ + Validate names. + + This method will check if names is a list-like and aligns + with length of parse nodes. + + Raises + ------ + ValueError + * If value is not a list and less then length of nodes. + """ + raise AbstractMethodError(self) + + def _parse_doc( + self, raw_doc: FilePath | ReadBuffer[bytes] | ReadBuffer[str] + ) -> bytes: + """ + Build tree from path_or_buffer. + + This method will parse XML object into tree + either from string/bytes or file location. + """ + raise AbstractMethodError(self) + + +class _EtreeFrameParser(_XMLFrameParser): + """ + Internal class to parse XML into DataFrames with the Python + standard library XML module: `xml.etree.ElementTree`. + """ + + def parse_data(self) -> list[dict[str, str | None]]: + from xml.etree.ElementTree import ( + XML, + iterparse, + ) + + if self.stylesheet is not None: + raise ValueError( + "To use stylesheet, you need lxml installed and selected as parser." + ) + + if self.iterparse is None: + self.xml_doc = XML(self._parse_doc(self.path_or_buffer)) + self._validate_path() + elems = self.xml_doc.findall(self.xpath, namespaces=self.namespaces) + + self._validate_names() + + xml_dicts: list[dict[str, str | None]] = ( + self._parse_nodes(elems) + if self.iterparse is None + else self._iterparse_nodes(iterparse) + ) + + return xml_dicts + def _validate_path(self) -> None: """ Notes @@ -485,7 +501,9 @@ def _validate_names(self) -> None: f"{type(self.names).__name__} is not a valid type for names" ) - def _parse_doc(self, raw_doc) -> bytes: + def _parse_doc( + self, raw_doc: FilePath | ReadBuffer[bytes] | ReadBuffer[str] + ) -> bytes: from xml.etree.ElementTree import ( XMLParser, parse, @@ -521,7 +539,10 @@ def parse_data(self) -> list[dict[str, str | None]]: validate xpath, names, optionally parse and run XSLT, and parse original or transformed XML and return specific nodes. """ - from lxml.etree import XML + from lxml.etree import ( + XML, + iterparse, + ) if self.iterparse is None: self.xml_doc = XML(self._parse_doc(self.path_or_buffer)) @@ -531,162 +552,18 @@ def parse_data(self) -> list[dict[str, str | None]]: self.xml_doc = XML(self._transform_doc()) self._validate_path() + elems = self.xml_doc.xpath(self.xpath, namespaces=self.namespaces) self._validate_names() xml_dicts: list[dict[str, str | None]] = ( - self._parse_nodes() if self.iterparse is None else self._iterparse_nodes() + self._parse_nodes(elems) + if self.iterparse is None + else self._iterparse_nodes(iterparse) ) return xml_dicts - def _parse_nodes(self) -> list[dict[str, str | None]]: - elems = self.xml_doc.xpath(self.xpath, namespaces=self.namespaces) - dicts: list[dict[str, str | None]] - - if self.elems_only and self.attrs_only: - raise ValueError("Either element or attributes can be parsed not both.") - - elif self.elems_only: - if self.names: - dicts = [ - { - **( - {el.tag: el.text.strip()} - if el.text and not el.text.isspace() - else {} - ), - **{ - nm: ch.text.strip() if ch.text else None - for nm, ch in zip(self.names, el.xpath("*")) - }, - } - for el in elems - ] - else: - dicts = [ - { - ch.tag: ch.text.strip() if ch.text else None - for ch in el.xpath("*") - } - for el in elems - ] - - elif self.attrs_only: - dicts = [el.attrib for el in elems] - - else: - if self.names: - dicts = [ - { - **el.attrib, - **( - {el.tag: el.text.strip()} - if el.text and not el.text.isspace() - else {} - ), - **{ - nm: ch.text.strip() if ch.text else None - for nm, ch in zip(self.names, el.xpath("*")) - }, - } - for el in elems - ] - else: - dicts = [ - { - **el.attrib, - **( - {el.tag: el.text.strip()} - if el.text and not el.text.isspace() - else {} - ), - **{ - ch.tag: ch.text.strip() if ch.text else None - for ch in el.xpath("*") - }, - } - for el in elems - ] - - if self.namespaces or "}" in list(dicts[0].keys())[0]: - dicts = [ - {k.split("}")[1] if "}" in k else k: v for k, v in d.items()} - for d in dicts - ] - - keys = list(dict.fromkeys([k for d in dicts for k in d.keys()])) - dicts = [{k: d[k] if k in d.keys() else None for k in keys} for d in dicts] - - if self.names: - dicts = [{nm: v for nm, v in zip(self.names, d.values())} for d in dicts] - - return dicts - - def _iterparse_nodes(self) -> list[dict[str, str | None]]: - from lxml.etree import iterparse - - dicts: list[dict[str, str | None]] = [] - row: dict[str, str | None] | None = None - - if not isinstance(self.iterparse, dict): - raise TypeError( - f"{type(self.iterparse).__name__} is not a valid type for iterparse" - ) - - row_node = next(iter(self.iterparse.keys())) if self.iterparse else "" - if not is_list_like(self.iterparse[row_node]): - raise TypeError( - f"{type(self.iterparse[row_node])} is not a valid type " - "for value in iterparse" - ) - - if ( - not isinstance(self.path_or_buffer, str) - or is_url(self.path_or_buffer) - or is_fsspec_url(self.path_or_buffer) - or self.path_or_buffer.startswith((" None: msg = ( @@ -728,7 +605,9 @@ def _validate_names(self) -> None: f"{type(self.names).__name__} is not a valid type for names" ) - def _parse_doc(self, raw_doc) -> bytes: + def _parse_doc( + self, raw_doc: FilePath | ReadBuffer[bytes] | ReadBuffer[str] + ) -> bytes: from lxml.etree import ( XMLParser, fromstring, @@ -954,9 +833,7 @@ def _parse( ) -@deprecate_nonkeyword_arguments( - version=None, allowed_args=["path_or_buffer"], stacklevel=2 -) +@deprecate_nonkeyword_arguments(version=None, allowed_args=["path_or_buffer"]) @doc( storage_options=_shared_docs["storage_options"], decompression_options=_shared_docs["decompression_options"] % "path_or_buffer", @@ -1020,7 +897,8 @@ def read_xml( names : list-like, optional Column names for DataFrame of parsed XML data. Use this parameter to - rename original element names and distinguish same named elements. + rename original element names and distinguish same named elements and + attributes. dtype : Type name or dict of column -> type, optional Data type for data or columns. E.g. {{'a': np.float64, 'b': np.int32, diff --git a/pandas/plotting/_core.py b/pandas/plotting/_core.py index 929ddb52aea6d..bc39d1f619f49 100644 --- a/pandas/plotting/_core.py +++ b/pandas/plotting/_core.py @@ -1879,11 +1879,11 @@ def _get_plot_backend(backend: str | None = None): ----- Modifies `_backends` with imported backend as a side effect. """ - backend = backend or get_option("plotting.backend") + backend_str: str = backend or get_option("plotting.backend") - if backend in _backends: - return _backends[backend] + if backend_str in _backends: + return _backends[backend_str] - module = _load_backend(backend) - _backends[backend] = module + module = _load_backend(backend_str) + _backends[backend_str] = module return module diff --git a/pandas/plotting/_matplotlib/boxplot.py b/pandas/plotting/_matplotlib/boxplot.py index f82889a304dd2..a49b035b1aaf1 100644 --- a/pandas/plotting/_matplotlib/boxplot.py +++ b/pandas/plotting/_matplotlib/boxplot.py @@ -2,6 +2,7 @@ from typing import ( TYPE_CHECKING, + Literal, NamedTuple, ) import warnings @@ -34,7 +35,10 @@ class BoxPlot(LinePlot): - _kind = "box" + @property + def _kind(self) -> Literal["box"]: + return "box" + _layout_type = "horizontal" _valid_return_types = (None, "axes", "dict", "both") diff --git a/pandas/plotting/_matplotlib/compat.py b/pandas/plotting/_matplotlib/compat.py index c731c40f10a05..6015662999a7d 100644 --- a/pandas/plotting/_matplotlib/compat.py +++ b/pandas/plotting/_matplotlib/compat.py @@ -1,4 +1,6 @@ # being a bit too dynamic +from __future__ import annotations + import operator from pandas.util.version import Version diff --git a/pandas/plotting/_matplotlib/converter.py b/pandas/plotting/_matplotlib/converter.py index 19f726009f646..873084393371c 100644 --- a/pandas/plotting/_matplotlib/converter.py +++ b/pandas/plotting/_matplotlib/converter.py @@ -574,6 +574,8 @@ def _daily_finder(vmin, vmax, freq: BaseOffset): Period(ordinal=int(vmin), freq=freq), Period(ordinal=int(vmax), freq=freq), ) + assert isinstance(vmin, Period) + assert isinstance(vmax, Period) span = vmax.ordinal - vmin.ordinal + 1 dates_ = period_range(start=vmin, end=vmax, freq=freq) # Initialize the output @@ -865,7 +867,8 @@ def _quarterly_finder(vmin, vmax, freq): info_fmt[year_start] = "%F" else: - years = dates_[year_start] // 4 + 1 + # https://github.com/pandas-dev/pandas/pull/47602 + years = dates_[year_start] // 4 + 1970 nyears = span / periodsperyear (min_anndef, maj_anndef) = _get_default_annual_spacing(nyears) major_idx = year_start[(years % maj_anndef == 0)] @@ -1073,7 +1076,9 @@ def __call__(self, x, pos=0) -> str: fmt = self.formatdict.pop(x, "") if isinstance(fmt, np.bytes_): fmt = fmt.decode("utf-8") - return Period(ordinal=int(x), freq=self.freq).strftime(fmt) + period = Period(ordinal=int(x), freq=self.freq) + assert isinstance(period, Period) + return period.strftime(fmt) class TimeSeries_TimedeltaFormatter(Formatter): diff --git a/pandas/plotting/_matplotlib/core.py b/pandas/plotting/_matplotlib/core.py index 5fceb14b9d1cc..301474edc6a8e 100644 --- a/pandas/plotting/_matplotlib/core.py +++ b/pandas/plotting/_matplotlib/core.py @@ -1,9 +1,14 @@ from __future__ import annotations +from abc import ( + ABC, + abstractmethod, +) from typing import ( TYPE_CHECKING, Hashable, Iterable, + Literal, Sequence, ) import warnings @@ -11,7 +16,10 @@ from matplotlib.artist import Artist import numpy as np -from pandas._typing import IndexLabel +from pandas._typing import ( + IndexLabel, + PlottingOrientation, +) from pandas.errors import AbstractMethodError from pandas.util._decorators import cache_readonly @@ -78,7 +86,7 @@ def _color_in_style(style: str) -> bool: return not set(BASE_COLORS).isdisjoint(style) -class MPLPlot: +class MPLPlot(ABC): """ Base class for assembling a pandas plot using matplotlib @@ -89,13 +97,17 @@ class MPLPlot: """ @property - def _kind(self): + @abstractmethod + def _kind(self) -> str: """Specify kind str. Must be overridden in child class""" raise NotImplementedError _layout_type = "vertical" _default_rot = 0 - orientation: str | None = None + + @property + def orientation(self) -> str | None: + return None axes: np.ndarray # of Axes objects @@ -667,6 +679,7 @@ def _adorn_subplots(self): ) for ax in self.axes: + ax = getattr(ax, "right_ax", ax) if self.yticks is not None: ax.set_yticks(self.yticks) @@ -843,7 +856,9 @@ def _get_xticks(self, convert_period: bool = False): @classmethod @register_pandas_matplotlib_converters - def _plot(cls, ax: Axes, x, y, style=None, is_errorbar: bool = False, **kwds): + def _plot( + cls, ax: Axes, x, y: np.ndarray, style=None, is_errorbar: bool = False, **kwds + ): mask = isna(y) if mask.any(): y = np.ma.array(y) @@ -1101,7 +1116,7 @@ def _get_axes_layout(self) -> tuple[int, int]: return (len(y_set), len(x_set)) -class PlanePlot(MPLPlot): +class PlanePlot(MPLPlot, ABC): """ Abstract class for plotting on plane, currently scatter and hexbin. """ @@ -1159,7 +1174,9 @@ def _plot_colorbar(self, ax: Axes, **kwds): class ScatterPlot(PlanePlot): - _kind = "scatter" + @property + def _kind(self) -> Literal["scatter"]: + return "scatter" def __init__(self, data, x, y, s=None, c=None, **kwargs) -> None: if s is None: @@ -1247,7 +1264,9 @@ def _make_plot(self): class HexBinPlot(PlanePlot): - _kind = "hexbin" + @property + def _kind(self) -> Literal["hexbin"]: + return "hexbin" def __init__(self, data, x, y, C=None, **kwargs) -> None: super().__init__(data, x, y, **kwargs) @@ -1277,9 +1296,15 @@ def _make_legend(self): class LinePlot(MPLPlot): - _kind = "line" _default_rot = 0 - orientation = "vertical" + + @property + def orientation(self) -> PlottingOrientation: + return "vertical" + + @property + def _kind(self) -> Literal["line", "area", "hist", "kde", "box"]: + return "line" def __init__(self, data, **kwargs) -> None: from pandas.plotting import plot_params @@ -1363,8 +1388,7 @@ def _plot( # type: ignore[override] cls._update_stacker(ax, stacking_id, y) return lines - @classmethod - def _ts_plot(cls, ax: Axes, x, data, style=None, **kwds): + def _ts_plot(self, ax: Axes, x, data, style=None, **kwds): # accept x to be consistent with normal plot func, # x is not passed to tsplot as it uses data.index as x coordinate # column_num must be in kwds for stacking purpose @@ -1377,9 +1401,9 @@ def _ts_plot(cls, ax: Axes, x, data, style=None, **kwds): decorate_axes(ax.left_ax, freq, kwds) if hasattr(ax, "right_ax"): decorate_axes(ax.right_ax, freq, kwds) - ax._plot_data.append((data, cls._kind, kwds)) + ax._plot_data.append((data, self._kind, kwds)) - lines = cls._plot(ax, data.index, data.values, style=style, **kwds) + lines = self._plot(ax, data.index, data.values, style=style, **kwds) # set date formatter, locators and rescale limits format_dateaxis(ax, ax.freq, data.index) return lines @@ -1471,7 +1495,9 @@ def get_label(i): class AreaPlot(LinePlot): - _kind = "area" + @property + def _kind(self) -> Literal["area"]: + return "area" def __init__(self, data, **kwargs) -> None: kwargs.setdefault("stacked", True) @@ -1544,9 +1570,15 @@ def _post_plot_logic(self, ax: Axes, data): class BarPlot(MPLPlot): - _kind = "bar" + @property + def _kind(self) -> Literal["bar", "barh"]: + return "bar" + _default_rot = 90 - orientation = "vertical" + + @property + def orientation(self) -> PlottingOrientation: + return "vertical" def __init__(self, data, **kwargs) -> None: # we have to treat a series differently than a @@ -1698,9 +1730,15 @@ def _decorate_ticks(self, ax: Axes, name, ticklabels, start_edge, end_edge): class BarhPlot(BarPlot): - _kind = "barh" + @property + def _kind(self) -> Literal["barh"]: + return "barh" + _default_rot = 0 - orientation = "horizontal" + + @property + def orientation(self) -> Literal["horizontal"]: + return "horizontal" @property def _start_base(self): @@ -1727,7 +1765,10 @@ def _decorate_ticks(self, ax: Axes, name, ticklabels, start_edge, end_edge): class PiePlot(MPLPlot): - _kind = "pie" + @property + def _kind(self) -> Literal["pie"]: + return "pie" + _layout_type = "horizontal" def __init__(self, data, kind=None, **kwargs) -> None: diff --git a/pandas/plotting/_matplotlib/groupby.py b/pandas/plotting/_matplotlib/groupby.py index 1b16eefb360ae..4f1cd3f38343a 100644 --- a/pandas/plotting/_matplotlib/groupby.py +++ b/pandas/plotting/_matplotlib/groupby.py @@ -112,7 +112,9 @@ def reconstruct_data_with_by( data_list = [] for key, group in grouped: - columns = MultiIndex.from_product([[key], cols]) + # error: List item 1 has incompatible type "Union[Hashable, + # Sequence[Hashable]]"; expected "Iterable[Hashable]" + columns = MultiIndex.from_product([[key], cols]) # type: ignore[list-item] sub_group = group[cols] sub_group.columns = columns data_list.append(sub_group) diff --git a/pandas/plotting/_matplotlib/hist.py b/pandas/plotting/_matplotlib/hist.py index 3be168fe159cf..77496cf049f3d 100644 --- a/pandas/plotting/_matplotlib/hist.py +++ b/pandas/plotting/_matplotlib/hist.py @@ -1,9 +1,14 @@ from __future__ import annotations -from typing import TYPE_CHECKING +from typing import ( + TYPE_CHECKING, + Literal, +) import numpy as np +from pandas._typing import PlottingOrientation + from pandas.core.dtypes.common import ( is_integer, is_list_like, @@ -40,7 +45,9 @@ class HistPlot(LinePlot): - _kind = "hist" + @property + def _kind(self) -> Literal["hist", "kde"]: + return "hist" def __init__(self, data, bins=10, bottom=0, **kwargs) -> None: self.bins = bins # use mpl default @@ -64,8 +71,8 @@ def _args_adjust(self): def _calculate_bins(self, data: DataFrame) -> np.ndarray: """Calculate bins given data""" - values = data._convert(datetime=True)._get_numeric_data() - values = np.ravel(values) + nd_values = data._convert(datetime=True)._get_numeric_data() + values = np.ravel(nd_values) values = values[~isna(values)] hist, bins = np.histogram( @@ -159,7 +166,7 @@ def _post_plot_logic(self, ax: Axes, data): ax.set_ylabel("Frequency") @property - def orientation(self): + def orientation(self) -> PlottingOrientation: if self.kwds.get("orientation", None) == "horizontal": return "horizontal" else: @@ -167,8 +174,13 @@ def orientation(self): class KdePlot(HistPlot): - _kind = "kde" - orientation = "vertical" + @property + def _kind(self) -> Literal["kde"]: + return "kde" + + @property + def orientation(self) -> Literal["vertical"]: + return "vertical" def __init__(self, data, bw_method=None, ind=None, **kwargs) -> None: MPLPlot.__init__(self, data, **kwargs) diff --git a/pandas/plotting/_matplotlib/style.py b/pandas/plotting/_matplotlib/style.py index 597c0dafa8cab..9e459b82fec97 100644 --- a/pandas/plotting/_matplotlib/style.py +++ b/pandas/plotting/_matplotlib/style.py @@ -143,8 +143,8 @@ def _get_colors_from_colormap( num_colors: int, ) -> list[Color]: """Get colors from colormap.""" - colormap = _get_cmap_instance(colormap) - return [colormap(num) for num in np.linspace(0, 1, num=num_colors)] + cmap = _get_cmap_instance(colormap) + return [cmap(num) for num in np.linspace(0, 1, num=num_colors)] def _get_cmap_instance(colormap: str | Colormap) -> Colormap: diff --git a/pandas/plotting/_matplotlib/timeseries.py b/pandas/plotting/_matplotlib/timeseries.py index 303266ae410de..ca6cccb0f98eb 100644 --- a/pandas/plotting/_matplotlib/timeseries.py +++ b/pandas/plotting/_matplotlib/timeseries.py @@ -2,6 +2,7 @@ from __future__ import annotations +from datetime import timedelta import functools from typing import ( TYPE_CHECKING, @@ -185,11 +186,10 @@ def _get_ax_freq(ax: Axes): return ax_freq -def _get_period_alias(freq) -> str | None: +def _get_period_alias(freq: timedelta | BaseOffset | str) -> str | None: freqstr = to_offset(freq).rule_code - freq = get_period_alias(freqstr) - return freq + return get_period_alias(freqstr) def _get_freq(ax: Axes, series: Series): @@ -235,7 +235,9 @@ def use_dynamic_x(ax: Axes, data: DataFrame | Series) -> bool: x = data.index if base <= FreqGroup.FR_DAY.value: return x[:1].is_normalized - return Period(x[0], freq_str).to_timestamp().tz_localize(x.tz) == x[0] + period = Period(x[0], freq_str) + assert isinstance(period, Period) + return period.to_timestamp().tz_localize(x.tz) == x[0] return True diff --git a/pandas/plotting/_matplotlib/tools.py b/pandas/plotting/_matplotlib/tools.py index bfbf77e85afd3..94357e5002ffd 100644 --- a/pandas/plotting/_matplotlib/tools.py +++ b/pandas/plotting/_matplotlib/tools.py @@ -83,7 +83,11 @@ def table( return table -def _get_layout(nplots: int, layout=None, layout_type: str = "box") -> tuple[int, int]: +def _get_layout( + nplots: int, + layout: tuple[int, int] | None = None, + layout_type: str = "box", +) -> tuple[int, int]: if layout is not None: if not isinstance(layout, (tuple, list)) or len(layout) != 2: raise ValueError("Layout must be a tuple of (rows, columns)") diff --git a/pandas/plotting/_misc.py b/pandas/plotting/_misc.py index 6e7b53e4c5ae4..0e82a0fc924fb 100644 --- a/pandas/plotting/_misc.py +++ b/pandas/plotting/_misc.py @@ -1,7 +1,18 @@ +from __future__ import annotations + from contextlib import contextmanager +from typing import ( + TYPE_CHECKING, + Iterator, +) from pandas.plotting._core import _get_plot_backend +if TYPE_CHECKING: + from matplotlib.axes import Axes + from matplotlib.figure import Figure + import numpy as np + def table(ax, data, rowLabels=None, colLabels=None, **kwargs): """ @@ -27,7 +38,7 @@ def table(ax, data, rowLabels=None, colLabels=None, **kwargs): ) -def register(): +def register() -> None: """ Register pandas formatters and converters with matplotlib. @@ -49,7 +60,7 @@ def register(): plot_backend.register() -def deregister(): +def deregister() -> None: """ Remove pandas formatters and converters. @@ -81,7 +92,7 @@ def scatter_matrix( hist_kwds=None, range_padding=0.05, **kwargs, -): +) -> np.ndarray: """ Draw a matrix of scatter plots. @@ -156,7 +167,7 @@ def scatter_matrix( ) -def radviz(frame, class_column, ax=None, color=None, colormap=None, **kwds): +def radviz(frame, class_column, ax=None, color=None, colormap=None, **kwds) -> Axes: """ Plot a multidimensional dataset in 2D. @@ -239,7 +250,7 @@ def radviz(frame, class_column, ax=None, color=None, colormap=None, **kwds): def andrews_curves( frame, class_column, ax=None, samples=200, color=None, colormap=None, **kwargs -): +) -> Axes: """ Generate a matplotlib plot of Andrews curves, for visualising clusters of multivariate data. @@ -297,7 +308,7 @@ def andrews_curves( ) -def bootstrap_plot(series, fig=None, size=50, samples=500, **kwds): +def bootstrap_plot(series, fig=None, size=50, samples=500, **kwds) -> Figure: """ Bootstrap plot on mean, median and mid-range statistics. @@ -364,7 +375,7 @@ def parallel_coordinates( axvlines_kwds=None, sort_labels=False, **kwargs, -): +) -> Axes: """ Parallel coordinates plotting. @@ -430,7 +441,7 @@ def parallel_coordinates( ) -def lag_plot(series, lag=1, ax=None, **kwds): +def lag_plot(series, lag=1, ax=None, **kwds) -> Axes: """ Lag plot for time series. @@ -474,7 +485,7 @@ def lag_plot(series, lag=1, ax=None, **kwds): return plot_backend.lag_plot(series=series, lag=lag, ax=ax, **kwds) -def autocorrelation_plot(series, ax=None, **kwargs): +def autocorrelation_plot(series, ax=None, **kwargs) -> Axes: """ Autocorrelation plot for time series. @@ -531,21 +542,21 @@ def __getitem__(self, key): raise ValueError(f"{key} is not a valid pandas plotting option") return super().__getitem__(key) - def __setitem__(self, key, value): + def __setitem__(self, key, value) -> None: key = self._get_canonical_key(key) - return super().__setitem__(key, value) + super().__setitem__(key, value) - def __delitem__(self, key): + def __delitem__(self, key) -> None: key = self._get_canonical_key(key) if key in self._DEFAULT_KEYS: raise ValueError(f"Cannot remove default parameter {key}") - return super().__delitem__(key) + super().__delitem__(key) def __contains__(self, key) -> bool: key = self._get_canonical_key(key) return super().__contains__(key) - def reset(self): + def reset(self) -> None: """ Reset the option store to its initial state @@ -560,7 +571,7 @@ def _get_canonical_key(self, key): return self._ALIASES.get(key, key) @contextmanager - def use(self, key, value): + def use(self, key, value) -> Iterator[_Options]: """ Temporarily set a parameter value using the with statement. Aliasing allowed. diff --git a/pandas/tests/api/test_api.py b/pandas/tests/api/test_api.py index 1bc2cf5085f1a..6350f402ac0e5 100644 --- a/pandas/tests/api/test_api.py +++ b/pandas/tests/api/test_api.py @@ -16,7 +16,9 @@ def check(self, namespace, expected, ignored=None): # ignored ones # compare vs the expected - result = sorted(f for f in dir(namespace) if not f.startswith("__")) + result = sorted( + f for f in dir(namespace) if not f.startswith("__") and f != "annotations" + ) if ignored is not None: result = sorted(set(result) - set(ignored)) @@ -116,6 +118,7 @@ class TestPDApi(Base): "eval", "factorize", "get_dummies", + "from_dummies", "infer_freq", "isna", "isnull", diff --git a/pandas/tests/apply/test_frame_apply.py b/pandas/tests/apply/test_frame_apply.py index ef7ab4a469865..72a9d8723d34c 100644 --- a/pandas/tests/apply/test_frame_apply.py +++ b/pandas/tests/apply/test_frame_apply.py @@ -1577,3 +1577,11 @@ def test_apply_type(): result = df.apply(type, axis=1) expected = Series({"a": Series, "b": Series, "c": Series}) tm.assert_series_equal(result, expected) + + +def test_apply_on_empty_dataframe(): + # GH 39111 + df = DataFrame({"a": [1, 2], "b": [3, 0]}) + result = df.head(0).apply(lambda x: max(x["a"], x["b"]), axis=1) + expected = Series([]) + tm.assert_series_equal(result, expected) diff --git a/pandas/tests/apply/test_series_apply.py b/pandas/tests/apply/test_series_apply.py index 69f7bebb63986..8900aa0060559 100644 --- a/pandas/tests/apply/test_series_apply.py +++ b/pandas/tests/apply/test_series_apply.py @@ -598,6 +598,34 @@ def test_map_dict_na_key(): tm.assert_series_equal(result, expected) +@pytest.mark.parametrize("arg_func", [dict, Series]) +def test_map_dict_ignore_na(arg_func): + # GH#47527 + mapping = arg_func({1: 10, np.nan: 42}) + ser = Series([1, np.nan, 2]) + result = ser.map(mapping, na_action="ignore") + expected = Series([10, np.nan, np.nan]) + tm.assert_series_equal(result, expected) + + +def test_map_defaultdict_ignore_na(): + # GH#47527 + mapping = defaultdict(int, {1: 10, np.nan: 42}) + ser = Series([1, np.nan, 2]) + result = ser.map(mapping) + expected = Series([10, 0, 0]) + tm.assert_series_equal(result, expected) + + +def test_map_categorical_na_ignore(): + # GH#47527 + values = pd.Categorical([1, np.nan, 2], categories=[10, 1]) + ser = Series(values) + result = ser.map({1: 10, np.nan: 42}) + expected = Series([10, np.nan, np.nan]) + tm.assert_series_equal(result, expected) + + def test_map_dict_subclass_with_missing(): """ Test Series.map with a dictionary subclass that defines __missing__, diff --git a/pandas/tests/arithmetic/test_period.py b/pandas/tests/arithmetic/test_period.py index 7adc407fd5de1..50f5ab8aee9dd 100644 --- a/pandas/tests/arithmetic/test_period.py +++ b/pandas/tests/arithmetic/test_period.py @@ -1243,6 +1243,21 @@ def test_parr_add_sub_tdt64_nat_array(self, box_with_array, other): with pytest.raises(TypeError, match=msg): other - obj + # some but not *all* NaT + other = other.copy() + other[0] = np.timedelta64(0, "ns") + expected = PeriodIndex([pi[0]] + ["NaT"] * 8, freq="19D") + expected = tm.box_expected(expected, box_with_array) + + result = obj + other + tm.assert_equal(result, expected) + result = other + obj + tm.assert_equal(result, expected) + result = obj - other + tm.assert_equal(result, expected) + with pytest.raises(TypeError, match=msg): + other - obj + # --------------------------------------------------------------- # Unsorted diff --git a/pandas/tests/arrays/datetimes/test_constructors.py b/pandas/tests/arrays/datetimes/test_constructors.py index 684b478d1de08..cb2d8f31f0f9c 100644 --- a/pandas/tests/arrays/datetimes/test_constructors.py +++ b/pandas/tests/arrays/datetimes/test_constructors.py @@ -77,9 +77,16 @@ def test_mismatched_timezone_raises(self): dtype=DatetimeTZDtype(tz="US/Central"), ) dtype = DatetimeTZDtype(tz="US/Eastern") - with pytest.raises(TypeError, match="Timezone of the array"): + msg = r"dtype=datetime64\[ns.*\] does not match data dtype datetime64\[ns.*\]" + with pytest.raises(TypeError, match=msg): DatetimeArray(arr, dtype=dtype) + # also with mismatched tzawareness + with pytest.raises(TypeError, match=msg): + DatetimeArray(arr, dtype=np.dtype("M8[ns]")) + with pytest.raises(TypeError, match=msg): + DatetimeArray(arr.tz_localize(None), dtype=arr.dtype) + def test_non_array_raises(self): with pytest.raises(ValueError, match="list"): DatetimeArray([1, 2, 3]) diff --git a/pandas/tests/arrays/interval/test_interval.py b/pandas/tests/arrays/interval/test_interval.py index 44eafc72b1f5f..073e6b6119b14 100644 --- a/pandas/tests/arrays/interval/test_interval.py +++ b/pandas/tests/arrays/interval/test_interval.py @@ -60,12 +60,12 @@ def test_is_empty(self, constructor, left, right, closed): class TestMethods: - @pytest.mark.parametrize("new_closed", ["left", "right", "both", "neither"]) - def test_set_closed(self, closed, new_closed): + @pytest.mark.parametrize("new_inclusive", ["left", "right", "both", "neither"]) + def test_set_inclusive(self, closed, new_inclusive): # GH 21670 array = IntervalArray.from_breaks(range(10), inclusive=closed) - result = array.set_closed(new_closed) - expected = IntervalArray.from_breaks(range(10), inclusive=new_closed) + result = array.set_inclusive(new_inclusive) + expected = IntervalArray.from_breaks(range(10), inclusive=new_inclusive) tm.assert_extension_array_equal(result, expected) @pytest.mark.parametrize( @@ -134,10 +134,10 @@ def test_set_na(self, left_right_dtypes): tm.assert_extension_array_equal(result, expected) - def test_setitem_mismatched_closed(self): + def test_setitem_mismatched_inclusive(self): arr = IntervalArray.from_breaks(range(4), "right") orig = arr.copy() - other = arr.set_closed("both") + other = arr.set_inclusive("both") msg = "'value.inclusive' is 'both', expected 'right'" with pytest.raises(ValueError, match=msg): @@ -414,15 +414,15 @@ def test_interval_error_and_warning(): def test_interval_array_error_and_warning(): # GH 40245 - msg = ( - "Deprecated argument `closed` cannot " - "be passed if argument `inclusive` is not None" - ) - with pytest.raises(ValueError, match=msg): - IntervalArray([Interval(0, 1), Interval(1, 5)], closed="both", inclusive="both") - - msg = "Argument `closed` is deprecated in favor of `inclusive`" - with tm.assert_produces_warning(FutureWarning, match=msg, check_stacklevel=False): + msg = "Can only specify 'closed' or 'inclusive', not both." + with pytest.raises(TypeError, match=msg): + with tm.assert_produces_warning(FutureWarning): + IntervalArray( + [Interval(0, 1), Interval(1, 5)], closed="both", inclusive="both" + ) + + msg = "the 'closed'' keyword is deprecated, use 'inclusive' instead." + with tm.assert_produces_warning(FutureWarning, match=msg): IntervalArray([Interval(0, 1), Interval(1, 5)], closed="both") @@ -433,15 +433,13 @@ def test_arrow_interval_type_error_and_warning(): from pandas.core.arrays.arrow._arrow_utils import ArrowIntervalType - msg = ( - "Deprecated argument `closed` cannot " - "be passed if argument `inclusive` is not None" - ) - with pytest.raises(ValueError, match=msg): - ArrowIntervalType(pa.int64(), closed="both", inclusive="both") + msg = "Can only specify 'closed' or 'inclusive', not both." + with pytest.raises(TypeError, match=msg): + with tm.assert_produces_warning(FutureWarning): + ArrowIntervalType(pa.int64(), closed="both", inclusive="both") - msg = "Argument `closed` is deprecated in favor of `inclusive`" - with tm.assert_produces_warning(FutureWarning, match=msg, check_stacklevel=False): + msg = "the 'closed'' keyword is deprecated, use 'inclusive' instead." + with tm.assert_produces_warning(FutureWarning, match=msg): ArrowIntervalType(pa.int64(), closed="both") @@ -460,3 +458,38 @@ def test_interval_index_subtype(timezone, inclusive_endpoints_fixture): dates[:-1], dates[1:], inclusive=inclusive_endpoints_fixture ) tm.assert_index_equal(result, expected) + + +def test_from_tuples_deprecation(): + # GH#40245 + with tm.assert_produces_warning(FutureWarning): + IntervalArray.from_tuples([(0, 1), (1, 2)], closed="right") + + +def test_from_tuples_deprecation_error(): + # GH#40245 + msg = "Can only specify 'closed' or 'inclusive', not both." + with pytest.raises(TypeError, match=msg): + with tm.assert_produces_warning(FutureWarning): + IntervalArray.from_tuples( + [(0, 1), (1, 2)], closed="right", inclusive="right" + ) + + +def test_from_breaks_deprecation(): + # GH#40245 + with tm.assert_produces_warning(FutureWarning): + IntervalArray.from_breaks([0, 1, 2, 3], closed="right") + + +def test_from_arrays_deprecation(): + # GH#40245 + with tm.assert_produces_warning(FutureWarning): + IntervalArray.from_arrays([0, 1, 2], [1, 2, 3], closed="right") + + +def test_set_closed_deprecated(): + # GH#40245 + array = IntervalArray.from_breaks(range(10)) + with tm.assert_produces_warning(FutureWarning): + array.set_closed(closed="both") diff --git a/pandas/tests/arrays/numpy_/test_indexing.py b/pandas/tests/arrays/numpy_/test_indexing.py index f92411efe774c..225d64ad7d258 100644 --- a/pandas/tests/arrays/numpy_/test_indexing.py +++ b/pandas/tests/arrays/numpy_/test_indexing.py @@ -7,6 +7,17 @@ class TestSearchsorted: + def test_searchsorted_string(self, string_dtype): + arr = pd.array(["a", "b", "c"], dtype=string_dtype) + + result = arr.searchsorted("a", side="left") + assert is_scalar(result) + assert result == 0 + + result = arr.searchsorted("a", side="right") + assert is_scalar(result) + assert result == 1 + def test_searchsorted_numeric_dtypes_scalar(self, any_real_numpy_dtype): arr = pd.array([1, 3, 90], dtype=any_real_numpy_dtype) result = arr.searchsorted(30) diff --git a/pandas/tests/arrays/sparse/test_array.py b/pandas/tests/arrays/sparse/test_array.py index 492427b2be213..9b78eb345e188 100644 --- a/pandas/tests/arrays/sparse/test_array.py +++ b/pandas/tests/arrays/sparse/test_array.py @@ -391,23 +391,36 @@ def test_setting_fill_value_updates(): @pytest.mark.parametrize( - "arr, loc", + "arr,fill_value,loc", [ - ([None, 1, 2], 0), - ([0, None, 2], 1), - ([0, 1, None], 2), - ([0, 1, 1, None, None], 3), - ([1, 1, 1, 2], -1), - ([], -1), + ([None, 1, 2], None, 0), + ([0, None, 2], None, 1), + ([0, 1, None], None, 2), + ([0, 1, 1, None, None], None, 3), + ([1, 1, 1, 2], None, -1), + ([], None, -1), + ([None, 1, 0, 0, None, 2], None, 0), + ([None, 1, 0, 0, None, 2], 1, 1), + ([None, 1, 0, 0, None, 2], 2, 5), + ([None, 1, 0, 0, None, 2], 3, -1), + ([None, 0, 0, 1, 2, 1], 0, 1), + ([None, 0, 0, 1, 2, 1], 1, 3), ], ) -def test_first_fill_value_loc(arr, loc): - result = SparseArray(arr)._first_fill_value_loc() +def test_first_fill_value_loc(arr, fill_value, loc): + result = SparseArray(arr, fill_value=fill_value)._first_fill_value_loc() assert result == loc @pytest.mark.parametrize( - "arr", [[1, 2, np.nan, np.nan], [1, np.nan, 2, np.nan], [1, 2, np.nan]] + "arr", + [ + [1, 2, np.nan, np.nan], + [1, np.nan, 2, np.nan], + [1, 2, np.nan], + [np.nan, 1, 0, 0, np.nan, 2], + [np.nan, 0, 0, 1, 2, 1], + ], ) @pytest.mark.parametrize("fill_value", [np.nan, 0, 1]) def test_unique_na_fill(arr, fill_value): diff --git a/pandas/tests/arrays/string_/test_indexing.py b/pandas/tests/arrays/string_/test_indexing.py deleted file mode 100644 index 41466c43288c3..0000000000000 --- a/pandas/tests/arrays/string_/test_indexing.py +++ /dev/null @@ -1,16 +0,0 @@ -from pandas.core.dtypes.common import is_scalar - -import pandas as pd - - -class TestSearchsorted: - def test_searchsorted(self, string_dtype): - arr = pd.array(["a", "b", "c"], dtype=string_dtype) - - result = arr.searchsorted("a", side="left") - assert is_scalar(result) - assert result == 0 - - result = arr.searchsorted("a", side="right") - assert is_scalar(result) - assert result == 1 diff --git a/pandas/tests/arrays/string_/test_string.py b/pandas/tests/arrays/string_/test_string.py index b563f84207b22..6a17a56a47cbc 100644 --- a/pandas/tests/arrays/string_/test_string.py +++ b/pandas/tests/arrays/string_/test_string.py @@ -5,7 +5,10 @@ import numpy as np import pytest -from pandas.compat import pa_version_under2p0 +from pandas.compat import ( + pa_version_under2p0, + pa_version_under6p0, +) from pandas.errors import PerformanceWarning import pandas.util._test_decorators as td @@ -101,7 +104,7 @@ def test_add(dtype, request): "unsupported operand type(s) for +: 'ArrowStringArray' and " "'ArrowStringArray'" ) - mark = pytest.mark.xfail(raises=TypeError, reason=reason) + mark = pytest.mark.xfail(raises=NotImplementedError, reason=reason) request.node.add_marker(mark) a = pd.Series(["a", "b", "c", None, None], dtype=dtype) @@ -142,7 +145,7 @@ def test_add_2d(dtype, request): def test_add_sequence(dtype, request): if dtype.storage == "pyarrow": reason = "unsupported operand type(s) for +: 'ArrowStringArray' and 'list'" - mark = pytest.mark.xfail(raises=TypeError, reason=reason) + mark = pytest.mark.xfail(raises=NotImplementedError, reason=reason) request.node.add_marker(mark) a = pd.array(["a", "b", None, None], dtype=dtype) @@ -160,7 +163,7 @@ def test_add_sequence(dtype, request): def test_mul(dtype, request): if dtype.storage == "pyarrow": reason = "unsupported operand type(s) for *: 'ArrowStringArray' and 'int'" - mark = pytest.mark.xfail(raises=TypeError, reason=reason) + mark = pytest.mark.xfail(raises=NotImplementedError, reason=reason) request.node.add_marker(mark) a = pd.array(["a", "b", None], dtype=dtype) @@ -375,7 +378,7 @@ def test_reduce_missing(skipna, dtype): @pytest.mark.parametrize("method", ["min", "max"]) @pytest.mark.parametrize("skipna", [True, False]) def test_min_max(method, skipna, dtype, request): - if dtype.storage == "pyarrow": + if dtype.storage == "pyarrow" and pa_version_under6p0: reason = "'ArrowStringArray' object has no attribute 'max'" mark = pytest.mark.xfail(raises=TypeError, reason=reason) request.node.add_marker(mark) @@ -392,7 +395,7 @@ def test_min_max(method, skipna, dtype, request): @pytest.mark.parametrize("method", ["min", "max"]) @pytest.mark.parametrize("box", [pd.Series, pd.array]) def test_min_max_numpy(method, box, dtype, request): - if dtype.storage == "pyarrow": + if dtype.storage == "pyarrow" and (pa_version_under6p0 or box is pd.array): if box is pd.array: reason = "'<=' not supported between instances of 'str' and 'NoneType'" else: @@ -588,3 +591,23 @@ def test_isin(dtype, fixed_now_ts): result = s.isin(["a", fixed_now_ts]) expected = pd.Series([True, False, False]) tm.assert_series_equal(result, expected) + + +def test_setitem_scalar_with_mask_validation(dtype): + # https://github.com/pandas-dev/pandas/issues/47628 + # setting None with a boolean mask (through _putmaks) should still result + # in pd.NA values in the underlying array + ser = pd.Series(["a", "b", "c"], dtype=dtype) + mask = np.array([False, True, False]) + + ser[mask] = None + assert ser.array[1] is pd.NA + + # for other non-string we should also raise an error + ser = pd.Series(["a", "b", "c"], dtype=dtype) + if type(ser.array) is pd.arrays.StringArray: + msg = "Cannot set non-string value" + else: + msg = "Scalar must be NA or str" + with pytest.raises(ValueError, match=msg): + ser[mask] = 1 diff --git a/pandas/tests/arrays/test_array.py b/pandas/tests/arrays/test_array.py index f7f015cbe4a23..79e73fec706f1 100644 --- a/pandas/tests/arrays/test_array.py +++ b/pandas/tests/arrays/test_array.py @@ -298,7 +298,7 @@ def test_array_inference(data, expected): [ # mix of frequencies [pd.Period("2000", "D"), pd.Period("2001", "A")], - # mix of closed + # mix of inclusive [pd.Interval(0, 1, "left"), pd.Interval(1, 2, "right")], # Mix of timezones [pd.Timestamp("2000", tz="CET"), pd.Timestamp("2000", tz="UTC")], diff --git a/pandas/tests/arrays/test_datetimelike.py b/pandas/tests/arrays/test_datetimelike.py index 10881495c27b3..ea895e5656ccb 100644 --- a/pandas/tests/arrays/test_datetimelike.py +++ b/pandas/tests/arrays/test_datetimelike.py @@ -564,7 +564,7 @@ def test_shift_fill_int_deprecated(self): expected = arr.copy() if self.array_cls is PeriodArray: - fill_val = PeriodArray._scalar_type._from_ordinal(1, freq=arr.freq) + fill_val = arr._scalar_type._from_ordinal(1, freq=arr.freq) else: fill_val = arr._scalar_type(1) expected[0] = fill_val diff --git a/pandas/tests/arrays/test_datetimes.py b/pandas/tests/arrays/test_datetimes.py index f3d471ca96614..af1a292a2975a 100644 --- a/pandas/tests/arrays/test_datetimes.py +++ b/pandas/tests/arrays/test_datetimes.py @@ -1,6 +1,8 @@ """ Tests for DatetimeArray """ +import operator + import numpy as np import pytest @@ -37,6 +39,26 @@ def dtype(self, unit, tz_naive_fixture): else: return DatetimeTZDtype(unit=unit, tz=tz) + @pytest.fixture + def dta_dti(self, unit, dtype): + tz = getattr(dtype, "tz", None) + + dti = pd.date_range("2016-01-01", periods=55, freq="D", tz=tz) + if tz is None: + arr = np.asarray(dti).astype(f"M8[{unit}]") + else: + arr = np.asarray(dti.tz_convert("UTC").tz_localize(None)).astype( + f"M8[{unit}]" + ) + + dta = DatetimeArray._simple_new(arr, dtype=dtype) + return dta, dti + + @pytest.fixture + def dta(self, dta_dti): + dta, dti = dta_dti + return dta + def test_non_nano(self, unit, reso, dtype): arr = np.arange(5, dtype=np.int64).view(f"M8[{unit}]") dta = DatetimeArray._simple_new(arr, dtype=dtype) @@ -52,17 +74,8 @@ def test_non_nano(self, unit, reso, dtype): @pytest.mark.parametrize( "field", DatetimeArray._field_ops + DatetimeArray._bool_ops ) - def test_fields(self, unit, reso, field, dtype): - tz = getattr(dtype, "tz", None) - dti = pd.date_range("2016-01-01", periods=55, freq="D", tz=tz) - if tz is None: - arr = np.asarray(dti).astype(f"M8[{unit}]") - else: - arr = np.asarray(dti.tz_convert("UTC").tz_localize(None)).astype( - f"M8[{unit}]" - ) - - dta = DatetimeArray._simple_new(arr, dtype=dtype) + def test_fields(self, unit, reso, field, dtype, dta_dti): + dta, dti = dta_dti # FIXME: assert (dti == dta).all() @@ -107,6 +120,93 @@ def test_std_non_nano(self, unit): assert res._reso == dta._reso assert res == dti.std().floor(unit) + @pytest.mark.filterwarnings("ignore:Converting to PeriodArray.*:UserWarning") + def test_to_period(self, dta_dti): + dta, dti = dta_dti + result = dta.to_period("D") + expected = dti._data.to_period("D") + + tm.assert_extension_array_equal(result, expected) + + def test_iter(self, dta): + res = next(iter(dta)) + expected = dta[0] + + assert type(res) is pd.Timestamp + assert res.value == expected.value + assert res._reso == expected._reso + assert res == expected + + def test_astype_object(self, dta): + result = dta.astype(object) + assert all(x._reso == dta._reso for x in result) + assert all(x == y for x, y in zip(result, dta)) + + def test_to_pydatetime(self, dta_dti): + dta, dti = dta_dti + + result = dta.to_pydatetime() + expected = dti.to_pydatetime() + tm.assert_numpy_array_equal(result, expected) + + @pytest.mark.parametrize("meth", ["time", "timetz", "date"]) + def test_time_date(self, dta_dti, meth): + dta, dti = dta_dti + + result = getattr(dta, meth) + expected = getattr(dti, meth) + tm.assert_numpy_array_equal(result, expected) + + def test_format_native_types(self, unit, reso, dtype, dta_dti): + # In this case we should get the same formatted values with our nano + # version dti._data as we do with the non-nano dta + dta, dti = dta_dti + + res = dta._format_native_types() + exp = dti._data._format_native_types() + tm.assert_numpy_array_equal(res, exp) + + def test_repr(self, dta_dti, unit): + dta, dti = dta_dti + + assert repr(dta) == repr(dti._data).replace("[ns", f"[{unit}") + + # TODO: tests with td64 + def test_compare_mismatched_resolutions(self, comparison_op): + # comparison that numpy gets wrong bc of silent overflows + op = comparison_op + + iinfo = np.iinfo(np.int64) + vals = np.array([iinfo.min, iinfo.min + 1, iinfo.max], dtype=np.int64) + + # Construct so that arr2[1] < arr[1] < arr[2] < arr2[2] + arr = np.array(vals).view("M8[ns]") + arr2 = arr.view("M8[s]") + + left = DatetimeArray._simple_new(arr, dtype=arr.dtype) + right = DatetimeArray._simple_new(arr2, dtype=arr2.dtype) + + if comparison_op is operator.eq: + expected = np.array([False, False, False]) + elif comparison_op is operator.ne: + expected = np.array([True, True, True]) + elif comparison_op in [operator.lt, operator.le]: + expected = np.array([False, False, True]) + else: + expected = np.array([False, True, False]) + + result = op(left, right) + tm.assert_numpy_array_equal(result, expected) + + result = op(left[1], right) + tm.assert_numpy_array_equal(result, expected) + + if op not in [operator.eq, operator.ne]: + # check that numpy still gets this wrong; if it is fixed we may be + # able to remove compare_mismatched_resolutions + np_res = op(left._ndarray, right._ndarray) + tm.assert_numpy_array_equal(np_res[1:], ~expected[1:]) + class TestDatetimeArrayComparisons: # TODO: merge this into tests/arithmetic/test_datetime64 once it is @@ -145,6 +245,36 @@ def test_cmp_dt64_arraylike_tznaive(self, comparison_op): class TestDatetimeArray: + def test_astype_non_nano_tznaive(self): + dti = pd.date_range("2016-01-01", periods=3) + + res = dti.astype("M8[s]") + assert res.dtype == "M8[s]" + + dta = dti._data + res = dta.astype("M8[s]") + assert res.dtype == "M8[s]" + assert isinstance(res, pd.core.arrays.DatetimeArray) # used to be ndarray + + def test_astype_non_nano_tzaware(self): + dti = pd.date_range("2016-01-01", periods=3, tz="UTC") + + res = dti.astype("M8[s, US/Pacific]") + assert res.dtype == "M8[s, US/Pacific]" + + dta = dti._data + res = dta.astype("M8[s, US/Pacific]") + assert res.dtype == "M8[s, US/Pacific]" + + # from non-nano to non-nano, preserving reso + res2 = res.astype("M8[s, UTC]") + assert res2.dtype == "M8[s, UTC]" + assert not tm.shares_memory(res2, res) + + res3 = res.astype("M8[s, UTC]", copy=False) + assert res2.dtype == "M8[s, UTC]" + assert tm.shares_memory(res3, res) + def test_astype_to_same(self): arr = DatetimeArray._from_sequence( ["2000"], dtype=DatetimeTZDtype(tz="US/Central") diff --git a/pandas/tests/arrays/test_period.py b/pandas/tests/arrays/test_period.py index de0e766e4a2aa..a4b442ff526e9 100644 --- a/pandas/tests/arrays/test_period.py +++ b/pandas/tests/arrays/test_period.py @@ -115,6 +115,20 @@ def test_sub_period(): arr - other +def test_sub_period_overflow(): + # GH#47538 + dti = pd.date_range("1677-09-22", periods=2, freq="D") + pi = dti.to_period("ns") + + per = pd.Period._from_ordinal(10**14, pi.freq) + + with pytest.raises(OverflowError, match="Overflow in int64 addition"): + pi - per + + with pytest.raises(OverflowError, match="Overflow in int64 addition"): + per - pi + + # ---------------------------------------------------------------------------- # Methods diff --git a/pandas/tests/arrays/test_timedeltas.py b/pandas/tests/arrays/test_timedeltas.py index c8b850d35035a..b3b79bd988ad8 100644 --- a/pandas/tests/arrays/test_timedeltas.py +++ b/pandas/tests/arrays/test_timedeltas.py @@ -1,3 +1,5 @@ +from datetime import timedelta + import numpy as np import pytest @@ -6,7 +8,10 @@ import pandas as pd from pandas import Timedelta import pandas._testing as tm -from pandas.core.arrays import TimedeltaArray +from pandas.core.arrays import ( + DatetimeArray, + TimedeltaArray, +) class TestNonNano: @@ -25,6 +30,11 @@ def reso(self, unit): else: raise NotImplementedError(unit) + @pytest.fixture + def tda(self, unit): + arr = np.arange(5, dtype=np.int64).view(f"m8[{unit}]") + return TimedeltaArray._simple_new(arr, dtype=arr.dtype) + def test_non_nano(self, unit, reso): arr = np.arange(5, dtype=np.int64).view(f"m8[{unit}]") tda = TimedeltaArray._simple_new(arr, dtype=arr.dtype) @@ -33,17 +43,130 @@ def test_non_nano(self, unit, reso): assert tda[0]._reso == reso @pytest.mark.parametrize("field", TimedeltaArray._field_ops) - def test_fields(self, unit, reso, field): - arr = np.arange(5, dtype=np.int64).view(f"m8[{unit}]") - tda = TimedeltaArray._simple_new(arr, dtype=arr.dtype) - - as_nano = arr.astype("m8[ns]") + def test_fields(self, tda, field): + as_nano = tda._ndarray.astype("m8[ns]") tda_nano = TimedeltaArray._simple_new(as_nano, dtype=as_nano.dtype) result = getattr(tda, field) expected = getattr(tda_nano, field) tm.assert_numpy_array_equal(result, expected) + def test_to_pytimedelta(self, tda): + as_nano = tda._ndarray.astype("m8[ns]") + tda_nano = TimedeltaArray._simple_new(as_nano, dtype=as_nano.dtype) + + result = tda.to_pytimedelta() + expected = tda_nano.to_pytimedelta() + tm.assert_numpy_array_equal(result, expected) + + def test_total_seconds(self, unit, tda): + as_nano = tda._ndarray.astype("m8[ns]") + tda_nano = TimedeltaArray._simple_new(as_nano, dtype=as_nano.dtype) + + result = tda.total_seconds() + expected = tda_nano.total_seconds() + tm.assert_numpy_array_equal(result, expected) + + @pytest.mark.parametrize( + "nat", [np.datetime64("NaT", "ns"), np.datetime64("NaT", "us")] + ) + def test_add_nat_datetimelike_scalar(self, nat, tda): + result = tda + nat + assert isinstance(result, DatetimeArray) + assert result._reso == tda._reso + assert result.isna().all() + + result = nat + tda + assert isinstance(result, DatetimeArray) + assert result._reso == tda._reso + assert result.isna().all() + + def test_add_pdnat(self, tda): + result = tda + pd.NaT + assert isinstance(result, TimedeltaArray) + assert result._reso == tda._reso + assert result.isna().all() + + result = pd.NaT + tda + assert isinstance(result, TimedeltaArray) + assert result._reso == tda._reso + assert result.isna().all() + + # TODO: 2022-07-11 this is the only test that gets to DTA.tz_convert + # or tz_localize with non-nano; implement tests specific to that. + def test_add_datetimelike_scalar(self, tda, tz_naive_fixture): + ts = pd.Timestamp("2016-01-01", tz=tz_naive_fixture) + + msg = "with mis-matched resolutions" + with pytest.raises(NotImplementedError, match=msg): + # mismatched reso -> check that we don't give an incorrect result + tda + ts + with pytest.raises(NotImplementedError, match=msg): + # mismatched reso -> check that we don't give an incorrect result + ts + tda + + ts = ts._as_unit(tda._unit) + + exp_values = tda._ndarray + ts.asm8 + expected = ( + DatetimeArray._simple_new(exp_values, dtype=exp_values.dtype) + .tz_localize("UTC") + .tz_convert(ts.tz) + ) + + result = tda + ts + tm.assert_extension_array_equal(result, expected) + + result = ts + tda + tm.assert_extension_array_equal(result, expected) + + def test_mul_scalar(self, tda): + other = 2 + result = tda * other + expected = TimedeltaArray._simple_new(tda._ndarray * other, dtype=tda.dtype) + tm.assert_extension_array_equal(result, expected) + assert result._reso == tda._reso + + def test_mul_listlike(self, tda): + other = np.arange(len(tda)) + result = tda * other + expected = TimedeltaArray._simple_new(tda._ndarray * other, dtype=tda.dtype) + tm.assert_extension_array_equal(result, expected) + assert result._reso == tda._reso + + def test_mul_listlike_object(self, tda): + other = np.arange(len(tda)) + result = tda * other.astype(object) + expected = TimedeltaArray._simple_new(tda._ndarray * other, dtype=tda.dtype) + tm.assert_extension_array_equal(result, expected) + assert result._reso == tda._reso + + def test_div_numeric_scalar(self, tda): + other = 2 + result = tda / other + expected = TimedeltaArray._simple_new(tda._ndarray / other, dtype=tda.dtype) + tm.assert_extension_array_equal(result, expected) + assert result._reso == tda._reso + + def test_div_td_scalar(self, tda): + other = timedelta(seconds=1) + result = tda / other + expected = tda._ndarray / np.timedelta64(1, "s") + tm.assert_numpy_array_equal(result, expected) + + def test_div_numeric_array(self, tda): + other = np.arange(len(tda)) + result = tda / other + expected = TimedeltaArray._simple_new(tda._ndarray / other, dtype=tda.dtype) + tm.assert_extension_array_equal(result, expected) + assert result._reso == tda._reso + + def test_div_td_array(self, tda): + other = tda._ndarray + tda._ndarray[-1] + result = tda / other + expected = tda._ndarray / other + tm.assert_numpy_array_equal(result, expected) + class TestTimedeltaArray: @pytest.mark.parametrize("dtype", [int, np.int32, np.int64, "uint32", "uint64"]) diff --git a/pandas/tests/computation/test_eval.py b/pandas/tests/computation/test_eval.py index e70d493d23515..b0ad2f69a75b9 100644 --- a/pandas/tests/computation/test_eval.py +++ b/pandas/tests/computation/test_eval.py @@ -12,6 +12,7 @@ from pandas.errors import ( NumExprClobberingError, PerformanceWarning, + UndefinedVariableError, ) import pandas.util._test_decorators as td @@ -44,7 +45,6 @@ from pandas.core.computation.ops import ( ARITH_OPS_SYMS, SPECIAL_CASE_ARITH_OPS_SYMS, - UndefinedVariableError, _binary_math_ops, _binary_ops_dict, _unary_math_ops, diff --git a/pandas/tests/config/test_localization.py b/pandas/tests/config/test_localization.py index 21b1b7ed6ee65..f972a9ee3b497 100644 --- a/pandas/tests/config/test_localization.py +++ b/pandas/tests/config/test_localization.py @@ -10,31 +10,67 @@ set_locale, ) -from pandas.compat import is_platform_windows - import pandas as pd _all_locales = get_locales() or [] -_current_locale = locale.getlocale() +_current_locale = locale.setlocale(locale.LC_ALL) # getlocale() is wrong, see GH#46595 -# Don't run any of these tests if we are on Windows or have no locales. -pytestmark = pytest.mark.skipif( - is_platform_windows() or not _all_locales, reason="Need non-Windows and locales" -) +# Don't run any of these tests if we have no locales. +pytestmark = pytest.mark.skipif(not _all_locales, reason="Need locales") _skip_if_only_one_locale = pytest.mark.skipif( len(_all_locales) <= 1, reason="Need multiple locales for meaningful test" ) -def test_can_set_locale_valid_set(): +def _get_current_locale(lc_var: int = locale.LC_ALL) -> str: + # getlocale is not always compliant with setlocale, use setlocale. GH#46595 + return locale.setlocale(lc_var) + + +@pytest.mark.parametrize("lc_var", (locale.LC_ALL, locale.LC_CTYPE, locale.LC_TIME)) +def test_can_set_current_locale(lc_var): + # Can set the current locale + before_locale = _get_current_locale(lc_var) + assert can_set_locale(before_locale, lc_var=lc_var) + after_locale = _get_current_locale(lc_var) + assert before_locale == after_locale + + +@pytest.mark.parametrize("lc_var", (locale.LC_ALL, locale.LC_CTYPE, locale.LC_TIME)) +def test_can_set_locale_valid_set(lc_var): # Can set the default locale. - assert can_set_locale("") + before_locale = _get_current_locale(lc_var) + assert can_set_locale("", lc_var=lc_var) + after_locale = _get_current_locale(lc_var) + assert before_locale == after_locale -def test_can_set_locale_invalid_set(): +@pytest.mark.parametrize("lc_var", (locale.LC_ALL, locale.LC_CTYPE, locale.LC_TIME)) +def test_can_set_locale_invalid_set(lc_var): # Cannot set an invalid locale. - assert not can_set_locale("non-existent_locale") + before_locale = _get_current_locale(lc_var) + assert not can_set_locale("non-existent_locale", lc_var=lc_var) + after_locale = _get_current_locale(lc_var) + assert before_locale == after_locale + + +@pytest.mark.parametrize( + "lang,enc", + [ + ("it_CH", "UTF-8"), + ("en_US", "ascii"), + ("zh_CN", "GB2312"), + ("it_IT", "ISO-8859-1"), + ], +) +@pytest.mark.parametrize("lc_var", (locale.LC_ALL, locale.LC_CTYPE, locale.LC_TIME)) +def test_can_set_locale_no_leak(lang, enc, lc_var): + # Test that can_set_locale does not leak even when returning False. See GH#46595 + before_locale = _get_current_locale(lc_var) + can_set_locale((lang, enc), locale.LC_ALL) + after_locale = _get_current_locale(lc_var) + assert before_locale == after_locale def test_can_set_locale_invalid_get(monkeypatch): @@ -72,10 +108,7 @@ def test_get_locales_prefix(): ], ) def test_set_locale(lang, enc): - if all(x is None for x in _current_locale): - # Not sure why, but on some Travis runs with pytest, - # getlocale() returned (None, None). - pytest.skip("Current locale is not set.") + before_locale = _get_current_locale() enc = codecs.lookup(enc).name new_locale = lang, enc @@ -95,8 +128,8 @@ def test_set_locale(lang, enc): assert normalized_locale == new_locale # Once we exit the "with" statement, locale should be back to what it was. - current_locale = locale.getlocale() - assert current_locale == _current_locale + after_locale = _get_current_locale() + assert before_locale == after_locale def test_encoding_detected(): diff --git a/pandas/tests/dtypes/test_common.py b/pandas/tests/dtypes/test_common.py index c5d0567b6dfc0..92b99ba6d1fe2 100644 --- a/pandas/tests/dtypes/test_common.py +++ b/pandas/tests/dtypes/test_common.py @@ -474,6 +474,9 @@ def test_is_datetime64_ns_dtype(): pd.DatetimeIndex([1, 2, 3], dtype=np.dtype("datetime64[ns]")) ) + # non-nano dt64tz + assert not com.is_datetime64_ns_dtype(DatetimeTZDtype("us", "US/Eastern")) + def test_is_timedelta64_ns_dtype(): assert not com.is_timedelta64_ns_dtype(np.dtype("m8[ps]")) diff --git a/pandas/tests/dtypes/test_dtypes.py b/pandas/tests/dtypes/test_dtypes.py index aef61045179ef..64849c4223486 100644 --- a/pandas/tests/dtypes/test_dtypes.py +++ b/pandas/tests/dtypes/test_dtypes.py @@ -593,13 +593,13 @@ def test_construction_string_regex(self, subtype): @pytest.mark.parametrize( "subtype", ["interval[int64]", "Interval[int64]", "int64", np.dtype("int64")] ) - def test_construction_allows_closed_none(self, subtype): + def test_construction_allows_inclusive_none(self, subtype): # GH#38394 dtype = IntervalDtype(subtype) assert dtype.inclusive is None - def test_closed_mismatch(self): + def test_inclusive_mismatch(self): msg = "'inclusive' keyword does not match value specified in dtype string" with pytest.raises(ValueError, match=msg): IntervalDtype("interval[int64, left]", "right") @@ -638,7 +638,7 @@ def test_construction_errors(self, subtype): with pytest.raises(TypeError, match=msg): IntervalDtype(subtype) - def test_closed_must_match(self): + def test_inclusive_must_match(self): # GH#37933 dtype = IntervalDtype(np.float64, "left") @@ -646,7 +646,7 @@ def test_closed_must_match(self): with pytest.raises(ValueError, match=msg): IntervalDtype(dtype, inclusive="both") - def test_closed_invalid(self): + def test_inclusive_invalid(self): with pytest.raises(ValueError, match="inclusive must be one of"): IntervalDtype(np.float64, "foo") @@ -822,11 +822,11 @@ def test_not_string(self): # GH30568: though IntervalDtype has object kind, it cannot be string assert not is_string_dtype(IntervalDtype()) - def test_unpickling_without_closed(self): + def test_unpickling_without_inclusive(self): # GH#38394 dtype = IntervalDtype("interval") - assert dtype._closed is None + assert dtype._inclusive is None tm.round_trip_pickle(dtype) @@ -1140,3 +1140,15 @@ def test_compare_complex_dtypes(): with pytest.raises(TypeError, match=msg): df.lt(df.astype(object)) + + +def test_multi_column_dtype_assignment(): + # GH #27583 + df = pd.DataFrame({"a": [0.0], "b": 0.0}) + expected = pd.DataFrame({"a": [0], "b": 0}) + + df[["a", "b"]] = 0 + tm.assert_frame_equal(df, expected) + + df["b"] = 0 + tm.assert_frame_equal(df, expected) diff --git a/pandas/tests/dtypes/test_inference.py b/pandas/tests/dtypes/test_inference.py index b12476deccbfc..8fe6abd3b0ed5 100644 --- a/pandas/tests/dtypes/test_inference.py +++ b/pandas/tests/dtypes/test_inference.py @@ -700,25 +700,32 @@ def test_convert_int_overflow(self, value): result = lib.maybe_convert_objects(arr) tm.assert_numpy_array_equal(arr, result) - def test_maybe_convert_objects_uint64(self): - # see gh-4471 - arr = np.array([2**63], dtype=object) - exp = np.array([2**63], dtype=np.uint64) - tm.assert_numpy_array_equal(lib.maybe_convert_objects(arr), exp) - - # NumPy bug: can't compare uint64 to int64, as that - # results in both casting to float64, so we should - # make sure that this function is robust against it - arr = np.array([np.uint64(2**63)], dtype=object) - exp = np.array([2**63], dtype=np.uint64) - tm.assert_numpy_array_equal(lib.maybe_convert_objects(arr), exp) - - arr = np.array([2, -1], dtype=object) - exp = np.array([2, -1], dtype=np.int64) - tm.assert_numpy_array_equal(lib.maybe_convert_objects(arr), exp) - - arr = np.array([2**63, -1], dtype=object) - exp = np.array([2**63, -1], dtype=object) + @pytest.mark.parametrize( + "value, expected_dtype", + [ + # see gh-4471 + ([2**63], np.uint64), + # NumPy bug: can't compare uint64 to int64, as that + # results in both casting to float64, so we should + # make sure that this function is robust against it + ([np.uint64(2**63)], np.uint64), + ([2, -1], np.int64), + ([2**63, -1], object), + # GH#47294 + ([np.uint8(1)], np.uint8), + ([np.uint16(1)], np.uint16), + ([np.uint32(1)], np.uint32), + ([np.uint64(1)], np.uint64), + ([np.uint8(2), np.uint16(1)], np.uint16), + ([np.uint32(2), np.uint16(1)], np.uint32), + ([np.uint32(2), -1], object), + ([np.uint32(2), 1], np.uint64), + ([np.uint32(2), np.int32(1)], object), + ], + ) + def test_maybe_convert_objects_uint(self, value, expected_dtype): + arr = np.array(value, dtype=object) + exp = np.array(value, dtype=expected_dtype) tm.assert_numpy_array_equal(lib.maybe_convert_objects(arr), exp) def test_maybe_convert_objects_datetime(self): diff --git a/pandas/tests/exchange/test_spec_conformance.py b/pandas/tests/exchange/test_spec_conformance.py index f5b8bb569f35e..392402871a5fd 100644 --- a/pandas/tests/exchange/test_spec_conformance.py +++ b/pandas/tests/exchange/test_spec_conformance.py @@ -24,7 +24,9 @@ def test_only_one_dtype(test_data, df_from_dict): column_size = len(test_data[columns[0]]) for column in columns: - assert dfX.get_column_by_name(column).null_count == 0 + null_count = dfX.get_column_by_name(column).null_count + assert null_count == 0 + assert isinstance(null_count, int) assert dfX.get_column_by_name(column).size == column_size assert dfX.get_column_by_name(column).offset == 0 @@ -49,6 +51,7 @@ def test_mixed_dtypes(df_from_dict): for column, kind in columns.items(): colX = dfX.get_column_by_name(column) assert colX.null_count == 0 + assert isinstance(colX.null_count, int) assert colX.size == 3 assert colX.offset == 0 @@ -62,6 +65,7 @@ def test_na_float(df_from_dict): dfX = df.__dataframe__() colX = dfX.get_column_by_name("a") assert colX.null_count == 1 + assert isinstance(colX.null_count, int) def test_noncategorical(df_from_dict): diff --git a/pandas/tests/extension/arrow/arrays.py b/pandas/tests/extension/arrow/arrays.py index 22595c4e461d7..26b94ebe5a8da 100644 --- a/pandas/tests/extension/arrow/arrays.py +++ b/pandas/tests/extension/arrow/arrays.py @@ -23,7 +23,6 @@ take, ) from pandas.api.types import is_scalar -from pandas.core.arraylike import OpsMixin from pandas.core.arrays.arrow import ArrowExtensionArray as _ArrowExtensionArray from pandas.core.construction import extract_array @@ -72,7 +71,7 @@ def construct_array_type(cls) -> type_t[ArrowStringArray]: return ArrowStringArray -class ArrowExtensionArray(OpsMixin, _ArrowExtensionArray): +class ArrowExtensionArray(_ArrowExtensionArray): _data: pa.ChunkedArray @classmethod diff --git a/pandas/tests/extension/base/methods.py b/pandas/tests/extension/base/methods.py index b829b017d5fb1..838c9f5b8a35f 100644 --- a/pandas/tests/extension/base/methods.py +++ b/pandas/tests/extension/base/methods.py @@ -5,6 +5,7 @@ import pytest from pandas.core.dtypes.common import is_bool_dtype +from pandas.core.dtypes.missing import na_value_for_dtype import pandas as pd import pandas._testing as tm @@ -49,8 +50,7 @@ def test_value_counts_with_normalize(self, data): else: expected = pd.Series(0.0, index=result.index) expected[result > 0] = 1 / len(values) - - if isinstance(data.dtype, pd.core.dtypes.dtypes.BaseMaskedDtype): + if na_value_for_dtype(data.dtype) is pd.NA: # TODO(GH#44692): avoid special-casing expected = expected.astype("Float64") @@ -213,7 +213,12 @@ def test_unique(self, data, box, method): @pytest.mark.parametrize("na_sentinel", [-1, -2]) def test_factorize(self, data_for_grouping, na_sentinel): - codes, uniques = pd.factorize(data_for_grouping, na_sentinel=na_sentinel) + if na_sentinel == -1: + msg = "Specifying `na_sentinel=-1` is deprecated" + else: + msg = "Specifying the specific value to use for `na_sentinel` is deprecated" + with tm.assert_produces_warning(FutureWarning, match=msg): + codes, uniques = pd.factorize(data_for_grouping, na_sentinel=na_sentinel) expected_codes = np.array( [0, 0, na_sentinel, na_sentinel, 1, 1, 0, 2], dtype=np.intp ) @@ -224,8 +229,15 @@ def test_factorize(self, data_for_grouping, na_sentinel): @pytest.mark.parametrize("na_sentinel", [-1, -2]) def test_factorize_equivalence(self, data_for_grouping, na_sentinel): - codes_1, uniques_1 = pd.factorize(data_for_grouping, na_sentinel=na_sentinel) - codes_2, uniques_2 = data_for_grouping.factorize(na_sentinel=na_sentinel) + if na_sentinel == -1: + msg = "Specifying `na_sentinel=-1` is deprecated" + else: + msg = "Specifying the specific value to use for `na_sentinel` is deprecated" + with tm.assert_produces_warning(FutureWarning, match=msg): + codes_1, uniques_1 = pd.factorize( + data_for_grouping, na_sentinel=na_sentinel + ) + codes_2, uniques_2 = data_for_grouping.factorize(na_sentinel=na_sentinel) tm.assert_numpy_array_equal(codes_1, codes_2) self.assert_extension_array_equal(uniques_1, uniques_2) diff --git a/pandas/tests/extension/base/ops.py b/pandas/tests/extension/base/ops.py index a1d232b737da7..569782e55fd72 100644 --- a/pandas/tests/extension/base/ops.py +++ b/pandas/tests/extension/base/ops.py @@ -67,10 +67,10 @@ class BaseArithmeticOpsTests(BaseOpsUtil): * divmod_exc = TypeError """ - series_scalar_exc: type[TypeError] | None = TypeError - frame_scalar_exc: type[TypeError] | None = TypeError - series_array_exc: type[TypeError] | None = TypeError - divmod_exc: type[TypeError] | None = TypeError + series_scalar_exc: type[Exception] | None = TypeError + frame_scalar_exc: type[Exception] | None = TypeError + series_array_exc: type[Exception] | None = TypeError + divmod_exc: type[Exception] | None = TypeError def test_arith_series_with_scalar(self, data, all_arithmetic_operators): # series & scalar diff --git a/pandas/tests/extension/base/setitem.py b/pandas/tests/extension/base/setitem.py index 9e016e0101ef6..04fa3c11a6c40 100644 --- a/pandas/tests/extension/base/setitem.py +++ b/pandas/tests/extension/base/setitem.py @@ -357,6 +357,20 @@ def test_setitem_with_expansion_dataframe_column(self, data, full_indexer): self.assert_frame_equal(result, expected) + def test_setitem_with_expansion_row(self, data, na_value): + df = pd.DataFrame({"data": data[:1]}) + + df.loc[1, "data"] = data[1] + expected = pd.DataFrame({"data": data[:2]}) + self.assert_frame_equal(df, expected) + + # https://github.com/pandas-dev/pandas/issues/47284 + df.loc[2, "data"] = na_value + expected = pd.DataFrame( + {"data": pd.Series([data[0], data[1], na_value], dtype=data.dtype)} + ) + self.assert_frame_equal(df, expected) + def test_setitem_series(self, data, full_indexer): # https://github.com/pandas-dev/pandas/issues/32395 ser = pd.Series(data, name="data") diff --git a/pandas/tests/extension/date/array.py b/pandas/tests/extension/date/array.py index b14b9921be3d3..eca935cdc9128 100644 --- a/pandas/tests/extension/date/array.py +++ b/pandas/tests/extension/date/array.py @@ -109,10 +109,9 @@ def __init__( self._month = np.zeros(ldates, dtype=np.uint8) # 255 (1, 31) self._day = np.zeros(ldates, dtype=np.uint8) # 255 (1, 12) - # "object_" object is not iterable [misc] - for (i,), (y, m, d) in np.ndenumerate( # type: ignore[misc] - np.char.split(dates, sep="-") - ): + # error: "object_" object is not iterable + obj = np.char.split(dates, sep="-") + for (i,), (y, m, d) in np.ndenumerate(obj): # type: ignore[misc] self._year[i] = int(y) self._month[i] = int(m) self._day[i] = int(d) diff --git a/pandas/tests/extension/test_arrow.py b/pandas/tests/extension/test_arrow.py index 03616267c3f86..a2a96da02b2a6 100644 --- a/pandas/tests/extension/test_arrow.py +++ b/pandas/tests/extension/test_arrow.py @@ -18,11 +18,14 @@ timedelta, ) +import numpy as np import pytest from pandas.compat import ( pa_version_under2p0, pa_version_under3p0, + pa_version_under6p0, + pa_version_under8p0, ) import pandas as pd @@ -34,7 +37,7 @@ from pandas.core.arrays.arrow.dtype import ArrowDtype # isort:skip -@pytest.fixture(params=tm.ALL_PYARROW_DTYPES) +@pytest.fixture(params=tm.ALL_PYARROW_DTYPES, ids=str) def dtype(request): return ArrowDtype(pyarrow_dtype=request.param) @@ -93,6 +96,101 @@ def data_missing(data): return type(data)._from_sequence([None, data[0]]) +@pytest.fixture(params=["data", "data_missing"]) +def all_data(request, data, data_missing): + """Parametrized fixture returning 'data' or 'data_missing' integer arrays. + + Used to test dtype conversion with and without missing values. + """ + if request.param == "data": + return data + elif request.param == "data_missing": + return data_missing + + +@pytest.fixture +def data_for_grouping(dtype): + """ + Data for factorization, grouping, and unique tests. + + Expected to be like [B, B, NA, NA, A, A, B, C] + + Where A < B < C and NA is missing + """ + pa_dtype = dtype.pyarrow_dtype + if pa.types.is_boolean(pa_dtype): + A = False + B = True + C = True + elif pa.types.is_floating(pa_dtype): + A = -1.1 + B = 0.0 + C = 1.1 + elif pa.types.is_signed_integer(pa_dtype): + A = -1 + B = 0 + C = 1 + elif pa.types.is_unsigned_integer(pa_dtype): + A = 0 + B = 1 + C = 10 + elif pa.types.is_date(pa_dtype): + A = date(1999, 12, 31) + B = date(2010, 1, 1) + C = date(2022, 1, 1) + elif pa.types.is_timestamp(pa_dtype): + A = datetime(1999, 1, 1, 1, 1, 1, 1) + B = datetime(2020, 1, 1) + C = datetime(2020, 1, 1, 1) + elif pa.types.is_duration(pa_dtype): + A = timedelta(-1) + B = timedelta(0) + C = timedelta(1, 4) + elif pa.types.is_time(pa_dtype): + A = time(0, 0) + B = time(0, 12) + C = time(12, 12) + else: + raise NotImplementedError + return pd.array([B, B, None, None, A, A, B, C], dtype=dtype) + + +@pytest.fixture +def data_for_sorting(data_for_grouping): + """ + Length-3 array with a known sort order. + + This should be three items [B, C, A] with + A < B < C + """ + return type(data_for_grouping)._from_sequence( + [data_for_grouping[0], data_for_grouping[7], data_for_grouping[4]] + ) + + +@pytest.fixture +def data_missing_for_sorting(data_for_grouping): + """ + Length-3 array with a known sort order. + + This should be three items [B, NA, A] with + A < B and NA missing. + """ + return type(data_for_grouping)._from_sequence( + [data_for_grouping[0], data_for_grouping[2], data_for_grouping[4]] + ) + + +@pytest.fixture +def data_for_twos(data): + """Length-100 array in which all the elements are two.""" + pa_dtype = data.dtype.pyarrow_dtype + if pa.types.is_integer(pa_dtype) or pa.types.is_floating(pa_dtype): + return pd.array([2] * 100, dtype=data.dtype) + # tests will be xfailed where 2 is not a valid scalar for pa_dtype + return data + + @pytest.fixture def na_value(): """The scalar missing value for this type. Default 'None'""" @@ -104,14 +202,23 @@ class TestBaseCasting(base.BaseCastingTests): class TestConstructors(base.BaseConstructorsTests): - @pytest.mark.xfail( - reason=( - "str(dtype) constructs " - "e.g. in64[pyarrow] like int64 (numpy) " - "due to StorageExtensionDtype.__str__" - ) - ) - def test_from_dtype(self, data): + def test_from_dtype(self, data, request): + pa_dtype = data.dtype.pyarrow_dtype + if pa.types.is_timestamp(pa_dtype) and pa_dtype.tz: + if pa_version_under2p0: + request.node.add_marker( + pytest.mark.xfail( + reason=f"timestamp data with tz={pa_dtype.tz} " + "converted to integer when pyarrow < 2.0", + ) + ) + else: + request.node.add_marker( + pytest.mark.xfail( + raises=NotImplementedError, + reason=f"pyarrow.type_for_alias cannot infer {pa_dtype}", + ) + ) super().test_from_dtype(data) @@ -197,10 +304,1645 @@ def test_loc_iloc_frame_single_dtype(self, request, using_array_manager, data): super().test_loc_iloc_frame_single_dtype(data) +class TestBaseNumericReduce(base.BaseNumericReduceTests): + def check_reduce(self, ser, op_name, skipna): + pa_dtype = ser.dtype.pyarrow_dtype + result = getattr(ser, op_name)(skipna=skipna) + if pa.types.is_boolean(pa_dtype): + # Can't convert if ser contains NA + pytest.skip( + "pandas boolean data with NA does not fully support all reductions" + ) + elif pa.types.is_integer(pa_dtype) or pa.types.is_floating(pa_dtype): + ser = ser.astype("Float64") + expected = getattr(ser, op_name)(skipna=skipna) + tm.assert_almost_equal(result, expected) + + @pytest.mark.parametrize("skipna", [True, False]) + def test_reduce_series(self, data, all_numeric_reductions, skipna, request): + pa_dtype = data.dtype.pyarrow_dtype + xfail_mark = pytest.mark.xfail( + raises=TypeError, + reason=( + f"{all_numeric_reductions} is not implemented in " + f"pyarrow={pa.__version__} for {pa_dtype}" + ), + ) + if all_numeric_reductions in {"skew", "kurt"}: + request.node.add_marker(xfail_mark) + elif ( + all_numeric_reductions in {"median", "var", "std", "prod", "max", "min"} + and pa_version_under6p0 + ): + request.node.add_marker(xfail_mark) + elif all_numeric_reductions in {"sum", "mean"} and pa_version_under2p0: + request.node.add_marker(xfail_mark) + elif ( + all_numeric_reductions in {"sum", "mean"} + and skipna is False + and pa_version_under6p0 + and (pa.types.is_integer(pa_dtype) or pa.types.is_floating(pa_dtype)) + ): + request.node.add_marker( + pytest.mark.xfail( + raises=AssertionError, + reason=( + f"{all_numeric_reductions} with skip_nulls={skipna} did not " + f"return NA for {pa_dtype} with pyarrow={pa.__version__}" + ), + ) + ) + elif not ( + pa.types.is_integer(pa_dtype) + or pa.types.is_floating(pa_dtype) + or pa.types.is_boolean(pa_dtype) + ) and not ( + all_numeric_reductions in {"min", "max"} + and (pa.types.is_temporal(pa_dtype) and not pa.types.is_duration(pa_dtype)) + ): + request.node.add_marker(xfail_mark) + elif pa.types.is_boolean(pa_dtype) and all_numeric_reductions in { + "std", + "var", + "median", + }: + request.node.add_marker(xfail_mark) + super().test_reduce_series(data, all_numeric_reductions, skipna) + + +class TestBaseBooleanReduce(base.BaseBooleanReduceTests): + @pytest.mark.parametrize("skipna", [True, False]) + def test_reduce_series( + self, data, all_boolean_reductions, skipna, na_value, request + ): + pa_dtype = data.dtype.pyarrow_dtype + xfail_mark = pytest.mark.xfail( + raises=TypeError, + reason=( + f"{all_boolean_reductions} is not implemented in " + f"pyarrow={pa.__version__} for {pa_dtype}" + ), + ) + if not pa.types.is_boolean(pa_dtype): + request.node.add_marker(xfail_mark) + elif pa_version_under3p0: + request.node.add_marker(xfail_mark) + op_name = all_boolean_reductions + s = pd.Series(data) + result = getattr(s, op_name)(skipna=skipna) + assert result is (op_name == "any") + + +class TestBaseGroupby(base.BaseGroupbyTests): + def test_groupby_agg_extension(self, data_for_grouping, request): + tz = getattr(data_for_grouping.dtype.pyarrow_dtype, "tz", None) + if pa_version_under2p0 and tz not in (None, "UTC"): + request.node.add_marker( + pytest.mark.xfail( + reason=f"Not supported by pyarrow < 2.0 with timestamp type {tz}." + ) + ) + super().test_groupby_agg_extension(data_for_grouping) + + def test_groupby_extension_no_sort(self, data_for_grouping, request): + pa_dtype = data_for_grouping.dtype.pyarrow_dtype + if pa.types.is_boolean(pa_dtype): + request.node.add_marker( + pytest.mark.xfail( + reason=f"{pa_dtype} only has 2 unique possible values", + ) + ) + elif pa.types.is_duration(pa_dtype): + request.node.add_marker( + pytest.mark.xfail( + raises=pa.ArrowNotImplementedError, + reason=f"pyarrow doesn't support factorizing {pa_dtype}", + ) + ) + elif pa.types.is_date(pa_dtype) or ( + pa.types.is_timestamp(pa_dtype) and pa_dtype.tz is None + ): + request.node.add_marker( + pytest.mark.xfail( + raises=AttributeError, + reason="GH 34986", + ) + ) + super().test_groupby_extension_no_sort(data_for_grouping) + + def test_groupby_extension_transform(self, data_for_grouping, request): + pa_dtype = data_for_grouping.dtype.pyarrow_dtype + if pa.types.is_boolean(pa_dtype): + request.node.add_marker( + pytest.mark.xfail( + reason=f"{pa_dtype} only has 2 unique possible values", + ) + ) + elif pa.types.is_duration(pa_dtype): + request.node.add_marker( + pytest.mark.xfail( + raises=pa.ArrowNotImplementedError, + reason=f"pyarrow doesn't support factorizing {pa_dtype}", + ) + ) + super().test_groupby_extension_transform(data_for_grouping) + + def test_groupby_extension_apply( + self, data_for_grouping, groupby_apply_op, request + ): + pa_dtype = data_for_grouping.dtype.pyarrow_dtype + # Is there a better way to get the "series" ID for groupby_apply_op? + is_series = "series" in request.node.nodeid + is_object = "object" in request.node.nodeid + if pa.types.is_duration(pa_dtype): + request.node.add_marker( + pytest.mark.xfail( + raises=pa.ArrowNotImplementedError, + reason=f"pyarrow doesn't support factorizing {pa_dtype}", + ) + ) + elif pa.types.is_date(pa_dtype) or ( + pa.types.is_timestamp(pa_dtype) and pa_dtype.tz is None + ): + if is_object: + request.node.add_marker( + pytest.mark.xfail( + raises=TypeError, + reason="GH 47514: _concat_datetime expects axis arg.", + ) + ) + elif not is_series: + request.node.add_marker( + pytest.mark.xfail( + raises=AttributeError, + reason="GH 34986", + ) + ) + super().test_groupby_extension_apply(data_for_grouping, groupby_apply_op) + + def test_in_numeric_groupby(self, data_for_grouping, request): + pa_dtype = data_for_grouping.dtype.pyarrow_dtype + if pa.types.is_integer(pa_dtype) or pa.types.is_floating(pa_dtype): + request.node.add_marker( + pytest.mark.xfail( + reason="ArrowExtensionArray doesn't support .sum() yet.", + ) + ) + super().test_in_numeric_groupby(data_for_grouping) + + @pytest.mark.parametrize("as_index", [True, False]) + def test_groupby_extension_agg(self, as_index, data_for_grouping, request): + pa_dtype = data_for_grouping.dtype.pyarrow_dtype + if pa.types.is_boolean(pa_dtype): + request.node.add_marker( + pytest.mark.xfail( + raises=ValueError, + reason=f"{pa_dtype} only has 2 unique possible values", + ) + ) + elif pa.types.is_duration(pa_dtype): + request.node.add_marker( + pytest.mark.xfail( + raises=pa.ArrowNotImplementedError, + reason=f"pyarrow doesn't support factorizing {pa_dtype}", + ) + ) + elif as_index is True and ( + pa.types.is_date(pa_dtype) + or (pa.types.is_timestamp(pa_dtype) and pa_dtype.tz is None) + ): + request.node.add_marker( + pytest.mark.xfail( + raises=AttributeError, + reason="GH 34986", + ) + ) + super().test_groupby_extension_agg(as_index, data_for_grouping) + + +class TestBaseDtype(base.BaseDtypeTests): + def test_construct_from_string_own_name(self, dtype, request): + pa_dtype = dtype.pyarrow_dtype + if pa.types.is_timestamp(pa_dtype) and pa_dtype.tz is not None: + request.node.add_marker( + pytest.mark.xfail( + raises=NotImplementedError, + reason=f"pyarrow.type_for_alias cannot infer {pa_dtype}", + ) + ) + super().test_construct_from_string_own_name(dtype) + + def test_is_dtype_from_name(self, dtype, request): + pa_dtype = dtype.pyarrow_dtype + if pa.types.is_timestamp(pa_dtype) and pa_dtype.tz is not None: + request.node.add_marker( + pytest.mark.xfail( + raises=NotImplementedError, + reason=f"pyarrow.type_for_alias cannot infer {pa_dtype}", + ) + ) + super().test_is_dtype_from_name(dtype) + + def test_construct_from_string(self, dtype, request): + pa_dtype = dtype.pyarrow_dtype + if pa.types.is_timestamp(pa_dtype) and pa_dtype.tz is not None: + request.node.add_marker( + pytest.mark.xfail( + raises=NotImplementedError, + reason=f"pyarrow.type_for_alias cannot infer {pa_dtype}", + ) + ) + super().test_construct_from_string(dtype) + + def test_construct_from_string_another_type_raises(self, dtype): + msg = r"'another_type' must end with '\[pyarrow\]'" + with pytest.raises(TypeError, match=msg): + type(dtype).construct_from_string("another_type") + + def test_get_common_dtype(self, dtype, request): + pa_dtype = dtype.pyarrow_dtype + if ( + pa.types.is_date(pa_dtype) + or pa.types.is_time(pa_dtype) + or ( + pa.types.is_timestamp(pa_dtype) + and (pa_dtype.unit != "ns" or pa_dtype.tz is not None) + ) + or (pa.types.is_duration(pa_dtype) and pa_dtype.unit != "ns") + ): + request.node.add_marker( + pytest.mark.xfail( + reason=( + f"{pa_dtype} does not have associated numpy " + f"dtype findable by find_common_type" + ) + ) + ) + super().test_get_common_dtype(dtype) + + class TestBaseIndex(base.BaseIndexTests): pass -def test_arrowdtype_construct_from_string_type_with_parameters(): +class TestBaseInterface(base.BaseInterfaceTests): + def test_contains(self, data, data_missing, request): + tz = getattr(data.dtype.pyarrow_dtype, "tz", None) + unit = getattr(data.dtype.pyarrow_dtype, "unit", None) + if pa_version_under2p0 and tz not in (None, "UTC") and unit == "us": + request.node.add_marker( + pytest.mark.xfail( + reason=( + f"Not supported by pyarrow < 2.0 " + f"with timestamp type {tz} and {unit}" + ) + ) + ) + super().test_contains(data, data_missing) + + @pytest.mark.xfail(reason="pyarrow.ChunkedArray does not support views.") + def test_view(self, data): + super().test_view(data) + + +class TestBaseMissing(base.BaseMissingTests): + def test_fillna_limit_pad(self, data_missing, using_array_manager, request): + if using_array_manager and pa.types.is_duration( + data_missing.dtype.pyarrow_dtype + ): + request.node.add_marker( + pytest.mark.xfail( + reason="Checking ndim when using arraymanager with duration type" + ) + ) + super().test_fillna_limit_pad(data_missing) + + def test_fillna_limit_backfill(self, data_missing, using_array_manager, request): + if using_array_manager and pa.types.is_duration( + data_missing.dtype.pyarrow_dtype + ): + request.node.add_marker( + pytest.mark.xfail( + reason="Checking ndim when using arraymanager with duration type" + ) + ) + super().test_fillna_limit_backfill(data_missing) + + def test_fillna_series(self, data_missing, using_array_manager, request): + if using_array_manager and pa.types.is_duration( + data_missing.dtype.pyarrow_dtype + ): + request.node.add_marker( + pytest.mark.xfail( + reason="Checking ndim when using arraymanager with duration type" + ) + ) + super().test_fillna_series(data_missing) + + def test_fillna_series_method( + self, data_missing, fillna_method, using_array_manager, request + ): + if using_array_manager and pa.types.is_duration( + data_missing.dtype.pyarrow_dtype + ): + request.node.add_marker( + pytest.mark.xfail( + reason="Checking ndim when using arraymanager with duration type" + ) + ) + super().test_fillna_series_method(data_missing, fillna_method) + + def test_fillna_frame(self, data_missing, using_array_manager, request): + if using_array_manager and pa.types.is_duration( + data_missing.dtype.pyarrow_dtype + ): + request.node.add_marker( + pytest.mark.xfail( + reason="Checking ndim when using arraymanager with duration type" + ) + ) + super().test_fillna_frame(data_missing) + + +class TestBasePrinting(base.BasePrintingTests): + def test_series_repr(self, data, request): + pa_dtype = data.dtype.pyarrow_dtype + if ( + pa.types.is_date(pa_dtype) + or pa.types.is_duration(pa_dtype) + or (pa.types.is_timestamp(pa_dtype) and pa_dtype.tz is None) + ): + request.node.add_marker( + pytest.mark.xfail( + raises=TypeError, + reason="GH 47514: _concat_datetime expects axis arg.", + ) + ) + super().test_series_repr(data) + + def test_dataframe_repr(self, data, request): + pa_dtype = data.dtype.pyarrow_dtype + if ( + pa.types.is_date(pa_dtype) + or pa.types.is_duration(pa_dtype) + or (pa.types.is_timestamp(pa_dtype) and pa_dtype.tz is None) + ): + request.node.add_marker( + pytest.mark.xfail( + raises=TypeError, + reason="GH 47514: _concat_datetime expects axis arg.", + ) + ) + super().test_dataframe_repr(data) + + +class TestBaseReshaping(base.BaseReshapingTests): + @pytest.mark.parametrize("in_frame", [True, False]) + def test_concat(self, data, in_frame, request): + pa_dtype = data.dtype.pyarrow_dtype + if ( + pa.types.is_date(pa_dtype) + or pa.types.is_duration(pa_dtype) + or (pa.types.is_timestamp(pa_dtype) and pa_dtype.tz is None) + ): + request.node.add_marker( + pytest.mark.xfail( + raises=TypeError, + reason="GH 47514: _concat_datetime expects axis arg.", + ) + ) + super().test_concat(data, in_frame) + + @pytest.mark.parametrize("in_frame", [True, False]) + def test_concat_all_na_block(self, data_missing, in_frame, request): + pa_dtype = data_missing.dtype.pyarrow_dtype + if ( + pa.types.is_date(pa_dtype) + or pa.types.is_duration(pa_dtype) + or (pa.types.is_timestamp(pa_dtype) and pa_dtype.tz is None) + ): + request.node.add_marker( + pytest.mark.xfail( + raises=TypeError, + reason="GH 47514: _concat_datetime expects axis arg.", + ) + ) + super().test_concat_all_na_block(data_missing, in_frame) + + def test_concat_columns(self, data, na_value, request): + tz = getattr(data.dtype.pyarrow_dtype, "tz", None) + if pa_version_under2p0 and tz not in (None, "UTC"): + request.node.add_marker( + pytest.mark.xfail( + reason=f"Not supported by pyarrow < 2.0 with timestamp type {tz}" + ) + ) + super().test_concat_columns(data, na_value) + + def test_concat_extension_arrays_copy_false(self, data, na_value, request): + tz = getattr(data.dtype.pyarrow_dtype, "tz", None) + if pa_version_under2p0 and tz not in (None, "UTC"): + request.node.add_marker( + pytest.mark.xfail( + reason=f"Not supported by pyarrow < 2.0 with timestamp type {tz}" + ) + ) + super().test_concat_extension_arrays_copy_false(data, na_value) + + def test_concat_with_reindex(self, data, request, using_array_manager): + pa_dtype = data.dtype.pyarrow_dtype + if pa.types.is_duration(pa_dtype): + request.node.add_marker( + pytest.mark.xfail( + raises=TypeError, + reason="GH 47514: _concat_datetime expects axis arg.", + ) + ) + elif pa.types.is_date(pa_dtype) or ( + pa.types.is_timestamp(pa_dtype) and pa_dtype.tz is None + ): + request.node.add_marker( + pytest.mark.xfail( + raises=AttributeError if not using_array_manager else TypeError, + reason="GH 34986", + ) + ) + super().test_concat_with_reindex(data) + + def test_align(self, data, na_value, request): + tz = getattr(data.dtype.pyarrow_dtype, "tz", None) + if pa_version_under2p0 and tz not in (None, "UTC"): + request.node.add_marker( + pytest.mark.xfail( + reason=f"Not supported by pyarrow < 2.0 with timestamp type {tz}" + ) + ) + super().test_align(data, na_value) + + def test_align_frame(self, data, na_value, request): + tz = getattr(data.dtype.pyarrow_dtype, "tz", None) + if pa_version_under2p0 and tz not in (None, "UTC"): + request.node.add_marker( + pytest.mark.xfail( + reason=f"Not supported by pyarrow < 2.0 with timestamp type {tz}" + ) + ) + super().test_align_frame(data, na_value) + + def test_align_series_frame(self, data, na_value, request): + tz = getattr(data.dtype.pyarrow_dtype, "tz", None) + if pa_version_under2p0 and tz not in (None, "UTC"): + request.node.add_marker( + pytest.mark.xfail( + reason=f"Not supported by pyarrow < 2.0 with timestamp type {tz}" + ) + ) + super().test_align_series_frame(data, na_value) + + def test_merge(self, data, na_value, request): + tz = getattr(data.dtype.pyarrow_dtype, "tz", None) + if pa_version_under2p0 and tz not in (None, "UTC"): + request.node.add_marker( + pytest.mark.xfail( + reason=f"Not supported by pyarrow < 2.0 with timestamp type {tz}" + ) + ) + super().test_merge(data, na_value) + + def test_merge_on_extension_array(self, data, request): + pa_dtype = data.dtype.pyarrow_dtype + if pa.types.is_date(pa_dtype) or ( + pa.types.is_timestamp(pa_dtype) and pa_dtype.tz is None + ): + request.node.add_marker( + pytest.mark.xfail( + raises=AttributeError, + reason="GH 34986", + ) + ) + super().test_merge_on_extension_array(data) + + def test_merge_on_extension_array_duplicates(self, data, request): + pa_dtype = data.dtype.pyarrow_dtype + if pa.types.is_date(pa_dtype) or ( + pa.types.is_timestamp(pa_dtype) and pa_dtype.tz is None + ): + request.node.add_marker( + pytest.mark.xfail( + raises=AttributeError, + reason="GH 34986", + ) + ) + super().test_merge_on_extension_array_duplicates(data) + + def test_ravel(self, data, request): + tz = getattr(data.dtype.pyarrow_dtype, "tz", None) + if pa_version_under2p0 and tz not in (None, "UTC"): + request.node.add_marker( + pytest.mark.xfail( + reason=f"Not supported by pyarrow < 2.0 with timestamp type {tz}" + ) + ) + super().test_ravel(data) + + @pytest.mark.xfail(reason="GH 45419: pyarrow.ChunkedArray does not support views") + def test_transpose(self, data): + super().test_transpose(data) + + def test_transpose_frame(self, data, request): + tz = getattr(data.dtype.pyarrow_dtype, "tz", None) + if pa_version_under2p0 and tz not in (None, "UTC"): + request.node.add_marker( + pytest.mark.xfail( + reason=f"Not supported by pyarrow < 2.0 with timestamp type {tz}" + ) + ) + super().test_transpose_frame(data) + + +class TestBaseSetitem(base.BaseSetitemTests): + def test_setitem_scalar_series(self, data, box_in_series, request): + tz = getattr(data.dtype.pyarrow_dtype, "tz", None) + if pa_version_under2p0 and tz not in (None, "UTC"): + request.node.add_marker( + pytest.mark.xfail( + reason=f"Not supported by pyarrow < 2.0 with timestamp type {tz}" + ) + ) + super().test_setitem_scalar_series(data, box_in_series) + + def test_setitem_sequence(self, data, box_in_series, using_array_manager, request): + tz = getattr(data.dtype.pyarrow_dtype, "tz", None) + if pa_version_under2p0 and tz not in (None, "UTC"): + request.node.add_marker( + pytest.mark.xfail( + reason=(f"Not supported by pyarrow < 2.0 with timestamp type {tz}") + ) + ) + elif ( + using_array_manager + and pa.types.is_duration(data.dtype.pyarrow_dtype) + and box_in_series + ): + request.node.add_marker( + pytest.mark.xfail( + reason="Checking ndim when using arraymanager with duration type" + ) + ) + super().test_setitem_sequence(data, box_in_series) + + def test_setitem_sequence_mismatched_length_raises( + self, data, as_array, using_array_manager, request + ): + if using_array_manager and pa.types.is_duration(data.dtype.pyarrow_dtype): + request.node.add_marker( + pytest.mark.xfail( + reason="Checking ndim when using arraymanager with duration type" + ) + ) + super().test_setitem_sequence_mismatched_length_raises(data, as_array) + + def test_setitem_empty_indexer( + self, data, box_in_series, using_array_manager, request + ): + if ( + using_array_manager + and pa.types.is_duration(data.dtype.pyarrow_dtype) + and box_in_series + ): + request.node.add_marker( + pytest.mark.xfail( + reason="Checking ndim when using arraymanager with duration type" + ) + ) + super().test_setitem_empty_indexer(data, box_in_series) + + def test_setitem_sequence_broadcasts( + self, data, box_in_series, using_array_manager, request + ): + tz = getattr(data.dtype.pyarrow_dtype, "tz", None) + if pa_version_under2p0 and tz not in (None, "UTC"): + request.node.add_marker( + pytest.mark.xfail( + reason=(f"Not supported by pyarrow < 2.0 with timestamp type {tz}") + ) + ) + elif ( + using_array_manager + and pa.types.is_duration(data.dtype.pyarrow_dtype) + and box_in_series + ): + request.node.add_marker( + pytest.mark.xfail( + reason="Checking ndim when using arraymanager with duration type" + ) + ) + super().test_setitem_sequence_broadcasts(data, box_in_series) + + @pytest.mark.parametrize("setter", ["loc", "iloc"]) + def test_setitem_scalar(self, data, setter, using_array_manager, request): + tz = getattr(data.dtype.pyarrow_dtype, "tz", None) + if pa_version_under2p0 and tz not in (None, "UTC"): + request.node.add_marker( + pytest.mark.xfail( + reason=(f"Not supported by pyarrow < 2.0 with timestamp type {tz}") + ) + ) + elif using_array_manager and pa.types.is_duration(data.dtype.pyarrow_dtype): + request.node.add_marker( + pytest.mark.xfail( + reason="Checking ndim when using arraymanager with duration type" + ) + ) + super().test_setitem_scalar(data, setter) + + def test_setitem_loc_scalar_mixed(self, data, using_array_manager, request): + tz = getattr(data.dtype.pyarrow_dtype, "tz", None) + if pa_version_under2p0 and tz not in (None, "UTC"): + request.node.add_marker( + pytest.mark.xfail( + reason=(f"Not supported by pyarrow < 2.0 with timestamp type {tz}") + ) + ) + elif using_array_manager and pa.types.is_duration(data.dtype.pyarrow_dtype): + request.node.add_marker( + pytest.mark.xfail( + reason="Checking ndim when using arraymanager with duration type" + ) + ) + super().test_setitem_loc_scalar_mixed(data) + + def test_setitem_loc_scalar_single(self, data, using_array_manager, request): + tz = getattr(data.dtype.pyarrow_dtype, "tz", None) + if pa_version_under2p0 and tz not in (None, "UTC"): + request.node.add_marker( + pytest.mark.xfail( + reason=f"Not supported by pyarrow < 2.0 with timestamp type {tz}" + ) + ) + elif using_array_manager and pa.types.is_duration(data.dtype.pyarrow_dtype): + request.node.add_marker( + pytest.mark.xfail( + reason="Checking ndim when using arraymanager with duration type" + ) + ) + super().test_setitem_loc_scalar_single(data) + + def test_setitem_loc_scalar_multiple_homogoneous( + self, data, using_array_manager, request + ): + tz = getattr(data.dtype.pyarrow_dtype, "tz", None) + if pa_version_under2p0 and tz not in (None, "UTC"): + request.node.add_marker( + pytest.mark.xfail( + reason=(f"Not supported by pyarrow < 2.0 with timestamp type {tz}") + ) + ) + elif using_array_manager and pa.types.is_duration(data.dtype.pyarrow_dtype): + request.node.add_marker( + pytest.mark.xfail( + reason="Checking ndim when using arraymanager with duration type" + ) + ) + super().test_setitem_loc_scalar_multiple_homogoneous(data) + + def test_setitem_iloc_scalar_mixed(self, data, using_array_manager, request): + tz = getattr(data.dtype.pyarrow_dtype, "tz", None) + if pa_version_under2p0 and tz not in (None, "UTC"): + request.node.add_marker( + pytest.mark.xfail( + reason=(f"Not supported by pyarrow < 2.0 with timestamp type {tz}") + ) + ) + elif using_array_manager and pa.types.is_duration(data.dtype.pyarrow_dtype): + request.node.add_marker( + pytest.mark.xfail( + reason="Checking ndim when using arraymanager with duration type" + ) + ) + super().test_setitem_iloc_scalar_mixed(data) + + def test_setitem_iloc_scalar_single(self, data, using_array_manager, request): + tz = getattr(data.dtype.pyarrow_dtype, "tz", None) + if pa_version_under2p0 and tz not in (None, "UTC"): + request.node.add_marker( + pytest.mark.xfail( + reason=(f"Not supported by pyarrow < 2.0 with timestamp type {tz}") + ) + ) + elif using_array_manager and pa.types.is_duration(data.dtype.pyarrow_dtype): + request.node.add_marker( + pytest.mark.xfail( + reason="Checking ndim when using arraymanager with duration type" + ) + ) + super().test_setitem_iloc_scalar_single(data) + + def test_setitem_iloc_scalar_multiple_homogoneous( + self, data, using_array_manager, request + ): + tz = getattr(data.dtype.pyarrow_dtype, "tz", None) + if pa_version_under2p0 and tz not in (None, "UTC"): + request.node.add_marker( + pytest.mark.xfail( + reason=(f"Not supported by pyarrow < 2.0 with timestamp type {tz}") + ) + ) + elif using_array_manager and pa.types.is_duration(data.dtype.pyarrow_dtype): + request.node.add_marker( + pytest.mark.xfail( + reason="Checking ndim when using arraymanager with duration type" + ) + ) + super().test_setitem_iloc_scalar_multiple_homogoneous(data) + + @pytest.mark.parametrize( + "mask", + [ + np.array([True, True, True, False, False]), + pd.array([True, True, True, False, False], dtype="boolean"), + pd.array([True, True, True, pd.NA, pd.NA], dtype="boolean"), + ], + ids=["numpy-array", "boolean-array", "boolean-array-na"], + ) + def test_setitem_mask( + self, data, mask, box_in_series, using_array_manager, request + ): + tz = getattr(data.dtype.pyarrow_dtype, "tz", None) + if pa_version_under2p0 and tz not in (None, "UTC"): + request.node.add_marker( + pytest.mark.xfail( + reason=(f"Not supported by pyarrow < 2.0 with timestamp type {tz}") + ) + ) + elif ( + using_array_manager + and pa.types.is_duration(data.dtype.pyarrow_dtype) + and box_in_series + ): + request.node.add_marker( + pytest.mark.xfail( + reason="Checking ndim when using arraymanager with duration type" + ) + ) + super().test_setitem_mask(data, mask, box_in_series) + + def test_setitem_mask_boolean_array_with_na( + self, data, box_in_series, using_array_manager, request + ): + tz = getattr(data.dtype.pyarrow_dtype, "tz", None) + unit = getattr(data.dtype.pyarrow_dtype, "unit", None) + if pa_version_under2p0 and tz not in (None, "UTC") and unit == "us": + request.node.add_marker( + pytest.mark.xfail( + reason=(f"Not supported by pyarrow < 2.0 with timestamp type {tz}") + ) + ) + elif ( + using_array_manager + and pa.types.is_duration(data.dtype.pyarrow_dtype) + and box_in_series + ): + request.node.add_marker( + pytest.mark.xfail( + reason="Checking ndim when using arraymanager with duration type" + ) + ) + super().test_setitem_mask_boolean_array_with_na(data, box_in_series) + + @pytest.mark.parametrize( + "idx", + [[0, 1, 2], pd.array([0, 1, 2], dtype="Int64"), np.array([0, 1, 2])], + ids=["list", "integer-array", "numpy-array"], + ) + def test_setitem_integer_array( + self, data, idx, box_in_series, using_array_manager, request + ): + tz = getattr(data.dtype.pyarrow_dtype, "tz", None) + if pa_version_under2p0 and tz not in (None, "UTC"): + request.node.add_marker( + pytest.mark.xfail( + reason=(f"Not supported by pyarrow < 2.0 with timestamp type {tz}") + ) + ) + elif ( + using_array_manager + and pa.types.is_duration(data.dtype.pyarrow_dtype) + and box_in_series + ): + request.node.add_marker( + pytest.mark.xfail( + reason="Checking ndim when using arraymanager with duration type" + ) + ) + super().test_setitem_integer_array(data, idx, box_in_series) + + @pytest.mark.parametrize("as_callable", [True, False]) + @pytest.mark.parametrize("setter", ["loc", None]) + def test_setitem_mask_aligned( + self, data, as_callable, setter, using_array_manager, request + ): + tz = getattr(data.dtype.pyarrow_dtype, "tz", None) + if pa_version_under2p0 and tz not in (None, "UTC"): + request.node.add_marker( + pytest.mark.xfail( + reason=(f"Not supported by pyarrow < 2.0 with timestamp type {tz}") + ) + ) + elif using_array_manager and pa.types.is_duration(data.dtype.pyarrow_dtype): + request.node.add_marker( + pytest.mark.xfail( + reason="Checking ndim when using arraymanager with duration type" + ) + ) + super().test_setitem_mask_aligned(data, as_callable, setter) + + @pytest.mark.parametrize("setter", ["loc", None]) + def test_setitem_mask_broadcast(self, data, setter, using_array_manager, request): + tz = getattr(data.dtype.pyarrow_dtype, "tz", None) + if pa_version_under2p0 and tz not in (None, "UTC"): + request.node.add_marker( + pytest.mark.xfail( + reason=(f"Not supported by pyarrow < 2.0 with timestamp type {tz}") + ) + ) + elif using_array_manager and pa.types.is_duration(data.dtype.pyarrow_dtype): + request.node.add_marker( + pytest.mark.xfail( + reason="Checking ndim when using arraymanager with duration type" + ) + ) + super().test_setitem_mask_broadcast(data, setter) + + def test_setitem_tuple_index(self, data, request): + tz = getattr(data.dtype.pyarrow_dtype, "tz", None) + if pa_version_under2p0 and tz not in (None, "UTC"): + request.node.add_marker( + pytest.mark.xfail( + reason=(f"Not supported by pyarrow < 2.0 with timestamp type {tz}") + ) + ) + super().test_setitem_tuple_index(data) + + def test_setitem_slice(self, data, box_in_series, using_array_manager, request): + tz = getattr(data.dtype.pyarrow_dtype, "tz", None) + if pa_version_under2p0 and tz not in (None, "UTC"): + request.node.add_marker( + pytest.mark.xfail( + reason=(f"Not supported by pyarrow < 2.0 with timestamp type {tz}") + ) + ) + elif ( + using_array_manager + and pa.types.is_duration(data.dtype.pyarrow_dtype) + and box_in_series + ): + request.node.add_marker( + pytest.mark.xfail( + reason="Checking ndim when using arraymanager with duration type" + ) + ) + super().test_setitem_slice(data, box_in_series) + + def test_setitem_loc_iloc_slice(self, data, using_array_manager, request): + tz = getattr(data.dtype.pyarrow_dtype, "tz", None) + if pa_version_under2p0 and tz not in (None, "UTC"): + request.node.add_marker( + pytest.mark.xfail( + reason=f"Not supported by pyarrow < 2.0 with timestamp type {tz}" + ) + ) + elif using_array_manager and pa.types.is_duration(data.dtype.pyarrow_dtype): + request.node.add_marker( + pytest.mark.xfail( + reason="Checking ndim when using arraymanager with duration type" + ) + ) + super().test_setitem_loc_iloc_slice(data) + + def test_setitem_slice_array(self, data, request): + tz = getattr(data.dtype.pyarrow_dtype, "tz", None) + if pa_version_under2p0 and tz not in (None, "UTC"): + request.node.add_marker( + pytest.mark.xfail( + reason=f"Not supported by pyarrow < 2.0 with timestamp type {tz}" + ) + ) + super().test_setitem_slice_array(data) + + def test_setitem_with_expansion_dataframe_column( + self, data, full_indexer, using_array_manager, request + ): + # Is there a better way to get the full_indexer id "null_slice"? + is_null_slice = "null_slice" in request.node.nodeid + tz = getattr(data.dtype.pyarrow_dtype, "tz", None) + if pa_version_under2p0 and tz not in (None, "UTC") and not is_null_slice: + request.node.add_marker( + pytest.mark.xfail( + reason=f"Not supported by pyarrow < 2.0 with timestamp type {tz}" + ) + ) + elif ( + using_array_manager + and pa.types.is_duration(data.dtype.pyarrow_dtype) + and not is_null_slice + ): + request.node.add_marker( + pytest.mark.xfail( + reason="Checking ndim when using arraymanager with duration type" + ) + ) + super().test_setitem_with_expansion_dataframe_column(data, full_indexer) + + def test_setitem_with_expansion_row( + self, data, na_value, using_array_manager, request + ): + tz = getattr(data.dtype.pyarrow_dtype, "tz", None) + if pa_version_under2p0 and tz not in (None, "UTC"): + request.node.add_marker( + pytest.mark.xfail( + reason=(f"Not supported by pyarrow < 2.0 with timestamp type {tz}") + ) + ) + elif using_array_manager and pa.types.is_duration(data.dtype.pyarrow_dtype): + request.node.add_marker( + pytest.mark.xfail( + reason="Checking ndim when using arraymanager with duration type" + ) + ) + super().test_setitem_with_expansion_row(data, na_value) + + def test_setitem_frame_2d_values(self, data, using_array_manager, request): + tz = getattr(data.dtype.pyarrow_dtype, "tz", None) + if pa_version_under2p0 and tz not in (None, "UTC"): + request.node.add_marker( + pytest.mark.xfail( + reason=f"Not supported by pyarrow < 2.0 with timestamp type {tz}" + ) + ) + elif using_array_manager and pa.types.is_duration(data.dtype.pyarrow_dtype): + request.node.add_marker( + pytest.mark.xfail( + reason="Checking ndim when using arraymanager with duration type" + ) + ) + super().test_setitem_frame_2d_values(data) + + @pytest.mark.xfail(reason="GH 45419: pyarrow.ChunkedArray does not support views") + def test_setitem_preserves_views(self, data): + super().test_setitem_preserves_views(data) + + +class TestBaseParsing(base.BaseParsingTests): + @pytest.mark.parametrize("engine", ["c", "python"]) + def test_EA_types(self, engine, data, request): + pa_dtype = data.dtype.pyarrow_dtype + if pa.types.is_boolean(pa_dtype): + request.node.add_marker( + pytest.mark.xfail(raises=TypeError, reason="GH 47534") + ) + elif pa.types.is_timestamp(pa_dtype) and pa_dtype.tz is not None: + request.node.add_marker( + pytest.mark.xfail( + raises=NotImplementedError, + reason=f"Parameterized types with tz={pa_dtype.tz} not supported.", + ) + ) + super().test_EA_types(engine, data) + + +class TestBaseUnaryOps(base.BaseUnaryOpsTests): + @pytest.mark.xfail( + pa_version_under2p0, + raises=NotImplementedError, + reason="pyarrow.compute.invert not supported in pyarrow<2.0", + ) + def test_invert(self, data, request): + pa_dtype = data.dtype.pyarrow_dtype + if not pa.types.is_boolean(pa_dtype): + request.node.add_marker( + pytest.mark.xfail( + raises=pa.ArrowNotImplementedError, + reason=f"pyarrow.compute.invert does support {pa_dtype}", + ) + ) + super().test_invert(data) + + +class TestBaseMethods(base.BaseMethodsTests): + @pytest.mark.parametrize("periods", [1, -2]) + def test_diff(self, data, periods, request): + pa_dtype = data.dtype.pyarrow_dtype + if pa.types.is_unsigned_integer(pa_dtype) and periods == 1: + request.node.add_marker( + pytest.mark.xfail( + raises=pa.ArrowInvalid, + reason=( + f"diff with {pa_dtype} and periods={periods} will overflow" + ), + ) + ) + super().test_diff(data, periods) + + @pytest.mark.parametrize("dropna", [True, False]) + def test_value_counts(self, all_data, dropna, request): + pa_dtype = all_data.dtype.pyarrow_dtype + if pa.types.is_date(pa_dtype) or ( + pa.types.is_timestamp(pa_dtype) and pa_dtype.tz is None + ): + request.node.add_marker( + pytest.mark.xfail( + raises=AttributeError, + reason="GH 34986", + ) + ) + elif pa.types.is_duration(pa_dtype): + request.node.add_marker( + pytest.mark.xfail( + raises=pa.ArrowNotImplementedError, + reason=f"value_count has no kernel for {pa_dtype}", + ) + ) + super().test_value_counts(all_data, dropna) + + def test_value_counts_with_normalize(self, data, request): + pa_dtype = data.dtype.pyarrow_dtype + if pa.types.is_date(pa_dtype) or ( + pa.types.is_timestamp(pa_dtype) and pa_dtype.tz is None + ): + request.node.add_marker( + pytest.mark.xfail( + raises=AttributeError, + reason="GH 34986", + ) + ) + elif pa.types.is_duration(pa_dtype): + request.node.add_marker( + pytest.mark.xfail( + raises=pa.ArrowNotImplementedError, + reason=f"value_count has no pyarrow kernel for {pa_dtype}", + ) + ) + super().test_value_counts_with_normalize(data) + + def test_argmin_argmax( + self, data_for_sorting, data_missing_for_sorting, na_value, request + ): + pa_dtype = data_for_sorting.dtype.pyarrow_dtype + if pa.types.is_boolean(pa_dtype): + request.node.add_marker( + pytest.mark.xfail( + reason=f"{pa_dtype} only has 2 unique possible values", + ) + ) + super().test_argmin_argmax(data_for_sorting, data_missing_for_sorting, na_value) + + @pytest.mark.parametrize("ascending", [True, False]) + def test_sort_values(self, data_for_sorting, ascending, sort_by_key, request): + pa_dtype = data_for_sorting.dtype.pyarrow_dtype + if pa.types.is_duration(pa_dtype) and not ascending and not pa_version_under2p0: + request.node.add_marker( + pytest.mark.xfail( + raises=pa.ArrowNotImplementedError, + reason=( + f"unique has no pyarrow kernel " + f"for {pa_dtype} when ascending={ascending}" + ), + ) + ) + super().test_sort_values(data_for_sorting, ascending, sort_by_key) + + @pytest.mark.parametrize("ascending", [True, False]) + def test_sort_values_frame(self, data_for_sorting, ascending, request): + pa_dtype = data_for_sorting.dtype.pyarrow_dtype + if pa.types.is_duration(pa_dtype): + request.node.add_marker( + pytest.mark.xfail( + raises=pa.ArrowNotImplementedError, + reason=( + f"dictionary_encode has no pyarrow kernel " + f"for {pa_dtype} when ascending={ascending}" + ), + ) + ) + super().test_sort_values_frame(data_for_sorting, ascending) + + @pytest.mark.parametrize("box", [pd.Series, lambda x: x]) + @pytest.mark.parametrize("method", [lambda x: x.unique(), pd.unique]) + def test_unique(self, data, box, method, request): + pa_dtype = data.dtype.pyarrow_dtype + if pa.types.is_duration(pa_dtype) and not pa_version_under2p0: + request.node.add_marker( + pytest.mark.xfail( + raises=pa.ArrowNotImplementedError, + reason=f"unique has no pyarrow kernel for {pa_dtype}.", + ) + ) + super().test_unique(data, box, method) + + @pytest.mark.parametrize("na_sentinel", [-1, -2]) + def test_factorize(self, data_for_grouping, na_sentinel, request): + pa_dtype = data_for_grouping.dtype.pyarrow_dtype + if pa.types.is_duration(pa_dtype): + request.node.add_marker( + pytest.mark.xfail( + raises=pa.ArrowNotImplementedError, + reason=f"dictionary_encode has no pyarrow kernel for {pa_dtype}", + ) + ) + elif pa.types.is_boolean(pa_dtype): + request.node.add_marker( + pytest.mark.xfail( + reason=f"{pa_dtype} only has 2 unique possible values", + ) + ) + super().test_factorize(data_for_grouping, na_sentinel) + + @pytest.mark.parametrize("na_sentinel", [-1, -2]) + def test_factorize_equivalence(self, data_for_grouping, na_sentinel, request): + pa_dtype = data_for_grouping.dtype.pyarrow_dtype + if pa.types.is_duration(pa_dtype): + request.node.add_marker( + pytest.mark.xfail( + raises=pa.ArrowNotImplementedError, + reason=f"dictionary_encode has no pyarrow kernel for {pa_dtype}", + ) + ) + super().test_factorize_equivalence(data_for_grouping, na_sentinel) + + def test_factorize_empty(self, data, request): + pa_dtype = data.dtype.pyarrow_dtype + if pa.types.is_duration(pa_dtype): + request.node.add_marker( + pytest.mark.xfail( + raises=pa.ArrowNotImplementedError, + reason=f"dictionary_encode has no pyarrow kernel for {pa_dtype}", + ) + ) + super().test_factorize_empty(data) + + def test_fillna_copy_frame(self, data_missing, request, using_array_manager): + pa_dtype = data_missing.dtype.pyarrow_dtype + if using_array_manager and pa.types.is_duration(pa_dtype): + request.node.add_marker( + pytest.mark.xfail( + reason=f"Checking ndim when using arraymanager with {pa_dtype}" + ) + ) + super().test_fillna_copy_frame(data_missing) + + def test_fillna_copy_series(self, data_missing, request, using_array_manager): + pa_dtype = data_missing.dtype.pyarrow_dtype + if using_array_manager and pa.types.is_duration(pa_dtype): + request.node.add_marker( + pytest.mark.xfail( + reason=f"Checking ndim when using arraymanager with {pa_dtype}" + ) + ) + super().test_fillna_copy_series(data_missing) + + def test_shift_fill_value(self, data, request): + pa_dtype = data.dtype.pyarrow_dtype + tz = getattr(pa_dtype, "tz", None) + if pa_version_under2p0 and tz not in (None, "UTC"): + request.node.add_marker( + pytest.mark.xfail( + reason=f"Not supported by pyarrow < 2.0 with timestamp type {tz}" + ) + ) + super().test_shift_fill_value(data) + + @pytest.mark.parametrize("repeats", [0, 1, 2, [1, 2, 3]]) + def test_repeat(self, data, repeats, as_series, use_numpy, request): + pa_dtype = data.dtype.pyarrow_dtype + tz = getattr(pa_dtype, "tz", None) + if pa_version_under2p0 and tz not in (None, "UTC") and repeats != 0: + request.node.add_marker( + pytest.mark.xfail( + reason=( + f"Not supported by pyarrow < 2.0 with " + f"timestamp type {tz} when repeats={repeats}" + ) + ) + ) + super().test_repeat(data, repeats, as_series, use_numpy) + + def test_insert(self, data, request): + pa_dtype = data.dtype.pyarrow_dtype + tz = getattr(pa_dtype, "tz", None) + if pa_version_under2p0 and tz not in (None, "UTC"): + request.node.add_marker( + pytest.mark.xfail( + reason=f"Not supported by pyarrow < 2.0 with timestamp type {tz}" + ) + ) + super().test_insert(data) + + def test_combine_first(self, data, request, using_array_manager): + pa_dtype = data.dtype.pyarrow_dtype + tz = getattr(pa_dtype, "tz", None) + if using_array_manager and pa.types.is_duration(pa_dtype): + request.node.add_marker( + pytest.mark.xfail( + reason=f"Checking ndim when using arraymanager with {pa_dtype}" + ) + ) + elif pa_version_under2p0 and tz not in (None, "UTC"): + request.node.add_marker( + pytest.mark.xfail( + reason=f"Not supported by pyarrow < 2.0 with timestamp type {tz}" + ) + ) + super().test_combine_first(data) + + @pytest.mark.parametrize("frame", [True, False]) + @pytest.mark.parametrize( + "periods, indices", + [(-2, [2, 3, 4, -1, -1]), (0, [0, 1, 2, 3, 4]), (2, [-1, -1, 0, 1, 2])], + ) + def test_container_shift( + self, data, frame, periods, indices, request, using_array_manager + ): + pa_dtype = data.dtype.pyarrow_dtype + if ( + using_array_manager + and pa.types.is_duration(pa_dtype) + and periods in (-2, 2) + ): + request.node.add_marker( + pytest.mark.xfail( + reason=( + f"Checking ndim when using arraymanager with " + f"{pa_dtype} and periods={periods}" + ) + ) + ) + super().test_container_shift(data, frame, periods, indices) + + @pytest.mark.xfail( + reason="result dtype pyarrow[bool] better than expected dtype object" + ) + def test_combine_le(self, data_repeated): + super().test_combine_le(data_repeated) + + def test_combine_add(self, data_repeated, request): + pa_dtype = next(data_repeated(1)).dtype.pyarrow_dtype + if pa.types.is_temporal(pa_dtype): + request.node.add_marker( + pytest.mark.xfail( + raises=TypeError, + reason=f"{pa_dtype} cannot be added to {pa_dtype}", + ) + ) + super().test_combine_add(data_repeated) + + def test_searchsorted(self, data_for_sorting, as_series, request): + pa_dtype = data_for_sorting.dtype.pyarrow_dtype + if pa.types.is_boolean(pa_dtype): + request.node.add_marker( + pytest.mark.xfail( + reason=f"{pa_dtype} only has 2 unique possible values", + ) + ) + super().test_searchsorted(data_for_sorting, as_series) + + def test_where_series(self, data, na_value, as_frame, request, using_array_manager): + pa_dtype = data.dtype.pyarrow_dtype + if using_array_manager and pa.types.is_duration(pa_dtype): + request.node.add_marker( + pytest.mark.xfail( + reason=f"Checking ndim when using arraymanager with {pa_dtype}" + ) + ) + elif pa.types.is_temporal(pa_dtype): + request.node.add_marker( + pytest.mark.xfail( + raises=pa.ArrowNotImplementedError, + reason=f"Unsupported cast from double to {pa_dtype}", + ) + ) + super().test_where_series(data, na_value, as_frame) + + +class TestBaseArithmeticOps(base.BaseArithmeticOpsTests): + + divmod_exc = NotImplementedError + + def _patch_combine(self, obj, other, op): + # BaseOpsUtil._combine can upcast expected dtype + # (because it generates expected on python scalars) + # while ArrowExtensionArray maintains original type + expected = base.BaseArithmeticOpsTests._combine(self, obj, other, op) + was_frame = False + if isinstance(expected, pd.DataFrame): + was_frame = True + expected_data = expected.iloc[:, 0] + original_dtype = obj.iloc[:, 0].dtype + else: + expected_data = expected + original_dtype = obj.dtype + pa_array = pa.array(expected_data._values).cast(original_dtype.pyarrow_dtype) + pd_array = type(expected_data._values)(pa_array) + if was_frame: + expected = pd.DataFrame( + pd_array, index=expected.index, columns=expected.columns + ) + else: + expected = pd.Series(pd_array) + return expected + + def test_arith_series_with_scalar( + self, data, all_arithmetic_operators, request, monkeypatch + ): + pa_dtype = data.dtype.pyarrow_dtype + + arrow_temporal_supported = not pa_version_under8p0 and ( + all_arithmetic_operators in ("__add__", "__radd__") + and pa.types.is_duration(pa_dtype) + or all_arithmetic_operators in ("__sub__", "__rsub__") + and pa.types.is_temporal(pa_dtype) + ) + if ( + all_arithmetic_operators + in { + "__mod__", + "__rmod__", + } + or pa_version_under2p0 + ): + self.series_scalar_exc = NotImplementedError + elif arrow_temporal_supported: + self.series_scalar_exc = None + elif not ( + pa.types.is_floating(pa_dtype) + or pa.types.is_integer(pa_dtype) + or arrow_temporal_supported + ): + self.series_scalar_exc = pa.ArrowNotImplementedError + else: + self.series_scalar_exc = None + if ( + all_arithmetic_operators == "__rpow__" + and (pa.types.is_floating(pa_dtype) or pa.types.is_integer(pa_dtype)) + and not pa_version_under2p0 + ): + request.node.add_marker( + pytest.mark.xfail( + reason=( + f"GH 29997: 1**pandas.NA == 1 while 1**pyarrow.NA == NULL " + f"for {pa_dtype}" + ) + ) + ) + elif arrow_temporal_supported: + request.node.add_marker( + pytest.mark.xfail( + raises=TypeError, + reason=( + f"{all_arithmetic_operators} not supported between" + f"pd.NA and {pa_dtype} Python scalar" + ), + ) + ) + elif ( + all_arithmetic_operators in {"__rtruediv__", "__rfloordiv__"} + and (pa.types.is_floating(pa_dtype) or pa.types.is_integer(pa_dtype)) + and not pa_version_under2p0 + ): + request.node.add_marker( + pytest.mark.xfail( + raises=pa.ArrowInvalid, + reason="divide by 0", + ) + ) + if all_arithmetic_operators == "__floordiv__" and pa.types.is_integer(pa_dtype): + # BaseOpsUtil._combine always returns int64, while ArrowExtensionArray does + # not upcast + monkeypatch.setattr(TestBaseArithmeticOps, "_combine", self._patch_combine) + super().test_arith_series_with_scalar(data, all_arithmetic_operators) + + def test_arith_frame_with_scalar( + self, data, all_arithmetic_operators, request, monkeypatch + ): + pa_dtype = data.dtype.pyarrow_dtype + + arrow_temporal_supported = not pa_version_under8p0 and ( + all_arithmetic_operators in ("__add__", "__radd__") + and pa.types.is_duration(pa_dtype) + or all_arithmetic_operators in ("__sub__", "__rsub__") + and pa.types.is_temporal(pa_dtype) + ) + if ( + all_arithmetic_operators + in { + "__mod__", + "__rmod__", + } + or pa_version_under2p0 + ): + self.frame_scalar_exc = NotImplementedError + elif arrow_temporal_supported: + self.frame_scalar_exc = None + elif not (pa.types.is_floating(pa_dtype) or pa.types.is_integer(pa_dtype)): + self.frame_scalar_exc = pa.ArrowNotImplementedError + else: + self.frame_scalar_exc = None + if ( + all_arithmetic_operators == "__rpow__" + and (pa.types.is_floating(pa_dtype) or pa.types.is_integer(pa_dtype)) + and not pa_version_under2p0 + ): + request.node.add_marker( + pytest.mark.xfail( + reason=( + f"GH 29997: 1**pandas.NA == 1 while 1**pyarrow.NA == NULL " + f"for {pa_dtype}" + ) + ) + ) + elif arrow_temporal_supported: + request.node.add_marker( + pytest.mark.xfail( + raises=TypeError, + reason=( + f"{all_arithmetic_operators} not supported between" + f"pd.NA and {pa_dtype} Python scalar" + ), + ) + ) + elif ( + all_arithmetic_operators in {"__rtruediv__", "__rfloordiv__"} + and (pa.types.is_floating(pa_dtype) or pa.types.is_integer(pa_dtype)) + and not pa_version_under2p0 + ): + request.node.add_marker( + pytest.mark.xfail( + raises=pa.ArrowInvalid, + reason="divide by 0", + ) + ) + if all_arithmetic_operators == "__floordiv__" and pa.types.is_integer(pa_dtype): + # BaseOpsUtil._combine always returns int64, while ArrowExtensionArray does + # not upcast + monkeypatch.setattr(TestBaseArithmeticOps, "_combine", self._patch_combine) + super().test_arith_frame_with_scalar(data, all_arithmetic_operators) + + def test_arith_series_with_array( + self, data, all_arithmetic_operators, request, monkeypatch + ): + pa_dtype = data.dtype.pyarrow_dtype + + arrow_temporal_supported = not pa_version_under8p0 and ( + all_arithmetic_operators in ("__add__", "__radd__") + and pa.types.is_duration(pa_dtype) + or all_arithmetic_operators in ("__sub__", "__rsub__") + and pa.types.is_temporal(pa_dtype) + ) + if ( + all_arithmetic_operators + in { + "__mod__", + "__rmod__", + } + or pa_version_under2p0 + ): + self.series_array_exc = NotImplementedError + elif arrow_temporal_supported: + self.series_array_exc = None + elif not (pa.types.is_floating(pa_dtype) or pa.types.is_integer(pa_dtype)): + self.series_array_exc = pa.ArrowNotImplementedError + else: + self.series_array_exc = None + if ( + all_arithmetic_operators == "__rpow__" + and (pa.types.is_floating(pa_dtype) or pa.types.is_integer(pa_dtype)) + and not pa_version_under2p0 + ): + request.node.add_marker( + pytest.mark.xfail( + reason=( + f"GH 29997: 1**pandas.NA == 1 while 1**pyarrow.NA == NULL " + f"for {pa_dtype}" + ) + ) + ) + elif ( + all_arithmetic_operators + in ( + "__sub__", + "__rsub__", + ) + and pa.types.is_unsigned_integer(pa_dtype) + and not pa_version_under2p0 + ): + request.node.add_marker( + pytest.mark.xfail( + raises=pa.ArrowInvalid, + reason=( + f"Implemented pyarrow.compute.subtract_checked " + f"which raises on overflow for {pa_dtype}" + ), + ) + ) + elif arrow_temporal_supported: + request.node.add_marker( + pytest.mark.xfail( + raises=TypeError, + reason=( + f"{all_arithmetic_operators} not supported between" + f"pd.NA and {pa_dtype} Python scalar" + ), + ) + ) + elif ( + all_arithmetic_operators in {"__rtruediv__", "__rfloordiv__"} + and (pa.types.is_floating(pa_dtype) or pa.types.is_integer(pa_dtype)) + and not pa_version_under2p0 + ): + request.node.add_marker( + pytest.mark.xfail( + raises=pa.ArrowInvalid, + reason="divide by 0", + ) + ) + op_name = all_arithmetic_operators + ser = pd.Series(data) + # pd.Series([ser.iloc[0]] * len(ser)) may not return ArrowExtensionArray + # since ser.iloc[0] is a python scalar + other = pd.Series(pd.array([ser.iloc[0]] * len(ser), dtype=data.dtype)) + if pa.types.is_floating(pa_dtype) or ( + pa.types.is_integer(pa_dtype) and all_arithmetic_operators != "__truediv__" + ): + monkeypatch.setattr(TestBaseArithmeticOps, "_combine", self._patch_combine) + self.check_opname(ser, op_name, other, exc=self.series_array_exc) + + def test_add_series_with_extension_array(self, data, request): + pa_dtype = data.dtype.pyarrow_dtype + if ( + not ( + pa.types.is_integer(pa_dtype) + or pa.types.is_floating(pa_dtype) + or (not pa_version_under8p0 and pa.types.is_duration(pa_dtype)) + ) + or pa_version_under2p0 + ): + request.node.add_marker( + pytest.mark.xfail( + raises=NotImplementedError, + reason=f"add_checked not implemented for {pa_dtype}", + ) + ) + super().test_add_series_with_extension_array(data) + + +class TestBaseComparisonOps(base.BaseComparisonOpsTests): + def assert_series_equal(self, left, right, *args, **kwargs): + # Series.combine for "expected" retains bool[pyarrow] dtype + # While "result" return "boolean" dtype + right = pd.Series(right._values.to_numpy(), dtype="boolean") + super().assert_series_equal(left, right, *args, **kwargs) + + def test_compare_array(self, data, comparison_op, na_value, request): + pa_dtype = data.dtype.pyarrow_dtype + ser = pd.Series(data) + # pd.Series([ser.iloc[0]] * len(ser)) may not return ArrowExtensionArray + # since ser.iloc[0] is a python scalar + other = pd.Series(pd.array([ser.iloc[0]] * len(ser), dtype=data.dtype)) + if comparison_op.__name__ in ["eq", "ne"]: + # comparison should match point-wise comparisons + result = comparison_op(ser, other) + # Series.combine does not calculate the NA mask correctly + # when comparing over an array + assert result[8] is na_value + assert result[97] is na_value + expected = ser.combine(other, comparison_op) + expected[8] = na_value + expected[97] = na_value + self.assert_series_equal(result, expected) + + else: + exc = None + try: + result = comparison_op(ser, other) + except Exception as err: + exc = err + + if exc is None: + # Didn't error, then should match point-wise behavior + if pa.types.is_temporal(pa_dtype): + # point-wise comparison with pd.NA raises TypeError + assert result[8] is na_value + assert result[97] is na_value + result = result.drop([8, 97]).reset_index(drop=True) + ser = ser.drop([8, 97]) + other = other.drop([8, 97]) + expected = ser.combine(other, comparison_op) + self.assert_series_equal(result, expected) + else: + with pytest.raises(type(exc)): + ser.combine(other, comparison_op) + + +def test_arrowdtype_construct_from_string_type_with_unsupported_parameters(): with pytest.raises(NotImplementedError, match="Passing pyarrow type"): - ArrowDtype.construct_from_string("timestamp[s][pyarrow]") + ArrowDtype.construct_from_string("timestamp[s, tz=UTC][pyarrow]") diff --git a/pandas/tests/extension/test_boolean.py b/pandas/tests/extension/test_boolean.py index e45bffba944c0..dd067102aba6c 100644 --- a/pandas/tests/extension/test_boolean.py +++ b/pandas/tests/extension/test_boolean.py @@ -177,7 +177,12 @@ class TestMethods(base.BaseMethodsTests): @pytest.mark.parametrize("na_sentinel", [-1, -2]) def test_factorize(self, data_for_grouping, na_sentinel): # override because we only have 2 unique values - labels, uniques = pd.factorize(data_for_grouping, na_sentinel=na_sentinel) + if na_sentinel == -1: + msg = "Specifying `na_sentinel=-1` is deprecated" + else: + msg = "Specifying the specific value to use for `na_sentinel` is deprecated" + with tm.assert_produces_warning(FutureWarning, match=msg): + labels, uniques = pd.factorize(data_for_grouping, na_sentinel=na_sentinel) expected_labels = np.array( [0, 0, na_sentinel, na_sentinel, 1, 1, 0], dtype=np.intp ) diff --git a/pandas/tests/extension/test_extension.py b/pandas/tests/extension/test_extension.py index 1ed626cd51080..a4b1a4b43ef2b 100644 --- a/pandas/tests/extension/test_extension.py +++ b/pandas/tests/extension/test_extension.py @@ -4,6 +4,7 @@ import numpy as np import pytest +import pandas._testing as tm from pandas.core.arrays import ExtensionArray @@ -24,3 +25,16 @@ def test_errors(self, data, all_arithmetic_operators): op_name = all_arithmetic_operators with pytest.raises(AttributeError): getattr(data, op_name) + + +def test_depr_na_sentinel(): + # GH#46910 + msg = "The `na_sentinel` argument of `MyEA.factorize` is deprecated" + with tm.assert_produces_warning(DeprecationWarning, match=msg): + + class MyEA(ExtensionArray): + def factorize(self, na_sentinel=-1): + pass + + with tm.assert_produces_warning(None): + MyEA() diff --git a/pandas/tests/extension/test_string.py b/pandas/tests/extension/test_string.py index 8a8bdee90e467..6cea21b6672d8 100644 --- a/pandas/tests/extension/test_string.py +++ b/pandas/tests/extension/test_string.py @@ -167,9 +167,7 @@ def test_reduce_series_numeric(self, data, all_numeric_reductions, skipna): class TestMethods(base.BaseMethodsTests): - @pytest.mark.xfail(reason="returns nullable: GH 44692") - def test_value_counts_with_normalize(self, data): - super().test_value_counts_with_normalize(data) + pass class TestCasting(base.BaseCastingTests): diff --git a/pandas/tests/frame/constructors/test_from_dict.py b/pandas/tests/frame/constructors/test_from_dict.py index 72107d849f598..7c2b009673bb7 100644 --- a/pandas/tests/frame/constructors/test_from_dict.py +++ b/pandas/tests/frame/constructors/test_from_dict.py @@ -17,11 +17,6 @@ class TestFromDict: # Note: these tests are specific to the from_dict method, not for # passing dictionaries to DataFrame.__init__ - def test_from_dict_scalars_requires_index(self): - msg = "If using all scalar values, you must pass an index" - with pytest.raises(ValueError, match=msg): - DataFrame.from_dict(OrderedDict([("b", 8), ("a", 5), ("a", 6)])) - def test_constructor_list_of_odicts(self): data = [ OrderedDict([["a", 1.5], ["b", 3], ["c", 4], ["d", 6]]), @@ -189,3 +184,16 @@ def test_frame_dict_constructor_empty_series(self): # it works! DataFrame({"foo": s1, "bar": s2, "baz": s3}) DataFrame.from_dict({"foo": s1, "baz": s3, "bar": s2}) + + def test_from_dict_scalars_requires_index(self): + msg = "If using all scalar values, you must pass an index" + with pytest.raises(ValueError, match=msg): + DataFrame.from_dict(OrderedDict([("b", 8), ("a", 5), ("a", 6)])) + + def test_from_dict_orient_invalid(self): + msg = ( + "Expected 'index', 'columns' or 'tight' for orient parameter. " + "Got 'abc' instead" + ) + with pytest.raises(ValueError, match=msg): + DataFrame.from_dict({"foo": 1, "baz": 3, "bar": 2}, orient="abc") diff --git a/pandas/tests/frame/indexing/test_getitem.py b/pandas/tests/frame/indexing/test_getitem.py index 8cc8b487ff44f..7994c56f8d68b 100644 --- a/pandas/tests/frame/indexing/test_getitem.py +++ b/pandas/tests/frame/indexing/test_getitem.py @@ -357,6 +357,21 @@ def test_getitem_empty_frame_with_boolean(self): df2 = df[df > 0] tm.assert_frame_equal(df, df2) + def test_getitem_returns_view_when_column_is_unique_in_df(self): + # GH#45316 + df = DataFrame([[1, 2, 3], [4, 5, 6]], columns=["a", "a", "b"]) + view = df["b"] + view.loc[:] = 100 + expected = DataFrame([[1, 2, 100], [4, 5, 100]], columns=["a", "a", "b"]) + tm.assert_frame_equal(df, expected) + + def test_getitem_frozenset_unique_in_column(self): + # GH#41062 + df = DataFrame([[1, 2, 3, 4]], columns=[frozenset(["KEY"]), "B", "C", "C"]) + result = df[frozenset(["KEY"])] + expected = Series([1], name=frozenset(["KEY"])) + tm.assert_series_equal(result, expected) + class TestGetitemSlice: def test_getitem_slice_float64(self, frame_or_series): diff --git a/pandas/tests/frame/indexing/test_indexing.py b/pandas/tests/frame/indexing/test_indexing.py index 0fbf375e441ac..edcd577dd948d 100644 --- a/pandas/tests/frame/indexing/test_indexing.py +++ b/pandas/tests/frame/indexing/test_indexing.py @@ -1298,6 +1298,25 @@ def test_loc_expand_empty_frame_keep_midx_names(self): ) tm.assert_frame_equal(df, expected) + @pytest.mark.parametrize("val", ["x", 1]) + @pytest.mark.parametrize("idxr", ["a", ["a"]]) + def test_loc_setitem_rhs_frame(self, idxr, val): + # GH#47578 + df = DataFrame({"a": [1, 2]}) + df.loc[:, "a"] = DataFrame({"a": [val, 11]}, index=[1, 2]) + expected = DataFrame({"a": [np.nan, val]}) + tm.assert_frame_equal(df, expected) + + @td.skip_array_manager_invalid_test + def test_iloc_setitem_enlarge_no_warning(self): + # GH#47381 + df = DataFrame(columns=["a", "b"]) + expected = df.copy() + view = df[:] + with tm.assert_produces_warning(None): + df.iloc[:, 0] = np.array([1, 2], dtype=np.float64) + tm.assert_frame_equal(view, expected) + class TestDataFrameIndexingUInt64: def test_setitem(self, uint64_frame): diff --git a/pandas/tests/frame/indexing/test_setitem.py b/pandas/tests/frame/indexing/test_setitem.py index cf6d351aa78a0..cd547819dbe94 100644 --- a/pandas/tests/frame/indexing/test_setitem.py +++ b/pandas/tests/frame/indexing/test_setitem.py @@ -57,7 +57,9 @@ class mystring(str): expected = DataFrame({"a": [1], "b": [2], mystring("c"): [3]}, index=index) tm.assert_equal(df, expected) - @pytest.mark.parametrize("dtype", ["int32", "int64", "float32", "float64"]) + @pytest.mark.parametrize( + "dtype", ["int32", "int64", "uint32", "uint64", "float32", "float64"] + ) def test_setitem_dtype(self, dtype, float_frame): arr = np.random.randn(len(float_frame)) @@ -210,6 +212,7 @@ def test_setitem_dict_preserves_dtypes(self): "a": Series([0, 1, 2], dtype="int64"), "b": Series([1, 2, 3], dtype=float), "c": Series([1, 2, 3], dtype=float), + "d": Series([1, 2, 3], dtype="uint32"), } ) df = DataFrame( @@ -217,10 +220,16 @@ def test_setitem_dict_preserves_dtypes(self): "a": Series([], dtype="int64"), "b": Series([], dtype=float), "c": Series([], dtype=float), + "d": Series([], dtype="uint32"), } ) for idx, b in enumerate([1, 2, 3]): - df.loc[df.shape[0]] = {"a": int(idx), "b": float(b), "c": float(b)} + df.loc[df.shape[0]] = { + "a": int(idx), + "b": float(b), + "c": float(b), + "d": np.uint32(b), + } tm.assert_frame_equal(df, expected) @pytest.mark.parametrize( @@ -683,6 +692,13 @@ def test_boolean_mask_nullable_int64(self): ) tm.assert_frame_equal(result, expected) + def test_setitem_ea_dtype_rhs_series(self): + # GH#47425 + df = DataFrame({"a": [1, 2]}) + df["a"] = Series([1, 2], dtype="Int64") + expected = DataFrame({"a": [1, 2]}, dtype="Int64") + tm.assert_frame_equal(df, expected) + # TODO(ArrayManager) set column with 2d column array, see #44788 @td.skip_array_manager_not_yet_implemented def test_setitem_npmatrix_2d(self): @@ -702,6 +718,29 @@ def test_setitem_npmatrix_2d(self): tm.assert_frame_equal(df, expected) + @pytest.mark.parametrize("vals", [{}, {"d": "a"}]) + def test_setitem_aligning_dict_with_index(self, vals): + # GH#47216 + df = DataFrame({"a": [1, 2], "b": [3, 4], **vals}) + df.loc[:, "a"] = {1: 100, 0: 200} + df.loc[:, "c"] = {0: 5, 1: 6} + df.loc[:, "e"] = {1: 5} + expected = DataFrame( + {"a": [200, 100], "b": [3, 4], **vals, "c": [5, 6], "e": [np.nan, 5]} + ) + tm.assert_frame_equal(df, expected) + + def test_setitem_rhs_dataframe(self): + # GH#47578 + df = DataFrame({"a": [1, 2]}) + df["a"] = DataFrame({"a": [10, 11]}, index=[1, 2]) + expected = DataFrame({"a": [np.nan, 10]}) + tm.assert_frame_equal(df, expected) + + df = DataFrame({"a": [1, 2]}) + df.isetitem(0, DataFrame({"a": [10, 11]}, index=[1, 2])) + tm.assert_frame_equal(df, expected) + class TestSetitemTZAwareValues: @pytest.fixture @@ -776,9 +815,7 @@ def test_setitem_string_column_numpy_dtype_raising(self): def test_setitem_empty_df_duplicate_columns(self): # GH#38521 df = DataFrame(columns=["a", "b", "b"], dtype="float64") - msg = "will attempt to set the values inplace instead" - with tm.assert_produces_warning(FutureWarning, match=msg): - df.loc[:, "a"] = list(range(2)) + df.loc[:, "a"] = list(range(2)) expected = DataFrame( [[0, np.nan, np.nan], [1, np.nan, np.nan]], columns=["a", "b", "b"] ) diff --git a/pandas/tests/frame/indexing/test_where.py b/pandas/tests/frame/indexing/test_where.py index 9d004613116b8..5b9883f3866e7 100644 --- a/pandas/tests/frame/indexing/test_where.py +++ b/pandas/tests/frame/indexing/test_where.py @@ -1035,3 +1035,17 @@ def test_where_dt64_2d(): mask[:] = True expected = df _check_where_equivalences(df, mask, other, expected) + + +def test_where_mask_deprecated(frame_or_series): + # GH 47728 + obj = DataFrame(np.random.randn(4, 3)) + obj = tm.get_obj(obj, frame_or_series) + + mask = obj > 0 + + with tm.assert_produces_warning(FutureWarning): + obj.where(mask, -1, errors="raise") + + with tm.assert_produces_warning(FutureWarning): + obj.mask(mask, -1, errors="raise") diff --git a/pandas/tests/frame/methods/test_append.py b/pandas/tests/frame/methods/test_append.py index d1c9c379759b5..f07ffee20a55f 100644 --- a/pandas/tests/frame/methods/test_append.py +++ b/pandas/tests/frame/methods/test_append.py @@ -159,7 +159,7 @@ def test_append_empty_dataframe(self): expected = df1.copy() tm.assert_frame_equal(result, expected) - def test_append_dtypes(self): + def test_append_dtypes(self, using_array_manager): # GH 5754 # row appends of different dtypes (so need to do by-item) @@ -183,7 +183,10 @@ def test_append_dtypes(self): expected = DataFrame( {"bar": Series([Timestamp("20130101"), np.nan], dtype="M8[ns]")} ) - expected = expected.astype(object) + if using_array_manager: + # TODO(ArrayManager) decide on exact casting rules in concat + # With ArrayManager, all-NaN float is not ignored + expected = expected.astype(object) tm.assert_frame_equal(result, expected) df1 = DataFrame({"bar": Timestamp("20130101")}, index=range(1)) @@ -192,7 +195,9 @@ def test_append_dtypes(self): expected = DataFrame( {"bar": Series([Timestamp("20130101"), np.nan], dtype="M8[ns]")} ) - expected = expected.astype(object) + if using_array_manager: + # With ArrayManager, all-NaN float is not ignored + expected = expected.astype(object) tm.assert_frame_equal(result, expected) df1 = DataFrame({"bar": np.nan}, index=range(1)) @@ -201,7 +206,9 @@ def test_append_dtypes(self): expected = DataFrame( {"bar": Series([np.nan, Timestamp("20130101")], dtype="M8[ns]")} ) - expected = expected.astype(object) + if using_array_manager: + # With ArrayManager, all-NaN float is not ignored + expected = expected.astype(object) tm.assert_frame_equal(result, expected) df1 = DataFrame({"bar": Timestamp("20130101")}, index=range(1)) diff --git a/pandas/tests/frame/methods/test_compare.py b/pandas/tests/frame/methods/test_compare.py index 468811eba0d39..609242db453ba 100644 --- a/pandas/tests/frame/methods/test_compare.py +++ b/pandas/tests/frame/methods/test_compare.py @@ -180,3 +180,59 @@ def test_compare_unaligned_objects(): df1 = pd.DataFrame(np.ones((3, 3))) df2 = pd.DataFrame(np.zeros((2, 1))) df1.compare(df2) + + +def test_compare_result_names(): + # GH 44354 + df1 = pd.DataFrame( + {"col1": ["a", "b", "c"], "col2": [1.0, 2.0, np.nan], "col3": [1.0, 2.0, 3.0]}, + ) + df2 = pd.DataFrame( + { + "col1": ["c", "b", "c"], + "col2": [1.0, 2.0, np.nan], + "col3": [1.0, 2.0, np.nan], + }, + ) + result = df1.compare(df2, result_names=("left", "right")) + expected = pd.DataFrame( + { + ("col1", "left"): {0: "a", 2: np.nan}, + ("col1", "right"): {0: "c", 2: np.nan}, + ("col3", "left"): {0: np.nan, 2: 3.0}, + ("col3", "right"): {0: np.nan, 2: np.nan}, + } + ) + tm.assert_frame_equal(result, expected) + + +@pytest.mark.parametrize( + "result_names", + [ + [1, 2], + "HK", + {"2": 2, "3": 3}, + 3, + 3.0, + ], +) +def test_invalid_input_result_names(result_names): + # GH 44354 + df1 = pd.DataFrame( + {"col1": ["a", "b", "c"], "col2": [1.0, 2.0, np.nan], "col3": [1.0, 2.0, 3.0]}, + ) + df2 = pd.DataFrame( + { + "col1": ["c", "b", "c"], + "col2": [1.0, 2.0, np.nan], + "col3": [1.0, 2.0, np.nan], + }, + ) + with pytest.raises( + TypeError, + match=( + f"Passing 'result_names' as a {type(result_names)} is not " + "supported. Provide 'result_names' as a tuple instead." + ), + ): + df1.compare(df2, result_names=result_names) diff --git a/pandas/tests/frame/methods/test_dtypes.py b/pandas/tests/frame/methods/test_dtypes.py index 31592f987f04d..87e6ed5b1b135 100644 --- a/pandas/tests/frame/methods/test_dtypes.py +++ b/pandas/tests/frame/methods/test_dtypes.py @@ -1,6 +1,7 @@ from datetime import timedelta import numpy as np +import pytest from pandas.core.dtypes.dtypes import DatetimeTZDtype @@ -79,6 +80,20 @@ def test_dtypes_are_correct_after_column_slice(self): Series({"a": np.float_, "b": np.float_, "c": np.float_}), ) + @pytest.mark.parametrize( + "data", + [pd.NA, True], + ) + def test_dtypes_are_correct_after_groupby_last(self, data): + # GH46409 + df = DataFrame( + {"id": [1, 2, 3, 4], "test": [True, pd.NA, data, False]} + ).convert_dtypes() + result = df.groupby("id").last().test + expected = df.set_index("id").test + assert result.dtype == pd.BooleanDtype() + tm.assert_series_equal(expected, result) + def test_dtypes_gh8722(self, float_string_frame): float_string_frame["bool"] = float_string_frame["A"] > 0 result = float_string_frame.dtypes diff --git a/pandas/tests/frame/methods/test_fillna.py b/pandas/tests/frame/methods/test_fillna.py index 5008e64dd0e99..d86c1b2aedcac 100644 --- a/pandas/tests/frame/methods/test_fillna.py +++ b/pandas/tests/frame/methods/test_fillna.py @@ -284,6 +284,7 @@ def test_fillna_downcast_noop(self, frame_or_series): res3 = obj2.fillna("foo", downcast=np.dtype(np.int32)) tm.assert_equal(res3, expected) + @td.skip_array_manager_not_yet_implemented @pytest.mark.parametrize("columns", [["A", "A", "B"], ["A", "A"]]) def test_fillna_dictlike_value_duplicate_colnames(self, columns): # GH#43476 @@ -673,6 +674,40 @@ def test_fillna_inplace_with_columns_limit_and_value(self): df.fillna(axis=1, value=100, limit=1, inplace=True) tm.assert_frame_equal(df, expected) + @td.skip_array_manager_invalid_test + @pytest.mark.parametrize("val", [-1, {"x": -1, "y": -1}]) + def test_inplace_dict_update_view(self, val): + # GH#47188 + df = DataFrame({"x": [np.nan, 2], "y": [np.nan, 2]}) + result_view = df[:] + df.fillna(val, inplace=True) + expected = DataFrame({"x": [-1, 2.0], "y": [-1.0, 2]}) + tm.assert_frame_equal(df, expected) + tm.assert_frame_equal(result_view, expected) + + def test_single_block_df_with_horizontal_axis(self): + # GH 47713 + df = DataFrame( + { + "col1": [5, 0, np.nan, 10, np.nan], + "col2": [7, np.nan, np.nan, 5, 3], + "col3": [12, np.nan, 1, 2, 0], + "col4": [np.nan, 1, 1, np.nan, 18], + } + ) + result = df.fillna(50, limit=1, axis=1) + expected = DataFrame( + [ + [5.0, 7.0, 12.0, 50.0], + [0.0, 50.0, np.nan, 1.0], + [50.0, np.nan, 1.0, 1.0], + [10.0, 5.0, 2.0, 50.0], + [50.0, 3.0, 0.0, 18.0], + ], + columns=["col1", "col2", "col3", "col4"], + ) + tm.assert_frame_equal(result, expected) + def test_fillna_nonconsolidated_frame(): # https://github.com/pandas-dev/pandas/issues/36495 diff --git a/pandas/tests/frame/methods/test_join.py b/pandas/tests/frame/methods/test_join.py index 0935856fb223a..7db26f7eb570b 100644 --- a/pandas/tests/frame/methods/test_join.py +++ b/pandas/tests/frame/methods/test_join.py @@ -367,6 +367,15 @@ def test_join_left_sequence_non_unique_index(): tm.assert_frame_equal(joined, expected) +def test_join_list_series(float_frame): + # GH#46850 + # Join a DataFrame with a list containing both a Series and a DataFrame + left = float_frame.A.to_frame() + right = [float_frame.B, float_frame[["C", "D"]]] + result = left.join(right) + tm.assert_frame_equal(result, float_frame) + + @pytest.mark.parametrize("sort_kw", [True, False]) def test_suppress_future_warning_with_sort_kw(sort_kw): a = DataFrame({"col1": [1, 2]}, index=["c", "a"]) diff --git a/pandas/tests/frame/methods/test_to_csv.py b/pandas/tests/frame/methods/test_to_csv.py index 01009d6df3920..df7bc04202e39 100644 --- a/pandas/tests/frame/methods/test_to_csv.py +++ b/pandas/tests/frame/methods/test_to_csv.py @@ -833,6 +833,18 @@ def test_to_csv_float_format(self): ) tm.assert_frame_equal(rs, xp) + def test_to_csv_float_format_over_decimal(self): + # GH#47436 + df = DataFrame({"a": [0.5, 1.0]}) + result = df.to_csv( + decimal=",", + float_format=lambda x: np.format_float_positional(x, trim="-"), + index=False, + ) + expected_rows = ["a", "0.5", "1"] + expected = tm.convert_rows_list_to_csv_str(expected_rows) + assert result == expected + def test_to_csv_unicodewriter_quoting(self): df = DataFrame({"A": [1, 2, 3], "B": ["foo", "bar", "baz"]}) @@ -1285,3 +1297,32 @@ def test_to_csv_na_quoting(self): ) expected = '""\n""\n' assert result == expected + + def test_to_csv_categorical_and_ea(self): + # GH#46812 + df = DataFrame({"a": "x", "b": [1, pd.NA]}) + df["b"] = df["b"].astype("Int16") + df["b"] = df["b"].astype("category") + result = df.to_csv() + expected_rows = [",a,b", "0,x,1", "1,x,"] + expected = tm.convert_rows_list_to_csv_str(expected_rows) + assert result == expected + + def test_to_csv_categorical_and_interval(self): + # GH#46297 + df = DataFrame( + { + "a": [ + pd.Interval( + Timestamp("2020-01-01"), + Timestamp("2020-01-02"), + inclusive="both", + ) + ] + } + ) + df["a"] = df["a"].astype("category") + result = df.to_csv() + expected_rows = [",a", '0,"[2020-01-01, 2020-01-02]"'] + expected = tm.convert_rows_list_to_csv_str(expected_rows) + assert result == expected diff --git a/pandas/tests/frame/methods/test_update.py b/pandas/tests/frame/methods/test_update.py index 408113e9bc417..d3257ac09a0ab 100644 --- a/pandas/tests/frame/methods/test_update.py +++ b/pandas/tests/frame/methods/test_update.py @@ -1,6 +1,8 @@ import numpy as np import pytest +import pandas.util._test_decorators as td + import pandas as pd from pandas import ( DataFrame, @@ -146,3 +148,14 @@ def test_update_with_different_dtype(self): expected = DataFrame({"a": [1, 3], "b": [np.nan, 2], "c": ["foo", np.nan]}) tm.assert_frame_equal(df, expected) + + @td.skip_array_manager_invalid_test + def test_update_modify_view(self): + # GH#47188 + df = DataFrame({"A": ["1", np.nan], "B": ["100", np.nan]}) + df2 = DataFrame({"A": ["a", "x"], "B": ["100", "200"]}) + result_view = df2[:] + df2.update(df) + expected = DataFrame({"A": ["1", "x"], "B": ["100", "200"]}) + tm.assert_frame_equal(df2, expected) + tm.assert_frame_equal(result_view, expected) diff --git a/pandas/tests/frame/test_arithmetic.py b/pandas/tests/frame/test_arithmetic.py index 0864032b741c9..25257a2c102fd 100644 --- a/pandas/tests/frame/test_arithmetic.py +++ b/pandas/tests/frame/test_arithmetic.py @@ -1,5 +1,6 @@ from collections import deque from datetime import datetime +from enum import Enum import functools import operator import re @@ -2050,3 +2051,15 @@ def _constructor_sliced(self): result = sdf + sdf tm.assert_frame_equal(result, expected) + + +def test_enum_column_equality(): + Cols = Enum("Cols", "col1 col2") + + q1 = DataFrame({Cols.col1: [1, 2, 3]}) + q2 = DataFrame({Cols.col1: [1, 2, 3]}) + + result = q1[Cols.col1] == q2[Cols.col1] + expected = Series([True, True, True], name=Cols.col1) + + tm.assert_series_equal(result, expected) diff --git a/pandas/tests/frame/test_constructors.py b/pandas/tests/frame/test_constructors.py index 0a67061016566..d00cf198b3296 100644 --- a/pandas/tests/frame/test_constructors.py +++ b/pandas/tests/frame/test_constructors.py @@ -52,6 +52,7 @@ IntervalArray, PeriodArray, SparseArray, + TimedeltaArray, ) from pandas.core.api import Int64Index @@ -433,6 +434,25 @@ def test_constructor_int_overflow(self, values): assert result[0].dtype == object assert result[0][0] == value + @pytest.mark.parametrize( + "values", + [ + np.array([1], dtype=np.uint16), + np.array([1], dtype=np.uint32), + np.array([1], dtype=np.uint64), + [np.uint16(1)], + [np.uint32(1)], + [np.uint64(1)], + ], + ) + def test_constructor_numpy_uints(self, values): + # GH#47294 + value = values[0] + result = DataFrame(values) + + assert result[0].dtype == value.dtype + assert result[0][0] == value + def test_constructor_ordereddict(self): import random @@ -2665,6 +2685,12 @@ def test_from_dict_with_missing_copy_false(self): ) tm.assert_frame_equal(df, expected) + def test_construction_empty_array_multi_column_raises(self): + # GH#46822 + msg = "Empty data passed with indices specified." + with pytest.raises(ValueError, match=msg): + DataFrame(data=np.array([]), columns=["a", "b"]) + class TestDataFrameConstructorIndexInference: def test_frame_from_dict_of_series_overlapping_monthly_period_indexes(self): @@ -3086,8 +3112,50 @@ def test_tzaware_data_tznaive_dtype(self, constructor): assert np.all(result.dtypes == "M8[ns]") assert np.all(result == ts_naive) - def test_construction_empty_array_multi_column_raises(self): - # GH#46822 - msg = "Empty data passed with indices specified." - with pytest.raises(ValueError, match=msg): - DataFrame(data=np.array([]), columns=["a", "b"]) + +# TODO: better location for this test? +class TestAllowNonNano: + # Until 2.0, we do not preserve non-nano dt64/td64 when passed as ndarray, + # but do preserve it when passed as DTA/TDA + + @pytest.fixture(params=[True, False]) + def as_td(self, request): + return request.param + + @pytest.fixture + def arr(self, as_td): + values = np.arange(5).astype(np.int64).view("M8[s]") + if as_td: + values = values - values[0] + return TimedeltaArray._simple_new(values, dtype=values.dtype) + else: + return DatetimeArray._simple_new(values, dtype=values.dtype) + + def test_index_allow_non_nano(self, arr): + idx = Index(arr) + assert idx.dtype == arr.dtype + + def test_dti_tdi_allow_non_nano(self, arr, as_td): + if as_td: + idx = pd.TimedeltaIndex(arr) + else: + idx = DatetimeIndex(arr) + assert idx.dtype == arr.dtype + + def test_series_allow_non_nano(self, arr): + ser = Series(arr) + assert ser.dtype == arr.dtype + + def test_frame_allow_non_nano(self, arr): + df = DataFrame(arr) + assert df.dtypes[0] == arr.dtype + + @pytest.mark.xfail( + # TODO(2.0): xfail should become unnecessary + strict=False, + reason="stack_arrays converts TDA to ndarray, then goes " + "through ensure_wrapped_if_datetimelike", + ) + def test_frame_from_dict_allow_non_nano(self, arr): + df = DataFrame({0: arr}) + assert df.dtypes[0] == arr.dtype diff --git a/pandas/tests/frame/test_query_eval.py b/pandas/tests/frame/test_query_eval.py index fe3b04e8e27e6..35335c54cd41e 100644 --- a/pandas/tests/frame/test_query_eval.py +++ b/pandas/tests/frame/test_query_eval.py @@ -3,6 +3,7 @@ import numpy as np import pytest +from pandas.errors import UndefinedVariableError import pandas.util._test_decorators as td import pandas as pd @@ -495,8 +496,6 @@ def test_query_syntax_error(self): df.query("i - +", engine=engine, parser=parser) def test_query_scope(self): - from pandas.core.computation.ops import UndefinedVariableError - engine, parser = self.engine, self.parser skip_if_no_pandas_parser(parser) @@ -522,8 +521,6 @@ def test_query_scope(self): df.query("@a > b > c", engine=engine, parser=parser) def test_query_doesnt_pickup_local(self): - from pandas.core.computation.ops import UndefinedVariableError - engine, parser = self.engine, self.parser n = m = 10 df = DataFrame(np.random.randint(m, size=(n, 3)), columns=list("abc")) @@ -618,8 +615,6 @@ def test_nested_scope(self): tm.assert_frame_equal(result, expected) def test_nested_raises_on_local_self_reference(self): - from pandas.core.computation.ops import UndefinedVariableError - df = DataFrame(np.random.randn(5, 3)) # can't reference ourself b/c we're a local so @ is necessary @@ -678,8 +673,6 @@ def test_at_inside_string(self): tm.assert_frame_equal(result, expected) def test_query_undefined_local(self): - from pandas.core.computation.ops import UndefinedVariableError - engine, parser = self.engine, self.parser skip_if_no_pandas_parser(parser) @@ -838,8 +831,6 @@ def test_date_index_query_with_NaT_duplicates(self): df.query("index < 20130101 < dates3", engine=engine, parser=parser) def test_nested_scope(self): - from pandas.core.computation.ops import UndefinedVariableError - engine = self.engine parser = self.parser # smoke test diff --git a/pandas/tests/groupby/aggregate/test_aggregate.py b/pandas/tests/groupby/aggregate/test_aggregate.py index d52b6ceaf8990..54ee32502bbc9 100644 --- a/pandas/tests/groupby/aggregate/test_aggregate.py +++ b/pandas/tests/groupby/aggregate/test_aggregate.py @@ -1413,3 +1413,28 @@ def test_multi_axis_1_raises(func): gb = df.groupby("a", axis=1) with pytest.raises(NotImplementedError, match="axis other than 0 is not supported"): gb.agg(func) + + +@pytest.mark.parametrize( + "test, constant", + [ + ([[20, "A"], [20, "B"], [10, "C"]], {0: [10, 20], 1: ["C", ["A", "B"]]}), + ([[20, "A"], [20, "B"], [30, "C"]], {0: [20, 30], 1: [["A", "B"], "C"]}), + ([["a", 1], ["a", 1], ["b", 2], ["b", 3]], {0: ["a", "b"], 1: [1, [2, 3]]}), + pytest.param( + [["a", 1], ["a", 2], ["b", 3], ["b", 3]], + {0: ["a", "b"], 1: [[1, 2], 3]}, + marks=pytest.mark.xfail, + ), + ], +) +def test_agg_of_mode_list(test, constant): + # GH#25581 + df1 = DataFrame(test) + result = df1.groupby(0).agg(Series.mode) + # Mode usually only returns 1 value, but can return a list in the case of a tie. + + expected = DataFrame(constant) + expected = expected.set_index(0) + + tm.assert_frame_equal(result, expected) diff --git a/pandas/tests/groupby/aggregate/test_cython.py b/pandas/tests/groupby/aggregate/test_cython.py index 9631de7833cf4..869ed31b6a2d9 100644 --- a/pandas/tests/groupby/aggregate/test_cython.py +++ b/pandas/tests/groupby/aggregate/test_cython.py @@ -92,8 +92,9 @@ def test_cython_agg_boolean(): def test_cython_agg_nothing_to_agg(): frame = DataFrame({"a": np.random.randint(0, 5, 50), "b": ["foo", "bar"] * 25}) - with pytest.raises(NotImplementedError, match="does not implement"): - frame.groupby("a")["b"].mean(numeric_only=True) + with tm.assert_produces_warning(FutureWarning, match="This will raise a TypeError"): + with pytest.raises(NotImplementedError, match="does not implement"): + frame.groupby("a")["b"].mean(numeric_only=True) with pytest.raises(TypeError, match="Could not convert (foo|bar)*"): frame.groupby("a")["b"].mean() @@ -114,8 +115,9 @@ def test_cython_agg_nothing_to_agg_with_dates(): "dates": pd.date_range("now", periods=50, freq="T"), } ) - with pytest.raises(NotImplementedError, match="does not implement"): - frame.groupby("b").dates.mean(numeric_only=True) + with tm.assert_produces_warning(FutureWarning, match="This will raise a TypeError"): + with pytest.raises(NotImplementedError, match="does not implement"): + frame.groupby("b").dates.mean(numeric_only=True) def test_cython_agg_frame_columns(): diff --git a/pandas/tests/groupby/test_function.py b/pandas/tests/groupby/test_function.py index 25266d1e6ab80..dda583e3a1962 100644 --- a/pandas/tests/groupby/test_function.py +++ b/pandas/tests/groupby/test_function.py @@ -306,7 +306,7 @@ def test_idxmax(self, gb): # non-cython calls should not include the grouper expected = DataFrame([[0.0], [np.nan]], columns=["B"], index=[1, 3]) expected.index.name = "A" - msg = "The default value of numeric_only" + msg = "The default value of numeric_only in DataFrameGroupBy.idxmax" with tm.assert_produces_warning(FutureWarning, match=msg): result = gb.idxmax() tm.assert_frame_equal(result, expected) @@ -317,7 +317,7 @@ def test_idxmin(self, gb): # non-cython calls should not include the grouper expected = DataFrame([[0.0], [np.nan]], columns=["B"], index=[1, 3]) expected.index.name = "A" - msg = "The default value of numeric_only" + msg = "The default value of numeric_only in DataFrameGroupBy.idxmin" with tm.assert_produces_warning(FutureWarning, match=msg): result = gb.idxmin() tm.assert_frame_equal(result, expected) @@ -555,6 +555,81 @@ def test_idxmin_idxmax_axis1(): gb2.idxmax(axis=1) +@pytest.mark.parametrize("numeric_only", [True, False, None]) +def test_axis1_numeric_only(request, groupby_func, numeric_only): + if groupby_func in ("idxmax", "idxmin"): + pytest.skip("idxmax and idx_min tested in test_idxmin_idxmax_axis1") + if groupby_func in ("mad", "tshift"): + pytest.skip("mad and tshift are deprecated") + if groupby_func in ("corrwith", "skew"): + msg = "GH#47723 groupby.corrwith and skew do not correctly implement axis=1" + request.node.add_marker(pytest.mark.xfail(reason=msg)) + + df = DataFrame(np.random.randn(10, 4), columns=["A", "B", "C", "D"]) + df["E"] = "x" + groups = [1, 2, 3, 1, 2, 3, 1, 2, 3, 4] + gb = df.groupby(groups) + method = getattr(gb, groupby_func) + args = (0,) if groupby_func == "fillna" else () + kwargs = {"axis": 1} + if numeric_only is not None: + # when numeric_only is None we don't pass any argument + kwargs["numeric_only"] = numeric_only + + # Functions without numeric_only and axis args + no_args = ("cumprod", "cumsum", "diff", "fillna", "pct_change", "rank", "shift") + # Functions with axis args + has_axis = ( + "cumprod", + "cumsum", + "diff", + "pct_change", + "rank", + "shift", + "cummax", + "cummin", + "idxmin", + "idxmax", + "fillna", + ) + if numeric_only is not None and groupby_func in no_args: + msg = "got an unexpected keyword argument 'numeric_only'" + with pytest.raises(TypeError, match=msg): + method(*args, **kwargs) + elif groupby_func not in has_axis: + msg = "got an unexpected keyword argument 'axis'" + warn = FutureWarning if groupby_func == "skew" and not numeric_only else None + with tm.assert_produces_warning(warn, match="Dropping of nuisance columns"): + with pytest.raises(TypeError, match=msg): + method(*args, **kwargs) + # fillna and shift are successful even on object dtypes + elif (numeric_only is None or not numeric_only) and groupby_func not in ( + "fillna", + "shift", + ): + msgs = ( + # cummax, cummin, rank + "not supported between instances of", + # cumprod + "can't multiply sequence by non-int of type 'float'", + # cumsum, diff, pct_change + "unsupported operand type", + ) + with pytest.raises(TypeError, match=f"({'|'.join(msgs)})"): + method(*args, **kwargs) + else: + result = method(*args, **kwargs) + + df_expected = df.drop(columns="E").T if numeric_only else df.T + expected = getattr(df_expected, groupby_func)(*args).T + if groupby_func == "shift" and not numeric_only: + # shift with axis=1 leaves the leftmost column as numeric + # but transposing for expected gives us object dtype + expected = expected.astype(float) + + tm.assert_equal(result, expected) + + def test_groupby_cumprod(): # GH 4095 df = DataFrame({"key": ["b"] * 10, "value": 2}) @@ -1321,7 +1396,7 @@ def test_deprecate_numeric_only( assert "b" not in result.columns elif ( # kernels that work on any dtype and have numeric_only arg - kernel in ("first", "last", "corrwith") + kernel in ("first", "last") or ( # kernels that work on any dtype and don't have numeric_only arg kernel in ("any", "all", "bfill", "ffill", "fillna", "nth", "nunique") @@ -1339,7 +1414,8 @@ def test_deprecate_numeric_only( "(not allowed for this dtype" "|must be a string or a number" "|cannot be performed against 'object' dtypes" - "|must be a string or a real number)" + "|must be a string or a real number" + "|unsupported operand type)" ) with pytest.raises(TypeError, match=msg): method(*args, **kwargs) @@ -1356,6 +1432,132 @@ def test_deprecate_numeric_only( method(*args, **kwargs) +@pytest.mark.parametrize("dtype", [bool, int, float, object]) +def test_deprecate_numeric_only_series(dtype, groupby_func, request): + # GH#46560 + if groupby_func in ("backfill", "mad", "pad", "tshift"): + pytest.skip("method is deprecated") + elif groupby_func == "corrwith": + msg = "corrwith is not implemented on SeriesGroupBy" + request.node.add_marker(pytest.mark.xfail(reason=msg)) + + grouper = [0, 0, 1] + + ser = Series([1, 0, 0], dtype=dtype) + gb = ser.groupby(grouper) + method = getattr(gb, groupby_func) + + expected_ser = Series([1, 0, 0]) + expected_gb = expected_ser.groupby(grouper) + expected_method = getattr(expected_gb, groupby_func) + + if groupby_func == "corrwith": + args = (ser,) + elif groupby_func == "corr": + args = (ser,) + elif groupby_func == "cov": + args = (ser,) + elif groupby_func == "nth": + args = (0,) + elif groupby_func == "fillna": + args = (True,) + elif groupby_func == "take": + args = ([0],) + elif groupby_func == "quantile": + args = (0.5,) + else: + args = () + + fails_on_numeric_object = ( + "corr", + "cov", + "cummax", + "cummin", + "cumprod", + "cumsum", + "idxmax", + "idxmin", + "quantile", + ) + # ops that give an object result on object input + obj_result = ( + "first", + "last", + "nth", + "bfill", + "ffill", + "shift", + "sum", + "diff", + "pct_change", + ) + + # Test default behavior; kernels that fail may be enabled in the future but kernels + # that succeed should not be allowed to fail (without deprecation, at least) + if groupby_func in fails_on_numeric_object and dtype is object: + if groupby_func in ("idxmax", "idxmin"): + msg = "not allowed for this dtype" + elif groupby_func == "quantile": + msg = "cannot be performed against 'object' dtypes" + else: + msg = "is not supported for object dtype" + with pytest.raises(TypeError, match=msg): + method(*args) + elif dtype is object: + result = method(*args) + expected = expected_method(*args) + if groupby_func in obj_result: + expected = expected.astype(object) + tm.assert_series_equal(result, expected) + + has_numeric_only = ( + "first", + "last", + "max", + "mean", + "median", + "min", + "prod", + "quantile", + "sem", + "skew", + "std", + "sum", + "var", + "cummax", + "cummin", + "cumprod", + "cumsum", + ) + if groupby_func not in has_numeric_only: + msg = "got an unexpected keyword argument 'numeric_only'" + with pytest.raises(TypeError, match=msg): + method(*args, numeric_only=True) + elif dtype is object: + err_category = NotImplementedError + err_msg = f"{groupby_func} does not implement numeric_only" + if groupby_func.startswith("cum"): + # cum ops already exhibit future behavior + warn_category = None + warn_msg = "" + err_category = TypeError + err_msg = f"{groupby_func} is not supported for object dtype" + elif groupby_func == "skew": + warn_category = FutureWarning + warn_msg = "will raise a TypeError in the future" + else: + warn_category = FutureWarning + warn_msg = "This will raise a TypeError" + + with tm.assert_produces_warning(warn_category, match=warn_msg): + with pytest.raises(err_category, match=err_msg): + method(*args, numeric_only=True) + else: + result = method(*args, numeric_only=True) + expected = method(*args, numeric_only=False) + tm.assert_series_equal(result, expected) + + @pytest.mark.parametrize("dtype", [int, float, object]) @pytest.mark.parametrize( "kwargs", @@ -1379,3 +1581,15 @@ def test_groupby_empty_dataset(dtype, kwargs): expected = df.groupby("A").B.describe(**kwargs).reset_index(drop=True).iloc[:0] expected.index = Index([]) tm.assert_frame_equal(result, expected) + + +def test_corrwith_with_1_axis(): + # GH 47723 + df = DataFrame({"a": [1, 1, 2], "b": [3, 7, 4]}) + result = df.groupby("a").corrwith(df, axis=1) + index = Index( + data=[(1, 0), (1, 1), (1, 2), (2, 2), (2, 0), (2, 1)], + name=("a", None), + ) + expected = Series([np.nan] * 6, index=index) + tm.assert_series_equal(result, expected) diff --git a/pandas/tests/groupby/test_groupby.py b/pandas/tests/groupby/test_groupby.py index 97e616ef14cef..920b869ef799b 100644 --- a/pandas/tests/groupby/test_groupby.py +++ b/pandas/tests/groupby/test_groupby.py @@ -2776,3 +2776,22 @@ def test_by_column_values_with_same_starting_value(): ).set_index("Name") tm.assert_frame_equal(result, expected_result) + + +def test_groupby_none_in_first_mi_level(): + # GH#47348 + arr = [[None, 1, 0, 1], [2, 3, 2, 3]] + ser = Series(1, index=MultiIndex.from_arrays(arr, names=["a", "b"])) + result = ser.groupby(level=[0, 1]).sum() + expected = Series( + [1, 2], MultiIndex.from_tuples([(0.0, 2), (1.0, 3)], names=["a", "b"]) + ) + tm.assert_series_equal(result, expected) + + +def test_groupby_none_column_name(): + # GH#47348 + df = DataFrame({None: [1, 1, 2, 2], "b": [1, 1, 2, 3], "c": [4, 5, 6, 7]}) + result = df.groupby(by=[None]).sum() + expected = DataFrame({"b": [2, 5], "c": [9, 13]}, index=Index([1, 2], name=None)) + tm.assert_frame_equal(result, expected) diff --git a/pandas/tests/groupby/test_groupby_dropna.py b/pandas/tests/groupby/test_groupby_dropna.py index ca55263146db3..515c96780e731 100644 --- a/pandas/tests/groupby/test_groupby_dropna.py +++ b/pandas/tests/groupby/test_groupby_dropna.py @@ -378,3 +378,12 @@ def test_groupby_nan_included(): tm.assert_numpy_array_equal(result_values, expected_values) assert np.isnan(list(result.keys())[2]) assert list(result.keys())[0:2] == ["g1", "g2"] + + +def test_groupby_drop_nan_with_multi_index(): + # GH 39895 + df = pd.DataFrame([[np.nan, 0, 1]], columns=["a", "b", "c"]) + df = df.set_index(["a", "b"]) + result = df.groupby(["a", "b"], dropna=False).first() + expected = df + tm.assert_frame_equal(result, expected) diff --git a/pandas/tests/groupby/test_quantile.py b/pandas/tests/groupby/test_quantile.py index 20328426a69b2..2b7e71d9619a4 100644 --- a/pandas/tests/groupby/test_quantile.py +++ b/pandas/tests/groupby/test_quantile.py @@ -343,3 +343,38 @@ def test_columns_groupby_quantile(): ) tm.assert_frame_equal(result, expected) + + +def test_timestamp_groupby_quantile(): + # GH 33168 + df = DataFrame( + { + "timestamp": pd.date_range( + start="2020-04-19 00:00:00", freq="1T", periods=100, tz="UTC" + ).floor("1H"), + "category": list(range(1, 101)), + "value": list(range(101, 201)), + } + ) + + result = df.groupby("timestamp").quantile([0.2, 0.8]) + + expected = DataFrame( + [ + {"category": 12.8, "value": 112.8}, + {"category": 48.2, "value": 148.2}, + {"category": 68.8, "value": 168.8}, + {"category": 92.2, "value": 192.2}, + ], + index=pd.MultiIndex.from_tuples( + [ + (pd.Timestamp("2020-04-19 00:00:00+00:00"), 0.2), + (pd.Timestamp("2020-04-19 00:00:00+00:00"), 0.8), + (pd.Timestamp("2020-04-19 01:00:00+00:00"), 0.2), + (pd.Timestamp("2020-04-19 01:00:00+00:00"), 0.8), + ], + names=("timestamp", None), + ), + ) + + tm.assert_frame_equal(result, expected) diff --git a/pandas/tests/indexes/datetimes/methods/test_snap.py b/pandas/tests/indexes/datetimes/methods/test_snap.py index e591441c4f148..a94d00d919082 100644 --- a/pandas/tests/indexes/datetimes/methods/test_snap.py +++ b/pandas/tests/indexes/datetimes/methods/test_snap.py @@ -7,10 +7,30 @@ import pandas._testing as tm +def astype_non_nano(dti_nano, unit): + # TODO(2.0): remove once DTI/DTA.astype supports non-nano + if unit == "ns": + return dti_nano + + dta_nano = dti_nano._data + arr_nano = dta_nano._ndarray + + arr = arr_nano.astype(f"M8[{unit}]") + if dti_nano.tz is None: + dtype = arr.dtype + else: + dtype = type(dti_nano.dtype)(tz=dti_nano.tz, unit=unit) + dta = type(dta_nano)._simple_new(arr, dtype=dtype) + dti = DatetimeIndex(dta, name=dti_nano.name) + assert dti.dtype == dtype + return dti + + @pytest.mark.filterwarnings("ignore::DeprecationWarning") @pytest.mark.parametrize("tz", [None, "Asia/Shanghai", "Europe/Berlin"]) @pytest.mark.parametrize("name", [None, "my_dti"]) -def test_dti_snap(name, tz): +@pytest.mark.parametrize("unit", ["ns", "us", "ms", "s"]) +def test_dti_snap(name, tz, unit): dti = DatetimeIndex( [ "1/1/2002", @@ -25,10 +45,12 @@ def test_dti_snap(name, tz): tz=tz, freq="D", ) + dti = astype_non_nano(dti, unit) result = dti.snap(freq="W-MON") expected = date_range("12/31/2001", "1/7/2002", name=name, tz=tz, freq="w-mon") expected = expected.repeat([3, 4]) + expected = astype_non_nano(expected, unit) tm.assert_index_equal(result, expected) assert result.tz == expected.tz assert result.freq is None @@ -38,6 +60,7 @@ def test_dti_snap(name, tz): expected = date_range("1/1/2002", "1/7/2002", name=name, tz=tz, freq="b") expected = expected.repeat([1, 1, 1, 2, 2]) + expected = astype_non_nano(expected, unit) tm.assert_index_equal(result, expected) assert result.tz == expected.tz assert result.freq is None diff --git a/pandas/tests/indexes/datetimes/test_indexing.py b/pandas/tests/indexes/datetimes/test_indexing.py index b8f72a8c1f988..a203fee5b3a61 100644 --- a/pandas/tests/indexes/datetimes/test_indexing.py +++ b/pandas/tests/indexes/datetimes/test_indexing.py @@ -777,3 +777,32 @@ def test_indexer_between_time(self): msg = r"Cannot convert arg \[datetime\.datetime\(2010, 1, 2, 1, 0\)\] to a time" with pytest.raises(ValueError, match=msg): rng.indexer_between_time(datetime(2010, 1, 2, 1), datetime(2010, 1, 2, 5)) + + @pytest.mark.parametrize("unit", ["us", "ms", "s"]) + def test_indexer_between_time_non_nano(self, unit): + # For simple cases like this, the non-nano indexer_between_time + # should match the nano result + + rng = date_range("1/1/2000", "1/5/2000", freq="5min") + arr_nano = rng._data._ndarray + + arr = arr_nano.astype(f"M8[{unit}]") + + dta = type(rng._data)._simple_new(arr, dtype=arr.dtype) + dti = DatetimeIndex(dta) + assert dti.dtype == arr.dtype + + tic = time(1, 25) + toc = time(2, 29) + + result = dti.indexer_between_time(tic, toc) + expected = rng.indexer_between_time(tic, toc) + tm.assert_numpy_array_equal(result, expected) + + # case with non-zero micros in arguments + tic = time(1, 25, 0, 45678) + toc = time(2, 29, 0, 1234) + + result = dti.indexer_between_time(tic, toc) + expected = rng.indexer_between_time(tic, toc) + tm.assert_numpy_array_equal(result, expected) diff --git a/pandas/tests/indexes/interval/test_constructors.py b/pandas/tests/indexes/interval/test_constructors.py index b57bcf7abc1e1..a23f66d241cd9 100644 --- a/pandas/tests/indexes/interval/test_constructors.py +++ b/pandas/tests/indexes/interval/test_constructors.py @@ -61,6 +61,16 @@ def test_constructor(self, constructor, breaks, closed, name): tm.assert_index_equal(result.left, Index(breaks[:-1])) tm.assert_index_equal(result.right, Index(breaks[1:])) + def test_constructor_inclusive_default(self, constructor, name): + result_kwargs = self.get_kwargs_from_breaks([3, 14, 15, 92, 653]) + inclusive_in = result_kwargs.pop("inclusive", None) + result = constructor(name=name, **result_kwargs) + + if inclusive_in is not None: + result_kwargs["inclusive"] = "right" + expected = constructor(name=name, **result_kwargs) + tm.assert_index_equal(result, expected) + @pytest.mark.parametrize( "breaks, subtype", [ @@ -78,7 +88,7 @@ def test_constructor_dtype(self, constructor, breaks, subtype): expected = constructor(**expected_kwargs) result_kwargs = self.get_kwargs_from_breaks(breaks) - iv_dtype = IntervalDtype(subtype, "both") + iv_dtype = IntervalDtype(subtype, "right") for dtype in (iv_dtype, str(iv_dtype)): result = constructor(dtype=dtype, **result_kwargs) tm.assert_index_equal(result, expected) @@ -94,8 +104,8 @@ def test_constructor_dtype(self, constructor, breaks, subtype): timedelta_range("1 day", periods=5), ], ) - def test_constructor_pass_closed(self, constructor, breaks): - # not passing closed to IntervalDtype, but to IntervalArray constructor + def test_constructor_pass_inclusive(self, constructor, breaks): + # not passing inclusive to IntervalDtype, but to IntervalArray constructor warn = None if isinstance(constructor, partial) and constructor.func is Index: # passing kwargs to Index is deprecated @@ -183,7 +193,7 @@ def test_generic_errors(self, constructor): # filler input data to be used when supplying invalid kwargs filler = self.get_kwargs_from_breaks(range(10)) - # invalid closed + # invalid inclusive msg = "inclusive must be one of 'right', 'left', 'both', 'neither'" with pytest.raises(ValueError, match=msg): constructor(inclusive="invalid", **filler) @@ -219,7 +229,7 @@ class TestFromArrays(ConstructorTests): def constructor(self): return IntervalIndex.from_arrays - def get_kwargs_from_breaks(self, breaks, inclusive="both"): + def get_kwargs_from_breaks(self, breaks, inclusive="right"): """ converts intervals in breaks format to a dictionary of kwargs to specific to the format expected by IntervalIndex.from_arrays @@ -268,7 +278,7 @@ class TestFromBreaks(ConstructorTests): def constructor(self): return IntervalIndex.from_breaks - def get_kwargs_from_breaks(self, breaks, inclusive="both"): + def get_kwargs_from_breaks(self, breaks, inclusive="right"): """ converts intervals in breaks format to a dictionary of kwargs to specific to the format expected by IntervalIndex.from_breaks @@ -306,7 +316,7 @@ class TestFromTuples(ConstructorTests): def constructor(self): return IntervalIndex.from_tuples - def get_kwargs_from_breaks(self, breaks, inclusive="both"): + def get_kwargs_from_breaks(self, breaks, inclusive="right"): """ converts intervals in breaks format to a dictionary of kwargs to specific to the format expected by IntervalIndex.from_tuples @@ -356,7 +366,7 @@ class TestClassConstructors(ConstructorTests): def constructor(self, request): return request.param - def get_kwargs_from_breaks(self, breaks, inclusive="both"): + def get_kwargs_from_breaks(self, breaks, inclusive="right"): """ converts intervals in breaks format to a dictionary of kwargs to specific to the format expected by the IntervalIndex/Index constructors @@ -389,9 +399,9 @@ def test_constructor_string(self): pass def test_constructor_errors(self, constructor): - # mismatched closed within intervals with no constructor override + # mismatched inclusive within intervals with no constructor override ivs = [Interval(0, 1, inclusive="right"), Interval(2, 3, inclusive="left")] - msg = "intervals must all be closed on the same side" + msg = "intervals must all be inclusive on the same side" with pytest.raises(ValueError, match=msg): constructor(ivs) @@ -410,7 +420,7 @@ def test_constructor_errors(self, constructor): @pytest.mark.filterwarnings("ignore:Passing keywords other:FutureWarning") @pytest.mark.parametrize( - "data, closed", + "data, inclusive", [ ([], "both"), ([np.nan, np.nan], "neither"), @@ -428,14 +438,14 @@ def test_constructor_errors(self, constructor): (IntervalIndex.from_breaks(range(5), inclusive="both"), "right"), ], ) - def test_override_inferred_closed(self, constructor, data, closed): + def test_override_inferred_inclusive(self, constructor, data, inclusive): # GH 19370 if isinstance(data, IntervalIndex): tuples = data.to_tuples() else: tuples = [(iv.left, iv.right) if notna(iv) else iv for iv in data] - expected = IntervalIndex.from_tuples(tuples, inclusive=closed) - result = constructor(data, inclusive=closed) + expected = IntervalIndex.from_tuples(tuples, inclusive=inclusive) + result = constructor(data, inclusive=inclusive) tm.assert_index_equal(result, expected) @pytest.mark.parametrize( @@ -450,7 +460,7 @@ def test_index_object_dtype(self, values_constructor): assert type(result) is Index tm.assert_numpy_array_equal(result.values, np.array(values)) - def test_index_mixed_closed(self): + def test_index_mixed_inclusive(self): # GH27172 intervals = [ Interval(0, 1, inclusive="left"), @@ -463,8 +473,8 @@ def test_index_mixed_closed(self): tm.assert_index_equal(result, expected) -def test_dtype_closed_mismatch(): - # GH#38394 closed specified in both dtype and IntervalIndex constructor +def test_dtype_inclusive_mismatch(): + # GH#38394 dtype = IntervalDtype(np.int64, "left") diff --git a/pandas/tests/indexes/interval/test_indexing.py b/pandas/tests/indexes/interval/test_indexing.py index 4cf754a7e52e0..e05cb73cfe446 100644 --- a/pandas/tests/indexes/interval/test_indexing.py +++ b/pandas/tests/indexes/interval/test_indexing.py @@ -76,12 +76,12 @@ def test_get_loc_length_one_scalar(self, scalar, closed): with pytest.raises(KeyError, match=str(scalar)): index.get_loc(scalar) - @pytest.mark.parametrize("other_closed", ["left", "right", "both", "neither"]) + @pytest.mark.parametrize("other_inclusive", ["left", "right", "both", "neither"]) @pytest.mark.parametrize("left, right", [(0, 5), (-1, 4), (-1, 6), (6, 7)]) - def test_get_loc_length_one_interval(self, left, right, closed, other_closed): + def test_get_loc_length_one_interval(self, left, right, closed, other_inclusive): # GH 20921 index = IntervalIndex.from_tuples([(0, 5)], inclusive=closed) - interval = Interval(left, right, inclusive=other_closed) + interval = Interval(left, right, inclusive=other_inclusive) if interval == index[0]: result = index.get_loc(interval) assert result == 0 @@ -89,7 +89,7 @@ def test_get_loc_length_one_interval(self, left, right, closed, other_closed): with pytest.raises( KeyError, match=re.escape( - f"Interval({left}, {right}, inclusive='{other_closed}')" + f"Interval({left}, {right}, inclusive='{other_inclusive}')" ), ): index.get_loc(interval) diff --git a/pandas/tests/indexes/interval/test_interval.py b/pandas/tests/indexes/interval/test_interval.py index 4e33c3abd3252..5bf29093152d8 100644 --- a/pandas/tests/indexes/interval/test_interval.py +++ b/pandas/tests/indexes/interval/test_interval.py @@ -871,21 +871,21 @@ def test_nbytes(self): expected = 64 # 4 * 8 * 2 assert result == expected - @pytest.mark.parametrize("new_closed", ["left", "right", "both", "neither"]) - def test_set_closed(self, name, closed, new_closed): + @pytest.mark.parametrize("new_inclusive", ["left", "right", "both", "neither"]) + def test_set_inclusive(self, name, closed, new_inclusive): # GH 21670 index = interval_range(0, 5, inclusive=closed, name=name) - result = index.set_closed(new_closed) - expected = interval_range(0, 5, inclusive=new_closed, name=name) + result = index.set_inclusive(new_inclusive) + expected = interval_range(0, 5, inclusive=new_inclusive, name=name) tm.assert_index_equal(result, expected) @pytest.mark.parametrize("bad_inclusive", ["foo", 10, "LEFT", True, False]) - def test_set_closed_errors(self, bad_inclusive): + def test_set_inclusive_errors(self, bad_inclusive): # GH 21670 index = interval_range(0, 5) msg = f"invalid option for 'inclusive': {bad_inclusive}" with pytest.raises(ValueError, match=msg): - index.set_closed(bad_inclusive) + index.set_inclusive(bad_inclusive) def test_is_all_dates(self): # GH 23576 @@ -897,35 +897,31 @@ def test_is_all_dates(self): def test_interval_index_error_and_warning(self): # GH 40245 - msg = ( - "Deprecated argument `closed` cannot " - "be passed if argument `inclusive` is not None" - ) - with pytest.raises(ValueError, match=msg): - IntervalIndex.from_breaks(range(11), closed="both", inclusive="both") + msg = "Can only specify 'closed' or 'inclusive', not both." + msg_warn = "the 'closed'' keyword is deprecated, use 'inclusive' instead." + with pytest.raises(TypeError, match=msg): + with tm.assert_produces_warning(FutureWarning, match=msg_warn): + IntervalIndex.from_breaks(range(11), closed="both", inclusive="both") - with pytest.raises(ValueError, match=msg): - IntervalIndex.from_arrays([0, 1], [1, 2], closed="both", inclusive="both") + with pytest.raises(TypeError, match=msg): + with tm.assert_produces_warning(FutureWarning, match=msg_warn): + IntervalIndex.from_arrays( + [0, 1], [1, 2], closed="both", inclusive="both" + ) - with pytest.raises(ValueError, match=msg): - IntervalIndex.from_tuples( - [(0, 1), (0.5, 1.5)], closed="both", inclusive="both" - ) + with pytest.raises(TypeError, match=msg): + with tm.assert_produces_warning(FutureWarning, match=msg_warn): + IntervalIndex.from_tuples( + [(0, 1), (0.5, 1.5)], closed="both", inclusive="both" + ) - msg = "Argument `closed` is deprecated in favor of `inclusive`" - with tm.assert_produces_warning( - FutureWarning, match=msg, check_stacklevel=False - ): + with tm.assert_produces_warning(FutureWarning, match=msg_warn): IntervalIndex.from_breaks(range(11), closed="both") - with tm.assert_produces_warning( - FutureWarning, match=msg, check_stacklevel=False - ): + with tm.assert_produces_warning(FutureWarning, match=msg_warn): IntervalIndex.from_arrays([0, 1], [1, 2], closed="both") - with tm.assert_produces_warning( - FutureWarning, match=msg, check_stacklevel=False - ): + with tm.assert_produces_warning(FutureWarning, match=msg_warn): IntervalIndex.from_tuples([(0, 1), (0.5, 1.5)], closed="both") @@ -955,3 +951,9 @@ def test_searchsorted_invalid_argument(arg): msg = "'<' not supported between instances of 'pandas._libs.interval.Interval' and " with pytest.raises(TypeError, match=msg): values.searchsorted(arg) + + +def test_interval_range_deprecated_closed(): + # GH#40245 + with tm.assert_produces_warning(FutureWarning): + interval_range(start=0, end=5, closed="right") diff --git a/pandas/tests/indexes/interval/test_interval_range.py b/pandas/tests/indexes/interval/test_interval_range.py index 255470cf4683e..3bde2f51178dc 100644 --- a/pandas/tests/indexes/interval/test_interval_range.py +++ b/pandas/tests/indexes/interval/test_interval_range.py @@ -360,13 +360,13 @@ def test_errors(self): def test_interval_range_error_and_warning(self): # GH 40245 - msg = ( - "Deprecated argument `closed` cannot " - "be passed if argument `inclusive` is not None" - ) - with pytest.raises(ValueError, match=msg): - interval_range(end=5, periods=4, closed="both", inclusive="both") + msg = "Can only specify 'closed' or 'inclusive', not both." + msg_warn = "the 'closed'' keyword is deprecated, use 'inclusive' instead." + + with pytest.raises(TypeError, match=msg): + with tm.assert_produces_warning(FutureWarning, match=msg_warn): + interval_range(end=5, periods=4, closed="both", inclusive="both") - msg = "Argument `closed` is deprecated in favor of `inclusive`" - with tm.assert_produces_warning(FutureWarning, match=msg): + msg = "the 'closed'' keyword is deprecated, use 'inclusive' instead." + with tm.assert_produces_warning(FutureWarning, match=msg_warn): interval_range(end=5, periods=4, closed="right") diff --git a/pandas/tests/indexes/interval/test_interval_tree.py b/pandas/tests/indexes/interval/test_interval_tree.py index 06c499b9e33f4..6c30d16e61582 100644 --- a/pandas/tests/indexes/interval/test_interval_tree.py +++ b/pandas/tests/indexes/interval/test_interval_tree.py @@ -190,24 +190,6 @@ def test_construction_overflow(self): expected = (50 + np.iinfo(np.int64).max) / 2 assert result == expected - def test_interval_tree_error_and_warning(self): - # GH 40245 - - msg = ( - "Deprecated argument `closed` cannot " - "be passed if argument `inclusive` is not None" - ) - with pytest.raises(ValueError, match=msg): - left, right = np.arange(10), [np.iinfo(np.int64).max] * 10 - IntervalTree(left, right, closed="both", inclusive="both") - - msg = "Argument `closed` is deprecated in favor of `inclusive`" - with tm.assert_produces_warning( - FutureWarning, match=msg, check_stacklevel=False - ): - left, right = np.arange(10), [np.iinfo(np.int64).max] * 10 - IntervalTree(left, right, closed="both") - @pytest.mark.xfail(not IS64, reason="GH 23440") @pytest.mark.parametrize( "left, right, expected", diff --git a/pandas/tests/indexes/interval/test_pickle.py b/pandas/tests/indexes/interval/test_pickle.py index 7f5784b6d76b9..ef6db9c8a0513 100644 --- a/pandas/tests/indexes/interval/test_pickle.py +++ b/pandas/tests/indexes/interval/test_pickle.py @@ -1,13 +1,10 @@ -import pytest - from pandas import IntervalIndex import pandas._testing as tm class TestPickle: - @pytest.mark.parametrize("inclusive", ["left", "right", "both"]) - def test_pickle_round_trip_closed(self, inclusive): + def test_pickle_round_trip_inclusive(self, closed): # https://github.com/pandas-dev/pandas/issues/35658 - idx = IntervalIndex.from_tuples([(1, 2), (2, 3)], inclusive=inclusive) + idx = IntervalIndex.from_tuples([(1, 2), (2, 3)], inclusive=closed) result = tm.round_trip_pickle(idx) tm.assert_index_equal(result, idx) diff --git a/pandas/tests/indexes/interval/test_setops.py b/pandas/tests/indexes/interval/test_setops.py index 5933961cc0f9d..2e1f6f7925374 100644 --- a/pandas/tests/indexes/interval/test_setops.py +++ b/pandas/tests/indexes/interval/test_setops.py @@ -10,22 +10,22 @@ import pandas._testing as tm -def monotonic_index(start, end, dtype="int64", closed="right"): +def monotonic_index(start, end, dtype="int64", inclusive="right"): return IntervalIndex.from_breaks( - np.arange(start, end, dtype=dtype), inclusive=closed + np.arange(start, end, dtype=dtype), inclusive=inclusive ) -def empty_index(dtype="int64", closed="right"): - return IntervalIndex(np.array([], dtype=dtype), inclusive=closed) +def empty_index(dtype="int64", inclusive="right"): + return IntervalIndex(np.array([], dtype=dtype), inclusive=inclusive) class TestIntervalIndex: def test_union(self, closed, sort): - index = monotonic_index(0, 11, closed=closed) - other = monotonic_index(5, 13, closed=closed) + index = monotonic_index(0, 11, inclusive=closed) + other = monotonic_index(5, 13, inclusive=closed) - expected = monotonic_index(0, 13, closed=closed) + expected = monotonic_index(0, 13, inclusive=closed) result = index[::-1].union(other, sort=sort) if sort is None: tm.assert_index_equal(result, expected) @@ -41,12 +41,12 @@ def test_union(self, closed, sort): def test_union_empty_result(self, closed, sort): # GH 19101: empty result, same dtype - index = empty_index(dtype="int64", closed=closed) + index = empty_index(dtype="int64", inclusive=closed) result = index.union(index, sort=sort) tm.assert_index_equal(result, index) # GH 19101: empty result, different numeric dtypes -> common dtype is f8 - other = empty_index(dtype="float64", closed=closed) + other = empty_index(dtype="float64", inclusive=closed) result = index.union(other, sort=sort) expected = other tm.assert_index_equal(result, expected) @@ -54,7 +54,7 @@ def test_union_empty_result(self, closed, sort): other = index.union(index, sort=sort) tm.assert_index_equal(result, expected) - other = empty_index(dtype="uint64", closed=closed) + other = empty_index(dtype="uint64", inclusive=closed) result = index.union(other, sort=sort) tm.assert_index_equal(result, expected) @@ -62,10 +62,10 @@ def test_union_empty_result(self, closed, sort): tm.assert_index_equal(result, expected) def test_intersection(self, closed, sort): - index = monotonic_index(0, 11, closed=closed) - other = monotonic_index(5, 13, closed=closed) + index = monotonic_index(0, 11, inclusive=closed) + other = monotonic_index(5, 13, inclusive=closed) - expected = monotonic_index(5, 11, closed=closed) + expected = monotonic_index(5, 11, inclusive=closed) result = index[::-1].intersection(other, sort=sort) if sort is None: tm.assert_index_equal(result, expected) @@ -100,21 +100,21 @@ def test_intersection(self, closed, sort): tm.assert_index_equal(result, expected) def test_intersection_empty_result(self, closed, sort): - index = monotonic_index(0, 11, closed=closed) + index = monotonic_index(0, 11, inclusive=closed) # GH 19101: empty result, same dtype - other = monotonic_index(300, 314, closed=closed) - expected = empty_index(dtype="int64", closed=closed) + other = monotonic_index(300, 314, inclusive=closed) + expected = empty_index(dtype="int64", inclusive=closed) result = index.intersection(other, sort=sort) tm.assert_index_equal(result, expected) # GH 19101: empty result, different numeric dtypes -> common dtype is float64 - other = monotonic_index(300, 314, dtype="float64", closed=closed) + other = monotonic_index(300, 314, dtype="float64", inclusive=closed) result = index.intersection(other, sort=sort) expected = other[:0] tm.assert_index_equal(result, expected) - other = monotonic_index(300, 314, dtype="uint64", closed=closed) + other = monotonic_index(300, 314, dtype="uint64", inclusive=closed) result = index.intersection(other, sort=sort) tm.assert_index_equal(result, expected) @@ -136,7 +136,7 @@ def test_difference(self, closed, sort): # GH 19101: empty result, same dtype result = index.difference(index, sort=sort) - expected = empty_index(dtype="int64", closed=closed) + expected = empty_index(dtype="int64", inclusive=closed) tm.assert_index_equal(result, expected) # GH 19101: empty result, different dtypes @@ -147,7 +147,7 @@ def test_difference(self, closed, sort): tm.assert_index_equal(result, expected) def test_symmetric_difference(self, closed, sort): - index = monotonic_index(0, 11, closed=closed) + index = monotonic_index(0, 11, inclusive=closed) result = index[1:].symmetric_difference(index[:-1], sort=sort) expected = IntervalIndex([index[0], index[-1]]) if sort is None: @@ -156,7 +156,7 @@ def test_symmetric_difference(self, closed, sort): # GH 19101: empty result, same dtype result = index.symmetric_difference(index, sort=sort) - expected = empty_index(dtype="int64", closed=closed) + expected = empty_index(dtype="int64", inclusive=closed) if sort is None: tm.assert_index_equal(result, expected) assert tm.equalContents(result, expected) @@ -166,7 +166,7 @@ def test_symmetric_difference(self, closed, sort): index.left.astype("float64"), index.right, inclusive=closed ) result = index.symmetric_difference(other, sort=sort) - expected = empty_index(dtype="float64", closed=closed) + expected = empty_index(dtype="float64", inclusive=closed) tm.assert_index_equal(result, expected) @pytest.mark.filterwarnings("ignore:'<' not supported between:RuntimeWarning") @@ -174,7 +174,7 @@ def test_symmetric_difference(self, closed, sort): "op_name", ["union", "intersection", "difference", "symmetric_difference"] ) def test_set_incompatible_types(self, closed, op_name, sort): - index = monotonic_index(0, 11, closed=closed) + index = monotonic_index(0, 11, inclusive=closed) set_op = getattr(index, op_name) # TODO: standardize return type of non-union setops type(self vs other) @@ -187,8 +187,8 @@ def test_set_incompatible_types(self, closed, op_name, sort): tm.assert_index_equal(result, expected) # mixed closed -> cast to object - for other_closed in {"right", "left", "both", "neither"} - {closed}: - other = monotonic_index(0, 11, closed=other_closed) + for other_inclusive in {"right", "left", "both", "neither"} - {closed}: + other = monotonic_index(0, 11, inclusive=other_inclusive) expected = getattr(index.astype(object), op_name)(other, sort=sort) if op_name == "difference": expected = index diff --git a/pandas/tests/indexes/multi/test_constructors.py b/pandas/tests/indexes/multi/test_constructors.py index 63b0bd235e57c..7fad59fc6654c 100644 --- a/pandas/tests/indexes/multi/test_constructors.py +++ b/pandas/tests/indexes/multi/test_constructors.py @@ -827,3 +827,13 @@ def test_multiindex_inference_consistency(): mi = MultiIndex.from_tuples([(x,) for x in arr]) lev = mi.levels[0] assert lev.dtype == object + + +def test_dtype_representation(): + # GH#46900 + pmidx = MultiIndex.from_arrays([[1], ["a"]], names=[("a", "b"), ("c", "d")]) + result = pmidx.dtypes + expected = Series( + ["int64", "object"], index=MultiIndex.from_tuples([("a", "b"), ("c", "d")]) + ) + tm.assert_series_equal(result, expected) diff --git a/pandas/tests/indexes/multi/test_setops.py b/pandas/tests/indexes/multi/test_setops.py index 5d1efea657426..39b5e0ffc526c 100644 --- a/pandas/tests/indexes/multi/test_setops.py +++ b/pandas/tests/indexes/multi/test_setops.py @@ -540,7 +540,36 @@ def test_union_duplicates(index, request): mi1 = MultiIndex.from_arrays([values, [1] * len(values)]) mi2 = MultiIndex.from_arrays([[values[0]] + values, [1] * (len(values) + 1)]) result = mi1.union(mi2) - tm.assert_index_equal(result, mi2.sort_values()) + expected = mi2.sort_values() + if mi2.levels[0].dtype == np.uint64 and (mi2.get_level_values(0) < 2**63).all(): + # GH#47294 - union uses lib.fast_zip, converting data to Python integers + # and loses type information. Result is then unsigned only when values are + # sufficiently large to require unsigned dtype. + expected = expected.set_levels( + [expected.levels[0].astype(int), expected.levels[1]] + ) + tm.assert_index_equal(result, expected) result = mi2.union(mi1) - tm.assert_index_equal(result, mi2.sort_values()) + tm.assert_index_equal(result, expected) + + +@pytest.mark.parametrize( + "levels1, levels2, codes1, codes2, names", + [ + ( + [["a", "b", "c"], [0, ""]], + [["c", "d", "b"], [""]], + [[0, 1, 2], [1, 1, 1]], + [[0, 1, 2], [0, 0, 0]], + ["name1", "name2"], + ), + ], +) +def test_intersection_lexsort_depth(levels1, levels2, codes1, codes2, names): + # GH#25169 + mi1 = MultiIndex(levels=levels1, codes=codes1, names=names) + mi2 = MultiIndex(levels=levels2, codes=codes2, names=names) + mi_int = mi1.intersection(mi2) + + assert mi_int.lexsort_depth == 0 diff --git a/pandas/tests/indexes/numeric/test_numeric.py b/pandas/tests/indexes/numeric/test_numeric.py index 7d2bcdf20c795..23262cb2eb768 100644 --- a/pandas/tests/indexes/numeric/test_numeric.py +++ b/pandas/tests/indexes/numeric/test_numeric.py @@ -509,6 +509,20 @@ def test_constructor_coercion_signed_to_unsigned( with pytest.raises(OverflowError, match=msg): Index([-1], dtype=any_unsigned_int_numpy_dtype) + def test_constructor_np_signed(self, any_signed_int_numpy_dtype): + # GH#47475 + scalar = np.dtype(any_signed_int_numpy_dtype).type(1) + result = Index([scalar]) + expected = Int64Index([1]) + tm.assert_index_equal(result, expected) + + def test_constructor_np_unsigned(self, any_unsigned_int_numpy_dtype): + # GH#47475 + scalar = np.dtype(any_unsigned_int_numpy_dtype).type(1) + result = Index([scalar]) + expected = UInt64Index([1]) + tm.assert_index_equal(result, expected) + def test_coerce_list(self): # coerce things arr = Index([1, 2, 3, 4]) diff --git a/pandas/tests/indexes/period/test_constructors.py b/pandas/tests/indexes/period/test_constructors.py index fdc286ef7ec1a..5dff5c2ad9c86 100644 --- a/pandas/tests/indexes/period/test_constructors.py +++ b/pandas/tests/indexes/period/test_constructors.py @@ -183,9 +183,10 @@ def test_constructor_datetime64arr(self): vals = np.arange(100000, 100000 + 10000, 100, dtype=np.int64) vals = vals.view(np.dtype("M8[us]")) - msg = r"Wrong dtype: datetime64\[us\]" - with pytest.raises(ValueError, match=msg): - PeriodIndex(vals, freq="D") + pi = PeriodIndex(vals, freq="D") + + expected = PeriodIndex(vals.astype("M8[ns]"), freq="D") + tm.assert_index_equal(pi, expected) @pytest.mark.parametrize("box", [None, "series", "index"]) def test_constructor_datetime64arr_ok(self, box): diff --git a/pandas/tests/indexes/ranges/test_setops.py b/pandas/tests/indexes/ranges/test_setops.py index 2942010af2720..71bd2f5590b8f 100644 --- a/pandas/tests/indexes/ranges/test_setops.py +++ b/pandas/tests/indexes/ranges/test_setops.py @@ -145,8 +145,9 @@ def test_union_noncomparable(self, sort): expected = Index(np.concatenate((other, index))) tm.assert_index_equal(result, expected) - @pytest.fixture( - params=[ + @pytest.mark.parametrize( + "idx1, idx2, expected_sorted, expected_notsorted", + [ ( RangeIndex(0, 10, 1), RangeIndex(0, 10, 1), @@ -157,13 +158,13 @@ def test_union_noncomparable(self, sort): RangeIndex(0, 10, 1), RangeIndex(5, 20, 1), RangeIndex(0, 20, 1), - Int64Index(range(20)), + RangeIndex(0, 20, 1), ), ( RangeIndex(0, 10, 1), RangeIndex(10, 20, 1), RangeIndex(0, 20, 1), - Int64Index(range(20)), + RangeIndex(0, 20, 1), ), ( RangeIndex(0, -10, -1), @@ -175,7 +176,7 @@ def test_union_noncomparable(self, sort): RangeIndex(0, -10, -1), RangeIndex(-10, -20, -1), RangeIndex(-19, 1, 1), - Int64Index(range(0, -20, -1)), + RangeIndex(0, -20, -1), ), ( RangeIndex(0, 10, 2), @@ -205,7 +206,7 @@ def test_union_noncomparable(self, sort): RangeIndex(0, 100, 5), RangeIndex(0, 100, 20), RangeIndex(0, 100, 5), - Int64Index(range(0, 100, 5)), + RangeIndex(0, 100, 5), ), ( RangeIndex(0, -100, -5), @@ -230,7 +231,7 @@ def test_union_noncomparable(self, sort): RangeIndex(0, 100, 2), RangeIndex(100, 150, 200), RangeIndex(0, 102, 2), - Int64Index(range(0, 102, 2)), + RangeIndex(0, 102, 2), ), ( RangeIndex(0, -100, -2), @@ -242,13 +243,13 @@ def test_union_noncomparable(self, sort): RangeIndex(0, -100, -1), RangeIndex(0, -50, -3), RangeIndex(-99, 1, 1), - Int64Index(list(range(0, -100, -1))), + RangeIndex(0, -100, -1), ), ( RangeIndex(0, 1, 1), RangeIndex(5, 6, 10), RangeIndex(0, 6, 5), - Int64Index([0, 5]), + RangeIndex(0, 10, 5), ), ( RangeIndex(0, 10, 5), @@ -274,16 +275,17 @@ def test_union_noncomparable(self, sort): Int64Index([1, 5, 6]), Int64Index([1, 5, 6]), ), - ] + # GH 43885 + ( + RangeIndex(0, 10), + RangeIndex(0, 5), + RangeIndex(0, 10), + RangeIndex(0, 10), + ), + ], + ids=lambda x: repr(x) if isinstance(x, RangeIndex) else x, ) - def unions(self, request): - """Inputs and expected outputs for RangeIndex.union tests""" - return request.param - - def test_union_sorted(self, unions): - - idx1, idx2, expected_sorted, expected_notsorted = unions - + def test_union_sorted(self, idx1, idx2, expected_sorted, expected_notsorted): res1 = idx1.union(idx2, sort=None) tm.assert_index_equal(res1, expected_sorted, exact=True) diff --git a/pandas/tests/indexes/test_base.py b/pandas/tests/indexes/test_base.py index 943cc945995a1..5d7fc23feb5a8 100644 --- a/pandas/tests/indexes/test_base.py +++ b/pandas/tests/indexes/test_base.py @@ -2,6 +2,7 @@ from datetime import datetime from io import StringIO import math +import operator import re import numpy as np @@ -1604,3 +1605,16 @@ def test_get_attributes_dict_deprecated(): with tm.assert_produces_warning(DeprecationWarning): attrs = idx._get_attributes_dict() assert attrs == {"name": None} + + +@pytest.mark.parametrize("op", [operator.lt, operator.gt]) +def test_nan_comparison_same_object(op): + # GH#47105 + idx = Index([np.nan]) + expected = np.array([False]) + + result = op(idx, idx) + tm.assert_numpy_array_equal(result, expected) + + result = op(idx, idx.copy()) + tm.assert_numpy_array_equal(result, expected) diff --git a/pandas/tests/indexes/timedeltas/test_indexing.py b/pandas/tests/indexes/timedeltas/test_indexing.py index b618f12e9f6c9..154a6289dfc00 100644 --- a/pandas/tests/indexes/timedeltas/test_indexing.py +++ b/pandas/tests/indexes/timedeltas/test_indexing.py @@ -14,6 +14,7 @@ TimedeltaIndex, Timestamp, notna, + offsets, timedelta_range, to_timedelta, ) @@ -346,3 +347,14 @@ def test_contains_nonunique(self): ): idx = TimedeltaIndex(vals) assert idx[0] in idx + + def test_contains(self): + # Checking for any NaT-like objects + # GH#13603 + td = to_timedelta(range(5), unit="d") + offsets.Hour(1) + for v in [NaT, None, float("nan"), np.nan]: + assert not (v in td) + + td = to_timedelta([NaT]) + for v in [NaT, None, float("nan"), np.nan]: + assert v in td diff --git a/pandas/tests/indexing/multiindex/test_loc.py b/pandas/tests/indexing/multiindex/test_loc.py index 19dfe20a3a68d..d4354766a203b 100644 --- a/pandas/tests/indexing/multiindex/test_loc.py +++ b/pandas/tests/indexing/multiindex/test_loc.py @@ -1,7 +1,10 @@ import numpy as np import pytest -from pandas.errors import PerformanceWarning +from pandas.errors import ( + IndexingError, + PerformanceWarning, +) import pandas as pd from pandas import ( @@ -11,7 +14,6 @@ Series, ) import pandas._testing as tm -from pandas.core.indexing import IndexingError @pytest.fixture diff --git a/pandas/tests/indexing/multiindex/test_multiindex.py b/pandas/tests/indexing/multiindex/test_multiindex.py index b88c411636610..08e15545cb998 100644 --- a/pandas/tests/indexing/multiindex/test_multiindex.py +++ b/pandas/tests/indexing/multiindex/test_multiindex.py @@ -213,3 +213,17 @@ def test_subtracting_two_series_with_unordered_index_and_all_nan_index( tm.assert_series_equal(result[0], a_series_expected) tm.assert_series_equal(result[1], b_series_expected) + + def test_nunique_smoke(self): + # GH 34019 + n = DataFrame([[1, 2], [1, 2]]).set_index([0, 1]).index.nunique() + assert n == 1 + + def test_multiindex_repeated_keys(self): + # GH19414 + tm.assert_series_equal( + Series([1, 2], MultiIndex.from_arrays([["a", "b"]])).loc[ + ["a", "a", "b", "b"] + ], + Series([1, 1, 2, 2], MultiIndex.from_arrays([["a", "a", "b", "b"]])), + ) diff --git a/pandas/tests/indexing/test_chaining_and_caching.py b/pandas/tests/indexing/test_chaining_and_caching.py index 47f929d87bd6f..adc001695579c 100644 --- a/pandas/tests/indexing/test_chaining_and_caching.py +++ b/pandas/tests/indexing/test_chaining_and_caching.py @@ -424,19 +424,6 @@ def test_detect_chained_assignment_warnings_errors(self): with pytest.raises(SettingWithCopyError, match=msg): df.loc[0]["A"] = 111 - def test_detect_chained_assignment_warnings_filter_and_dupe_cols(self): - # xref gh-13017. - with option_context("chained_assignment", "warn"): - df = DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, -9]], columns=["a", "a", "c"]) - - with tm.assert_produces_warning(SettingWithCopyWarning): - df.c.loc[df.c > 0] = None - - expected = DataFrame( - [[1, 2, 3], [4, 5, 6], [7, 8, -9]], columns=["a", "a", "c"] - ) - tm.assert_frame_equal(df, expected) - @pytest.mark.parametrize("rhs", [3, DataFrame({0: [1, 2, 3, 4]})]) def test_detect_chained_assignment_warning_stacklevel(self, rhs): # GH#42570 diff --git a/pandas/tests/indexing/test_iloc.py b/pandas/tests/indexing/test_iloc.py index 44cdc320148e1..2a116c992231b 100644 --- a/pandas/tests/indexing/test_iloc.py +++ b/pandas/tests/indexing/test_iloc.py @@ -11,6 +11,7 @@ import pytest from pandas.compat.numpy import is_numpy_min +from pandas.errors import IndexingError import pandas.util._test_decorators as td from pandas import ( @@ -32,7 +33,6 @@ ) import pandas._testing as tm from pandas.api.types import is_scalar -from pandas.core.indexing import IndexingError from pandas.tests.indexing.common import Base # We pass through the error message from numpy diff --git a/pandas/tests/indexing/test_indexing.py b/pandas/tests/indexing/test_indexing.py index dcb06f1cf778d..9d10e487e0cc2 100644 --- a/pandas/tests/indexing/test_indexing.py +++ b/pandas/tests/indexing/test_indexing.py @@ -7,6 +7,8 @@ import numpy as np import pytest +from pandas.errors import IndexingError + from pandas.core.dtypes.common import ( is_float_dtype, is_integer_dtype, @@ -24,7 +26,6 @@ ) import pandas._testing as tm from pandas.core.api import Float64Index -from pandas.core.indexing import IndexingError from pandas.tests.indexing.common import _mklbl from pandas.tests.indexing.test_floats import gen_obj diff --git a/pandas/tests/indexing/test_loc.py b/pandas/tests/indexing/test_loc.py index 1e5e65786a4aa..4c38a2219372d 100644 --- a/pandas/tests/indexing/test_loc.py +++ b/pandas/tests/indexing/test_loc.py @@ -12,6 +12,7 @@ import numpy as np import pytest +from pandas.errors import IndexingError import pandas.util._test_decorators as td import pandas as pd @@ -24,6 +25,7 @@ IndexSlice, MultiIndex, Period, + PeriodIndex, Series, SparseDtype, Timedelta, @@ -36,10 +38,7 @@ import pandas._testing as tm from pandas.api.types import is_scalar from pandas.core.api import Float64Index -from pandas.core.indexing import ( - IndexingError, - _one_ellipsis_message, -) +from pandas.core.indexing import _one_ellipsis_message from pandas.tests.indexing.common import Base @@ -2869,6 +2868,31 @@ def test_loc_setitem_using_datetimelike_str_as_index(fill_val, exp_dtype): tm.assert_index_equal(df.index, expected_index, exact=True) +def test_loc_set_int_dtype(): + # GH#23326 + df = DataFrame([list("abc")]) + df.loc[:, "col1"] = 5 + + expected = DataFrame({0: ["a"], 1: ["b"], 2: ["c"], "col1": [5]}) + tm.assert_frame_equal(df, expected) + + +def test_loc_periodindex_3_levels(): + # GH#24091 + p_index = PeriodIndex( + ["20181101 1100", "20181101 1200", "20181102 1300", "20181102 1400"], + name="datetime", + freq="B", + ) + mi_series = DataFrame( + [["A", "B", 1.0], ["A", "C", 2.0], ["Z", "Q", 3.0], ["W", "F", 4.0]], + index=p_index, + columns=["ONE", "TWO", "VALUES"], + ) + mi_series = mi_series.set_index(["ONE", "TWO"], append=True)["VALUES"] + assert mi_series.loc[(p_index[0], "A", "B")] == 1.0 + + class TestLocSeries: @pytest.mark.parametrize("val,expected", [(2**63 - 1, 3), (2**63, 4)]) def test_loc_uint64(self, val, expected): diff --git a/pandas/tests/io/data/excel/df_header_oob.xlsx b/pandas/tests/io/data/excel/df_header_oob.xlsx new file mode 100644 index 0000000000000..1e26091cd2ace Binary files /dev/null and b/pandas/tests/io/data/excel/df_header_oob.xlsx differ diff --git a/pandas/tests/io/data/excel/multiindex_no_index_names.xlsx b/pandas/tests/io/data/excel/multiindex_no_index_names.xlsx new file mode 100755 index 0000000000000..3913ffce5befb Binary files /dev/null and b/pandas/tests/io/data/excel/multiindex_no_index_names.xlsx differ diff --git a/pandas/tests/io/excel/test_openpyxl.py b/pandas/tests/io/excel/test_openpyxl.py index ea9a45cf829f2..3b122c8572751 100644 --- a/pandas/tests/io/excel/test_openpyxl.py +++ b/pandas/tests/io/excel/test_openpyxl.py @@ -396,3 +396,17 @@ def test_ints_spelled_with_decimals(datapath, ext): result = pd.read_excel(path) expected = DataFrame(range(2, 12), columns=[1]) tm.assert_frame_equal(result, expected) + + +def test_read_multiindex_header_no_index_names(datapath, ext): + # GH#47487 + path = datapath("io", "data", "excel", f"multiindex_no_index_names{ext}") + result = pd.read_excel(path, index_col=[0, 1, 2], header=[0, 1, 2]) + expected = DataFrame( + [[np.nan, "x", "x", "x"], ["x", np.nan, np.nan, np.nan]], + columns=pd.MultiIndex.from_tuples( + [("X", "Y", "A1"), ("X", "Y", "A2"), ("XX", "YY", "B1"), ("XX", "YY", "B2")] + ), + index=pd.MultiIndex.from_tuples([("A", "AA", "AAA"), ("A", "BB", "BBB")]), + ) + tm.assert_frame_equal(result, expected) diff --git a/pandas/tests/io/excel/test_readers.py b/pandas/tests/io/excel/test_readers.py index f6c60b11cc8ff..4ca34bec0a7d9 100644 --- a/pandas/tests/io/excel/test_readers.py +++ b/pandas/tests/io/excel/test_readers.py @@ -1556,6 +1556,12 @@ def test_excel_read_binary_via_read_excel(self, read_ext, engine): expected = pd.read_excel("test1" + read_ext, engine=engine) tm.assert_frame_equal(result, expected) + def test_read_excel_header_index_out_of_range(self, engine): + # GH#43143 + with open("df_header_oob.xlsx", "rb") as f: + with pytest.raises(ValueError, match="exceeds maximum"): + pd.read_excel(f, header=[0, 1]) + @pytest.mark.parametrize("filename", ["df_empty.xlsx", "df_equals.xlsx"]) def test_header_with_index_col(self, filename): # GH 33476 diff --git a/pandas/tests/io/excel/test_style.py b/pandas/tests/io/excel/test_style.py index c31e8ec022dcd..00f6ccb96a905 100644 --- a/pandas/tests/io/excel/test_style.py +++ b/pandas/tests/io/excel/test_style.py @@ -1,9 +1,15 @@ import contextlib +import time import numpy as np import pytest -from pandas import DataFrame +import pandas.util._test_decorators as td + +from pandas import ( + DataFrame, + read_excel, +) import pandas._testing as tm from pandas.io.excel import ExcelWriter @@ -70,6 +76,7 @@ def test_styler_to_excel_unstyled(engine): ["alignment", "vertical"], {"xlsxwriter": None, "openpyxl": "bottom"}, # xlsxwriter Fails ), + ("vertical-align: middle;", ["alignment", "vertical"], "center"), # Border widths ("border-left: 2pt solid red", ["border", "left", "style"], "medium"), ("border-left: 1pt dotted red", ["border", "left", "style"], "dotted"), @@ -205,3 +212,27 @@ def custom_converter(css): with contextlib.closing(openpyxl.load_workbook(path)) as wb: assert wb["custom"].cell(2, 2).font.color.value == "00111222" + + +@pytest.mark.single_cpu +@td.skip_if_not_us_locale +def test_styler_to_s3(s3_resource, s3so): + # GH#46381 + + mock_bucket_name, target_file = "pandas-test", "test.xlsx" + df = DataFrame({"x": [1, 2, 3], "y": [2, 4, 6]}) + styler = df.style.set_sticky(axis="index") + styler.to_excel(f"s3://{mock_bucket_name}/{target_file}", storage_options=s3so) + timeout = 5 + while True: + if target_file in ( + obj.key for obj in s3_resource.Bucket("pandas-test").objects.all() + ): + break + time.sleep(0.1) + timeout -= 0.1 + assert timeout > 0, "Timed out waiting for file to appear on moto" + result = read_excel( + f"s3://{mock_bucket_name}/{target_file}", index_col=0, storage_options=s3so + ) + tm.assert_frame_equal(result, df) diff --git a/pandas/tests/io/excel/test_writers.py b/pandas/tests/io/excel/test_writers.py index 42483645d9fc3..ba6366b71d854 100644 --- a/pandas/tests/io/excel/test_writers.py +++ b/pandas/tests/io/excel/test_writers.py @@ -838,6 +838,19 @@ def test_to_excel_multiindex_no_write_index(self, path): # Test that it is the same as the initial frame. tm.assert_frame_equal(frame1, frame3) + def test_to_excel_empty_multiindex(self, path): + # GH 19543. + expected = DataFrame([], columns=[0, 1, 2]) + + df = DataFrame([], index=MultiIndex.from_tuples([], names=[0, 1]), columns=[2]) + df.to_excel(path, "test1") + + with ExcelFile(path) as reader: + result = pd.read_excel(reader, sheet_name="test1") + tm.assert_frame_equal( + result, expected, check_index_type=False, check_dtype=False + ) + def test_to_excel_float_format(self, path): df = DataFrame( [[0.123456, 0.234567, 0.567567], [12.32112, 123123.2, 321321.2]], diff --git a/pandas/tests/io/formats/test_css.py b/pandas/tests/io/formats/test_css.py index c93694481ef53..70c91dd02751a 100644 --- a/pandas/tests/io/formats/test_css.py +++ b/pandas/tests/io/formats/test_css.py @@ -1,11 +1,10 @@ import pytest +from pandas.errors import CSSWarning + import pandas._testing as tm -from pandas.io.formats.css import ( - CSSResolver, - CSSWarning, -) +from pandas.io.formats.css import CSSResolver def assert_resolves(css, props, inherited=None): diff --git a/pandas/tests/io/formats/test_to_excel.py b/pandas/tests/io/formats/test_to_excel.py index b95a5b4365f43..7481baaee94f6 100644 --- a/pandas/tests/io/formats/test_to_excel.py +++ b/pandas/tests/io/formats/test_to_excel.py @@ -6,12 +6,15 @@ import pytest +from pandas.errors import CSSWarning import pandas.util._test_decorators as td import pandas._testing as tm -from pandas.io.formats.css import CSSWarning -from pandas.io.formats.excel import CSSToExcelConverter +from pandas.io.formats.excel import ( + CssExcelCell, + CSSToExcelConverter, +) @pytest.mark.parametrize( @@ -340,3 +343,89 @@ def test_css_named_colors_from_mpl_present(): pd_colors = CSSToExcelConverter.NAMED_COLORS for name, color in mpl_colors.items(): assert name in pd_colors and pd_colors[name] == color[1:] + + +@pytest.mark.parametrize( + "styles,expected", + [ + ([("color", "green"), ("color", "red")], "color: red;"), + ([("font-weight", "bold"), ("font-weight", "normal")], "font-weight: normal;"), + ([("text-align", "center"), ("TEXT-ALIGN", "right")], "text-align: right;"), + ], +) +def test_css_excel_cell_precedence(styles, expected): + """It applies favors latter declarations over former declarations""" + # See GH 47371 + converter = CSSToExcelConverter() + converter.__call__.cache_clear() + css_styles = {(0, 0): styles} + cell = CssExcelCell( + row=0, + col=0, + val="", + style=None, + css_styles=css_styles, + css_row=0, + css_col=0, + css_converter=converter, + ) + converter.__call__.cache_clear() + + assert cell.style == converter(expected) + + +@pytest.mark.parametrize( + "styles,cache_hits,cache_misses", + [ + ([[("color", "green"), ("color", "red"), ("color", "green")]], 0, 1), + ( + [ + [("font-weight", "bold")], + [("font-weight", "normal"), ("font-weight", "bold")], + ], + 1, + 1, + ), + ([[("text-align", "center")], [("TEXT-ALIGN", "center")]], 1, 1), + ( + [ + [("font-weight", "bold"), ("text-align", "center")], + [("font-weight", "bold"), ("text-align", "left")], + ], + 0, + 2, + ), + ( + [ + [("font-weight", "bold"), ("text-align", "center")], + [("font-weight", "bold"), ("text-align", "left")], + [("font-weight", "bold"), ("text-align", "center")], + ], + 1, + 2, + ), + ], +) +def test_css_excel_cell_cache(styles, cache_hits, cache_misses): + """It caches unique cell styles""" + # See GH 47371 + converter = CSSToExcelConverter() + converter.__call__.cache_clear() + + css_styles = {(0, i): _style for i, _style in enumerate(styles)} + for css_row, css_col in css_styles: + CssExcelCell( + row=0, + col=0, + val="", + style=None, + css_styles=css_styles, + css_row=css_row, + css_col=css_col, + css_converter=converter, + ) + cache_info = converter.__call__.cache_info() + converter.__call__.cache_clear() + + assert cache_info.hits == cache_hits + assert cache_info.misses == cache_misses diff --git a/pandas/tests/io/json/test_json_table_schema.py b/pandas/tests/io/json/test_json_table_schema.py index c90ac2fb3b813..f4c8b9e764d6d 100644 --- a/pandas/tests/io/json/test_json_table_schema.py +++ b/pandas/tests/io/json/test_json_table_schema.py @@ -708,6 +708,44 @@ def test_read_json_table_orient_raises(self, index_nm, vals, recwarn): with pytest.raises(NotImplementedError, match="can not yet read "): pd.read_json(out, orient="table") + @pytest.mark.parametrize( + "index_nm", + [None, "idx", pytest.param("index", marks=pytest.mark.xfail), "level_0"], + ) + @pytest.mark.parametrize( + "vals", + [ + {"ints": [1, 2, 3, 4]}, + {"objects": ["a", "b", "c", "d"]}, + {"objects": ["1", "2", "3", "4"]}, + {"date_ranges": pd.date_range("2016-01-01", freq="d", periods=4)}, + {"categoricals": pd.Series(pd.Categorical(["a", "b", "c", "c"]))}, + { + "ordered_cats": pd.Series( + pd.Categorical(["a", "b", "c", "c"], ordered=True) + ) + }, + {"floats": [1.0, 2.0, 3.0, 4.0]}, + {"floats": [1.1, 2.2, 3.3, 4.4]}, + {"bools": [True, False, False, True]}, + { + "timezones": pd.date_range( + "2016-01-01", freq="d", periods=4, tz="US/Central" + ) # added in # GH 35973 + }, + ], + ) + def test_read_json_table_period_orient(self, index_nm, vals, recwarn): + df = DataFrame( + vals, + index=pd.Index( + (pd.Period(f"2022Q{q}") for q in range(1, 5)), name=index_nm + ), + ) + out = df.to_json(orient="table") + result = pd.read_json(out, orient="table") + tm.assert_frame_equal(df, result) + @pytest.mark.parametrize( "idx", [ diff --git a/pandas/tests/io/json/test_pandas.py b/pandas/tests/io/json/test_pandas.py index eaffbc60ead32..026c3bc68ce34 100644 --- a/pandas/tests/io/json/test_pandas.py +++ b/pandas/tests/io/json/test_pandas.py @@ -1908,3 +1908,13 @@ def test_complex_data_tojson(self, data, expected): # GH41174 result = data.to_json() assert result == expected + + def test_json_uint64(self): + # GH21073 + expected = ( + '{"columns":["col1"],"index":[0,1],' + '"data":[[13342205958987758245],[12388075603347835679]]}' + ) + df = DataFrame(data={"col1": [13342205958987758245, 12388075603347835679]}) + result = df.to_json(orient="split") + assert result == expected diff --git a/pandas/tests/io/json/test_ujson.py b/pandas/tests/io/json/test_ujson.py index e82a888f47388..ae13d8d5fb180 100644 --- a/pandas/tests/io/json/test_ujson.py +++ b/pandas/tests/io/json/test_ujson.py @@ -23,6 +23,7 @@ DatetimeIndex, Index, NaT, + PeriodIndex, Series, Timedelta, Timestamp, @@ -1240,3 +1241,9 @@ def test_encode_timedelta_iso(self, td): expected = f'"{td.isoformat()}"' assert result == expected + + def test_encode_periodindex(self): + # GH 46683 + p = PeriodIndex(["2022-04-06", "2022-04-07"], freq="D") + df = DataFrame(index=p) + assert df.to_json() == "{}" diff --git a/pandas/tests/io/parser/common/test_common_basic.py b/pandas/tests/io/parser/common/test_common_basic.py index 115a2976ce618..a0da3a7eaadce 100644 --- a/pandas/tests/io/parser/common/test_common_basic.py +++ b/pandas/tests/io/parser/common/test_common_basic.py @@ -806,8 +806,7 @@ def test_read_csv_posargs_deprecation(all_parsers): "In a future version of pandas all arguments of read_csv " "except for the argument 'filepath_or_buffer' will be keyword-only" ) - with tm.assert_produces_warning(FutureWarning, match=msg): - parser.read_csv(f, " ") + parser.read_csv_check_warnings(FutureWarning, msg, f, " ") @pytest.mark.parametrize("delimiter", [",", "\t"]) @@ -921,5 +920,4 @@ def test_read_table_posargs_deprecation(all_parsers): "In a future version of pandas all arguments of read_table " "except for the argument 'filepath_or_buffer' will be keyword-only" ) - with tm.assert_produces_warning(FutureWarning, match=msg): - parser.read_table(data, " ") + parser.read_table_check_warnings(FutureWarning, msg, data, " ") diff --git a/pandas/tests/io/parser/conftest.py b/pandas/tests/io/parser/conftest.py index 066f448d97505..0462d1fe6da0b 100644 --- a/pandas/tests/io/parser/conftest.py +++ b/pandas/tests/io/parser/conftest.py @@ -42,6 +42,16 @@ def read_table(self, *args, **kwargs): kwargs = self.update_kwargs(kwargs) return read_table(*args, **kwargs) + def read_table_check_warnings( + self, warn_type: type[Warning], warn_msg: str, *args, **kwargs + ): + # We need to check the stacklevel here instead of in the tests + # since this is where read_table is called and where the warning + # should point to. + kwargs = self.update_kwargs(kwargs) + with tm.assert_produces_warning(warn_type, match=warn_msg): + return read_table(*args, **kwargs) + class CParser(BaseParser): engine = "c" diff --git a/pandas/tests/io/parser/test_header.py b/pandas/tests/io/parser/test_header.py index 3fc23525df89e..4ded70db8bae7 100644 --- a/pandas/tests/io/parser/test_header.py +++ b/pandas/tests/io/parser/test_header.py @@ -666,3 +666,15 @@ def test_header_none_and_on_bad_lines_skip(all_parsers): ) expected = DataFrame({"a": ["x", "z"], "b": [1, 3]}) tm.assert_frame_equal(result, expected) + + +@skip_pyarrow +def test_header_missing_rows(all_parsers): + # GH#47400 + parser = all_parsers + data = """a,b +1,2 +""" + msg = r"Passed header=\[0,1,2\], len of 3, but only 2 lines in file" + with pytest.raises(ValueError, match=msg): + parser.read_csv(StringIO(data), header=[0, 1, 2]) diff --git a/pandas/tests/io/parser/test_parse_dates.py b/pandas/tests/io/parser/test_parse_dates.py index 449d5a954613b..d05961b702c51 100644 --- a/pandas/tests/io/parser/test_parse_dates.py +++ b/pandas/tests/io/parser/test_parse_dates.py @@ -1677,9 +1677,7 @@ def test_parse_delimited_date_swap_with_warning( ): parser = all_parsers expected = DataFrame({0: [expected]}, dtype="datetime64[ns]") - warning_msg = ( - "Provide format or specify infer_datetime_format=True for consistent parsing" - ) + warning_msg = "Specify a format to ensure consistent parsing" with tm.assert_produces_warning(UserWarning, match=warning_msg): result = parser.read_csv( StringIO(date_string), header=None, dayfirst=dayfirst, parse_dates=[0] @@ -1687,6 +1685,17 @@ def test_parse_delimited_date_swap_with_warning( tm.assert_frame_equal(result, expected) +def test_parse_multiple_delimited_dates_with_swap_warnings(): + # GH46210 + warning_msg = "Specify a format to ensure consistent parsing" + with tm.assert_produces_warning(UserWarning, match=warning_msg) as record: + pd.to_datetime(["01/01/2000", "31/05/2000", "31/05/2001", "01/02/2000"]) + assert len({str(warning.message) for warning in record}) == 1 + # Using set(record) as repetitions of the same warning are suppressed + # https://docs.python.org/3/library/warnings.html + # and here we care to check that the warning is only shows once to users. + + def _helper_hypothesis_delimited_date(call, date_string, **kwargs): msg, result = None, None try: @@ -1848,12 +1857,14 @@ def test_parse_dates_and_keep_orgin_column(all_parsers): def test_dayfirst_warnings(): # GH 12585 warning_msg_day_first = ( - "Parsing '31/12/2014' in DD/MM/YYYY format. Provide " - "format or specify infer_datetime_format=True for consistent parsing." + r"Parsing dates in DD/MM/YYYY format when dayfirst=False \(the default\) was " + r"specified. This may lead to inconsistently parsed dates! Specify a format " + r"to ensure consistent parsing." ) warning_msg_month_first = ( - "Parsing '03/30/2011' in MM/DD/YYYY format. Provide " - "format or specify infer_datetime_format=True for consistent parsing." + "Parsing dates in MM/DD/YYYY format when dayfirst=True was " + "specified. This may lead to inconsistently parsed dates! Specify a format " + "to ensure consistent parsing." ) # CASE 1: valid input diff --git a/pandas/tests/io/parser/test_python_parser_only.py b/pandas/tests/io/parser/test_python_parser_only.py index abe6c831dd4e4..0717078a83a46 100644 --- a/pandas/tests/io/parser/test_python_parser_only.py +++ b/pandas/tests/io/parser/test_python_parser_only.py @@ -466,6 +466,17 @@ def test_index_col_false_and_header_none(python_parser_only): 0.5,0.03 0.1,0.2,0.3,2 """ - result = parser.read_csv(StringIO(data), sep=",", header=None, index_col=False) + with tm.assert_produces_warning(ParserWarning, match="Length of header"): + result = parser.read_csv(StringIO(data), sep=",", header=None, index_col=False) expected = DataFrame({0: [0.5, 0.1], 1: [0.03, 0.2]}) tm.assert_frame_equal(result, expected) + + +def test_header_int_do_not_infer_multiindex_names_on_different_line(python_parser_only): + # GH#46569 + parser = python_parser_only + data = StringIO("a\na,b\nc,d,e\nf,g,h") + with tm.assert_produces_warning(ParserWarning, match="Length of header"): + result = parser.read_csv(data, engine="python", index_col=False) + expected = DataFrame({"a": ["a", "c", "f"]}) + tm.assert_frame_equal(result, expected) diff --git a/pandas/tests/io/parser/test_quoting.py b/pandas/tests/io/parser/test_quoting.py index 456dd049d2f4a..a1aba949e74fe 100644 --- a/pandas/tests/io/parser/test_quoting.py +++ b/pandas/tests/io/parser/test_quoting.py @@ -38,7 +38,7 @@ def test_bad_quote_char(all_parsers, kwargs, msg): @pytest.mark.parametrize( "quoting,msg", [ - ("foo", '"quoting" must be an integer'), + ("foo", '"quoting" must be an integer|Argument'), (5, 'bad "quoting" value'), # quoting must be in the range [0, 3] ], ) diff --git a/pandas/tests/io/pytables/test_file_handling.py b/pandas/tests/io/pytables/test_file_handling.py index 9fde65e3a1a43..13b6b94dda8d4 100644 --- a/pandas/tests/io/pytables/test_file_handling.py +++ b/pandas/tests/io/pytables/test_file_handling.py @@ -4,6 +4,10 @@ import pytest from pandas.compat import is_platform_little_endian +from pandas.errors import ( + ClosedFileError, + PossibleDataLossError, +) from pandas import ( DataFrame, @@ -20,11 +24,7 @@ ) from pandas.io import pytables as pytables -from pandas.io.pytables import ( - ClosedFileError, - PossibleDataLossError, - Term, -) +from pandas.io.pytables import Term pytestmark = pytest.mark.single_cpu diff --git a/pandas/tests/io/pytables/test_store.py b/pandas/tests/io/pytables/test_store.py index 8a933f4981ff3..e8f4e7ee92fc3 100644 --- a/pandas/tests/io/pytables/test_store.py +++ b/pandas/tests/io/pytables/test_store.py @@ -589,7 +589,6 @@ def test_store_series_name(setup_path): tm.assert_series_equal(recons, series) -@pytest.mark.filterwarnings("ignore:\\nduplicate:pandas.io.pytables.DuplicateWarning") def test_overwrite_node(setup_path): with ensure_clean_store(setup_path) as store: @@ -1019,3 +1018,11 @@ def test_hdfstore_iteritems_deprecated(setup_path): hdf.put("table", df) with tm.assert_produces_warning(FutureWarning): next(hdf.iteritems()) + + +def test_hdfstore_strides(setup_path): + # GH22073 + df = DataFrame({"a": [1, 2, 3, 4], "b": [5, 6, 7, 8]}) + with ensure_clean_store(setup_path) as store: + store.put("df", df) + assert df["a"].values.strides == store["df"]["a"].values.strides diff --git a/pandas/tests/io/sas/data/0x00controlbyte.sas7bdat.bz2 b/pandas/tests/io/sas/data/0x00controlbyte.sas7bdat.bz2 new file mode 100644 index 0000000000000..ef980fb907694 Binary files /dev/null and b/pandas/tests/io/sas/data/0x00controlbyte.sas7bdat.bz2 differ diff --git a/pandas/tests/io/sas/data/0x40controlbyte.csv b/pandas/tests/io/sas/data/0x40controlbyte.csv new file mode 100644 index 0000000000000..e81f5cc3904b7 --- /dev/null +++ b/pandas/tests/io/sas/data/0x40controlbyte.csv @@ -0,0 +1,2 @@ +long_string_field1,long_string_field2,long_string_field3 +00000000000000000000000000000000000000000000000000,11111111111111111111111111111111111111111111111111,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa diff --git a/pandas/tests/io/sas/data/0x40controlbyte.sas7bdat b/pandas/tests/io/sas/data/0x40controlbyte.sas7bdat new file mode 100644 index 0000000000000..013542e282e2f Binary files /dev/null and b/pandas/tests/io/sas/data/0x40controlbyte.sas7bdat differ diff --git a/pandas/tests/io/sas/data/airline.sas7bdat.gz b/pandas/tests/io/sas/data/airline.sas7bdat.gz new file mode 100644 index 0000000000000..7b56e492295f4 Binary files /dev/null and b/pandas/tests/io/sas/data/airline.sas7bdat.gz differ diff --git a/pandas/tests/io/sas/test_sas.py b/pandas/tests/io/sas/test_sas.py index 5d2643c20ceb2..1e38baf4fc409 100644 --- a/pandas/tests/io/sas/test_sas.py +++ b/pandas/tests/io/sas/test_sas.py @@ -20,7 +20,15 @@ def test_sas_buffer_format(self): def test_sas_read_no_format_or_extension(self): # see gh-24548 - msg = "unable to infer format of SAS file" + msg = "unable to infer format of SAS file.+" with tm.ensure_clean("test_file_no_extension") as path: with pytest.raises(ValueError, match=msg): read_sas(path) + + +def test_sas_archive(datapath): + fname_uncompressed = datapath("io", "sas", "data", "airline.sas7bdat") + df_uncompressed = read_sas(fname_uncompressed) + fname_compressed = datapath("io", "sas", "data", "airline.sas7bdat.gz") + df_compressed = read_sas(fname_compressed, format="sas7bdat") + tm.assert_frame_equal(df_uncompressed, df_compressed) diff --git a/pandas/tests/io/sas/test_sas7bdat.py b/pandas/tests/io/sas/test_sas7bdat.py index 3f150c1a061ee..41b2e78d093ea 100644 --- a/pandas/tests/io/sas/test_sas7bdat.py +++ b/pandas/tests/io/sas/test_sas7bdat.py @@ -381,3 +381,19 @@ def test_exception_propagation_rle_decompress(tmp_path, datapath): tmp_file.write_bytes(data) with pytest.raises(ValueError, match="unknown control byte"): pd.read_sas(tmp_file) + + +def test_0x40_control_byte(datapath): + # GH 31243 + fname = datapath("io", "sas", "data", "0x40controlbyte.sas7bdat") + df = pd.read_sas(fname, encoding="ascii") + fname = datapath("io", "sas", "data", "0x40controlbyte.csv") + df0 = pd.read_csv(fname, dtype="object") + tm.assert_frame_equal(df, df0) + + +def test_0x00_control_byte(datapath): + # GH 47099 + fname = datapath("io", "sas", "data", "0x00controlbyte.sas7bdat.bz2") + df = next(pd.read_sas(fname, chunksize=11_000)) + assert df.shape == (11_000, 20) diff --git a/pandas/tests/io/test_clipboard.py b/pandas/tests/io/test_clipboard.py index 73e563fd2b743..cb34cb6678a67 100644 --- a/pandas/tests/io/test_clipboard.py +++ b/pandas/tests/io/test_clipboard.py @@ -3,6 +3,11 @@ import numpy as np import pytest +from pandas.errors import ( + PyperclipException, + PyperclipWindowsException, +) + from pandas import ( DataFrame, get_option, @@ -11,6 +16,8 @@ import pandas._testing as tm from pandas.io.clipboard import ( + CheckedCall, + _stringifyText, clipboard_get, clipboard_set, ) @@ -110,6 +117,81 @@ def df(request): raise ValueError +@pytest.fixture +def mock_ctypes(monkeypatch): + """ + Mocks WinError to help with testing the clipboard. + """ + + def _mock_win_error(): + return "Window Error" + + # Set raising to False because WinError won't exist on non-windows platforms + with monkeypatch.context() as m: + m.setattr("ctypes.WinError", _mock_win_error, raising=False) + yield + + +@pytest.mark.usefixtures("mock_ctypes") +def test_checked_call_with_bad_call(monkeypatch): + """ + Give CheckCall a function that returns a falsey value and + mock get_errno so it returns false so an exception is raised. + """ + + def _return_false(): + return False + + monkeypatch.setattr("pandas.io.clipboard.get_errno", lambda: True) + msg = f"Error calling {_return_false.__name__} \\(Window Error\\)" + + with pytest.raises(PyperclipWindowsException, match=msg): + CheckedCall(_return_false)() + + +@pytest.mark.usefixtures("mock_ctypes") +def test_checked_call_with_valid_call(monkeypatch): + """ + Give CheckCall a function that returns a truthy value and + mock get_errno so it returns true so an exception is not raised. + The function should return the results from _return_true. + """ + + def _return_true(): + return True + + monkeypatch.setattr("pandas.io.clipboard.get_errno", lambda: False) + + # Give CheckedCall a callable that returns a truthy value s + checked_call = CheckedCall(_return_true) + assert checked_call() is True + + +@pytest.mark.parametrize( + "text", + [ + "String_test", + True, + 1, + 1.0, + 1j, + ], +) +def test_stringify_text(text): + valid_types = (str, int, float, bool) + + if isinstance(text, valid_types): + result = _stringifyText(text) + assert result == str(text) + else: + msg = ( + "only str, int, float, and bool values " + f"can be copied to the clipboard, not {type(text).__name__}" + ) + with pytest.raises(PyperclipException, match=msg): + _stringifyText(text) + + @pytest.fixture def mock_clipboard(monkeypatch, request): """Fixture mocking clipboard IO. diff --git a/pandas/tests/io/test_orc.py b/pandas/tests/io/test_orc.py index f34e9b940317d..0bb320907b813 100644 --- a/pandas/tests/io/test_orc.py +++ b/pandas/tests/io/test_orc.py @@ -1,10 +1,13 @@ """ test orc compat """ import datetime +from io import BytesIO import os import numpy as np import pytest +import pandas.util._test_decorators as td + import pandas as pd from pandas import read_orc import pandas._testing as tm @@ -21,6 +24,27 @@ def dirpath(datapath): return datapath("io", "data", "orc") +# Examples of dataframes with dtypes for which conversion to ORC +# hasn't been implemented yet, that is, Category, unsigned integers, +# interval, period and sparse. +orc_writer_dtypes_not_supported = [ + pd.DataFrame({"unimpl": np.array([1, 20], dtype="uint64")}), + pd.DataFrame({"unimpl": pd.Series(["a", "b", "a"], dtype="category")}), + pd.DataFrame( + {"unimpl": [pd.Interval(left=0, right=2), pd.Interval(left=0, right=5)]} + ), + pd.DataFrame( + { + "unimpl": [ + pd.Period("2022-01-03", freq="D"), + pd.Period("2022-01-04", freq="D"), + ] + } + ), + pd.DataFrame({"unimpl": [np.nan] * 50}).astype(pd.SparseDtype("float", np.nan)), +] + + def test_orc_reader_empty(dirpath): columns = [ "boolean1", @@ -224,3 +248,60 @@ def test_orc_reader_snappy_compressed(dirpath): got = read_orc(inputfile).iloc[:10] tm.assert_equal(expected, got) + + +@td.skip_if_no("pyarrow", min_version="7.0.0") +def test_orc_roundtrip_file(dirpath): + # GH44554 + # PyArrow gained ORC write support with the current argument order + data = { + "boolean1": np.array([False, True], dtype="bool"), + "byte1": np.array([1, 100], dtype="int8"), + "short1": np.array([1024, 2048], dtype="int16"), + "int1": np.array([65536, 65536], dtype="int32"), + "long1": np.array([9223372036854775807, 9223372036854775807], dtype="int64"), + "float1": np.array([1.0, 2.0], dtype="float32"), + "double1": np.array([-15.0, -5.0], dtype="float64"), + "bytes1": np.array([b"\x00\x01\x02\x03\x04", b""], dtype="object"), + "string1": np.array(["hi", "bye"], dtype="object"), + } + expected = pd.DataFrame.from_dict(data) + + with tm.ensure_clean() as path: + expected.to_orc(path) + got = read_orc(path) + + tm.assert_equal(expected, got) + + +@td.skip_if_no("pyarrow", min_version="7.0.0") +def test_orc_roundtrip_bytesio(): + # GH44554 + # PyArrow gained ORC write support with the current argument order + data = { + "boolean1": np.array([False, True], dtype="bool"), + "byte1": np.array([1, 100], dtype="int8"), + "short1": np.array([1024, 2048], dtype="int16"), + "int1": np.array([65536, 65536], dtype="int32"), + "long1": np.array([9223372036854775807, 9223372036854775807], dtype="int64"), + "float1": np.array([1.0, 2.0], dtype="float32"), + "double1": np.array([-15.0, -5.0], dtype="float64"), + "bytes1": np.array([b"\x00\x01\x02\x03\x04", b""], dtype="object"), + "string1": np.array(["hi", "bye"], dtype="object"), + } + expected = pd.DataFrame.from_dict(data) + + bytes = expected.to_orc() + got = read_orc(BytesIO(bytes)) + + tm.assert_equal(expected, got) + + +@td.skip_if_no("pyarrow", min_version="7.0.0") +@pytest.mark.parametrize("df_not_supported", orc_writer_dtypes_not_supported) +def test_orc_writer_dtypes_not_supported(df_not_supported): + # GH44554 + # PyArrow gained ORC write support with the current argument order + msg = "The dtype of one or more columns is not supported yet." + with pytest.raises(NotImplementedError, match=msg): + df_not_supported.to_orc() diff --git a/pandas/tests/io/test_parquet.py b/pandas/tests/io/test_parquet.py index 5b899079dfffd..64e4a15a42061 100644 --- a/pandas/tests/io/test_parquet.py +++ b/pandas/tests/io/test_parquet.py @@ -626,6 +626,9 @@ def test_use_nullable_dtypes(self, engine, request): "d": pyarrow.array([True, False, True, None]), # Test that nullable dtypes used even in absence of nulls "e": pyarrow.array([1, 2, 3, 4], "int64"), + # GH 45694 + "f": pyarrow.array([1.0, 2.0, 3.0, None], "float32"), + "g": pyarrow.array([1.0, 2.0, 3.0, None], "float64"), } ) with tm.ensure_clean() as path: @@ -642,6 +645,8 @@ def test_use_nullable_dtypes(self, engine, request): "c": pd.array(["a", "b", "c", None], dtype="string"), "d": pd.array([True, False, True, None], dtype="boolean"), "e": pd.array([1, 2, 3, 4], dtype="Int64"), + "f": pd.array([1.0, 2.0, 3.0, None], dtype="Float32"), + "g": pd.array([1.0, 2.0, 3.0, None], dtype="Float64"), } ) if engine == "fastparquet": @@ -672,7 +677,17 @@ def test_read_empty_array(self, pa, dtype): "value": pd.array([], dtype=dtype), } ) - check_round_trip(df, pa, read_kwargs={"use_nullable_dtypes": True}) + # GH 45694 + expected = None + if dtype == "float": + expected = pd.DataFrame( + { + "value": pd.array([], dtype="Float64"), + } + ) + check_round_trip( + df, pa, read_kwargs={"use_nullable_dtypes": True}, expected=expected + ) @pytest.mark.filterwarnings("ignore:CategoricalBlock is deprecated:DeprecationWarning") diff --git a/pandas/tests/io/test_sql.py b/pandas/tests/io/test_sql.py index e28901fa1a1ed..ee55837324f20 100644 --- a/pandas/tests/io/test_sql.py +++ b/pandas/tests/io/test_sql.py @@ -620,7 +620,8 @@ def test_read_procedure(conn, request): @pytest.mark.db @pytest.mark.parametrize("conn", postgresql_connectable) -def test_copy_from_callable_insertion_method(conn, request): +@pytest.mark.parametrize("expected_count", [2, "Success!"]) +def test_copy_from_callable_insertion_method(conn, expected_count, request): # GH 8953 # Example in io.rst found under _io.sql.method # not available in sqlite, mysql @@ -641,10 +642,18 @@ def psql_insert_copy(table, conn, keys, data_iter): sql_query = f"COPY {table_name} ({columns}) FROM STDIN WITH CSV" cur.copy_expert(sql=sql_query, file=s_buf) + return expected_count conn = request.getfixturevalue(conn) expected = DataFrame({"col1": [1, 2], "col2": [0.1, 0.2], "col3": ["a", "n"]}) - expected.to_sql("test_frame", conn, index=False, method=psql_insert_copy) + result_count = expected.to_sql( + "test_frame", conn, index=False, method=psql_insert_copy + ) + # GH 46891 + if not isinstance(expected_count, int): + assert result_count is None + else: + assert result_count == expected_count result = sql.read_sql_table("test_frame", conn) tm.assert_frame_equal(result, expected) @@ -2595,9 +2604,17 @@ def test_datetime_date(self): elif self.flavor == "mysql": tm.assert_frame_equal(res, df) - def test_datetime_time(self): + @pytest.mark.parametrize("tz_aware", [False, True]) + def test_datetime_time(self, tz_aware): # test support for datetime.time, GH #8341 - df = DataFrame([time(9, 0, 0), time(9, 1, 30)], columns=["a"]) + if not tz_aware: + tz_times = [time(9, 0, 0), time(9, 1, 30)] + else: + tz_dt = date_range("2013-01-01 09:00:00", periods=2, tz="US/Pacific") + tz_times = Series(tz_dt.to_pydatetime()).map(lambda dt: dt.timetz()) + + df = DataFrame(tz_times, columns=["a"]) + assert df.to_sql("test_time", self.conn, index=False) == 2 res = read_sql_query("SELECT * FROM test_time", self.conn) if self.flavor == "sqlite": diff --git a/pandas/tests/io/xml/test_xml.py b/pandas/tests/io/xml/test_xml.py index 277b6442a0a8c..410c5f6703dcd 100644 --- a/pandas/tests/io/xml/test_xml.py +++ b/pandas/tests/io/xml/test_xml.py @@ -789,6 +789,81 @@ def test_names_option_output(datapath, parser): tm.assert_frame_equal(df_iter, df_expected) +def test_repeat_names(parser): + xml = """\ + + + circle + curved + + + sphere + curved + +""" + df_xpath = read_xml( + xml, xpath=".//shape", parser=parser, names=["type_dim", "shape", "type_edge"] + ) + + df_iter = read_xml_iterparse( + xml, + parser=parser, + iterparse={"shape": ["type", "name", "type"]}, + names=["type_dim", "shape", "type_edge"], + ) + + df_expected = DataFrame( + { + "type_dim": ["2D", "3D"], + "shape": ["circle", "sphere"], + "type_edge": ["curved", "curved"], + } + ) + + tm.assert_frame_equal(df_xpath, df_expected) + tm.assert_frame_equal(df_iter, df_expected) + + +def test_repeat_values_new_names(parser): + xml = """\ + + + rectangle + rectangle + + + square + rectangle + + + ellipse + ellipse + + + circle + ellipse + +""" + df_xpath = read_xml(xml, xpath=".//shape", parser=parser, names=["name", "group"]) + + df_iter = read_xml_iterparse( + xml, + parser=parser, + iterparse={"shape": ["name", "family"]}, + names=["name", "group"], + ) + + df_expected = DataFrame( + { + "name": ["rectangle", "square", "ellipse", "circle"], + "group": ["rectangle", "rectangle", "ellipse", "ellipse"], + } + ) + + tm.assert_frame_equal(df_xpath, df_expected) + tm.assert_frame_equal(df_iter, df_expected) + + def test_names_option_wrong_length(datapath, parser): filename = datapath("io", "data", "xml", "books.xml") @@ -1236,7 +1311,7 @@ def test_wrong_dict_value(datapath, parser): read_xml(filename, parser=parser, iterparse={"book": "category"}) -def test_bad_xml(datapath, parser): +def test_bad_xml(parser): bad_xml = """\ @@ -1277,6 +1352,113 @@ def test_bad_xml(datapath, parser): ) +def test_comment(parser): + xml = """\ + + + + + circle + 2D + + + sphere + 3D + + + + +""" + + df_xpath = read_xml(xml, xpath=".//shape", parser=parser) + + df_iter = read_xml_iterparse( + xml, parser=parser, iterparse={"shape": ["name", "type"]} + ) + + df_expected = DataFrame( + { + "name": ["circle", "sphere"], + "type": ["2D", "3D"], + } + ) + + tm.assert_frame_equal(df_xpath, df_expected) + tm.assert_frame_equal(df_iter, df_expected) + + +def test_dtd(parser): + xml = """\ + + + + +]> + + + circle + 2D + + + sphere + 3D + +""" + + df_xpath = read_xml(xml, xpath=".//shape", parser=parser) + + df_iter = read_xml_iterparse( + xml, parser=parser, iterparse={"shape": ["name", "type"]} + ) + + df_expected = DataFrame( + { + "name": ["circle", "sphere"], + "type": ["2D", "3D"], + } + ) + + tm.assert_frame_equal(df_xpath, df_expected) + tm.assert_frame_equal(df_iter, df_expected) + + +def test_processing_instruction(parser): + xml = """\ + + + + + +, , ?> + + + circle + 2D + + + sphere + 3D + +""" + + df_xpath = read_xml(xml, xpath=".//shape", parser=parser) + + df_iter = read_xml_iterparse( + xml, parser=parser, iterparse={"shape": ["name", "type"]} + ) + + df_expected = DataFrame( + { + "name": ["circle", "sphere"], + "type": ["2D", "3D"], + } + ) + + tm.assert_frame_equal(df_xpath, df_expected) + tm.assert_frame_equal(df_iter, df_expected) + + def test_no_result(datapath, parser): filename = datapath("io", "data", "xml", "books.xml") with pytest.raises( diff --git a/pandas/tests/io/xml/test_xml_dtypes.py b/pandas/tests/io/xml/test_xml_dtypes.py index 6aa4ddfac7628..5629830767c3c 100644 --- a/pandas/tests/io/xml/test_xml_dtypes.py +++ b/pandas/tests/io/xml/test_xml_dtypes.py @@ -457,7 +457,7 @@ def test_day_first_parse_dates(parser): ) with tm.assert_produces_warning( - UserWarning, match="Parsing '31/12/2020' in DD/MM/YYYY format" + UserWarning, match="Parsing dates in DD/MM/YYYY format" ): df_result = read_xml(xml, parse_dates=["date"], parser=parser) df_iter = read_xml_iterparse( diff --git a/pandas/tests/plotting/frame/test_frame.py b/pandas/tests/plotting/frame/test_frame.py index 3ec3744e43653..538c9c2fb5059 100644 --- a/pandas/tests/plotting/frame/test_frame.py +++ b/pandas/tests/plotting/frame/test_frame.py @@ -2204,6 +2204,17 @@ def test_xlabel_ylabel_dataframe_plane_plot(self, kind, xlabel, ylabel): assert ax.get_xlabel() == (xcol if xlabel is None else xlabel) assert ax.get_ylabel() == (ycol if ylabel is None else ylabel) + @pytest.mark.parametrize("secondary_y", (False, True)) + def test_secondary_y(self, secondary_y): + ax_df = DataFrame([0]).plot( + secondary_y=secondary_y, ylabel="Y", ylim=(0, 100), yticks=[99] + ) + for ax in ax_df.figure.axes: + if ax.yaxis.get_visible(): + assert ax.get_ylabel() == "Y" + assert ax.get_ylim() == (0, 100) + assert ax.get_yticks()[0] == 99 + def _generate_4_axes_via_gridspec(): import matplotlib as mpl diff --git a/pandas/tests/plotting/test_converter.py b/pandas/tests/plotting/test_converter.py index 656969bfad703..3ec8f4bd71c2b 100644 --- a/pandas/tests/plotting/test_converter.py +++ b/pandas/tests/plotting/test_converter.py @@ -15,8 +15,10 @@ from pandas import ( Index, Period, + PeriodIndex, Series, Timestamp, + arrays, date_range, ) import pandas._testing as tm @@ -375,3 +377,32 @@ def get_view_interval(self): tdc = converter.TimeSeries_TimedeltaFormatter() monkeypatch.setattr(tdc, "axis", mock_axis()) tdc(0.0, 0) + + +@pytest.mark.parametrize("year_span", [11.25, 30, 80, 150, 400, 800, 1500, 2500, 3500]) +# The range is limited to 11.25 at the bottom by if statements in +# the _quarterly_finder() function +def test_quarterly_finder(year_span): + vmin = -1000 + vmax = vmin + year_span * 4 + span = vmax - vmin + 1 + if span < 45: # the quarterly finder is only invoked if the span is >= 45 + return + nyears = span / 4 + (min_anndef, maj_anndef) = converter._get_default_annual_spacing(nyears) + result = converter._quarterly_finder(vmin, vmax, "Q") + quarters = PeriodIndex( + arrays.PeriodArray(np.array([x[0] for x in result]), freq="Q") + ) + majors = np.array([x[1] for x in result]) + minors = np.array([x[2] for x in result]) + major_quarters = quarters[majors] + minor_quarters = quarters[minors] + check_major_years = major_quarters.year % maj_anndef == 0 + check_minor_years = minor_quarters.year % min_anndef == 0 + check_major_quarters = major_quarters.quarter == 1 + check_minor_quarters = minor_quarters.quarter == 1 + assert np.all(check_major_years) + assert np.all(check_minor_years) + assert np.all(check_major_quarters) + assert np.all(check_minor_quarters) diff --git a/pandas/tests/plotting/test_misc.py b/pandas/tests/plotting/test_misc.py index ca82c37b8a8b0..ab8e64be648d4 100644 --- a/pandas/tests/plotting/test_misc.py +++ b/pandas/tests/plotting/test_misc.py @@ -10,6 +10,7 @@ Index, Series, Timestamp, + interval_range, ) import pandas._testing as tm from pandas.tests.plotting.common import ( @@ -597,3 +598,19 @@ def test_plot_bar_axis_units_timestamp_conversion(self): _check_plot_works(df.plot) s = Series({"A": 1.0}) _check_plot_works(s.plot.bar) + + def test_bar_plt_xaxis_intervalrange(self): + # GH 38969 + # Ensure IntervalIndex x-axis produces a bar plot as expected + from matplotlib.text import Text + + expected = [Text(0, 0, "([0, 1],)"), Text(1, 0, "([1, 2],)")] + s = Series( + [1, 2], + index=[interval_range(0, 2, inclusive="both")], + ) + _check_plot_works(s.plot.bar) + assert all( + (a.get_text() == b.get_text()) + for a, b in zip(s.plot.bar().get_xticklabels(), expected) + ) diff --git a/pandas/tests/reductions/test_reductions.py b/pandas/tests/reductions/test_reductions.py index 9d33e52709bd2..fa53ed47dbdba 100644 --- a/pandas/tests/reductions/test_reductions.py +++ b/pandas/tests/reductions/test_reductions.py @@ -198,6 +198,18 @@ def test_numpy_reduction_with_tz_aware_dtype(self, tz_aware_fixture, func): result = getattr(np, func)(expected, expected) tm.assert_series_equal(result, expected) + def test_nan_int_timedelta_sum(self): + # GH 27185 + df = DataFrame( + { + "A": Series([1, 2, NaT], dtype="timedelta64[ns]"), + "B": Series([1, 2, np.nan], dtype="Int64"), + } + ) + expected = Series({"A": Timedelta(3), "B": 3}) + result = df.sum() + tm.assert_series_equal(result, expected) + class TestIndexReductions: # Note: the name TestIndexReductions indicates these tests @@ -923,13 +935,9 @@ def test_all_any_params(self): with tm.assert_produces_warning(FutureWarning): s.all(bool_only=True, level=0) - # GH#38810 bool_only is not implemented alone. - msg = "Series.any does not implement bool_only" - with pytest.raises(NotImplementedError, match=msg): - s.any(bool_only=True) - msg = "Series.all does not implement bool_only." - with pytest.raises(NotImplementedError, match=msg): - s.all(bool_only=True) + # GH#47500 - test bool_only works + assert s.any(bool_only=True) + assert not s.all(bool_only=True) @pytest.mark.parametrize("bool_agg_func", ["any", "all"]) @pytest.mark.parametrize("skipna", [True, False]) diff --git a/pandas/tests/reductions/test_stat_reductions.py b/pandas/tests/reductions/test_stat_reductions.py index 0a6c0ccc891bb..be40d7ca631eb 100644 --- a/pandas/tests/reductions/test_stat_reductions.py +++ b/pandas/tests/reductions/test_stat_reductions.py @@ -149,10 +149,9 @@ def _check_stat_op( with pytest.raises(ValueError, match=msg): f(string_series_, axis=1) - # Unimplemented numeric_only parameter. if "numeric_only" in inspect.getfullargspec(f).args: - with pytest.raises(NotImplementedError, match=name): - f(string_series_, numeric_only=True) + # only the index is string; dtype is float + f(string_series_, numeric_only=True) def test_sum(self): string_series = tm.makeStringSeries().rename("series") diff --git a/pandas/tests/resample/test_resample_api.py b/pandas/tests/resample/test_resample_api.py index 5e10b9ee5277c..c5cd777962df3 100644 --- a/pandas/tests/resample/test_resample_api.py +++ b/pandas/tests/resample/test_resample_api.py @@ -859,6 +859,10 @@ def test_frame_downsample_method(method, numeric_only, expected_data): expected_index = date_range("2018-12-31", periods=1, freq="Y") df = DataFrame({"cat": ["cat_1", "cat_2"], "num": [5, 20]}, index=index) resampled = df.resample("Y") + if numeric_only is lib.no_default: + kwargs = {} + else: + kwargs = {"numeric_only": numeric_only} func = getattr(resampled, method) if numeric_only is lib.no_default and method not in ( @@ -882,9 +886,9 @@ def test_frame_downsample_method(method, numeric_only, expected_data): if isinstance(expected_data, str): klass = TypeError if method == "var" else ValueError with pytest.raises(klass, match=expected_data): - _ = func(numeric_only=numeric_only) + _ = func(**kwargs) else: - result = func(numeric_only=numeric_only) + result = func(**kwargs) expected = DataFrame(expected_data, index=expected_index) tm.assert_frame_equal(result, expected) @@ -922,8 +926,11 @@ def test_series_downsample_method(method, numeric_only, expected_data): func = getattr(resampled, method) if numeric_only and numeric_only is not lib.no_default: - with pytest.raises(NotImplementedError, match="not implement numeric_only"): - func(numeric_only=numeric_only) + with tm.assert_produces_warning( + FutureWarning, match="This will raise a TypeError" + ): + with pytest.raises(NotImplementedError, match="not implement numeric_only"): + func(numeric_only=numeric_only) elif method == "prod": with pytest.raises(TypeError, match="can't multiply sequence by non-int"): func(numeric_only=numeric_only) diff --git a/pandas/tests/resample/test_resampler_grouper.py b/pandas/tests/resample/test_resampler_grouper.py index c54d9de009940..8aff217cca5c1 100644 --- a/pandas/tests/resample/test_resampler_grouper.py +++ b/pandas/tests/resample/test_resampler_grouper.py @@ -470,3 +470,30 @@ def test_resample_groupby_agg_object_dtype_all_nan(consolidate): index=idx, ) tm.assert_frame_equal(result, expected) + + +def test_groupby_resample_with_list_of_keys(): + # GH 47362 + df = DataFrame( + data={ + "date": date_range(start="2016-01-01", periods=8), + "group": [0, 0, 0, 0, 1, 1, 1, 1], + "val": [1, 7, 5, 2, 3, 10, 5, 1], + } + ) + result = df.groupby("group").resample("2D", on="date")[["val"]].mean() + expected = DataFrame( + data={ + "val": [4.0, 3.5, 6.5, 3.0], + }, + index=Index( + data=[ + (0, Timestamp("2016-01-01")), + (0, Timestamp("2016-01-03")), + (1, Timestamp("2016-01-05")), + (1, Timestamp("2016-01-07")), + ], + name=("group", "date"), + ), + ) + tm.assert_frame_equal(result, expected) diff --git a/pandas/tests/reshape/concat/test_concat.py b/pandas/tests/reshape/concat/test_concat.py index eb44b4889afb8..4ba231523af14 100644 --- a/pandas/tests/reshape/concat/test_concat.py +++ b/pandas/tests/reshape/concat/test_concat.py @@ -15,6 +15,7 @@ InvalidIndexError, PerformanceWarning, ) +import pandas.util._test_decorators as td import pandas as pd from pandas import ( @@ -469,12 +470,12 @@ def __iter__(self): tm.assert_frame_equal(concat(CustomIterator2(), ignore_index=True), expected) def test_concat_order(self): - # GH 17344 + # GH 17344, GH#47331 dfs = [DataFrame(index=range(3), columns=["a", 1, None])] - dfs += [DataFrame(index=range(3), columns=[None, 1, "a"]) for i in range(100)] + dfs += [DataFrame(index=range(3), columns=[None, 1, "a"]) for _ in range(100)] result = concat(dfs, sort=True).columns - expected = dfs[0].columns + expected = Index([1, "a", None]) tm.assert_index_equal(result, expected) def test_concat_different_extension_dtypes_upcasts(self): @@ -755,3 +756,50 @@ def test_concat_retain_attrs(data): df2.attrs = {1: 1} df = concat([df1, df2]) assert df.attrs[1] == 1 + + +@td.skip_array_manager_invalid_test +@pytest.mark.parametrize("df_dtype", ["float64", "int64", "datetime64[ns]"]) +@pytest.mark.parametrize("empty_dtype", [None, "float64", "object"]) +def test_concat_ignore_emtpy_object_float(empty_dtype, df_dtype): + # https://github.com/pandas-dev/pandas/issues/45637 + df = DataFrame({"foo": [1, 2], "bar": [1, 2]}, dtype=df_dtype) + empty = DataFrame(columns=["foo", "bar"], dtype=empty_dtype) + result = concat([empty, df]) + expected = df + if df_dtype == "int64": + # TODO what exact behaviour do we want for integer eventually? + if empty_dtype == "float64": + expected = df.astype("float64") + else: + expected = df.astype("object") + tm.assert_frame_equal(result, expected) + + +@td.skip_array_manager_invalid_test +@pytest.mark.parametrize("df_dtype", ["float64", "int64", "datetime64[ns]"]) +@pytest.mark.parametrize("empty_dtype", [None, "float64", "object"]) +def test_concat_ignore_all_na_object_float(empty_dtype, df_dtype): + df = DataFrame({"foo": [1, 2], "bar": [1, 2]}, dtype=df_dtype) + empty = DataFrame({"foo": [np.nan], "bar": [np.nan]}, dtype=empty_dtype) + result = concat([empty, df], ignore_index=True) + + if df_dtype == "int64": + # TODO what exact behaviour do we want for integer eventually? + if empty_dtype == "object": + df_dtype = "object" + else: + df_dtype = "float64" + expected = DataFrame({"foo": [None, 1, 2], "bar": [None, 1, 2]}, dtype=df_dtype) + tm.assert_frame_equal(result, expected) + + +@td.skip_array_manager_invalid_test +def test_concat_ignore_empty_from_reindex(): + # https://github.com/pandas-dev/pandas/pull/43507#issuecomment-920375856 + df1 = DataFrame({"a": [1], "b": [pd.Timestamp("2012-01-01")]}) + df2 = DataFrame({"a": [2]}) + + result = concat([df1, df2.reindex(columns=df1.columns)], ignore_index=True) + expected = df1 = DataFrame({"a": [1, 2], "b": [pd.Timestamp("2012-01-01"), pd.NaT]}) + tm.assert_frame_equal(result, expected) diff --git a/pandas/tests/reshape/concat/test_index.py b/pandas/tests/reshape/concat/test_index.py index b20e4bcc2256b..9993700fd0737 100644 --- a/pandas/tests/reshape/concat/test_index.py +++ b/pandas/tests/reshape/concat/test_index.py @@ -387,3 +387,70 @@ def test_concat_with_levels_with_none_keys(self, levels): msg = "levels supported only when keys is not None" with pytest.raises(ValueError, match=msg): concat([df1, df2], levels=levels) + + def test_concat_range_index_result(self): + # GH#47501 + df1 = DataFrame({"a": [1, 2]}) + df2 = DataFrame({"b": [1, 2]}) + + result = concat([df1, df2], sort=True, axis=1) + expected = DataFrame({"a": [1, 2], "b": [1, 2]}) + tm.assert_frame_equal(result, expected) + expected_index = pd.RangeIndex(0, 2) + tm.assert_index_equal(result.index, expected_index, exact=True) + + def test_concat_index_keep_dtype(self): + # GH#47329 + df1 = DataFrame([[0, 1, 1]], columns=Index([1, 2, 3], dtype="object")) + df2 = DataFrame([[0, 1]], columns=Index([1, 2], dtype="object")) + result = concat([df1, df2], ignore_index=True, join="outer", sort=True) + expected = DataFrame( + [[0, 1, 1.0], [0, 1, np.nan]], columns=Index([1, 2, 3], dtype="object") + ) + tm.assert_frame_equal(result, expected) + + def test_concat_index_keep_dtype_ea_numeric(self, any_numeric_ea_dtype): + # GH#47329 + df1 = DataFrame( + [[0, 1, 1]], columns=Index([1, 2, 3], dtype=any_numeric_ea_dtype) + ) + df2 = DataFrame([[0, 1]], columns=Index([1, 2], dtype=any_numeric_ea_dtype)) + result = concat([df1, df2], ignore_index=True, join="outer", sort=True) + expected = DataFrame( + [[0, 1, 1.0], [0, 1, np.nan]], + columns=Index([1, 2, 3], dtype=any_numeric_ea_dtype), + ) + tm.assert_frame_equal(result, expected) + + @pytest.mark.parametrize("dtype", ["Int8", "Int16", "Int32"]) + def test_concat_index_find_common(self, dtype): + # GH#47329 + df1 = DataFrame([[0, 1, 1]], columns=Index([1, 2, 3], dtype=dtype)) + df2 = DataFrame([[0, 1]], columns=Index([1, 2], dtype="Int32")) + result = concat([df1, df2], ignore_index=True, join="outer", sort=True) + expected = DataFrame( + [[0, 1, 1.0], [0, 1, np.nan]], columns=Index([1, 2, 3], dtype="Int32") + ) + tm.assert_frame_equal(result, expected) + + def test_concat_axis_1_sort_false_rangeindex(self): + # GH 46675 + s1 = Series(["a", "b", "c"]) + s2 = Series(["a", "b"]) + s3 = Series(["a", "b", "c", "d"]) + s4 = Series([], dtype=object) + result = concat( + [s1, s2, s3, s4], sort=False, join="outer", ignore_index=False, axis=1 + ) + expected = DataFrame( + [ + ["a"] * 3 + [np.nan], + ["b"] * 3 + [np.nan], + ["c", np.nan] * 2, + [np.nan] * 2 + ["d"] + [np.nan], + ], + dtype=object, + ) + tm.assert_frame_equal( + result, expected, check_index_type=True, check_column_type=True + ) diff --git a/pandas/tests/reshape/concat/test_sort.py b/pandas/tests/reshape/concat/test_sort.py index a789dc0f8dc83..e83880625f3d6 100644 --- a/pandas/tests/reshape/concat/test_sort.py +++ b/pandas/tests/reshape/concat/test_sort.py @@ -93,6 +93,22 @@ def test_concat_frame_with_sort_false(self): tm.assert_frame_equal(result, expected) + # GH 37937 + df1 = DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}, index=[1, 2, 3]) + df2 = DataFrame({"c": [7, 8, 9], "d": [10, 11, 12]}, index=[3, 1, 6]) + result = pd.concat([df2, df1], axis=1, sort=False) + expected = DataFrame( + [ + [7.0, 10.0, 3.0, 6.0], + [8.0, 11.0, 1.0, 4.0], + [9.0, 12.0, np.nan, np.nan], + [np.nan, np.nan, 2.0, 5.0], + ], + index=[3, 1, 6, 2], + columns=["c", "d", "a", "b"], + ) + tm.assert_frame_equal(result, expected) + def test_concat_sort_none_warning(self): # GH#41518 df = DataFrame({1: [1, 2], "a": [3, 4]}) diff --git a/pandas/tests/reshape/merge/test_merge.py b/pandas/tests/reshape/merge/test_merge.py index ccdfc3cd23790..116fb298df61d 100644 --- a/pandas/tests/reshape/merge/test_merge.py +++ b/pandas/tests/reshape/merge/test_merge.py @@ -682,7 +682,7 @@ def _constructor(self): assert isinstance(result, NotADataFrame) - def test_join_append_timedeltas(self): + def test_join_append_timedeltas(self, using_array_manager): # timedelta64 issues with join/merge # GH 5695 @@ -696,9 +696,11 @@ def test_join_append_timedeltas(self): { "d": [datetime(2013, 11, 5, 5, 56), datetime(2013, 11, 5, 5, 56)], "t": [timedelta(0, 22500), timedelta(0, 22500)], - }, - dtype=object, + } ) + if using_array_manager: + # TODO(ArrayManager) decide on exact casting rules in concat + expected = expected.astype(object) tm.assert_frame_equal(result, expected) def test_join_append_timedeltas2(self): diff --git a/pandas/tests/reshape/test_from_dummies.py b/pandas/tests/reshape/test_from_dummies.py new file mode 100644 index 0000000000000..c52331e54f95e --- /dev/null +++ b/pandas/tests/reshape/test_from_dummies.py @@ -0,0 +1,398 @@ +import numpy as np +import pytest + +from pandas import ( + DataFrame, + Series, + from_dummies, + get_dummies, +) +import pandas._testing as tm + + +@pytest.fixture +def dummies_basic(): + return DataFrame( + { + "col1_a": [1, 0, 1], + "col1_b": [0, 1, 0], + "col2_a": [0, 1, 0], + "col2_b": [1, 0, 0], + "col2_c": [0, 0, 1], + }, + ) + + +@pytest.fixture +def dummies_with_unassigned(): + return DataFrame( + { + "col1_a": [1, 0, 0], + "col1_b": [0, 1, 0], + "col2_a": [0, 1, 0], + "col2_b": [0, 0, 0], + "col2_c": [0, 0, 1], + }, + ) + + +def test_error_wrong_data_type(): + dummies = [0, 1, 0] + with pytest.raises( + TypeError, + match=r"Expected 'data' to be a 'DataFrame'; Received 'data' of type: list", + ): + from_dummies(dummies) + + +def test_error_no_prefix_contains_unassigned(): + dummies = DataFrame({"a": [1, 0, 0], "b": [0, 1, 0]}) + with pytest.raises( + ValueError, + match=( + r"Dummy DataFrame contains unassigned value\(s\); " + r"First instance in row: 2" + ), + ): + from_dummies(dummies) + + +def test_error_no_prefix_wrong_default_category_type(): + dummies = DataFrame({"a": [1, 0, 1], "b": [0, 1, 1]}) + with pytest.raises( + TypeError, + match=( + r"Expected 'default_category' to be of type 'None', 'Hashable', or 'dict'; " + r"Received 'default_category' of type: list" + ), + ): + from_dummies(dummies, default_category=["c", "d"]) + + +def test_error_no_prefix_multi_assignment(): + dummies = DataFrame({"a": [1, 0, 1], "b": [0, 1, 1]}) + with pytest.raises( + ValueError, + match=( + r"Dummy DataFrame contains multi-assignment\(s\); " + r"First instance in row: 2" + ), + ): + from_dummies(dummies) + + +def test_error_no_prefix_contains_nan(): + dummies = DataFrame({"a": [1, 0, 0], "b": [0, 1, np.nan]}) + with pytest.raises( + ValueError, match=r"Dummy DataFrame contains NA value in column: 'b'" + ): + from_dummies(dummies) + + +def test_error_contains_non_dummies(): + dummies = DataFrame( + {"a": [1, 6, 3, 1], "b": [0, 1, 0, 2], "c": ["c1", "c2", "c3", "c4"]} + ) + with pytest.raises( + TypeError, + match=r"Passed DataFrame contains non-dummy data", + ): + from_dummies(dummies) + + +def test_error_with_prefix_multiple_seperators(): + dummies = DataFrame( + { + "col1_a": [1, 0, 1], + "col1_b": [0, 1, 0], + "col2-a": [0, 1, 0], + "col2-b": [1, 0, 1], + }, + ) + with pytest.raises( + ValueError, + match=(r"Separator not specified for column: col2-a"), + ): + from_dummies(dummies, sep="_") + + +def test_error_with_prefix_sep_wrong_type(dummies_basic): + + with pytest.raises( + TypeError, + match=( + r"Expected 'sep' to be of type 'str' or 'None'; " + r"Received 'sep' of type: list" + ), + ): + from_dummies(dummies_basic, sep=["_"]) + + +def test_error_with_prefix_contains_unassigned(dummies_with_unassigned): + with pytest.raises( + ValueError, + match=( + r"Dummy DataFrame contains unassigned value\(s\); " + r"First instance in row: 2" + ), + ): + from_dummies(dummies_with_unassigned, sep="_") + + +def test_error_with_prefix_default_category_wrong_type(dummies_with_unassigned): + with pytest.raises( + TypeError, + match=( + r"Expected 'default_category' to be of type 'None', 'Hashable', or 'dict'; " + r"Received 'default_category' of type: list" + ), + ): + from_dummies(dummies_with_unassigned, sep="_", default_category=["x", "y"]) + + +def test_error_with_prefix_default_category_dict_not_complete( + dummies_with_unassigned, +): + with pytest.raises( + ValueError, + match=( + r"Length of 'default_category' \(1\) did not match " + r"the length of the columns being encoded \(2\)" + ), + ): + from_dummies(dummies_with_unassigned, sep="_", default_category={"col1": "x"}) + + +def test_error_with_prefix_contains_nan(dummies_basic): + dummies_basic["col2_c"][2] = np.nan + with pytest.raises( + ValueError, match=r"Dummy DataFrame contains NA value in column: 'col2_c'" + ): + from_dummies(dummies_basic, sep="_") + + +def test_error_with_prefix_contains_non_dummies(dummies_basic): + dummies_basic["col2_c"][2] = "str" + with pytest.raises(TypeError, match=r"Passed DataFrame contains non-dummy data"): + from_dummies(dummies_basic, sep="_") + + +def test_error_with_prefix_double_assignment(): + dummies = DataFrame( + { + "col1_a": [1, 0, 1], + "col1_b": [1, 1, 0], + "col2_a": [0, 1, 0], + "col2_b": [1, 0, 0], + "col2_c": [0, 0, 1], + }, + ) + with pytest.raises( + ValueError, + match=( + r"Dummy DataFrame contains multi-assignment\(s\); " + r"First instance in row: 0" + ), + ): + from_dummies(dummies, sep="_") + + +def test_roundtrip_series_to_dataframe(): + categories = Series(["a", "b", "c", "a"]) + dummies = get_dummies(categories) + result = from_dummies(dummies) + expected = DataFrame({"": ["a", "b", "c", "a"]}) + tm.assert_frame_equal(result, expected) + + +def test_roundtrip_single_column_dataframe(): + categories = DataFrame({"": ["a", "b", "c", "a"]}) + dummies = get_dummies(categories) + result = from_dummies(dummies, sep="_") + expected = categories + tm.assert_frame_equal(result, expected) + + +def test_roundtrip_with_prefixes(): + categories = DataFrame({"col1": ["a", "b", "a"], "col2": ["b", "a", "c"]}) + dummies = get_dummies(categories) + result = from_dummies(dummies, sep="_") + expected = categories + tm.assert_frame_equal(result, expected) + + +def test_no_prefix_string_cats_basic(): + dummies = DataFrame({"a": [1, 0, 0, 1], "b": [0, 1, 0, 0], "c": [0, 0, 1, 0]}) + expected = DataFrame({"": ["a", "b", "c", "a"]}) + result = from_dummies(dummies) + tm.assert_frame_equal(result, expected) + + +def test_no_prefix_string_cats_basic_bool_values(): + dummies = DataFrame( + { + "a": [True, False, False, True], + "b": [False, True, False, False], + "c": [False, False, True, False], + } + ) + expected = DataFrame({"": ["a", "b", "c", "a"]}) + result = from_dummies(dummies) + tm.assert_frame_equal(result, expected) + + +def test_no_prefix_string_cats_basic_mixed_bool_values(): + dummies = DataFrame( + {"a": [1, 0, 0, 1], "b": [False, True, False, False], "c": [0, 0, 1, 0]} + ) + expected = DataFrame({"": ["a", "b", "c", "a"]}) + result = from_dummies(dummies) + tm.assert_frame_equal(result, expected) + + +def test_no_prefix_int_cats_basic(): + dummies = DataFrame( + {1: [1, 0, 0, 0], 25: [0, 1, 0, 0], 2: [0, 0, 1, 0], 5: [0, 0, 0, 1]} + ) + expected = DataFrame({"": [1, 25, 2, 5]}, dtype="object") + result = from_dummies(dummies) + tm.assert_frame_equal(result, expected) + + +def test_no_prefix_float_cats_basic(): + dummies = DataFrame( + {1.0: [1, 0, 0, 0], 25.0: [0, 1, 0, 0], 2.5: [0, 0, 1, 0], 5.84: [0, 0, 0, 1]} + ) + expected = DataFrame({"": [1.0, 25.0, 2.5, 5.84]}, dtype="object") + result = from_dummies(dummies) + tm.assert_frame_equal(result, expected) + + +def test_no_prefix_mixed_cats_basic(): + dummies = DataFrame( + { + 1.23: [1, 0, 0, 0, 0], + "c": [0, 1, 0, 0, 0], + 2: [0, 0, 1, 0, 0], + False: [0, 0, 0, 1, 0], + None: [0, 0, 0, 0, 1], + } + ) + expected = DataFrame({"": [1.23, "c", 2, False, None]}, dtype="object") + result = from_dummies(dummies) + tm.assert_frame_equal(result, expected) + + +def test_no_prefix_string_cats_contains_get_dummies_NaN_column(): + dummies = DataFrame({"a": [1, 0, 0], "b": [0, 1, 0], "NaN": [0, 0, 1]}) + expected = DataFrame({"": ["a", "b", "NaN"]}) + result = from_dummies(dummies) + tm.assert_frame_equal(result, expected) + + +@pytest.mark.parametrize( + "default_category, expected", + [ + pytest.param( + "c", + DataFrame({"": ["a", "b", "c"]}), + id="default_category is a str", + ), + pytest.param( + 1, + DataFrame({"": ["a", "b", 1]}), + id="default_category is a int", + ), + pytest.param( + 1.25, + DataFrame({"": ["a", "b", 1.25]}), + id="default_category is a float", + ), + pytest.param( + 0, + DataFrame({"": ["a", "b", 0]}), + id="default_category is a 0", + ), + pytest.param( + False, + DataFrame({"": ["a", "b", False]}), + id="default_category is a bool", + ), + pytest.param( + (1, 2), + DataFrame({"": ["a", "b", (1, 2)]}), + id="default_category is a tuple", + ), + ], +) +def test_no_prefix_string_cats_default_category(default_category, expected): + dummies = DataFrame({"a": [1, 0, 0], "b": [0, 1, 0]}) + result = from_dummies(dummies, default_category=default_category) + tm.assert_frame_equal(result, expected) + + +def test_with_prefix_basic(dummies_basic): + expected = DataFrame({"col1": ["a", "b", "a"], "col2": ["b", "a", "c"]}) + result = from_dummies(dummies_basic, sep="_") + tm.assert_frame_equal(result, expected) + + +def test_with_prefix_contains_get_dummies_NaN_column(): + dummies = DataFrame( + { + "col1_a": [1, 0, 0], + "col1_b": [0, 1, 0], + "col1_NaN": [0, 0, 1], + "col2_a": [0, 1, 0], + "col2_b": [0, 0, 0], + "col2_c": [0, 0, 1], + "col2_NaN": [1, 0, 0], + }, + ) + expected = DataFrame({"col1": ["a", "b", "NaN"], "col2": ["NaN", "a", "c"]}) + result = from_dummies(dummies, sep="_") + tm.assert_frame_equal(result, expected) + + +@pytest.mark.parametrize( + "default_category, expected", + [ + pytest.param( + "x", + DataFrame({"col1": ["a", "b", "x"], "col2": ["x", "a", "c"]}), + id="default_category is a str", + ), + pytest.param( + 0, + DataFrame({"col1": ["a", "b", 0], "col2": [0, "a", "c"]}), + id="default_category is a 0", + ), + pytest.param( + False, + DataFrame({"col1": ["a", "b", False], "col2": [False, "a", "c"]}), + id="default_category is a False", + ), + pytest.param( + {"col2": 1, "col1": 2.5}, + DataFrame({"col1": ["a", "b", 2.5], "col2": [1, "a", "c"]}), + id="default_category is a dict with int and float values", + ), + pytest.param( + {"col2": None, "col1": False}, + DataFrame({"col1": ["a", "b", False], "col2": [None, "a", "c"]}), + id="default_category is a dict with bool and None values", + ), + pytest.param( + {"col2": (1, 2), "col1": [1.25, False]}, + DataFrame({"col1": ["a", "b", [1.25, False]], "col2": [(1, 2), "a", "c"]}), + id="default_category is a dict with list and tuple values", + ), + ], +) +def test_with_prefix_default_category( + dummies_with_unassigned, default_category, expected +): + result = from_dummies( + dummies_with_unassigned, sep="_", default_category=default_category + ) + tm.assert_frame_equal(result, expected) diff --git a/pandas/tests/reshape/test_melt.py b/pandas/tests/reshape/test_melt.py index 4fbfee6f829ba..2013b3484ebff 100644 --- a/pandas/tests/reshape/test_melt.py +++ b/pandas/tests/reshape/test_melt.py @@ -1086,3 +1086,27 @@ def test_warn_of_column_name_value(self): with tm.assert_produces_warning(FutureWarning): result = df.melt(id_vars="value") tm.assert_frame_equal(result, expected) + + @pytest.mark.parametrize("dtype", ["O", "string"]) + def test_missing_stubname(self, dtype): + # GH46044 + df = DataFrame({"id": ["1", "2"], "a-1": [100, 200], "a-2": [300, 400]}) + df = df.astype({"id": dtype}) + result = wide_to_long( + df, + stubnames=["a", "b"], + i="id", + j="num", + sep="-", + ) + index = pd.Index( + [("1", 1), ("2", 1), ("1", 2), ("2", 2)], + name=("id", "num"), + ) + expected = DataFrame( + {"a": [100, 200, 300, 400], "b": [np.nan] * 4}, + index=index, + ) + new_level = expected.index.levels[0].astype(dtype) + expected.index = expected.index.set_levels(new_level, level=0) + tm.assert_frame_equal(result, expected) diff --git a/pandas/tests/scalar/timedelta/test_arithmetic.py b/pandas/tests/scalar/timedelta/test_arithmetic.py index 614245ec7a93e..f3b84388b0f70 100644 --- a/pandas/tests/scalar/timedelta/test_arithmetic.py +++ b/pandas/tests/scalar/timedelta/test_arithmetic.py @@ -318,6 +318,26 @@ def test_td_add_sub_dt64_ndarray(self): tm.assert_numpy_array_equal(-td + other, expected) tm.assert_numpy_array_equal(other - td, expected) + def test_td_add_sub_ndarray_0d(self): + td = Timedelta("1 day") + other = np.array(td.asm8) + + result = td + other + assert isinstance(result, Timedelta) + assert result == 2 * td + + result = other + td + assert isinstance(result, Timedelta) + assert result == 2 * td + + result = other - td + assert isinstance(result, Timedelta) + assert result == 0 * td + + result = td - other + assert isinstance(result, Timedelta) + assert result == 0 * td + class TestTimedeltaMultiplicationDivision: """ @@ -395,6 +415,20 @@ def test_td_mul_numeric_ndarray(self): result = other * td tm.assert_numpy_array_equal(result, expected) + def test_td_mul_numeric_ndarray_0d(self): + td = Timedelta("1 day") + other = np.array(2) + assert other.ndim == 0 + expected = Timedelta("2 days") + + res = td * other + assert type(res) is Timedelta + assert res == expected + + res = other * td + assert type(res) is Timedelta + assert res == expected + def test_td_mul_td64_ndarray_invalid(self): td = Timedelta("1 day") other = np.array([Timedelta("2 Days").to_timedelta64()]) @@ -484,6 +518,14 @@ def test_td_div_td64_ndarray(self): result = other / td tm.assert_numpy_array_equal(result, expected * 4) + def test_td_div_ndarray_0d(self): + td = Timedelta("1 day") + + other = np.array(1) + res = td / other + assert isinstance(res, Timedelta) + assert res == td + # --------------------------------------------------------------- # Timedelta.__rdiv__ @@ -539,6 +581,13 @@ def test_td_rdiv_ndarray(self): with pytest.raises(TypeError, match=msg): arr / td + def test_td_rdiv_ndarray_0d(self): + td = Timedelta(10, unit="d") + + arr = np.array(td.asm8) + + assert arr / td == 1 + # --------------------------------------------------------------- # Timedelta.__floordiv__ diff --git a/pandas/tests/scalar/timedelta/test_timedelta.py b/pandas/tests/scalar/timedelta/test_timedelta.py index 90c090d816c9d..0dd3a88670ece 100644 --- a/pandas/tests/scalar/timedelta/test_timedelta.py +++ b/pandas/tests/scalar/timedelta/test_timedelta.py @@ -13,6 +13,8 @@ NaT, iNaT, ) +from pandas._libs.tslibs.dtypes import NpyDatetimeUnit +from pandas.compat import IS64 from pandas.errors import OutOfBoundsTimedelta import pandas as pd @@ -33,7 +35,7 @@ def test_as_unit(self): res = td._as_unit("us") assert res.value == td.value // 1000 - assert res._reso == td._reso - 1 + assert res._reso == NpyDatetimeUnit.NPY_FR_us.value rt = res._as_unit("ns") assert rt.value == td.value @@ -41,7 +43,7 @@ def test_as_unit(self): res = td._as_unit("ms") assert res.value == td.value // 1_000_000 - assert res._reso == td._reso - 2 + assert res._reso == NpyDatetimeUnit.NPY_FR_ms.value rt = res._as_unit("ns") assert rt.value == td.value @@ -49,7 +51,7 @@ def test_as_unit(self): res = td._as_unit("s") assert res.value == td.value // 1_000_000_000 - assert res._reso == td._reso - 3 + assert res._reso == NpyDatetimeUnit.NPY_FR_s.value rt = res._as_unit("ns") assert rt.value == td.value @@ -58,7 +60,7 @@ def test_as_unit(self): def test_as_unit_overflows(self): # microsecond that would be just out of bounds for nano us = 9223372800000000 - td = Timedelta._from_value_and_reso(us, 9) + td = Timedelta._from_value_and_reso(us, NpyDatetimeUnit.NPY_FR_us.value) msg = "Cannot cast 106752 days 00:00:00 to unit='ns' without overflow" with pytest.raises(OutOfBoundsTimedelta, match=msg): @@ -66,7 +68,7 @@ def test_as_unit_overflows(self): res = td._as_unit("ms") assert res.value == us // 1000 - assert res._reso == 8 + assert res._reso == NpyDatetimeUnit.NPY_FR_ms.value def test_as_unit_rounding(self): td = Timedelta(microseconds=1500) @@ -75,7 +77,7 @@ def test_as_unit_rounding(self): expected = Timedelta(milliseconds=1) assert res == expected - assert res._reso == 8 + assert res._reso == NpyDatetimeUnit.NPY_FR_ms.value assert res.value == 1 with pytest.raises(ValueError, match="Cannot losslessly convert units"): @@ -83,15 +85,15 @@ def test_as_unit_rounding(self): def test_as_unit_non_nano(self): # case where we are going neither to nor from nano - td = Timedelta(days=1)._as_unit("D") + td = Timedelta(days=1)._as_unit("ms") assert td.days == 1 - assert td.value == 1 + assert td.value == 86_400_000 assert td.components.days == 1 assert td._d == 1 assert td.total_seconds() == 86400 - res = td._as_unit("h") - assert res.value == 24 + res = td._as_unit("us") + assert res.value == 86_400_000_000 assert res.components.days == 1 assert res.components.hours == 0 assert res._d == 1 @@ -100,18 +102,23 @@ def test_as_unit_non_nano(self): class TestNonNano: - @pytest.fixture(params=[7, 8, 9]) - def unit(self, request): - # 7, 8, 9 correspond to second, millisecond, and microsecond, respectively + @pytest.fixture(params=["s", "ms", "us"]) + def unit_str(self, request): return request.param + @pytest.fixture + def unit(self, unit_str): + # 7, 8, 9 correspond to second, millisecond, and microsecond, respectively + attr = f"NPY_FR_{unit_str}" + return getattr(NpyDatetimeUnit, attr).value + @pytest.fixture def val(self, unit): # microsecond that would be just out of bounds for nano us = 9223372800000000 - if unit == 9: + if unit == NpyDatetimeUnit.NPY_FR_us.value: value = us - elif unit == 8: + elif unit == NpyDatetimeUnit.NPY_FR_ms.value: value = us // 1000 else: value = us // 1_000_000 @@ -165,13 +172,135 @@ def test_to_timedelta64(self, td, unit): assert isinstance(res, np.timedelta64) assert res.view("i8") == td.value - if unit == 7: + if unit == NpyDatetimeUnit.NPY_FR_s.value: assert res.dtype == "m8[s]" - elif unit == 8: + elif unit == NpyDatetimeUnit.NPY_FR_ms.value: assert res.dtype == "m8[ms]" - elif unit == 9: + elif unit == NpyDatetimeUnit.NPY_FR_us.value: assert res.dtype == "m8[us]" + def test_truediv_timedeltalike(self, td): + assert td / td == 1 + assert (2.5 * td) / td == 2.5 + + other = Timedelta(td.value) + msg = "with mismatched resolutions are not supported" + with pytest.raises(ValueError, match=msg): + td / other + + with pytest.raises(ValueError, match=msg): + # __rtruediv__ + other.to_pytimedelta() / td + + def test_truediv_numeric(self, td): + assert td / np.nan is NaT + + res = td / 2 + assert res.value == td.value / 2 + assert res._reso == td._reso + + res = td / 2.0 + assert res.value == td.value / 2 + assert res._reso == td._reso + + def test_floordiv_timedeltalike(self, td): + assert td // td == 1 + assert (2.5 * td) // td == 2 + + other = Timedelta(td.value) + msg = "with mismatched resolutions are not supported" + with pytest.raises(ValueError, match=msg): + td // other + + with pytest.raises(ValueError, match=msg): + # __rfloordiv__ + other.to_pytimedelta() // td + + def test_floordiv_numeric(self, td): + assert td // np.nan is NaT + + res = td // 2 + assert res.value == td.value // 2 + assert res._reso == td._reso + + res = td // 2.0 + assert res.value == td.value // 2 + assert res._reso == td._reso + + assert td // np.array(np.nan) is NaT + + res = td // np.array(2) + assert res.value == td.value // 2 + assert res._reso == td._reso + + res = td // np.array(2.0) + assert res.value == td.value // 2 + assert res._reso == td._reso + + def test_addsub_mismatched_reso(self, td): + other = Timedelta(days=1) # can losslessly convert to other resos + + result = td + other + assert result._reso == td._reso + assert result.days == td.days + 1 + + result = other + td + assert result._reso == td._reso + assert result.days == td.days + 1 + + result = td - other + assert result._reso == td._reso + assert result.days == td.days - 1 + + result = other - td + assert result._reso == td._reso + assert result.days == 1 - td.days + + other2 = Timedelta(500) # can't cast losslessly + + msg = ( + "Timedelta addition/subtraction with mismatched resolutions is " + "not allowed when casting to the lower resolution would require " + "lossy rounding" + ) + with pytest.raises(ValueError, match=msg): + td + other2 + with pytest.raises(ValueError, match=msg): + other2 + td + with pytest.raises(ValueError, match=msg): + td - other2 + with pytest.raises(ValueError, match=msg): + other2 - td + + def test_min(self, td): + assert td.min <= td + assert td.min._reso == td._reso + assert td.min.value == NaT.value + 1 + + def test_max(self, td): + assert td.max >= td + assert td.max._reso == td._reso + assert td.max.value == np.iinfo(np.int64).max + + def test_resolution(self, td): + expected = Timedelta._from_value_and_reso(1, td._reso) + result = td.resolution + assert result == expected + assert result._reso == expected._reso + + +def test_timedelta_class_min_max_resolution(): + # when accessed on the class (as opposed to an instance), we default + # to nanoseconds + assert Timedelta.min == Timedelta(NaT.value + 1) + assert Timedelta.min._reso == NpyDatetimeUnit.NPY_FR_ns.value + + assert Timedelta.max == Timedelta(np.iinfo(np.int64).max) + assert Timedelta.max._reso == NpyDatetimeUnit.NPY_FR_ns.value + + assert Timedelta.resolution == Timedelta(1) + assert Timedelta.resolution._reso == NpyDatetimeUnit.NPY_FR_ns.value + class TestTimedeltaUnaryOps: def test_invert(self): @@ -562,6 +691,7 @@ def test_round_implementation_bounds(self): with pytest.raises(OverflowError, match=msg): Timedelta.max.ceil("s") + @pytest.mark.xfail(not IS64, reason="Failing on 32 bit build", strict=False) @given(val=st.integers(min_value=iNaT + 1, max_value=lib.i8max)) @pytest.mark.parametrize( "method", [Timedelta.round, Timedelta.floor, Timedelta.ceil] @@ -602,16 +732,21 @@ def test_round_sanity(self, val, method): assert np.abs((res - td).value) < nanos assert res.value % nanos == 0 - def test_contains(self): - # Checking for any NaT-like objects - # GH 13603 - td = to_timedelta(range(5), unit="d") + offsets.Hour(1) - for v in [NaT, None, float("nan"), np.nan]: - assert not (v in td) + @pytest.mark.parametrize("unit", ["ns", "us", "ms", "s"]) + def test_round_non_nano(self, unit): + td = Timedelta("1 days 02:34:57")._as_unit(unit) + + res = td.round("min") + assert res == Timedelta("1 days 02:35:00") + assert res._reso == td._reso + + res = td.floor("min") + assert res == Timedelta("1 days 02:34:00") + assert res._reso == td._reso - td = to_timedelta([NaT]) - for v in [NaT, None, float("nan"), np.nan]: - assert v in td + res = td.ceil("min") + assert res == Timedelta("1 days 02:35:00") + assert res._reso == td._reso def test_identity(self): diff --git a/pandas/tests/scalar/timestamp/test_timestamp.py b/pandas/tests/scalar/timestamp/test_timestamp.py index 89e5ce2241e42..67ad152dcab30 100644 --- a/pandas/tests/scalar/timestamp/test_timestamp.py +++ b/pandas/tests/scalar/timestamp/test_timestamp.py @@ -18,10 +18,14 @@ utc, ) +from pandas._libs.tslibs.dtypes import NpyDatetimeUnit from pandas._libs.tslibs.timezones import ( dateutil_gettz as gettz, get_timezone, + maybe_get_tz, + tz_compare, ) +from pandas.errors import OutOfBoundsDatetime import pandas.util._test_decorators as td from pandas import ( @@ -709,15 +713,20 @@ def dt64(self, reso): def ts(self, dt64): return Timestamp._from_dt64(dt64) + @pytest.fixture + def ts_tz(self, ts, tz_aware_fixture): + tz = maybe_get_tz(tz_aware_fixture) + return Timestamp._from_value_and_reso(ts.value, ts._reso, tz) + def test_non_nano_construction(self, dt64, ts, reso): assert ts.value == dt64.view("i8") if reso == "s": - assert ts._reso == 7 + assert ts._reso == NpyDatetimeUnit.NPY_FR_s.value elif reso == "ms": - assert ts._reso == 8 + assert ts._reso == NpyDatetimeUnit.NPY_FR_ms.value elif reso == "us": - assert ts._reso == 9 + assert ts._reso == NpyDatetimeUnit.NPY_FR_us.value def test_non_nano_fields(self, dt64, ts): alt = Timestamp(dt64) @@ -761,6 +770,16 @@ def test_month_name(self, dt64, ts): alt = Timestamp(dt64) assert ts.month_name() == alt.month_name() + def test_tz_convert(self, ts): + ts = Timestamp._from_value_and_reso(ts.value, ts._reso, utc) + + tz = pytz.timezone("US/Pacific") + result = ts.tz_convert(tz) + + assert isinstance(result, Timestamp) + assert result._reso == ts._reso + assert tz_compare(result.tz, tz) + def test_repr(self, dt64, ts): alt = Timestamp(dt64) @@ -823,11 +842,20 @@ def test_cmp_cross_reso_reversed_dt64(self): assert other.asm8 < ts - def test_pickle(self, ts): + def test_pickle(self, ts, tz_aware_fixture): + tz = tz_aware_fixture + tz = maybe_get_tz(tz) + ts = Timestamp._from_value_and_reso(ts.value, ts._reso, tz) rt = tm.round_trip_pickle(ts) assert rt._reso == ts._reso assert rt == ts + def test_normalize(self, dt64, ts): + alt = Timestamp(dt64) + result = ts.normalize() + assert result._reso == ts._reso + assert result == alt.normalize() + def test_asm8(self, dt64, ts): rt = ts.asm8 assert rt == dt64 @@ -850,3 +878,243 @@ def test_timestamp(self, dt64, ts): def test_to_period(self, dt64, ts): alt = Timestamp(dt64) assert ts.to_period("D") == alt.to_period("D") + + @pytest.mark.parametrize( + "td", [timedelta(days=4), Timedelta(days=4), np.timedelta64(4, "D")] + ) + def test_addsub_timedeltalike_non_nano(self, dt64, ts, td): + + result = ts - td + expected = Timestamp(dt64) - td + assert isinstance(result, Timestamp) + assert result._reso == ts._reso + assert result == expected + + result = ts + td + expected = Timestamp(dt64) + td + assert isinstance(result, Timestamp) + assert result._reso == ts._reso + assert result == expected + + result = td + ts + expected = td + Timestamp(dt64) + assert isinstance(result, Timestamp) + assert result._reso == ts._reso + assert result == expected + + @pytest.mark.xfail(reason="tz_localize not yet implemented for non-nano") + def test_addsub_offset(self, ts_tz): + # specifically non-Tick offset + off = offsets.YearBegin(1) + result = ts_tz + off + + assert isinstance(result, Timestamp) + assert result._reso == ts_tz._reso + # If ts_tz is ever on the last day of the year, the year would be + # incremented by one + assert result.year == ts_tz.year + assert result.day == 31 + assert result.month == 12 + assert tz_compare(result.tz, ts_tz.tz) + + result = ts_tz - off + + assert isinstance(result, Timestamp) + assert result._reso == ts_tz._reso + assert result.year == ts_tz.year - 1 + assert result.day == 31 + assert result.month == 12 + assert tz_compare(result.tz, ts_tz.tz) + + def test_sub_datetimelike_mismatched_reso(self, ts_tz): + # case with non-lossy rounding + ts = ts_tz + + # choose a unit for `other` that doesn't match ts_tz's; + # this construction ensures we get cases with other._reso < ts._reso + # and cases with other._reso > ts._reso + unit = { + NpyDatetimeUnit.NPY_FR_us.value: "ms", + NpyDatetimeUnit.NPY_FR_ms.value: "s", + NpyDatetimeUnit.NPY_FR_s.value: "us", + }[ts._reso] + other = ts._as_unit(unit) + assert other._reso != ts._reso + + result = ts - other + assert isinstance(result, Timedelta) + assert result.value == 0 + assert result._reso == min(ts._reso, other._reso) + + result = other - ts + assert isinstance(result, Timedelta) + assert result.value == 0 + assert result._reso == min(ts._reso, other._reso) + + msg = "Timestamp subtraction with mismatched resolutions" + if ts._reso < other._reso: + # Case where rounding is lossy + other2 = other + Timedelta._from_value_and_reso(1, other._reso) + with pytest.raises(ValueError, match=msg): + ts - other2 + with pytest.raises(ValueError, match=msg): + other2 - ts + else: + ts2 = ts + Timedelta._from_value_and_reso(1, ts._reso) + with pytest.raises(ValueError, match=msg): + ts2 - other + with pytest.raises(ValueError, match=msg): + other - ts2 + + def test_sub_timedeltalike_mismatched_reso(self, ts_tz): + # case with non-lossy rounding + ts = ts_tz + + # choose a unit for `other` that doesn't match ts_tz's; + # this construction ensures we get cases with other._reso < ts._reso + # and cases with other._reso > ts._reso + unit = { + NpyDatetimeUnit.NPY_FR_us.value: "ms", + NpyDatetimeUnit.NPY_FR_ms.value: "s", + NpyDatetimeUnit.NPY_FR_s.value: "us", + }[ts._reso] + other = Timedelta(0)._as_unit(unit) + assert other._reso != ts._reso + + result = ts + other + assert isinstance(result, Timestamp) + assert result == ts + assert result._reso == min(ts._reso, other._reso) + + result = other + ts + assert isinstance(result, Timestamp) + assert result == ts + assert result._reso == min(ts._reso, other._reso) + + msg = "Timestamp addition with mismatched resolutions" + if ts._reso < other._reso: + # Case where rounding is lossy + other2 = other + Timedelta._from_value_and_reso(1, other._reso) + with pytest.raises(ValueError, match=msg): + ts + other2 + with pytest.raises(ValueError, match=msg): + other2 + ts + else: + ts2 = ts + Timedelta._from_value_and_reso(1, ts._reso) + with pytest.raises(ValueError, match=msg): + ts2 + other + with pytest.raises(ValueError, match=msg): + other + ts2 + + msg = "Addition between Timestamp and Timedelta with mismatched resolutions" + with pytest.raises(ValueError, match=msg): + # With a mismatched td64 as opposed to Timedelta + ts + np.timedelta64(1, "ns") + + def test_min(self, ts): + assert ts.min <= ts + assert ts.min._reso == ts._reso + assert ts.min.value == NaT.value + 1 + + def test_max(self, ts): + assert ts.max >= ts + assert ts.max._reso == ts._reso + assert ts.max.value == np.iinfo(np.int64).max + + def test_resolution(self, ts): + expected = Timedelta._from_value_and_reso(1, ts._reso) + result = ts.resolution + assert result == expected + assert result._reso == expected._reso + + +def test_timestamp_class_min_max_resolution(): + # when accessed on the class (as opposed to an instance), we default + # to nanoseconds + assert Timestamp.min == Timestamp(NaT.value + 1) + assert Timestamp.min._reso == NpyDatetimeUnit.NPY_FR_ns.value + + assert Timestamp.max == Timestamp(np.iinfo(np.int64).max) + assert Timestamp.max._reso == NpyDatetimeUnit.NPY_FR_ns.value + + assert Timestamp.resolution == Timedelta(1) + assert Timestamp.resolution._reso == NpyDatetimeUnit.NPY_FR_ns.value + + +class TestAsUnit: + def test_as_unit(self): + ts = Timestamp("1970-01-01") + + assert ts._as_unit("ns") is ts + + res = ts._as_unit("us") + assert res.value == ts.value // 1000 + assert res._reso == NpyDatetimeUnit.NPY_FR_us.value + + rt = res._as_unit("ns") + assert rt.value == ts.value + assert rt._reso == ts._reso + + res = ts._as_unit("ms") + assert res.value == ts.value // 1_000_000 + assert res._reso == NpyDatetimeUnit.NPY_FR_ms.value + + rt = res._as_unit("ns") + assert rt.value == ts.value + assert rt._reso == ts._reso + + res = ts._as_unit("s") + assert res.value == ts.value // 1_000_000_000 + assert res._reso == NpyDatetimeUnit.NPY_FR_s.value + + rt = res._as_unit("ns") + assert rt.value == ts.value + assert rt._reso == ts._reso + + def test_as_unit_overflows(self): + # microsecond that would be just out of bounds for nano + us = 9223372800000000 + ts = Timestamp._from_value_and_reso(us, NpyDatetimeUnit.NPY_FR_us.value, None) + + msg = "Cannot cast 2262-04-12 00:00:00 to unit='ns' without overflow" + with pytest.raises(OutOfBoundsDatetime, match=msg): + ts._as_unit("ns") + + res = ts._as_unit("ms") + assert res.value == us // 1000 + assert res._reso == NpyDatetimeUnit.NPY_FR_ms.value + + def test_as_unit_rounding(self): + ts = Timestamp(1_500_000) # i.e. 1500 microseconds + res = ts._as_unit("ms") + + expected = Timestamp(1_000_000) # i.e. 1 millisecond + assert res == expected + + assert res._reso == NpyDatetimeUnit.NPY_FR_ms.value + assert res.value == 1 + + with pytest.raises(ValueError, match="Cannot losslessly convert units"): + ts._as_unit("ms", round_ok=False) + + def test_as_unit_non_nano(self): + # case where we are going neither to nor from nano + ts = Timestamp("1970-01-02")._as_unit("ms") + assert ts.year == 1970 + assert ts.month == 1 + assert ts.day == 2 + assert ts.hour == ts.minute == ts.second == ts.microsecond == ts.nanosecond == 0 + + res = ts._as_unit("s") + assert res.value == 24 * 3600 + assert res.year == 1970 + assert res.month == 1 + assert res.day == 2 + assert ( + res.hour + == res.minute + == res.second + == res.microsecond + == res.nanosecond + == 0 + ) diff --git a/pandas/tests/scalar/timestamp/test_timezones.py b/pandas/tests/scalar/timestamp/test_timezones.py index a7f7393fb3263..874575fa9ad4c 100644 --- a/pandas/tests/scalar/timestamp/test_timezones.py +++ b/pandas/tests/scalar/timestamp/test_timezones.py @@ -20,6 +20,7 @@ ) from pandas._libs.tslibs import timezones +from pandas._libs.tslibs.dtypes import NpyDatetimeUnit from pandas.errors import OutOfBoundsDatetime import pandas.util._test_decorators as td @@ -57,10 +58,11 @@ def test_tz_localize_pushes_out_of_bounds(self): with pytest.raises(OutOfBoundsDatetime, match=msg): Timestamp.max.tz_localize("US/Pacific") - def test_tz_localize_ambiguous_bool(self): + @pytest.mark.parametrize("unit", ["ns", "us", "ms", "s"]) + def test_tz_localize_ambiguous_bool(self, unit): # make sure that we are correctly accepting bool values as ambiguous # GH#14402 - ts = Timestamp("2015-11-01 01:00:03") + ts = Timestamp("2015-11-01 01:00:03")._as_unit(unit) expected0 = Timestamp("2015-11-01 01:00:03-0500", tz="US/Central") expected1 = Timestamp("2015-11-01 01:00:03-0600", tz="US/Central") @@ -70,9 +72,11 @@ def test_tz_localize_ambiguous_bool(self): result = ts.tz_localize("US/Central", ambiguous=True) assert result == expected0 + assert result._reso == getattr(NpyDatetimeUnit, f"NPY_FR_{unit}").value result = ts.tz_localize("US/Central", ambiguous=False) assert result == expected1 + assert result._reso == getattr(NpyDatetimeUnit, f"NPY_FR_{unit}").value def test_tz_localize_ambiguous(self): ts = Timestamp("2014-11-02 01:00") @@ -245,17 +249,28 @@ def test_timestamp_tz_localize(self, tz): ], ) @pytest.mark.parametrize("tz_type", ["", "dateutil/"]) + @pytest.mark.parametrize("unit", ["ns", "us", "ms", "s"]) def test_timestamp_tz_localize_nonexistent_shift( - self, start_ts, tz, end_ts, shift, tz_type + self, start_ts, tz, end_ts, shift, tz_type, unit ): # GH 8917, 24466 tz = tz_type + tz if isinstance(shift, str): shift = "shift_" + shift - ts = Timestamp(start_ts) + ts = Timestamp(start_ts)._as_unit(unit) result = ts.tz_localize(tz, nonexistent=shift) expected = Timestamp(end_ts).tz_localize(tz) - assert result == expected + + if unit == "us": + assert result == expected.replace(nanosecond=0) + elif unit == "ms": + micros = expected.microsecond - expected.microsecond % 1000 + assert result == expected.replace(microsecond=micros, nanosecond=0) + elif unit == "s": + assert result == expected.replace(microsecond=0, nanosecond=0) + else: + assert result == expected + assert result._reso == getattr(NpyDatetimeUnit, f"NPY_FR_{unit}").value @pytest.mark.parametrize("offset", [-1, 1]) @pytest.mark.parametrize("tz_type", ["", "dateutil/"]) @@ -268,16 +283,18 @@ def test_timestamp_tz_localize_nonexistent_shift_invalid(self, offset, tz_type): ts.tz_localize(tz, nonexistent=timedelta(seconds=offset)) @pytest.mark.parametrize("tz", ["Europe/Warsaw", "dateutil/Europe/Warsaw"]) - def test_timestamp_tz_localize_nonexistent_NaT(self, tz): + @pytest.mark.parametrize("unit", ["ns", "us", "ms", "s"]) + def test_timestamp_tz_localize_nonexistent_NaT(self, tz, unit): # GH 8917 - ts = Timestamp("2015-03-29 02:20:00") + ts = Timestamp("2015-03-29 02:20:00")._as_unit(unit) result = ts.tz_localize(tz, nonexistent="NaT") assert result is NaT @pytest.mark.parametrize("tz", ["Europe/Warsaw", "dateutil/Europe/Warsaw"]) - def test_timestamp_tz_localize_nonexistent_raise(self, tz): + @pytest.mark.parametrize("unit", ["ns", "us", "ms", "s"]) + def test_timestamp_tz_localize_nonexistent_raise(self, tz, unit): # GH 8917 - ts = Timestamp("2015-03-29 02:20:00") + ts = Timestamp("2015-03-29 02:20:00")._as_unit(unit) msg = "2015-03-29 02:20:00" with pytest.raises(pytz.NonExistentTimeError, match=msg): ts.tz_localize(tz, nonexistent="raise") diff --git a/pandas/tests/scalar/timestamp/test_unary_ops.py b/pandas/tests/scalar/timestamp/test_unary_ops.py index 5f07cabd51ca1..cc11037660ad2 100644 --- a/pandas/tests/scalar/timestamp/test_unary_ops.py +++ b/pandas/tests/scalar/timestamp/test_unary_ops.py @@ -19,7 +19,9 @@ iNaT, to_offset, ) +from pandas._libs.tslibs.dtypes import NpyDatetimeUnit from pandas._libs.tslibs.period import INVALID_FREQ_ERR_MSG +from pandas.compat import IS64 import pandas.util._test_decorators as td import pandas._testing as tm @@ -147,31 +149,42 @@ def test_round_minute_freq(self, test_input, freq, expected, rounder): result = func(freq) assert result == expected - def test_ceil(self): - dt = Timestamp("20130101 09:10:11") + @pytest.mark.parametrize("unit", ["ns", "us", "ms", "s"]) + def test_ceil(self, unit): + dt = Timestamp("20130101 09:10:11")._as_unit(unit) result = dt.ceil("D") expected = Timestamp("20130102") assert result == expected + assert result._reso == dt._reso - def test_floor(self): - dt = Timestamp("20130101 09:10:11") + @pytest.mark.parametrize("unit", ["ns", "us", "ms", "s"]) + def test_floor(self, unit): + dt = Timestamp("20130101 09:10:11")._as_unit(unit) result = dt.floor("D") expected = Timestamp("20130101") assert result == expected + assert result._reso == dt._reso @pytest.mark.parametrize("method", ["ceil", "round", "floor"]) - def test_round_dst_border_ambiguous(self, method): + @pytest.mark.parametrize( + "unit", + ["ns", "us", "ms", "s"], + ) + def test_round_dst_border_ambiguous(self, method, unit): # GH 18946 round near "fall back" DST ts = Timestamp("2017-10-29 00:00:00", tz="UTC").tz_convert("Europe/Madrid") + ts = ts._as_unit(unit) # result = getattr(ts, method)("H", ambiguous=True) assert result == ts + assert result._reso == getattr(NpyDatetimeUnit, f"NPY_FR_{unit}").value result = getattr(ts, method)("H", ambiguous=False) expected = Timestamp("2017-10-29 01:00:00", tz="UTC").tz_convert( "Europe/Madrid" ) assert result == expected + assert result._reso == getattr(NpyDatetimeUnit, f"NPY_FR_{unit}").value result = getattr(ts, method)("H", ambiguous="NaT") assert result is NaT @@ -188,12 +201,17 @@ def test_round_dst_border_ambiguous(self, method): ["floor", "2018-03-11 03:01:00-0500", "2H"], ], ) - def test_round_dst_border_nonexistent(self, method, ts_str, freq): + @pytest.mark.parametrize( + "unit", + ["ns", "us", "ms", "s"], + ) + def test_round_dst_border_nonexistent(self, method, ts_str, freq, unit): # GH 23324 round near "spring forward" DST - ts = Timestamp(ts_str, tz="America/Chicago") + ts = Timestamp(ts_str, tz="America/Chicago")._as_unit(unit) result = getattr(ts, method)(freq, nonexistent="shift_forward") expected = Timestamp("2018-03-11 03:00:00", tz="America/Chicago") assert result == expected + assert result._reso == getattr(NpyDatetimeUnit, f"NPY_FR_{unit}").value result = getattr(ts, method)(freq, nonexistent="NaT") assert result is NaT @@ -280,6 +298,7 @@ def test_round_implementation_bounds(self): with pytest.raises(OverflowError, match=msg): Timestamp.max.ceil("s") + @pytest.mark.xfail(not IS64, reason="Failing on 32 bit build", strict=False) @given(val=st.integers(iNaT + 1, lib.i8max)) @pytest.mark.parametrize( "method", [Timestamp.round, Timestamp.floor, Timestamp.ceil] @@ -338,6 +357,16 @@ def checker(res, ts, nanos): # -------------------------------------------------------------- # Timestamp.replace + def test_replace_non_nano(self): + ts = Timestamp._from_value_and_reso( + 91514880000000000, NpyDatetimeUnit.NPY_FR_us.value, None + ) + assert ts.to_pydatetime() == datetime(4869, 12, 28) + + result = ts.replace(year=4900) + assert result._reso == ts._reso + assert result.to_pydatetime() == datetime(4900, 12, 28) + def test_replace_naive(self): # GH#14621, GH#7825 ts = Timestamp("2016-01-01 09:00:00") @@ -455,35 +484,41 @@ def test_replace_across_dst(self, tz, normalize): ts2b = normalize(ts2) assert ts2 == ts2b - def test_replace_dst_border(self): + @pytest.mark.parametrize("unit", ["ns", "us", "ms", "s"]) + def test_replace_dst_border(self, unit): # Gh 7825 - t = Timestamp("2013-11-3", tz="America/Chicago") + t = Timestamp("2013-11-3", tz="America/Chicago")._as_unit(unit) result = t.replace(hour=3) expected = Timestamp("2013-11-3 03:00:00", tz="America/Chicago") assert result == expected + assert result._reso == getattr(NpyDatetimeUnit, f"NPY_FR_{unit}").value @pytest.mark.parametrize("fold", [0, 1]) @pytest.mark.parametrize("tz", ["dateutil/Europe/London", "Europe/London"]) - def test_replace_dst_fold(self, fold, tz): + @pytest.mark.parametrize("unit", ["ns", "us", "ms", "s"]) + def test_replace_dst_fold(self, fold, tz, unit): # GH 25017 d = datetime(2019, 10, 27, 2, 30) - ts = Timestamp(d, tz=tz) + ts = Timestamp(d, tz=tz)._as_unit(unit) result = ts.replace(hour=1, fold=fold) expected = Timestamp(datetime(2019, 10, 27, 1, 30)).tz_localize( tz, ambiguous=not fold ) assert result == expected + assert result._reso == getattr(NpyDatetimeUnit, f"NPY_FR_{unit}").value # -------------------------------------------------------------- # Timestamp.normalize @pytest.mark.parametrize("arg", ["2013-11-30", "2013-11-30 12:00:00"]) - def test_normalize(self, tz_naive_fixture, arg): + @pytest.mark.parametrize("unit", ["ns", "us", "ms", "s"]) + def test_normalize(self, tz_naive_fixture, arg, unit): tz = tz_naive_fixture - ts = Timestamp(arg, tz=tz) + ts = Timestamp(arg, tz=tz)._as_unit(unit) result = ts.normalize() expected = Timestamp("2013-11-30", tz=tz) assert result == expected + assert result._reso == getattr(NpyDatetimeUnit, f"NPY_FR_{unit}").value def test_normalize_pre_epoch_dates(self): # GH: 36294 diff --git a/pandas/tests/series/indexing/test_indexing.py b/pandas/tests/series/indexing/test_indexing.py index 3a8e14576a55d..2f4fffe57593f 100644 --- a/pandas/tests/series/indexing/test_indexing.py +++ b/pandas/tests/series/indexing/test_indexing.py @@ -5,7 +5,10 @@ import numpy as np import pytest +from pandas.errors import IndexingError + from pandas import ( + NA, DataFrame, IndexSlice, MultiIndex, @@ -330,6 +333,22 @@ def test_loc_setitem_all_false_indexer(): tm.assert_series_equal(ser, expected) +def test_loc_boolean_indexer_non_matching_index(): + # GH#46551 + ser = Series([1]) + result = ser.loc[Series([NA, False], dtype="boolean")] + expected = Series([], dtype="int64") + tm.assert_series_equal(result, expected) + + +def test_loc_boolean_indexer_miss_matching_index(): + # GH#46551 + ser = Series([1]) + indexer = Series([NA, False], dtype="boolean", index=[1, 2]) + with pytest.raises(IndexingError, match="Unalignable"): + ser.loc[indexer] + + class TestDeprecatedIndexers: @pytest.mark.parametrize("key", [{1}, {1: 1}]) def test_getitem_dict_and_set_deprecated(self, key): diff --git a/pandas/tests/series/indexing/test_setitem.py b/pandas/tests/series/indexing/test_setitem.py index e2a5517066ad9..244ad9884b82a 100644 --- a/pandas/tests/series/indexing/test_setitem.py +++ b/pandas/tests/series/indexing/test_setitem.py @@ -6,6 +6,8 @@ import numpy as np import pytest +from pandas.errors import IndexingError + from pandas.core.dtypes.common import is_list_like from pandas import ( @@ -30,7 +32,6 @@ timedelta_range, ) import pandas._testing as tm -from pandas.core.indexing import IndexingError from pandas.tseries.offsets import BDay @@ -534,6 +535,33 @@ def test_setitem_not_contained(self, string_series): expected = concat([string_series, app]) tm.assert_series_equal(ser, expected) + def test_setitem_keep_precision(self, any_numeric_ea_dtype): + # GH#32346 + ser = Series([1, 2], dtype=any_numeric_ea_dtype) + ser[2] = 10 + expected = Series([1, 2, 10], dtype=any_numeric_ea_dtype) + tm.assert_series_equal(ser, expected) + + @pytest.mark.parametrize("indexer", [1, 2]) + @pytest.mark.parametrize( + "na, target_na, dtype, target_dtype", + [ + (NA, NA, "Int64", "Int64"), + (NA, np.nan, "int64", "float64"), + (NaT, NaT, "int64", "object"), + (np.nan, NA, "Int64", "Int64"), + (np.nan, NA, "Float64", "Float64"), + (np.nan, np.nan, "int64", "float64"), + ], + ) + def test_setitem_enlarge_with_na(self, na, target_na, dtype, target_dtype, indexer): + # GH#32346 + ser = Series([1, 2], dtype=dtype) + ser[indexer] = na + expected_values = [1, target_na] if indexer == 1 else [1, 2, target_na] + expected = Series(expected_values, dtype=target_dtype) + tm.assert_series_equal(ser, expected) + def test_setitem_scalar_into_readonly_backing_data(): # GH#14359: test that you cannot mutate a read only buffer diff --git a/pandas/tests/series/test_api.py b/pandas/tests/series/test_api.py index 9a0c3fd5e9fed..0aab381d6e076 100644 --- a/pandas/tests/series/test_api.py +++ b/pandas/tests/series/test_api.py @@ -209,3 +209,97 @@ def test_series_iteritems_deprecated(self): ser = Series([1]) with tm.assert_produces_warning(FutureWarning): next(ser.iteritems()) + + @pytest.mark.parametrize( + "kernel, has_numeric_only", + [ + ("skew", True), + ("var", True), + ("all", False), + ("prod", True), + ("any", False), + ("idxmin", False), + ("quantile", False), + ("idxmax", False), + ("min", True), + ("sem", True), + ("mean", True), + ("nunique", False), + ("max", True), + ("sum", True), + ("count", False), + ("median", True), + ("std", True), + ("backfill", False), + ("rank", True), + ("pct_change", False), + ("cummax", False), + ("shift", False), + ("diff", False), + ("cumsum", False), + ("cummin", False), + ("cumprod", False), + ("fillna", False), + ("ffill", False), + ("pad", False), + ("bfill", False), + ("sample", False), + ("tail", False), + ("take", False), + ("head", False), + ("cov", False), + ("corr", False), + ], + ) + @pytest.mark.parametrize("dtype", [bool, int, float, object]) + def test_numeric_only(self, kernel, has_numeric_only, dtype): + # GH#47500 + ser = Series([0, 1, 1], dtype=dtype) + if kernel == "corrwith": + args = (ser,) + elif kernel == "corr": + args = (ser,) + elif kernel == "cov": + args = (ser,) + elif kernel == "nth": + args = (0,) + elif kernel == "fillna": + args = (True,) + elif kernel == "fillna": + args = ("ffill",) + elif kernel == "take": + args = ([0],) + elif kernel == "quantile": + args = (0.5,) + else: + args = () + method = getattr(ser, kernel) + if not has_numeric_only: + msg = ( + "(got an unexpected keyword argument 'numeric_only'" + "|too many arguments passed in)" + ) + with pytest.raises(TypeError, match=msg): + method(*args, numeric_only=True) + elif dtype is object: + if kernel == "rank": + msg = "Calling Series.rank with numeric_only=True and dtype object" + with tm.assert_produces_warning(FutureWarning, match=msg): + method(*args, numeric_only=True) + else: + warn_msg = ( + f"Calling Series.{kernel} with numeric_only=True and dtype object" + ) + err_msg = f"Series.{kernel} does not implement numeric_only" + with tm.assert_produces_warning(FutureWarning, match=warn_msg): + with pytest.raises(NotImplementedError, match=err_msg): + method(*args, numeric_only=True) + else: + result = method(*args, numeric_only=True) + expected = method(*args, numeric_only=False) + if isinstance(expected, Series): + # transformer + tm.assert_series_equal(result, expected) + else: + # reducer + assert result == expected diff --git a/pandas/tests/series/test_constructors.py b/pandas/tests/series/test_constructors.py index 3dce22a06c1b2..4e4ee4fd12d5f 100644 --- a/pandas/tests/series/test_constructors.py +++ b/pandas/tests/series/test_constructors.py @@ -745,6 +745,25 @@ def test_constructor_signed_int_overflow_deprecation(self): expected = Series([1, 200, 50], dtype="uint8") tm.assert_series_equal(ser, expected) + @pytest.mark.parametrize( + "values", + [ + np.array([1], dtype=np.uint16), + np.array([1], dtype=np.uint32), + np.array([1], dtype=np.uint64), + [np.uint16(1)], + [np.uint32(1)], + [np.uint64(1)], + ], + ) + def test_constructor_numpy_uints(self, values): + # GH#47294 + value = values[0] + result = Series(values) + + assert result[0].dtype == value.dtype + assert result[0] == value + def test_constructor_unsigned_dtype_overflow(self, any_unsigned_int_numpy_dtype): # see gh-15832 msg = "Trying to coerce negative values to unsigned integers" @@ -1181,8 +1200,8 @@ def test_constructor_infer_interval(self, data_constructor): @pytest.mark.parametrize( "data_constructor", [list, np.array], ids=["list", "ndarray[object]"] ) - def test_constructor_interval_mixed_closed(self, data_constructor): - # GH 23563: mixed closed results in object dtype (not interval dtype) + def test_constructor_interval_mixed_inclusive(self, data_constructor): + # GH 23563: mixed inclusive results in object dtype (not interval dtype) data = [Interval(0, 1, inclusive="both"), Interval(0, 2, inclusive="neither")] result = Series(data_constructor(data)) assert result.dtype == object diff --git a/pandas/tests/series/test_ufunc.py b/pandas/tests/series/test_ufunc.py index 13af94feaf744..624496ea26a81 100644 --- a/pandas/tests/series/test_ufunc.py +++ b/pandas/tests/series/test_ufunc.py @@ -439,3 +439,14 @@ def test_outer(): with pytest.raises(NotImplementedError, match=tm.EMPTY_STRING_PATTERN): np.subtract.outer(s, o) + + +def test_np_matmul(): + # GH26650 + df1 = pd.DataFrame(data=[[-1, 1, 10]]) + df2 = pd.DataFrame(data=[-1, 1, 10]) + expected_result = pd.DataFrame(data=[102]) + tm.assert_frame_equal( + expected_result, + np.matmul(df1, df2), + ) diff --git a/pandas/tests/strings/test_api.py b/pandas/tests/strings/test_api.py index 974ecc152f17b..d76ed65be9e1b 100644 --- a/pandas/tests/strings/test_api.py +++ b/pandas/tests/strings/test_api.py @@ -132,7 +132,7 @@ def test_api_for_categorical(any_string_method, any_string_dtype, request): any_string_dtype == "string" and get_option("string_storage") == "pyarrow" ): # unsupported operand type(s) for +: 'ArrowStringArray' and 'str' - mark = pytest.mark.xfail(raises=TypeError, reason="Not Implemented") + mark = pytest.mark.xfail(raises=NotImplementedError, reason="Not Implemented") request.node.add_marker(mark) s = Series(list("aabb"), dtype=any_string_dtype) diff --git a/pandas/tests/strings/test_cat.py b/pandas/tests/strings/test_cat.py index 8abbc59343e78..4decdff8063a8 100644 --- a/pandas/tests/strings/test_cat.py +++ b/pandas/tests/strings/test_cat.py @@ -376,3 +376,22 @@ def test_cat_different_classes(klass): result = s.str.cat(klass(["x", "y", "z"])) expected = Series(["ax", "by", "cz"]) tm.assert_series_equal(result, expected) + + +def test_cat_on_series_dot_str(): + # GH 28277 + # Test future warning of `Series.str.__iter__` + ps = Series(["AbC", "de", "FGHI", "j", "kLLLm"]) + with tm.assert_produces_warning(FutureWarning): + ps.str.cat(others=ps.str) + # TODO(2.0): The following code can be uncommented + # when `Series.str.__iter__` is removed. + + # message = re.escape( + # "others must be Series, Index, DataFrame, np.ndarray " + # "or list-like (either containing only strings or " + # "containing only objects of type Series/Index/" + # "np.ndarray[1-dim])" + # ) + # with pytest.raises(TypeError, match=message): + # ps.str.cat(others=ps.str) diff --git a/pandas/tests/strings/test_split_partition.py b/pandas/tests/strings/test_split_partition.py index 74458c13e8df7..7d73414a672c8 100644 --- a/pandas/tests/strings/test_split_partition.py +++ b/pandas/tests/strings/test_split_partition.py @@ -130,6 +130,23 @@ def test_rsplit_max_number(any_string_dtype): tm.assert_series_equal(result, exp) +@pytest.mark.parametrize("method", ["split", "rsplit"]) +def test_posargs_deprecation(method): + # GH 47423; Deprecate passing n as positional. + s = Series(["foo,bar,lorep"]) + + msg = ( + f"In a future version of pandas all arguments of StringMethods.{method} " + "except for the argument 'pat' will be keyword-only" + ) + + with tm.assert_produces_warning(FutureWarning, match=msg): + result = getattr(s.str, method)(",", 3) + + expected = Series([["foo", "bar", "lorep"]]) + tm.assert_series_equal(result, expected) + + def test_split_blank_string(any_string_dtype): # expand blank split GH 20067 values = Series([""], name="test", dtype=any_string_dtype) diff --git a/pandas/tests/strings/test_strings.py b/pandas/tests/strings/test_strings.py index aa31a5505b866..0e55676699c21 100644 --- a/pandas/tests/strings/test_strings.py +++ b/pandas/tests/strings/test_strings.py @@ -803,3 +803,28 @@ def test_str_accessor_in_apply_func(): expected = Series(["A/D", "B/E", "C/F"]) result = df.apply(lambda f: "/".join(f.str.upper()), axis=1) tm.assert_series_equal(result, expected) + + +def test_zfill(): + # https://github.com/pandas-dev/pandas/issues/20868 + value = Series(["-1", "1", "1000", 10, np.nan]) + expected = Series(["-01", "001", "1000", np.nan, np.nan]) + tm.assert_series_equal(value.str.zfill(3), expected) + + value = Series(["-2", "+5"]) + expected = Series(["-0002", "+0005"]) + tm.assert_series_equal(value.str.zfill(5), expected) + + +def test_zfill_with_non_integer_argument(): + value = Series(["-2", "+5"]) + wid = "a" + msg = f"width must be of integer type, not {type(wid).__name__}" + with pytest.raises(TypeError, match=msg): + value.str.zfill(wid) + + +def test_zfill_with_leading_sign(): + value = Series(["-cat", "-1", "+dog"]) + expected = Series(["-0cat", "-0001", "+0dog"]) + tm.assert_series_equal(value.str.zfill(5), expected) diff --git a/pandas/tests/test_algos.py b/pandas/tests/test_algos.py index 85a240a3e825d..def63c552e059 100644 --- a/pandas/tests/test_algos.py +++ b/pandas/tests/test_algos.py @@ -75,11 +75,11 @@ def test_factorize(self, index_or_series_obj, sort): tm.assert_numpy_array_equal(result_codes, expected_codes) tm.assert_index_equal(result_uniques, expected_uniques, exact=True) - def test_series_factorize_na_sentinel_none(self): + def test_series_factorize_use_na_sentinel_false(self): # GH#35667 values = np.array([1, 2, 1, np.nan]) ser = Series(values) - codes, uniques = ser.factorize(na_sentinel=None) + codes, uniques = ser.factorize(use_na_sentinel=False) expected_codes = np.array([0, 1, 0, 2], dtype=np.intp) expected_uniques = Index([1.0, 2.0, np.nan]) @@ -87,6 +87,20 @@ def test_series_factorize_na_sentinel_none(self): tm.assert_numpy_array_equal(codes, expected_codes) tm.assert_index_equal(uniques, expected_uniques) + @pytest.mark.parametrize("na_sentinel", [None, -1, -10]) + def test_depr_na_sentinel(self, na_sentinel, index_or_series_obj): + # GH#46910 + if na_sentinel is None: + msg = "Specifying `na_sentinel=None` is deprecated" + elif na_sentinel == -1: + msg = "Specifying `na_sentinel=-1` is deprecated" + else: + msg = "Specifying the specific value to use for `na_sentinel` is deprecated" + with tm.assert_produces_warning(FutureWarning, match=msg): + pd.factorize(index_or_series_obj, na_sentinel=na_sentinel) + with tm.assert_produces_warning(FutureWarning, match=msg): + index_or_series_obj.factorize(na_sentinel=na_sentinel) + def test_basic(self): codes, uniques = algos.factorize(["a", "b", "b", "a", "a", "c", "c", "c"]) @@ -418,7 +432,12 @@ def test_parametrized_factorize_na_value(self, data, na_value): ids=["numpy_array", "extension_array"], ) def test_factorize_na_sentinel(self, sort, na_sentinel, data, uniques): - codes, uniques = algos.factorize(data, sort=sort, na_sentinel=na_sentinel) + if na_sentinel == -1: + msg = "Specifying `na_sentinel=-1` is deprecated" + else: + msg = "the specific value to use for `na_sentinel` is deprecated" + with tm.assert_produces_warning(FutureWarning, match=msg): + codes, uniques = algos.factorize(data, sort=sort, na_sentinel=na_sentinel) if sort: expected_codes = np.array([1, 0, na_sentinel, 1], dtype=np.intp) expected_uniques = algos.safe_sort(uniques) @@ -446,10 +465,10 @@ def test_factorize_na_sentinel(self, sort, na_sentinel, data, uniques): ), ], ) - def test_object_factorize_na_sentinel_none( + def test_object_factorize_use_na_sentinel_false( self, data, expected_codes, expected_uniques ): - codes, uniques = algos.factorize(data, na_sentinel=None) + codes, uniques = algos.factorize(data, use_na_sentinel=False) tm.assert_numpy_array_equal(uniques, expected_uniques) tm.assert_numpy_array_equal(codes, expected_codes) @@ -469,10 +488,10 @@ def test_object_factorize_na_sentinel_none( ), ], ) - def test_int_factorize_na_sentinel_none( + def test_int_factorize_use_na_sentinel_false( self, data, expected_codes, expected_uniques ): - codes, uniques = algos.factorize(data, na_sentinel=None) + codes, uniques = algos.factorize(data, use_na_sentinel=False) tm.assert_numpy_array_equal(uniques, expected_uniques) tm.assert_numpy_array_equal(codes, expected_codes) @@ -1089,6 +1108,13 @@ def test_isin_float_df_string_search(self): expected_false = DataFrame({"values": [False, False]}) tm.assert_frame_equal(result, expected_false) + def test_isin_unsigned_dtype(self): + # GH#46485 + ser = Series([1378774140726870442], dtype=np.uint64) + result = ser.isin([1378774140726870528]) + expected = Series(False) + tm.assert_series_equal(result, expected) + class TestValueCounts: def test_value_counts(self): diff --git a/pandas/tests/test_errors.py b/pandas/tests/test_errors.py index 827c5767c514f..187d5399f5985 100644 --- a/pandas/tests/test_errors.py +++ b/pandas/tests/test_errors.py @@ -1,6 +1,9 @@ import pytest -from pandas.errors import AbstractMethodError +from pandas.errors import ( + AbstractMethodError, + UndefinedVariableError, +) import pandas as pd @@ -24,6 +27,14 @@ "SettingWithCopyError", "SettingWithCopyWarning", "NumExprClobberingError", + "IndexingError", + "PyperclipException", + "CSSWarning", + "ClosedFileError", + "PossibleDataLossError", + "IncompatibilityWarning", + "AttributeConflictWarning", + "DatabaseError", ], ) def test_exception_importable(exc): @@ -48,6 +59,24 @@ def test_catch_oob(): pd.Timestamp("15000101") +@pytest.mark.parametrize( + "is_local", + [ + True, + False, + ], +) +def test_catch_undefined_variable_error(is_local): + variable_name = "x" + if is_local: + msg = f"local variable '{variable_name}' is not defined" + else: + msg = f"name '{variable_name}' is not defined" + + with pytest.raises(UndefinedVariableError, match=msg): + raise UndefinedVariableError(variable_name, is_local) + + class Foo: @classmethod def classmethod(cls): diff --git a/pandas/tests/test_nanops.py b/pandas/tests/test_nanops.py index 005f7b088271f..f46d5c8e2590e 100644 --- a/pandas/tests/test_nanops.py +++ b/pandas/tests/test_nanops.py @@ -1120,3 +1120,25 @@ def test_check_below_min_count__large_shape(min_count, expected_result): shape = (2244367, 1253) result = nanops.check_below_min_count(shape, mask=None, min_count=min_count) assert result == expected_result + + +@pytest.mark.parametrize("func", ["nanmean", "nansum"]) +@pytest.mark.parametrize( + "dtype", + [ + np.uint8, + np.uint16, + np.uint32, + np.uint64, + np.int8, + np.int16, + np.int32, + np.int64, + np.float16, + np.float32, + np.float64, + ], +) +def test_check_bottleneck_disallow(dtype, func): + # GH 42878 bottleneck sometimes produces unreliable results for mean and sum + assert not nanops._bn_ok_dtype(dtype, func) diff --git a/pandas/tests/tools/test_to_datetime.py b/pandas/tests/tools/test_to_datetime.py index 4c34b0c0aec0a..afa06bf1a79af 100644 --- a/pandas/tests/tools/test_to_datetime.py +++ b/pandas/tests/tools/test_to_datetime.py @@ -284,6 +284,8 @@ def test_to_datetime_format_microsecond(self, cache): "%m/%d/%Y %H:%M:%S", Timestamp("2010-01-10 13:56:01"), ], + # The 3 tests below are locale-dependent. + # They pass, except when the machine locale is zh_CN or it_IT . pytest.param( "01/10/2010 08:14 PM", "%m/%d/%Y %I:%M %p", @@ -291,6 +293,7 @@ def test_to_datetime_format_microsecond(self, cache): marks=pytest.mark.xfail( locale.getlocale()[0] in ("zh_CN", "it_IT"), reason="fail on a CI build with LC_ALL=zh_CN.utf8/it_IT.utf8", + strict=False, ), ), pytest.param( @@ -300,6 +303,7 @@ def test_to_datetime_format_microsecond(self, cache): marks=pytest.mark.xfail( locale.getlocale()[0] in ("zh_CN", "it_IT"), reason="fail on a CI build with LC_ALL=zh_CN.utf8/it_IT.utf8", + strict=False, ), ), pytest.param( @@ -309,6 +313,7 @@ def test_to_datetime_format_microsecond(self, cache): marks=pytest.mark.xfail( locale.getlocale()[0] in ("zh_CN", "it_IT"), reason="fail on a CI build with LC_ALL=zh_CN.utf8/it_IT.utf8", + strict=False, ), ), ], @@ -1964,8 +1969,9 @@ def test_dayfirst(self, cache): def test_dayfirst_warnings_valid_input(self): # GH 12585 warning_msg_day_first = ( - "Parsing '31/12/2014' in DD/MM/YYYY format. Provide " - "format or specify infer_datetime_format=True for consistent parsing." + r"Parsing dates in DD/MM/YYYY format when dayfirst=False \(the default\) " + "was specified. This may lead to inconsistently parsed dates! Specify a " + "format to ensure consistent parsing." ) # CASE 1: valid input @@ -2001,12 +2007,14 @@ def test_dayfirst_warnings_invalid_input(self): # cannot consistently process with single format # warnings *always* raised warning_msg_day_first = ( - "Parsing '31/12/2014' in DD/MM/YYYY format. Provide " - "format or specify infer_datetime_format=True for consistent parsing." + r"Parsing dates in DD/MM/YYYY format when dayfirst=False \(the default\) " + "was specified. This may lead to inconsistently parsed dates! Specify a " + "format to ensure consistent parsing." ) warning_msg_month_first = ( - "Parsing '03/30/2011' in MM/DD/YYYY format. Provide " - "format or specify infer_datetime_format=True for consistent parsing." + r"Parsing dates in MM/DD/YYYY format when dayfirst=True " + "was specified. This may lead to inconsistently parsed dates! Specify a " + "format to ensure consistent parsing." ) arr = ["31/12/2014", "03/30/2011"] diff --git a/pandas/tests/tools/test_to_time.py b/pandas/tests/tools/test_to_time.py index 7983944d4384d..a8316e0f3970c 100644 --- a/pandas/tests/tools/test_to_time.py +++ b/pandas/tests/tools/test_to_time.py @@ -9,9 +9,12 @@ from pandas.core.tools.datetimes import to_time as to_time_alias from pandas.core.tools.times import to_time +# The tests marked with this are locale-dependent. +# They pass, except when the machine locale is zh_CN or it_IT. fails_on_non_english = pytest.mark.xfail( locale.getlocale()[0] in ("zh_CN", "it_IT"), reason="fail on a CI build with LC_ALL=zh_CN.utf8/it_IT.utf8", + strict=False, ) diff --git a/pandas/tests/tseries/offsets/conftest.py b/pandas/tests/tseries/offsets/conftest.py index df68c98dca43f..72f5c4a519a3a 100644 --- a/pandas/tests/tseries/offsets/conftest.py +++ b/pandas/tests/tseries/offsets/conftest.py @@ -5,7 +5,11 @@ import pandas.tseries.offsets as offsets -@pytest.fixture(params=[getattr(offsets, o) for o in offsets.__all__]) +@pytest.fixture( + params=[ + getattr(offsets, o) for o in offsets.__all__ if o not in ("Tick", "BaseOffset") + ] +) def offset_types(request): """ Fixture for all the datetime offsets available for a time series. diff --git a/pandas/tests/tseries/offsets/test_offsets.py b/pandas/tests/tseries/offsets/test_offsets.py index cf5cbe6e2af66..49661fe1ec8ce 100644 --- a/pandas/tests/tseries/offsets/test_offsets.py +++ b/pandas/tests/tseries/offsets/test_offsets.py @@ -556,10 +556,6 @@ def test_add_dt64_ndarray_non_nano(self, offset_types, unit, request): # check that the result with non-nano matches nano off = self._get_offset(offset_types) - if type(off) is DateOffset: - mark = pytest.mark.xfail(reason="non-nano not implemented") - request.node.add_marker(mark) - dti = date_range("2016-01-01", periods=35, freq="D") arr = dti._data._ndarray.astype(f"M8[{unit}]") diff --git a/pandas/tests/tslibs/test_api.py b/pandas/tests/tslibs/test_api.py index d61a2fca33f56..2d195fad83644 100644 --- a/pandas/tests/tslibs/test_api.py +++ b/pandas/tests/tslibs/test_api.py @@ -54,6 +54,8 @@ def test_namespace(): "astype_overflowsafe", "get_unit_from_dtype", "periods_per_day", + "periods_per_second", + "is_supported_unit", ] expected = set(submodules + api) diff --git a/pandas/tests/tslibs/test_np_datetime.py b/pandas/tests/tslibs/test_np_datetime.py index cc09f0fc77039..02edf1a093877 100644 --- a/pandas/tests/tslibs/test_np_datetime.py +++ b/pandas/tests/tslibs/test_np_datetime.py @@ -208,3 +208,15 @@ def test_astype_overflowsafe_td64(self): result = astype_overflowsafe(arr, dtype2) expected = arr.astype(dtype2) tm.assert_numpy_array_equal(result, expected) + + def test_astype_overflowsafe_disallow_rounding(self): + arr = np.array([-1500, 1500], dtype="M8[ns]") + dtype = np.dtype("M8[us]") + + msg = "Cannot losslessly cast '-1500 ns' to us" + with pytest.raises(ValueError, match=msg): + astype_overflowsafe(arr, dtype, round_ok=False) + + result = astype_overflowsafe(arr, dtype, round_ok=True) + expected = arr.astype(dtype) + tm.assert_numpy_array_equal(result, expected) diff --git a/pandas/tests/tslibs/test_resolution.py b/pandas/tests/tslibs/test_resolution.py index 15f4a9d032e5c..7b2268f16a85f 100644 --- a/pandas/tests/tslibs/test_resolution.py +++ b/pandas/tests/tslibs/test_resolution.py @@ -1,9 +1,11 @@ import numpy as np +import pytz from pandas._libs.tslibs import ( Resolution, get_resolution, ) +from pandas._libs.tslibs.dtypes import NpyDatetimeUnit def test_get_resolution_nano(): @@ -11,3 +13,12 @@ def test_get_resolution_nano(): arr = np.array([1], dtype=np.int64) res = get_resolution(arr) assert res == Resolution.RESO_NS + + +def test_get_resolution_non_nano_data(): + arr = np.array([1], dtype=np.int64) + res = get_resolution(arr, None, NpyDatetimeUnit.NPY_FR_us.value) + assert res == Resolution.RESO_US + + res = get_resolution(arr, pytz.UTC, NpyDatetimeUnit.NPY_FR_us.value) + assert res == Resolution.RESO_US diff --git a/pandas/tests/tslibs/test_timedeltas.py b/pandas/tests/tslibs/test_timedeltas.py index 661bb113e9549..36ca02d32dbbd 100644 --- a/pandas/tests/tslibs/test_timedeltas.py +++ b/pandas/tests/tslibs/test_timedeltas.py @@ -56,14 +56,19 @@ def test_delta_to_nanoseconds_error(): def test_delta_to_nanoseconds_td64_MY_raises(): + msg = ( + "delta_to_nanoseconds does not support Y or M units, " + "as their duration in nanoseconds is ambiguous" + ) + td = np.timedelta64(1234, "Y") - with pytest.raises(ValueError, match="0, 10"): + with pytest.raises(ValueError, match=msg): delta_to_nanoseconds(td) td = np.timedelta64(1234, "M") - with pytest.raises(ValueError, match="1, 10"): + with pytest.raises(ValueError, match=msg): delta_to_nanoseconds(td) @@ -127,5 +132,6 @@ def test_ints_to_pytimedelta_unsupported(unit): with pytest.raises(NotImplementedError, match=r"\d{1,2}"): ints_to_pytimedelta(arr, box=False) - with pytest.raises(NotImplementedError, match=r"\d{1,2}"): + msg = "Only resolutions 's', 'ms', 'us', 'ns' are supported" + with pytest.raises(NotImplementedError, match=msg): ints_to_pytimedelta(arr, box=True) diff --git a/pandas/tests/tslibs/test_tzconversion.py b/pandas/tests/tslibs/test_tzconversion.py new file mode 100644 index 0000000000000..c1a56ffb71b02 --- /dev/null +++ b/pandas/tests/tslibs/test_tzconversion.py @@ -0,0 +1,23 @@ +import numpy as np +import pytest +import pytz + +from pandas._libs.tslibs.tzconversion import tz_localize_to_utc + + +class TestTZLocalizeToUTC: + def test_tz_localize_to_utc_ambiguous_infer(self): + # val is a timestamp that is ambiguous when localized to US/Eastern + val = 1_320_541_200_000_000_000 + vals = np.array([val, val - 1, val], dtype=np.int64) + + with pytest.raises(pytz.AmbiguousTimeError, match="2011-11-06 01:00:00"): + tz_localize_to_utc(vals, pytz.timezone("US/Eastern"), ambiguous="infer") + + with pytest.raises(pytz.AmbiguousTimeError, match="are no repeated times"): + tz_localize_to_utc(vals[:1], pytz.timezone("US/Eastern"), ambiguous="infer") + + vals[1] += 1 + msg = "There are 2 dst switches when there should only be 1" + with pytest.raises(pytz.AmbiguousTimeError, match=msg): + tz_localize_to_utc(vals, pytz.timezone("US/Eastern"), ambiguous="infer") diff --git a/pandas/tests/util/test_assert_attr_equal.py b/pandas/tests/util/test_assert_attr_equal.py index 115ef58e085cc..bbbb0bf2172b1 100644 --- a/pandas/tests/util/test_assert_attr_equal.py +++ b/pandas/tests/util/test_assert_attr_equal.py @@ -10,7 +10,7 @@ def test_assert_attr_equal(nulls_fixture): obj = SimpleNamespace() obj.na_value = nulls_fixture - assert tm.assert_attr_equal("na_value", obj, obj) + tm.assert_attr_equal("na_value", obj, obj) def test_assert_attr_equal_different_nulls(nulls_fixture, nulls_fixture2): @@ -21,13 +21,13 @@ def test_assert_attr_equal_different_nulls(nulls_fixture, nulls_fixture2): obj2.na_value = nulls_fixture2 if nulls_fixture is nulls_fixture2: - assert tm.assert_attr_equal("na_value", obj, obj2) + tm.assert_attr_equal("na_value", obj, obj2) elif is_float(nulls_fixture) and is_float(nulls_fixture2): # we consider float("nan") and np.float64("nan") to be equivalent - assert tm.assert_attr_equal("na_value", obj, obj2) + tm.assert_attr_equal("na_value", obj, obj2) elif type(nulls_fixture) is type(nulls_fixture2): # e.g. Decimal("NaN") - assert tm.assert_attr_equal("na_value", obj, obj2) + tm.assert_attr_equal("na_value", obj, obj2) else: with pytest.raises(AssertionError, match='"na_value" are different'): tm.assert_attr_equal("na_value", obj, obj2) diff --git a/pandas/tests/util/test_assert_index_equal.py b/pandas/tests/util/test_assert_index_equal.py index 8211b52fed650..1fa7b979070a7 100644 --- a/pandas/tests/util/test_assert_index_equal.py +++ b/pandas/tests/util/test_assert_index_equal.py @@ -238,7 +238,29 @@ def test_index_equal_range_categories(check_categorical, exact): ) +def test_assert_index_equal_different_names_check_order_false(): + # GH#47328 + idx1 = Index([1, 3], name="a") + idx2 = Index([3, 1], name="b") + with pytest.raises(AssertionError, match='"names" are different'): + tm.assert_index_equal(idx1, idx2, check_order=False, check_names=True) + + def test_assert_index_equal_mixed_dtype(): # GH#39168 idx = Index(["foo", "bar", 42]) tm.assert_index_equal(idx, idx, check_order=False) + + +def test_assert_index_equal_ea_dtype_order_false(any_numeric_ea_dtype): + # GH#47207 + idx1 = Index([1, 3], dtype=any_numeric_ea_dtype) + idx2 = Index([3, 1], dtype=any_numeric_ea_dtype) + tm.assert_index_equal(idx1, idx2, check_order=False) + + +def test_assert_index_equal_object_ints_order_false(): + # GH#47207 + idx1 = Index([1, 3], dtype="object") + idx2 = Index([3, 1], dtype="object") + tm.assert_index_equal(idx1, idx2, check_order=False) diff --git a/pandas/tests/util/test_assert_series_equal.py b/pandas/tests/util/test_assert_series_equal.py index 2209bed67325c..963af81bcb6a5 100644 --- a/pandas/tests/util/test_assert_series_equal.py +++ b/pandas/tests/util/test_assert_series_equal.py @@ -1,3 +1,4 @@ +import numpy as np import pytest from pandas.core.dtypes.common import is_extension_array_dtype @@ -382,3 +383,29 @@ def test_assert_series_equal_identical_na(nulls_fixture): # while we're here do Index too idx = pd.Index(ser) tm.assert_index_equal(idx, idx.copy(deep=True)) + + +def test_identical_nested_series_is_equal(): + # GH#22400 + x = Series( + [ + 0, + 0.0131142231938, + 1.77774652865e-05, + np.array([0.4722720840328748, 0.4216929783681722]), + ] + ) + y = Series( + [ + 0, + 0.0131142231938, + 1.77774652865e-05, + np.array([0.4722720840328748, 0.4216929783681722]), + ] + ) + # These two arrays should be equal, nesting could cause issue + + tm.assert_series_equal(x, x) + tm.assert_series_equal(x, x, check_exact=True) + tm.assert_series_equal(x, y) + tm.assert_series_equal(x, y, check_exact=True) diff --git a/pandas/tseries/api.py b/pandas/tseries/api.py index 59666fa0048dd..e274838d45b27 100644 --- a/pandas/tseries/api.py +++ b/pandas/tseries/api.py @@ -2,7 +2,7 @@ Timeseries API """ -# flake8: noqa:F401 - from pandas.tseries.frequencies import infer_freq import pandas.tseries.offsets as offsets + +__all__ = ["infer_freq", "offsets"] diff --git a/pandas/tseries/frequencies.py b/pandas/tseries/frequencies.py index c541003f1160c..b2fbc022b2708 100644 --- a/pandas/tseries/frequencies.py +++ b/pandas/tseries/frequencies.py @@ -22,7 +22,7 @@ build_field_sarray, month_position_check, ) -from pandas._libs.tslibs.offsets import ( # noqa:F401 +from pandas._libs.tslibs.offsets import ( BaseOffset, DateOffset, Day, @@ -314,12 +314,12 @@ def get_freq(self) -> str | None: return _maybe_add_count("N", delta) @cache_readonly - def day_deltas(self): + def day_deltas(self) -> list[int]: ppd = periods_per_day(self._reso) return [x / ppd for x in self.deltas] @cache_readonly - def hour_deltas(self): + def hour_deltas(self) -> list[int]: pph = periods_per_day(self._reso) // 24 return [x / pph for x in self.deltas] @@ -328,10 +328,10 @@ def fields(self) -> np.ndarray: # structured array of fields return build_field_sarray(self.i8values, reso=self._reso) @cache_readonly - def rep_stamp(self): + def rep_stamp(self) -> Timestamp: return Timestamp(self.i8values[0]) - def month_position_check(self): + def month_position_check(self) -> str | None: return month_position_check(self.fields, self.index.dayofweek) @cache_readonly @@ -394,7 +394,11 @@ def _get_annual_rule(self) -> str | None: return None pos_check = self.month_position_check() - return {"cs": "AS", "bs": "BAS", "ce": "A", "be": "BA"}.get(pos_check) + + if pos_check is None: + return None + else: + return {"cs": "AS", "bs": "BAS", "ce": "A", "be": "BA"}.get(pos_check) def _get_quarterly_rule(self) -> str | None: if len(self.mdiffs) > 1: @@ -404,13 +408,21 @@ def _get_quarterly_rule(self) -> str | None: return None pos_check = self.month_position_check() - return {"cs": "QS", "bs": "BQS", "ce": "Q", "be": "BQ"}.get(pos_check) + + if pos_check is None: + return None + else: + return {"cs": "QS", "bs": "BQS", "ce": "Q", "be": "BQ"}.get(pos_check) def _get_monthly_rule(self) -> str | None: if len(self.mdiffs) > 1: return None pos_check = self.month_position_check() - return {"cs": "MS", "bs": "BMS", "ce": "M", "be": "BM"}.get(pos_check) + + if pos_check is None: + return None + else: + return {"cs": "MS", "bs": "BMS", "ce": "M", "be": "BM"}.get(pos_check) def _is_business_daily(self) -> bool: # quick check: cannot be business daily @@ -635,3 +647,14 @@ def _is_monthly(rule: str) -> bool: def _is_weekly(rule: str) -> bool: rule = rule.upper() return rule == "W" or rule.startswith("W-") + + +__all__ = [ + "Day", + "get_offset", + "get_period_alias", + "infer_freq", + "is_subperiod", + "is_superperiod", + "to_offset", +] diff --git a/pandas/tseries/holiday.py b/pandas/tseries/holiday.py index 6fd49e2340e30..6426dbcd54489 100644 --- a/pandas/tseries/holiday.py +++ b/pandas/tseries/holiday.py @@ -6,7 +6,7 @@ ) import warnings -from dateutil.relativedelta import ( # noqa:F401 +from dateutil.relativedelta import ( FR, MO, SA, @@ -582,3 +582,27 @@ def HolidayCalendarFactory(name, base, other, base_class=AbstractHolidayCalendar rules = AbstractHolidayCalendar.merge_class(base, other) calendar_class = type(name, (base_class,), {"rules": rules, "name": name}) return calendar_class + + +__all__ = [ + "after_nearest_workday", + "before_nearest_workday", + "FR", + "get_calendar", + "HolidayCalendarFactory", + "MO", + "nearest_workday", + "next_monday", + "next_monday_or_tuesday", + "next_workday", + "previous_friday", + "previous_workday", + "register", + "SA", + "SU", + "sunday_to_monday", + "TH", + "TU", + "WE", + "weekend_to_monday", +] diff --git a/pandas/tseries/offsets.py b/pandas/tseries/offsets.py index cee99d23f8d90..169c9cc18a7fd 100644 --- a/pandas/tseries/offsets.py +++ b/pandas/tseries/offsets.py @@ -1,4 +1,6 @@ -from pandas._libs.tslibs.offsets import ( # noqa:F401 +from __future__ import annotations + +from pandas._libs.tslibs.offsets import ( FY5253, BaseOffset, BDay, @@ -45,9 +47,14 @@ __all__ = [ "Day", + "BaseOffset", "BusinessDay", + "BusinessMonthBegin", + "BusinessMonthEnd", "BDay", "CustomBusinessDay", + "CustomBusinessMonthBegin", + "CustomBusinessMonthEnd", "CDay", "CBMonthEnd", "CBMonthBegin", @@ -73,6 +80,7 @@ "Week", "WeekOfMonth", "Easter", + "Tick", "Hour", "Minute", "Second", diff --git a/pandas/util/__init__.py b/pandas/util/__init__.py index 7adfca73c2f1e..6e6006dd28165 100644 --- a/pandas/util/__init__.py +++ b/pandas/util/__init__.py @@ -1,3 +1,4 @@ +# pyright: reportUnusedImport = false from pandas.util._decorators import ( # noqa:F401 Appender, Substitution, diff --git a/pandas/util/_decorators.py b/pandas/util/_decorators.py index 0f15511e491cc..f8359edaa8d44 100644 --- a/pandas/util/_decorators.py +++ b/pandas/util/_decorators.py @@ -11,8 +11,12 @@ ) import warnings -from pandas._libs.properties import cache_readonly # noqa:F401 -from pandas._typing import F +from pandas._libs.properties import cache_readonly +from pandas._typing import ( + F, + T, +) +from pandas.util._exceptions import find_stack_level def deprecate( @@ -260,7 +264,6 @@ def future_version_msg(version: str | None) -> str: def deprecate_nonkeyword_arguments( version: str | None, allowed_args: list[str] | None = None, - stacklevel: int = 2, name: str | None = None, ) -> Callable[[F], F]: """ @@ -280,9 +283,6 @@ def deprecate_nonkeyword_arguments( defaults to list of all arguments not having the default value. - stacklevel : int, default=2 - The stack level for warnings.warn - name : str, optional The specific name of the function to show in the warning message. If None, then the Qualified name of the function @@ -312,7 +312,7 @@ def wrapper(*args, **kwargs): warnings.warn( msg.format(arguments=arguments), FutureWarning, - stacklevel=stacklevel, + stacklevel=find_stack_level(), ) return func(*args, **kwargs) @@ -488,7 +488,7 @@ def __init__(self, addendum: str | None, join: str = "", indents: int = 0) -> No self.addendum = addendum self.join = join - def __call__(self, func: F) -> F: + def __call__(self, func: T) -> T: func.__doc__ = func.__doc__ if func.__doc__ else "" self.addendum = self.addendum if self.addendum else "" docitems = [func.__doc__, self.addendum] @@ -501,3 +501,16 @@ def indent(text: str | None, indents: int = 1) -> str: return "" jointext = "".join(["\n"] + [" "] * indents) return jointext.join(text.split("\n")) + + +__all__ = [ + "Appender", + "cache_readonly", + "deprecate", + "deprecate_kwarg", + "deprecate_nonkeyword_arguments", + "doc", + "future_version_msg", + "rewrite_axis_style_signature", + "Substitution", +] diff --git a/pandas/util/_exceptions.py b/pandas/util/_exceptions.py index ef467f096e963..c718451fbf621 100644 --- a/pandas/util/_exceptions.py +++ b/pandas/util/_exceptions.py @@ -3,10 +3,11 @@ import contextlib import inspect import os +from typing import Iterator @contextlib.contextmanager -def rewrite_exception(old_name: str, new_name: str): +def rewrite_exception(old_name: str, new_name: str) -> Iterator[None]: """ Rewrite the message of an exception. """ diff --git a/pandas/util/_test_decorators.py b/pandas/util/_test_decorators.py index bbcf984e68b4b..4a4f27f6c7906 100644 --- a/pandas/util/_test_decorators.py +++ b/pandas/util/_test_decorators.py @@ -27,7 +27,10 @@ def test_foo(): from contextlib import contextmanager import locale -from typing import Callable +from typing import ( + Callable, + Iterator, +) import warnings import numpy as np @@ -253,7 +256,7 @@ def check_file_leaks(func) -> Callable: @contextmanager -def file_leak_context(): +def file_leak_context() -> Iterator[None]: """ ContextManager analogue to check_file_leaks. """ @@ -290,7 +293,7 @@ def async_mark(): return async_mark -def mark_array_manager_not_yet_implemented(request): +def mark_array_manager_not_yet_implemented(request) -> None: mark = pytest.mark.xfail(reason="Not yet implemented for ArrayManager") request.node.add_marker(mark) diff --git a/pandas/util/_validators.py b/pandas/util/_validators.py index 8e3de9404fbee..fc3439a57a002 100644 --- a/pandas/util/_validators.py +++ b/pandas/util/_validators.py @@ -5,13 +5,17 @@ from __future__ import annotations from typing import ( + Any, Iterable, Sequence, + TypeVar, + overload, ) import warnings import numpy as np +from pandas._typing import IntervalInclusiveType from pandas.util._exceptions import find_stack_level from pandas.core.dtypes.common import ( @@ -19,6 +23,9 @@ is_integer, ) +BoolishT = TypeVar("BoolishT", bool, int) +BoolishNoneT = TypeVar("BoolishNoneT", bool, int, None) + def _check_arg_length(fname, args, max_fname_arg_count, compat_args): """ @@ -78,7 +85,7 @@ def _check_for_default_values(fname, arg_val_dict, compat_args): ) -def validate_args(fname, args, max_fname_arg_count, compat_args): +def validate_args(fname, args, max_fname_arg_count, compat_args) -> None: """ Checks whether the length of the `*args` argument passed into a function has at most `len(compat_args)` arguments and whether or not all of these @@ -132,7 +139,7 @@ def _check_for_invalid_keys(fname, kwargs, compat_args): raise TypeError(f"{fname}() got an unexpected keyword argument '{bad_arg}'") -def validate_kwargs(fname, kwargs, compat_args): +def validate_kwargs(fname, kwargs, compat_args) -> None: """ Checks whether parameters passed to the **kwargs argument in a function `fname` are valid parameters as specified in `*compat_args` @@ -159,7 +166,9 @@ def validate_kwargs(fname, kwargs, compat_args): _check_for_default_values(fname, kwds, compat_args) -def validate_args_and_kwargs(fname, args, kwargs, max_fname_arg_count, compat_args): +def validate_args_and_kwargs( + fname, args, kwargs, max_fname_arg_count, compat_args +) -> None: """ Checks whether parameters passed to the *args and **kwargs argument in a function `fname` are valid parameters as specified in `*compat_args` @@ -215,7 +224,9 @@ def validate_args_and_kwargs(fname, args, kwargs, max_fname_arg_count, compat_ar validate_kwargs(fname, kwargs, compat_args) -def validate_bool_kwarg(value, arg_name, none_allowed=True, int_allowed=False): +def validate_bool_kwarg( + value: BoolishNoneT, arg_name, none_allowed=True, int_allowed=False +) -> BoolishNoneT: """ Ensure that argument passed in arg_name can be interpreted as boolean. @@ -255,7 +266,9 @@ def validate_bool_kwarg(value, arg_name, none_allowed=True, int_allowed=False): return value -def validate_axis_style_args(data, args, kwargs, arg_name, method_name): +def validate_axis_style_args( + data, args, kwargs, arg_name, method_name +) -> dict[str, Any]: """ Argument handler for mixed index, columns / axis functions @@ -424,12 +437,22 @@ def validate_percentile(q: float | Iterable[float]) -> np.ndarray: return q_arr +@overload +def validate_ascending(ascending: BoolishT) -> BoolishT: + ... + + +@overload +def validate_ascending(ascending: Sequence[BoolishT]) -> list[BoolishT]: + ... + + def validate_ascending( - ascending: bool | int | Sequence[bool | int] = True, -): + ascending: bool | int | Sequence[BoolishT], +) -> bool | int | list[BoolishT]: """Validate ``ascending`` kwargs for ``sort_index`` method.""" kwargs = {"none_allowed": False, "int_allowed": True} - if not isinstance(ascending, (list, tuple)): + if not isinstance(ascending, Sequence): return validate_bool_kwarg(ascending, "ascending", **kwargs) return [validate_bool_kwarg(item, "ascending", **kwargs) for item in ascending] @@ -468,7 +491,7 @@ def validate_endpoints(closed: str | None) -> tuple[bool, bool]: return left_closed, right_closed -def validate_inclusive(inclusive: str | None) -> tuple[bool, bool]: +def validate_inclusive(inclusive: IntervalInclusiveType | None) -> tuple[bool, bool]: """ Check that the `inclusive` argument is among {"both", "neither", "left", "right"}. diff --git a/pyproject.toml b/pyproject.toml index 0e2e41fba461c..6ca37581b03f0 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -166,6 +166,7 @@ reportPropertyTypeMismatch = true reportUntypedClassDecorator = true reportUntypedFunctionDecorator = true reportUntypedNamedTuple = true +reportUnusedImport = true # disable subset of "basic" reportGeneralTypeIssues = false reportMissingModuleSource = false @@ -176,4 +177,3 @@ reportOptionalOperand = false reportOptionalSubscript = false reportPrivateImportUsage = false reportUnboundVariable = false -reportUnsupportedDunderAll = false diff --git a/pyright_reportGeneralTypeIssues.json b/pyright_reportGeneralTypeIssues.json index b43c6a109adbf..c482aa32600fb 100644 --- a/pyright_reportGeneralTypeIssues.json +++ b/pyright_reportGeneralTypeIssues.json @@ -15,7 +15,7 @@ "pandas/io/clipboard", "pandas/util/version", # and all files that currently don't pass - "pandas/_config/config.py", + "pandas/_testing/__init__.py", "pandas/core/algorithms.py", "pandas/core/apply.py", "pandas/core/array_algos/take.py", @@ -28,17 +28,14 @@ "pandas/core/arrays/datetimes.py", "pandas/core/arrays/interval.py", "pandas/core/arrays/masked.py", - "pandas/core/arrays/numeric.py", "pandas/core/arrays/period.py", "pandas/core/arrays/sparse/array.py", "pandas/core/arrays/sparse/dtype.py", "pandas/core/arrays/string_.py", "pandas/core/arrays/string_arrow.py", "pandas/core/arrays/timedeltas.py", - "pandas/core/common.py", "pandas/core/computation/align.py", "pandas/core/construction.py", - "pandas/core/describe.py", "pandas/core/dtypes/cast.py", "pandas/core/dtypes/common.py", "pandas/core/dtypes/concat.py", @@ -60,11 +57,9 @@ "pandas/core/indexes/multi.py", "pandas/core/indexes/numeric.py", "pandas/core/indexes/period.py", - "pandas/core/indexes/range.py", "pandas/core/indexing.py", "pandas/core/internals/api.py", "pandas/core/internals/array_manager.py", - "pandas/core/internals/base.py", "pandas/core/internals/blocks.py", "pandas/core/internals/concat.py", "pandas/core/internals/construction.py", @@ -83,7 +78,6 @@ "pandas/core/tools/datetimes.py", "pandas/core/tools/timedeltas.py", "pandas/core/util/hashing.py", - "pandas/core/util/numba_.py", "pandas/core/window/ewm.py", "pandas/core/window/rolling.py", "pandas/io/common.py", @@ -106,22 +100,11 @@ "pandas/io/parsers/arrow_parser_wrapper.py", "pandas/io/parsers/base_parser.py", "pandas/io/parsers/c_parser_wrapper.py", - "pandas/io/parsers/python_parser.py", - "pandas/io/parsers/readers.py", "pandas/io/pytables.py", - "pandas/io/sas/sas7bdat.py", - "pandas/io/sas/sasreader.py", + "pandas/io/sas/sas_xport.py", "pandas/io/sql.py", "pandas/io/stata.py", "pandas/io/xml.py", - "pandas/plotting/_core.py", - "pandas/plotting/_matplotlib/converter.py", - "pandas/plotting/_matplotlib/core.py", - "pandas/plotting/_matplotlib/hist.py", - "pandas/plotting/_matplotlib/misc.py", - "pandas/plotting/_matplotlib/style.py", - "pandas/plotting/_matplotlib/timeseries.py", - "pandas/plotting/_matplotlib/tools.py", "pandas/tseries/frequencies.py", ], } diff --git a/requirements-dev.txt b/requirements-dev.txt index 7640587b85a1c..ff410c59b43dd 100644 --- a/requirements-dev.txt +++ b/requirements-dev.txt @@ -1,11 +1,66 @@ # This file is auto-generated from environment.yml, do not modify. # See that file for comments about the need/usage of each dependency. -numpy>=1.19.5 -python-dateutil>=2.8.1 +cython==0.29.30 +pytest>=6.0 +pytest-cov +pytest-xdist>=1.31 +psutil +pytest-asyncio>=0.17 +boto3 +python-dateutil +numpy pytz +beautifulsoup4 +blosc +brotlipy +bottleneck +fastparquet +fsspec +html5lib +hypothesis +gcsfs +jinja2 +lxml +matplotlib +numba>=0.53.1 +numexpr>=2.8.0 +openpyxl +odfpy +pandas-gbq +psycopg2 +pyarrow +pymysql +pyreadstat +tables +python-snappy +pyxlsb +s3fs +scipy +sqlalchemy +tabulate +xarray +xlrd +xlsxwriter +xlwt +zstandard +aiobotocore<2.0.0 +botocore +cftime +dask +ipython +geopandas +seaborn +scikit-learn +statsmodels +coverage +pandas-datareader +pyyaml +py +torch +moto +flask asv -cython>=0.29.30 black==22.3.0 cpplint flake8==4.0.1 @@ -18,6 +73,7 @@ pycodestyle pyupgrade gitpython gitdb +natsort numpydoc pandas-dev-flaker==0.5.0 pydata-sphinx-theme==0.8.0 @@ -31,58 +87,15 @@ types-setuptools nbconvert>=6.4.5 nbsphinx pandoc -dask -toolz>=0.7.3 -partd>=0.3.10 -cloudpickle>=0.2.1 -markdown -feedparser -pyyaml -requests -boto3 -botocore>=1.11 -hypothesis>=5.5.3 -moto -flask -pytest>=6.0 -pytest-cov -pytest-xdist>=1.31 -pytest-asyncio>=0.17 -pytest-instafail -seaborn -statsmodels ipywidgets nbformat notebook>=6.0.3 -blosc -bottleneck>=1.3.1 ipykernel -ipython>=7.11.1 jinja2 -matplotlib>=3.3.2 -numexpr>=2.7.1 -scipy>=1.4.1 -numba>=0.50.1 -beautifulsoup4>=4.8.2 -html5lib -lxml -openpyxl -xlrd -xlsxwriter -xlwt -odfpy -fastparquet>=0.4.0 -pyarrow>2.0.1 -python-snappy -tables>=3.6.1 -s3fs>=0.4.0 -aiobotocore<2.0.0 -fsspec>=0.7.4 -gcsfs>=0.6.0 -sqlalchemy -xarray -cftime -pyreadstat -tabulate>=0.8.3 -natsort +markdown +feedparser +pyyaml +requests +jupyterlab >=3.4,<4 +jupyterlite==0.1.0b10 setuptools>=51.0.0 diff --git a/scripts/generate_pip_deps_from_conda.py b/scripts/generate_pip_deps_from_conda.py index 2ea50fa3ac8d4..8cb539d3b02c8 100755 --- a/scripts/generate_pip_deps_from_conda.py +++ b/scripts/generate_pip_deps_from_conda.py @@ -21,7 +21,7 @@ import yaml EXCLUDE = {"python", "c-compiler", "cxx-compiler"} -RENAME = {"pytables": "tables", "dask-core": "dask"} +RENAME = {"pytables": "tables", "geopandas-base": "geopandas", "pytorch": "torch"} def conda_package_to_pip(package: str): diff --git a/scripts/pandas_errors_documented.py b/scripts/pandas_errors_documented.py index 3e5bf34db4fe8..18db5fa10a8f9 100644 --- a/scripts/pandas_errors_documented.py +++ b/scripts/pandas_errors_documented.py @@ -22,7 +22,7 @@ def get_defined_errors(content: str) -> set[str]: for node in ast.walk(ast.parse(content)): if isinstance(node, ast.ClassDef): errors.add(node.name) - elif isinstance(node, ast.ImportFrom): + elif isinstance(node, ast.ImportFrom) and node.module != "__future__": for alias in node.names: errors.add(alias.name) return errors @@ -41,7 +41,7 @@ def main(argv: Sequence[str] | None = None) -> None: missing = file_errors.difference(doc_errors) if missing: sys.stdout.write( - f"The follow exceptions and/or warnings are not documented " + f"The following exceptions and/or warnings are not documented " f"in {API_PATH}: {missing}" ) sys.exit(1) diff --git a/setup.cfg b/setup.cfg index d3c4fe0cb35ce..b191930acf4c5 100644 --- a/setup.cfg +++ b/setup.cfg @@ -7,7 +7,7 @@ url = https://pandas.pydata.org author = The Pandas Development Team author_email = pandas-dev@python.org license = BSD-3-Clause -license_file = LICENSE +license_files = LICENSE platforms = any classifiers = Development Status :: 5 - Production/Stable @@ -45,6 +45,11 @@ zip_safe = False pandas_plotting_backends = matplotlib = pandas:plotting._matplotlib +[options.exclude_package_data] +* = + *.c + *.h + [options.extras_require] test = hypothesis>=5.5.3 @@ -102,9 +107,19 @@ ignore = # false positives B301, # single-letter variables - PDF023 + PDF023, # "use 'pandas._testing' instead" in non-test code - PDF025 + PDF025, + # If test must be a simple comparison against sys.platform or sys.version_info + Y002, + # Use "_typeshed.Self" instead of class-bound TypeVar + Y019, + # Docstrings should not be included in stubs + Y021, + # Use typing_extensions.TypeAlias for type aliases + Y026, + # Use "collections.abc.*" instead of "typing.*" (PEP 585 syntax) + Y027 exclude = doc/sphinxext/*.py, doc/build/*.py, diff --git a/setup.py b/setup.py index cb713e6d74392..70adbd3c083af 100755 --- a/setup.py +++ b/setup.py @@ -492,7 +492,11 @@ def srcpath(name=None, suffix=".pyx", subdir="src"): "_libs.properties": {"pyxfile": "_libs/properties"}, "_libs.reshape": {"pyxfile": "_libs/reshape", "depends": []}, "_libs.sparse": {"pyxfile": "_libs/sparse", "depends": _pxi_dep["sparse"]}, - "_libs.tslib": {"pyxfile": "_libs/tslib", "depends": tseries_depends}, + "_libs.tslib": { + "pyxfile": "_libs/tslib", + "depends": tseries_depends, + "sources": ["pandas/_libs/tslibs/src/datetime/np_datetime.c"], + }, "_libs.tslibs.base": {"pyxfile": "_libs/tslibs/base"}, "_libs.tslibs.ccalendar": {"pyxfile": "_libs/tslibs/ccalendar"}, "_libs.tslibs.dtypes": {"pyxfile": "_libs/tslibs/dtypes"}, @@ -551,7 +555,11 @@ def srcpath(name=None, suffix=".pyx", subdir="src"): "depends": tseries_depends, "sources": ["pandas/_libs/tslibs/src/datetime/np_datetime.c"], }, - "_libs.tslibs.vectorized": {"pyxfile": "_libs/tslibs/vectorized"}, + "_libs.tslibs.vectorized": { + "pyxfile": "_libs/tslibs/vectorized", + "depends": tseries_depends, + "sources": ["pandas/_libs/tslibs/src/datetime/np_datetime.c"], + }, "_libs.testing": {"pyxfile": "_libs/testing"}, "_libs.window.aggregations": { "pyxfile": "_libs/window/aggregations", diff --git a/web/interactive_terminal/README.md b/web/interactive_terminal/README.md new file mode 100644 index 0000000000000..865cf282676c9 --- /dev/null +++ b/web/interactive_terminal/README.md @@ -0,0 +1,35 @@ +# The interactive `pandas` terminal + +An interactive terminal to easily try `pandas` in the browser, powered by JupyterLite. + +![image](https://user-images.githubusercontent.com/591645/175000291-e8c69f6f-5f2c-48d7-817c-cff05ab2cde9.png) + +## Build + +The interactive terminal is built with the `jupyterlite` CLI. + +First make sure `jupyterlite` is installed: + +```bash +python -m pip install jupyterlite +``` + +Then in `web/interactive_terminal`, run the following command: + +```bash +jupyter lite build +``` + +## Configuration + +This folder contains configuration files for the interactive terminal powered by JupyterLite: + +- `jupyter_lite_config.json`: build time configuration, used when building the assets with the `jupyter lite build` command +- `jupyter-lite.json` run time configuration applied when launching the application in the browser + +The interactive `pandas` terminal application enables a couple of optimizations to only include the `repl` app in the generated static assets. +To learn more about it, check out the JupyterLite documentation: + +- Optimizations: https://jupyterlite.readthedocs.io/en/latest/howto/configure/advanced/optimizations.html +- JupyterLite schema: https://jupyterlite.readthedocs.io/en/latest/reference/schema-v0.html +- CLI reference: https://jupyterlite.readthedocs.io/en/latest/reference/cli.html diff --git a/web/interactive_terminal/jupyter-lite.json b/web/interactive_terminal/jupyter-lite.json new file mode 100644 index 0000000000000..473fb5a3dcc1a --- /dev/null +++ b/web/interactive_terminal/jupyter-lite.json @@ -0,0 +1,13 @@ +{ + "jupyter-lite-schema-version": 0, + "jupyter-config-data": { + "appName": "Pandas REPL", + "appUrl": "./repl", + "disabledExtensions": [ + "@jupyter-widgets/jupyterlab-manager" + ], + "enableMemoryStorage": true, + "settingsStorageDrivers": ["memoryStorageDriver"], + "contentsStorageDrivers": ["memoryStorageDriver"] + } + } diff --git a/web/interactive_terminal/jupyter_lite_config.json b/web/interactive_terminal/jupyter_lite_config.json new file mode 100644 index 0000000000000..8a8c4eb1ae051 --- /dev/null +++ b/web/interactive_terminal/jupyter_lite_config.json @@ -0,0 +1,7 @@ +{ + "LiteBuildConfig": { + "apps": ["repl"], + "no_unused_shared_packages": true, + "output_dir": "../build/lite" + } + } diff --git a/web/pandas/_templates/layout.html b/web/pandas/_templates/layout.html index 52e06a9bec55b..67876d88a2d1a 100644 --- a/web/pandas/_templates/layout.html +++ b/web/pandas/_templates/layout.html @@ -69,6 +69,11 @@