Skip to content

Conversation

albertoandreottiATgmail
Copy link
Contributor

Description

This change adds a new capability to the CoNLL reader, allowing it to read multiple CoNLL files at the same time into a single Dataframe. Example,

df = CoNLL().readDataset(spark, './path/to/conlls/*', partitions=12)

The difference with previous behavior is the path ending in an *. The original mechanism continues to work as before.
Two additional params that work only for the multi file situation,
partitions : minimum number of partitions used to create the Dataframe. Defaults to 8.
storage_level : pyspark.StorageLevel for the Dataframe. Defaults to pyspark.StorageLevel.DISK_ONLY.

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • Code improvements with no or little impact
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING page.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@maziyarpanahi maziyarpanahi merged commit 5d2ecc6 into release/333-release-candidate Nov 19, 2021
@KshitizGIT KshitizGIT deleted the conll_reader branch March 2, 2023 10:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement new-feature Introducing a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants