Optimized Conll Reader #6482

albertoandreottiATgmail · 2021-11-18T17:44:52Z

Description

This change adds a new capability to the CoNLL reader, allowing it to read multiple CoNLL files at the same time into a single Dataframe. Example,

df = CoNLL().readDataset(spark, './path/to/conlls/*', partitions=12)

The difference with previous behavior is the path ending in an *. The original mechanism continues to work as before.
Two additional params that work only for the multi file situation,
partitions : minimum number of partitions used to create the Dataframe. Defaults to 8.
storage_level : pyspark.StorageLevel for the Dataframe. Defaults to pyspark.StorageLevel.DISK_ONLY.

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Bug fix (non-breaking change which fixes an issue)
Code improvements with no or little impact
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING page.
I have added tests to cover my changes.
All new and existing tests passed.

albertoandreottiATgmail added 2 commits November 18, 2021 12:48

refactored conll reader

452d875

fixed cols

e88a630

albertoandreottiATgmail requested a review from maziyarpanahi November 18, 2021 17:45

maziyarpanahi self-assigned this Nov 19, 2021

maziyarpanahi added enhancement new-feature Introducing a new feature labels Nov 19, 2021

maziyarpanahi approved these changes Nov 19, 2021

View reviewed changes

maziyarpanahi merged commit 5d2ecc6 into release/333-release-candidate Nov 19, 2021

maziyarpanahi mentioned this pull request Nov 19, 2021

Release/333 release candidate #6480

Merged

maziyarpanahi mentioned this pull request May 16, 2022

Improve CoNLL reader to read large datasets #6331

Closed

KshitizGIT deleted the conll_reader branch March 2, 2023 10:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimized Conll Reader #6482

Optimized Conll Reader #6482

Uh oh!

albertoandreottiATgmail commented Nov 18, 2021

Uh oh!

Uh oh!

Optimized Conll Reader #6482

Optimized Conll Reader #6482

Uh oh!

Conversation

albertoandreottiATgmail commented Nov 18, 2021

Description

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Checklist:

Uh oh!

Uh oh!