-
Notifications
You must be signed in to change notification settings - Fork 11
feat: add skipEmptyLines option #644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis pull request introduces a new option, Changes
Possibly related PRs
Suggested reviewers
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Outside diff range and nitpick comments (11)
plugins/delimiter-extractor/ref/csv.txt (2)
7-9
: Empty rows at the end serve the PR objective, but consider a minor adjustment.The empty rows at the end (lines 7-9) align with the PR objective of testing the automatic removal of empty rows after the last data row. However, for consistency, consider removing the last completely empty line (line 9).
,, ,, -
This change ensures all rows in the file have a consistent format, even if they will be removed during processing.
1-9
: File structure is suitable for testing, but consider renaming the file.The overall structure of the file is well-suited for testing the
skipEmptyLines
option, containing a mix of data rows and empty rows both between data and at the end of the file. This aligns perfectly with the PR objectives.Consider renaming the file from
csv.txt
tocsv.csv
to accurately reflect its content type. This change would make the file's purpose more immediately clear to other developers.🧰 Tools
🪛 LanguageTool
[uncategorized] ~1-~1: A comma might be missing here.
Context: header1,header2,header3 column1,column2,column3 ,, ,, column4,c...(AI_EN_LECTOR_MISSING_PUNCTUATION_COMMA)
.changeset/flat-gifts-change.md (1)
1-6
: Approved with suggestion for improvement.The changeset correctly identifies the affected packages and uses an appropriate "minor" version bump for introducing a new feature. However, the description could be more informative.
Consider expanding the description to provide more context and usage information. Here's a suggested improvement:
-This release provides a new option `skipEmptyLines` to skip empty lines. +This release introduces a new option `skipEmptyLines` for both the delimiter and XLSX extractors. When enabled, this option allows the plugins to skip processing of empty lines in the input data, providing more flexibility in handling various data formats. Users can now configure this option to improve the efficiency of data extraction, especially when dealing with files that may contain irrelevant empty lines between data entries.This expanded description offers more context about which plugins are affected, what the option does, and why it might be useful to users.
plugins/xlsx-extractor/src/index.ts (1)
21-21
: Approved addition ofskipEmptyLines
property, with suggestions for improvement.The new
skipEmptyLines
property is a good addition that aligns with the PR objectives. However, consider the following improvements:
- Add a JSDoc comment for the new property to improve clarity for users of this interface.
- Consider organizing the properties in the interface in a logical order (e.g., grouping related properties or arranging them alphabetically).
Here's a suggested implementation with these improvements:
readonly rawNumbers?: boolean readonly dateNF?: string readonly headerDetectionOptions?: GetHeadersOptions + /** If true, empty lines in the data will be skipped during extraction. */ readonly skipEmptyLines?: boolean readonly chunkSize?: number readonly parallel?: number readonly debug?: boolean
plugins/xlsx-extractor/src/parser.spec.ts (4)
15-39
: Improved test coverage for 'Excel to WorkbookCapture'.The updated test case now includes more detailed expectations, which aligns well with the PR objectives. It properly checks for null values and spaces in the data.
Consider adding a comment explaining the significance of the
' '
(six spaces) value in the 'Code' field of the second data entry. This might help other developers understand if this is an expected edge case or a specific test scenario.
42-63
: Great addition of granular tests!The new tests for sheet names, header names, and values provide excellent coverage for the
parseBuffer
function. They ensure that the basic structure of the parsed data is correct.In the 'finds values' test (lines 60-62), consider asserting the actual values of the data, not just the length. This would provide even more robust verification of the parsing process.
64-93
: Excellent test for non-unique header handling!This test thoroughly checks the handling of non-unique headers, ensuring they are properly differentiated. This is crucial for maintaining data integrity during extraction.
Consider parameterizing this test to make it more maintainable. You could create an array of expected headers and use it in the assertion, like this:
const expectedHeaders = [ 'Name', 'Group', 'Rebates', 'Purchases', ...Array.from({length: 11}, (_, i) => [`Rebates_${i+1}`, `Purchases_${i+1}`]).flat() ]; expect(capture['Rebates-Purchases'].headers).toEqual(expectedHeaders);This approach would make the test more concise and easier to update if the number of duplicate headers changes in the future.
96-147
: Excellent implementation of tests for the new skipEmptyLines option!These tests thoroughly verify the behavior of the new
skipEmptyLines
option, covering both true and false cases. They align perfectly with the PR objectives and ensure the new feature works as expected.To enhance these tests further, consider the following suggestions:
- Add a test case for a file with no empty lines to ensure the option doesn't affect files without empty lines.
- Consider using a
beforeEach
to read the buffer, reducing duplication between the two tests.- Use
test.each
to parameterize the tests, making them more concise and easier to extend. For example:test.each([ [true, 3, [ { header1: { value: 'column1' }, header2: { value: 'column2' }, header3: { value: 'column3' } }, { header1: { value: 'column4' }, header2: { value: 'column5' }, header3: { value: 'column6' } }, { header1: { value: 'column7' }, header2: { value: 'column8' }, header3: { value: 'column9' } } ]], [false, 6, [ { header1: { value: 'column1' }, header2: { value: 'column2' }, header3: { value: 'column3' } }, {}, { header1: { value: 'column4' }, header2: { value: 'column5' }, header3: { value: 'column6' } }, {}, {}, { header1: { value: 'column7' }, header2: { value: 'column8' }, header3: { value: 'column9' } } ]] ])('parseBuffer with skipEmptyLines: %s', async (skipEmptyLines, expectedLength, expectedData) => { const capture = await parseBuffer(buffer, { skipEmptyLines }); const data = capture['Sheet1'].data; expect(data.length).toEqual(expectedLength); expect(data).toEqual(expectedData); });This approach would make the tests more maintainable and easier to extend with additional cases in the future.
plugins/delimiter-extractor/src/parser.spec.ts (2)
142-165
: LGTM: Well-structured test case for skipping empty lines.This test case effectively verifies the
skipEmptyLines
functionality. It's clear, concise, and aligns with the PR objectives.Consider adding a brief comment explaining the structure of the test data file (e.g., "Test file contains 3 data rows with 2 empty lines in between"). This would enhance readability and make the test's purpose more evident.
166-199
: LGTM: Well-structured test case for preserving empty lines.This test case effectively verifies the behavior when
skipEmptyLines
is set to false. It's clear, concise, and aligns with the PR objectives.There's a minor typo in the test name. Consider changing:
- test("don'tskip empty lines", async () => { + test("don't skip empty lines", async () => {This small change improves readability.
plugins/delimiter-extractor/src/parser.ts (1)
Line range hint
115-116
: DefineisNullOrWhitespace
before using itThe function
isNullOrWhitespace
is used in your code but is defined at the end of the file. For better readability and to avoid potential reference errors, it's advisable to define helper functions before they are used.Move the
isNullOrWhitespace
function definition above its first usage:+ const isNullOrWhitespace = (value: any) => + value === null || (typeof value === 'string' && value.trim() === '') ... // Existing code that uses isNullOrWhitespaceThis ensures the function is defined before it's called.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (2)
package-lock.json
is excluded by!**/package-lock.json
,!**/*.json
plugins/xlsx-extractor/ref/test-empty-rows.xlsx
is excluded by!**/*.xlsx
,!**/*.xlsx
📒 Files selected for processing (8)
- .changeset/flat-gifts-change.md (1 hunks)
- flatfilers/sandbox/src/index.ts (1 hunks)
- plugins/delimiter-extractor/ref/csv.txt (1 hunks)
- plugins/delimiter-extractor/src/parser.spec.ts (2 hunks)
- plugins/delimiter-extractor/src/parser.ts (3 hunks)
- plugins/xlsx-extractor/src/index.ts (1 hunks)
- plugins/xlsx-extractor/src/parser.spec.ts (1 hunks)
- plugins/xlsx-extractor/src/parser.ts (5 hunks)
🧰 Additional context used
🪛 LanguageTool
plugins/delimiter-extractor/ref/csv.txt
[uncategorized] ~1-~1: A comma might be missing here.
Context: header1,header2,header3 column1,column2,column3 ,, ,, column4,c...(AI_EN_LECTOR_MISSING_PUNCTUATION_COMMA)
🔇 Additional comments (15)
plugins/delimiter-extractor/ref/csv.txt (2)
1-1
: LGTM: Header row is correctly formatted.The header row is properly structured for a CSV file, with three distinct column headers separated by commas.
🧰 Tools
🪛 LanguageTool
[uncategorized] ~1-~1: A comma might be missing here.
Context: header1,header2,header3 column1,column2,column3 ,, ,, column4,c...(AI_EN_LECTOR_MISSING_PUNCTUATION_COMMA)
2-6
: Data rows are well-structured and serve the PR objective.The data rows (lines 2, 5, and 6) are correctly formatted and consistent with the header. The empty rows (lines 3 and 4) between data entries are intentional, aligning with the PR objective of testing the
skipEmptyLines
option.flatfilers/sandbox/src/index.ts (4)
8-10
: LGTM: ExcelExtractor configuration aligns with PR objectives.The addition of
skipEmptyLines: true
to the ExcelExtractor configuration correctly implements the new feature as described in the PR objectives. This change will allow users to skip empty rows in Excel files, enhancing the data extraction process.
13-13
: LGTM: DelimiterExtractor configuration is consistent and aligned with objectives.The addition of
skipEmptyLines: true
to the DelimiterExtractor configuration is consistent with the ExcelExtractor and correctly implements the new feature as described in the PR objectives. This change will allow users to skip empty rows in delimiter-separated files (like the mentioned 'contacts.txt'), further enhancing the data extraction process across different file types.
18-19
: LGTM: Improved code style and consistency in automap function.The reformatting of the
automap
function parameters and the switch to single quotes for string literals improve code consistency and readability. These changes align with modern JavaScript/TypeScript style guides without affecting the functionality.Also applies to: 26-26
1-29
: Summary: Changes successfully implement the skipEmptyLines feature.The modifications in this file successfully implement the
skipEmptyLines
option for both the ExcelExtractor and DelimiterExtractor, aligning perfectly with the PR objectives. This feature will allow users to skip empty rows in various file types, enhancing the data extraction process. The code style improvements also contribute to better readability and maintainability of the codebase.plugins/xlsx-extractor/src/index.ts (1)
21-21
: Verify implementation ofskipEmptyLines
in related files.While the addition of the
skipEmptyLines
property to theExcelExtractorOptions
interface is correct, it's important to ensure that this new option is properly implemented throughout the codebase:
- Verify that the
parseBuffer
function (likely in another file) handles this new option correctly.- Ensure that the
Extractor
function from '@flatfile/util-extractor' correctly passes this option to theparseBuffer
function.- Check other parts of the codebase that use
ExcelExtractorOptions
to see if they need updates to handle this new option.To assist in this verification, you can run the following script:
This script will help identify the relevant parts of the codebase that might need to be updated or verified for the new
skipEmptyLines
option.✅ Verification successful
skipEmptyLines
is correctly implemented across all relevant files.The addition of the
skipEmptyLines
property in theExcelExtractorOptions
interface has been appropriately handled in theparseBuffer
functions and other related areas of the codebase.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the implementation and usage of skipEmptyLines option # Check parseBuffer implementation echo "Checking parseBuffer implementation:" rg --type typescript -A 10 "function parseBuffer" src # Check Extractor usage echo "\nChecking Extractor usage:" rg --type typescript -A 5 "Extractor\(" src # Check ExcelExtractorOptions usage echo "\nChecking ExcelExtractorOptions usage:" rg --type typescript -A 5 "ExcelExtractorOptions" srcLength of output: 497
Script:
#!/bin/bash # Description: Verify the implementation and usage of skipEmptyLines option with corrected rg commands # Check parseBuffer implementation echo "Checking parseBuffer implementation:" rg --glob '*.ts' --glob '*.tsx' -A 10 "function parseBuffer" src # Check Extractor usage echo "\nChecking Extractor usage:" rg --glob '*.ts' --glob '*.tsx' -A 5 "Extractor\(" src # Check ExcelExtractorOptions usage echo "\nChecking ExcelExtractorOptions usage:" rg --glob '*.ts' --glob '*.tsx' -A 5 "ExcelExtractorOptions" srcLength of output: 557
Script:
#!/bin/bash # Description: Verify the implementation and usage of skipEmptyLines option without assuming 'src' directory # Check parseBuffer implementation echo "Checking parseBuffer implementation:" rg --glob '*.ts' --glob '*.tsx' -A 10 "function parseBuffer" . # Check Extractor usage echo "\nChecking Extractor usage:" rg --glob '*.ts' --glob '*.tsx' -A 5 "Extractor\(" . # Check ExcelExtractorOptions usage echo "\nChecking ExcelExtractorOptions usage:" rg --glob '*.ts' --glob '*.tsx' -A 5 "ExcelExtractorOptions" .Length of output: 7488
plugins/xlsx-extractor/src/parser.spec.ts (2)
7-14
: Excellent restructuring of the 'test-basic.xlsx' tests!The new describe block and the use of
beforeAll
improve the organization and efficiency of the tests. This structure allows for better isolation of test cases and reduces redundant code.
Line range hint
1-147
: Overall, excellent changes that significantly improve test coverage!The restructuring and addition of new tests, especially for the
skipEmptyLines
option, greatly enhance the robustness of theparseBuffer
function. The changes align perfectly with the PR objectives and provide comprehensive coverage for the new feature.A few minor suggestions have been made to further improve code maintainability and test coverage:
- Add explanatory comments for specific test cases.
- Consider parameterizing some tests for better maintainability.
- Explore additional edge cases, such as files with no empty lines.
Great job on improving the quality and reliability of the xlsx-extractor plugin!
plugins/delimiter-extractor/src/parser.spec.ts (2)
14-17
: LGTM: Proper buffer initialization for new test cases.The
csvBuffer
is correctly initialized usingfs.readFileSync
with a properly constructed file path. This setup is appropriate for the new test cases.
Line range hint
14-199
: Summary: Effective test coverage for newskipEmptyLines
functionality.The new test cases provide comprehensive coverage for the
skipEmptyLines
option:
- A new CSV buffer is correctly initialized for testing.
- Two test cases verify the behavior with
skipEmptyLines
set to true and false.- The tests use appropriate assertions to check the expected output in each case.
These additions effectively validate the new functionality without disrupting existing tests, ensuring backwards compatibility.
plugins/xlsx-extractor/src/parser.ts (4)
69-69
: Appropriate addition of the 'skipEmptyLines' optionThe
skipEmptyLines
option is correctly added to the options passed toconvertSheet
, enhancing the control over empty line handling.
93-93
: Properly extends 'ConvertSheetArgs' with 'skipEmptyLines'The
ConvertSheetArgs
type now includes theskipEmptyLines
boolean, correctly extending the function's arguments.
109-109
: Correctly passes 'skipEmptyLines' to 'convertSheet' functionThe
skipEmptyLines
parameter is properly included in theconvertSheet
function call, ensuring the option is utilized within the function.
117-117
: Logical expression for 'blankrows' option appears correctThe
blankrows
option is set based onheaderSelectionEnabled
andskipEmptyLines
, aligning with the intended functionality:
- When
headerSelectionEnabled
istrue
,blankrows
istrue
, retaining blank rows for header selection.- When
headerSelectionEnabled
isfalse
,blankrows
is set to the inverse ofskipEmptyLines
, effectively skipping empty lines whenskipEmptyLines
istrue
.
const data: Flatfile.RecordData[] = rows.map((row) => { | ||
const mappedRow = mapKeys(row, (key) => headers[key]) | ||
return mapValues(mappedRow, (value) => ({ | ||
value: transform(value), | ||
})) as Flatfile.RecordData | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle potential header and data mismatches
When mapping the headers
to each row
, there is a risk of mismatches if the number of headers does not match the number of columns in a row, leading to undefined
keys.
Ensure that the length of headers
matches the number of columns in each row
. You can adjust the mapping logic to handle discrepancies:
const data: Flatfile.RecordData[] = rows.map((row) => {
const mappedRow = mapKeys(row, (key, index) => headers[index] || `Column${index + 1}`)
return mapValues(mappedRow, (value) => ({
value: transform(value),
})) as Flatfile.RecordData
})
This ensures that every column has a corresponding header, assigning a default name if necessary.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
const data: Flatfile.RecordData[] = rows.map((row) => { | |
const mappedRow = mapKeys(row, (key) => headers[key]) | |
return mapValues(mappedRow, (value) => ({ | |
value: transform(value), | |
})) as Flatfile.RecordData | |
}) | |
const data: Flatfile.RecordData[] = rows.map((row) => { | |
const mappedRow = mapKeys(row, (key, index) => headers[index] || `Column${index + 1}`) | |
return mapValues(mappedRow, (value) => ({ | |
value: transform(value), | |
})) as Flatfile.RecordData | |
}) |
Please explain how to summarize this PR for the Changelog:
This PR provides a new option
skipEmptyLines
to skip empty lines.Tell code reviewer how and what to test:
With this release all empty rows after the last row of data will always be removed. The
skipEmptyLines
will control whether empty rows between data rows are removed.Example files with empty lines:
Book1.xlsx
contacts.txt
Example
listener.ts
: