[PERF] Performance issue in pandas 1.3.4's read_csv Function with Large Column CSV Files

**Issue Description:**

Hello.
I have discovered a performance degradation in the `read_csv` function of pandas version 1.3.4 when handling CSV files with a large number of columns. This problem significantly increases the loading time from just a few seconds in the previous version 1.2.5 to several minutes, almost 60x diff. I found some discussions on GitHub related to this issue, including [#44106](https://github.com/pandas-dev/pandas/issues/44106) and [#44192](https://github.com/pandas-dev/pandas/pull/44192).
I noticed that in your CI environment, the `requirements-wheel.txt` file specifies `pandas==1.3.4` for Python version `3.10`. This can lead to increased time and resource consumption in testing, especially when frequently using `pd.read_csv`.
It might be worth considering updating the pandas version or adjusting the dependencies to optimize testing performance. This could help ensure smoother testing processes and resource efficiency.


**Steps to Reproduce:**

I have created a small reproducible example to better illustrate this issue. 

```python
# v1.3.4
import os
import pandas
import numpy
import timeit

def generate_sample():
    if os.path.exists("test_small.csv.gz") == False:
        nb_col = 100000
        nb_row = 5
        feature_list = {'sample': ['s_' + str(i+1) for i in range(nb_row)]}
        for i in range(nb_col):
            feature_list.update({'feature_' + str(i+1): list(numpy.random.uniform(low=0, high=10, size=nb_row))})
        df = pandas.DataFrame(feature_list)
        df.to_csv("test_small.csv.gz", index=False, float_format="%.6f")

def load_csv_file():
    col_names = pandas.read_csv("test_small.csv.gz", low_memory=False, nrows=1).columns
    types_dict = {col: numpy.float32 for col in col_names}
    types_dict.update({'sample': str})
    feature_df = pandas.read_csv("test_small.csv.gz", index_col="sample", na_filter=False, dtype=types_dict, low_memory=False)
    print("loaded dataframe shape:", feature_df.shape)

generate_sample()
timeit.timeit(load_csv_file, number=1)

# results
loaded dataframe shape: (5, 100000)
120.37690759263933

```

```python
# v1.3.5
import os
import pandas
import numpy
import timeit

def generate_sample():
    if os.path.exists("test_small.csv.gz") == False:
        nb_col = 100000
        nb_row = 5
        feature_list = {'sample': ['s_' + str(i+1) for i in range(nb_row)]}
        for i in range(nb_col):
            feature_list.update({'feature_' + str(i+1): list(numpy.random.uniform(low=0, high=10, size=nb_row))})
        df = pandas.DataFrame(feature_list)
        df.to_csv("test_small.csv.gz", index=False, float_format="%.6f")

def load_csv_file():
    col_names = pandas.read_csv("test_small.csv.gz", low_memory=False, nrows=1).columns
    types_dict = {col: numpy.float32 for col in col_names}
    types_dict.update({'sample': str})
    feature_df = pandas.read_csv("test_small.csv.gz", index_col="sample", na_filter=False, dtype=types_dict, low_memory=False)
    print("loaded dataframe shape:", feature_df.shape)


generate_sample()
timeit.timeit(load_csv_file, number=1)

# results
loaded dataframe shape: (5, 100000)
2.8567268839105964

```

**Suggestion**


I would recommend considering an upgrade to a different version of pandas >= 1.3.5 or exploring other solutions to optimize the performance of loading CSV files. 
Any other workarounds or solutions would be greatly appreciated.
Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PERF] Performance issue in pandas 1.3.4's read_csv Function with Large Column CSV Files #3365

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[PERF] Performance issue in pandas 1.3.4's read_csv Function with Large Column CSV Files #3365

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions