BUG: quoting=csv.QUOTE_NONNUMERIC adds extra decimal places #60699

Gabriel-p · 2025-01-12T13:47:32Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np
import csv
from io import StringIO

df = pd.DataFrame({"col": np.array([8.57], dtype="float32")})

buff1 = StringIO()
df.to_csv(buff1)
print(buff1.getvalue())

buff2 = StringIO()
df.to_csv(buff2, quoting=csv.QUOTE_NONNUMERIC)
print(buff2.getvalue())

Issue Description

Extra decimal places are added when csv.QUOTE_NONNUMERIC is used in to_csv() method

Expected Behavior

Output file should store 0,8.57, not 0,8.569999694824219

Installed Versions

INSTALLED VERSIONS

commit : 0691c5c
python : 3.12.8
python-bits : 64
OS : Linux
OS-release : 6.8.0-50-generic
Version : #51~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 21 12:03:03 UTC 2
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : es_ES.UTF-8
LOCALE : es_ES.UTF-8

pandas : 2.2.3
numpy : 2.2.1
pytz : 2024.2
dateutil : 2.9.0.post0
pip : 24.2
Cython : None
sphinx : 8.1.3
IPython : 8.31.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
blosc : None
bottleneck : None
dataframe-api-compat : None
fastparquet : 2024.11.0
fsspec : 2024.12.0
html5lib : 1.1
hypothesis : None
gcsfs : None
jinja2 : 3.1.5
lxml.etree : None
matplotlib : 3.10.0
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
psycopg2 : None
pymysql : None
pyarrow : None
pyreadstat : None
pytest : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.15.0
sqlalchemy : 2.0.37
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
xlsxwriter : None
zstandard : None
tzdata : 2024.2
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

wjandrea · 2025-01-12T20:06:41Z

Beside the point, but you can simplify the example code:

import pandas as pd
import csv

df = pd.DataFrame({"col": [8.57]}, dtype="float32")

print(df.to_csv())

print(df.to_csv(quoting=csv.QUOTE_NONNUMERIC))

Outputs for reference:

,col
0,8.57

"","col"
0,8.569999694824219

rhshadrach · 2025-01-12T22:07:22Z

Thanks for the report! Internally, pandas is doing:

pandas/pandas/core/indexes/base.py

Line 7761 in 57d2489

values = np.array(values, dtype="object")

values = np.array([8.57], dtype="float32")
print(values)
# [8.57]
print(np.array(values, dtype="object"))
# [8.569999694824219]

I believe the cast to object dtype changes from float32 to Python's 64-bit float. If we can avoid the change to object, then this may be possible to avoid. Further investigations welcome!

akj2018 · 2025-01-13T10:48:09Z

In addition to float32, I tested this issue with other dtype.kind == f options, namely

float16

df = pd.DataFrame({"col": [8.57]}, dtype="float16")
print(df.to_csv())
# ,col
# 0,8.57

print(df.to_csv(quoting=csv.QUOTE_NONNUMERIC))
# "","col"
# 0,8.5703125

float32

df = pd.DataFrame({"col": [8.57]}, dtype="float32")
print(df.to_csv())
# ,col
# 0,8.57

print(df.to_csv(quoting=csv.QUOTE_NONNUMERIC))
# "","col"
# 0,8.569999694824219

float64 or float (Python's built-in, equivalent to float64 in NumPy)

df = pd.DataFrame({"col": [8.57]}, dtype="float64")
print(df.to_csv())
# ,col
# 0,8.57

print(df.to_csv(quoting=csv.QUOTE_NONNUMERIC))
# "","col"
# 0,8.57

Why different behavior for `quoting=csv.QUOTE_NONNUMERIC`

if quoting is not specified
- it uses the default csv.QUOTE_MINIMAL, internally assigned literal 0
- get_values_for_csv stringifies the numeric values and stores them as Unicode string (dtype=<U32)
- Finally, it calls values.astype(object, copy=False) , preserves the exact decimal formatting

pandas/pandas/core/indexes/base.py

Lines 7758 to 7759 in 57d2489

    
           if not quoting: 
        
               values = values.astype(str)

else quoting is specified
- Here, we have specified, csv.QUOTE_NONNUMERIC, assigned literal 1, but same behavior observed for all 3 options ( QUOTE_ALL [1], QUOTE_NONNUMERIC [2] , QUOTE_NONE [3])
- get_values_for_csv casts float16 / float32 / float64 into object dtype (converted into a generic object). In this case, objects are Python float (64-bit) as mentioned by @rhshadrach
- Different float dtypes (float16, float32, float64) differ in how they store close but not exact binary representation of decimal values.
- For 32 bit, 8.57 internally stored as the nearest float32 approximation like 8.569999694824219
- For 16 bit, 8.57 internally stored as the nearest float16 approximation like 8.5703125
- During printing, Python uses a “shortest round-trip” rule, where it tries to find a decimal string such that parsing it back to a float yields the same 64-bit bits. In many cases, for a float that originates from a float32 approximation, that decimal string will appear with “extra” digits, e.g. 8.569999694824219.
- For 64-bit, it shows 8.57 directly—no extra digits—since 8.57 is the “shortest decimal string”

pandas/pandas/core/indexes/base.py

Lines 7760 to 7765 in 57d2489

    
           else: 
        
               values = np.array(values, dtype="object") 
        
           values[mask] = na_rep 
        
           values = values.astype(object, copy=False) 
        
           return values

Proposed Solution

First convert float16 / float32 to string to preserve their decimal representation
Then converting exact decimal string back to a Python float (preserves the same short decimal format during print())

if not quoting:
    values = values.astype(str)
else:
    values = np.array(values, dtype="str")   # Convert float16 -> string
    values = values.astype(float, copy=False)   # Parse string -> Python float64

values = values.astype(object, copy=False)
values[mask] = na_rep
return values

Problems with this solution

Does not preserve exact float16 binary data (imp. for numeric operations)
Maybe decimal string might not perfectly reconstruct the original float16/float32 approximation, leading to small numeric differences.
performance and memory cost to stringifying and re-parsing.

Any feedback is welcome. Let me know if I need to test this solution for a specific test case and improve further. Once approved via discussion, I will perform testing and generate asv metrics.

Thank you

Gabriel-p added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 12, 2025

rhshadrach added IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: quoting=csv.QUOTE_NONNUMERIC adds extra decimal places #60699

BUG: quoting=csv.QUOTE_NONNUMERIC adds extra decimal places #60699

Gabriel-p commented Jan 12, 2025

INSTALLED VERSIONS

wjandrea commented Jan 12, 2025 •

edited

Loading

rhshadrach commented Jan 12, 2025

akj2018 commented Jan 13, 2025

BUG: quoting=csv.QUOTE_NONNUMERIC adds extra decimal places #60699

BUG: quoting=csv.QUOTE_NONNUMERIC adds extra decimal places #60699

Comments

Gabriel-p commented Jan 12, 2025

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

wjandrea commented Jan 12, 2025 • edited Loading

rhshadrach commented Jan 12, 2025

akj2018 commented Jan 13, 2025

Why different behavior for quoting=csv.QUOTE_NONNUMERIC

Proposed Solution

Problems with this solution

wjandrea commented Jan 12, 2025 •

edited

Loading

Why different behavior for `quoting=csv.QUOTE_NONNUMERIC`