Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: MultiIndex union/difference not commutative #60642

Open
2 of 3 tasks
ssche opened this issue Jan 2, 2025 · 4 comments
Open
2 of 3 tasks

BUG: MultiIndex union/difference not commutative #60642

ssche opened this issue Jan 2, 2025 · 4 comments
Labels
Bug Index Related to the Index class or subclasses Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex Needs Triage Issue that has not been reviewed by a pandas team member setops union, intersection, difference, symmetric_difference

Comments

@ssche
Copy link
Contributor

ssche commented Jan 2, 2025

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

I wasn't able to extract the data for this example to set up the test case more programmatically, but I managed to reduce the data significantly without compromising the behaviour. I hope the below is somewhat portable (please let me know if it isn't). I used numpy==2.2.1 and pandas==2.2.3

>>> import pickle

>>> ix1_pickled = [redacted]
>>> ix2_pickled = [redacted]
ix1 = pickle.loads(ix1_pickled)
ix2 = pickle.loads(ix2_pickled)

>>> ix1
MultiIndex([(nan, '2018-06-01')],
           names=['dim1', 'dim2'])

>>> ix2
MultiIndex([(nan, '2018-06-01')],
           names=['dim1', 'dim2'])
>>>
>>> # expected - both indices are the same - should yield same result
>>> ix2.union(ix1)
MultiIndex([(nan, '2018-06-01')],
           names=['dim1', 'dim2'])
>>> 
>>> # it seems each row is considered different
>>> ix1.union(ix2)
MultiIndex([(nan, '2018-06-01'),
            (nan, '2018-06-01')],
           names=['dim1', 'dim2'])
>>> 
>>> # expected
>>> ix1.difference(ix2)
MultiIndex([], names=['dim1', 'dim2'])
>>> 
>>> # not expected
>>> ix2.difference(ix1)
MultiIndex([(nan, '2018-06-01')],
           names=['dim1', 'dim2'])

>>> # some diagnostics to show the values (the dates other than `2018-06-01` could come from my attempt to minimise the example as those values were previously contained)
>>> ix1.levels
FrozenList([[nan], [2018-06-01 00:00:00, 2018-07-01 00:00:00, 2018-08-01 00:00:00, 2018-09-01 00:00:00, 2018-10-01 00:00:00, 2018-11-01 00:00:00, 2018-12-01 00:00:00]])
>>> ix2.levels
FrozenList([[nan], [2018-06-01 00:00:00, 2018-07-01 00:00:00, 2018-08-01 00:00:00, 2018-09-01 00:00:00, 2018-10-01 00:00:00, 2018-11-01 00:00:00, 2018-12-01 00:00:00]])
>>> ix1.levels[1] == ix2.levels[1]
array([ True,  True,  True,  True,  True,  True,  True])

Issue Description

Creating the union of two indices with a nan level causes the union result to depend on the order of the call (index1.union(index2) vs. index2.union(index1)). With other words, one of the calls yields the wrong result as the call deems every row to be distinct. I'm fairly certain that is is due to nan value in dim1, but if I recreate the example programmatically, the behaviour is as expected.

>>> ix3 = pd.MultiIndex.from_product([[np.nan], [pd.Timestamp('2018-06-01 00:00:00')]])
>>> ix3
MultiIndex([(nan, '2018-06-01')],
           )
>>> ix4 = pd.MultiIndex.from_product([[np.nan], [pd.Timestamp('2018-06-01 00:00:00')]])
>>> 
>>> ix4.dtypes
level_0           float64
level_1    datetime64[ns]
dtype: object
>>> ix3.dtypes
level_0           float64
level_1    datetime64[ns]
>>> ix3.union(ix4)
MultiIndex([(nan, '2018-06-01')],
           )
>>> ix4.union(ix3)
MultiIndex([(nan, '2018-06-01')],
           )

However, in test cases for a rather large application, I arrive at the state from the pickle example. I'm not sure what's different to the working example

Expected Behavior

I would expect the difference of the two indices from the pickled example to be empty and the union to be the same as the two indices.

I am also at a loss as to why I can't reproduce the wrong behaviour programmatically.

Installed Versions


INSTALLED VERSIONS
------------------
commit                : 0691c5cf90477d3503834d983f69350f250a6ff7
python                : 3.13.1
python-bits           : 64
OS                    : Linux
OS-release            : 6.12.6-200.fc41.x86_64
Version               : #1 SMP PREEMPT_DYNAMIC Thu Dec 19 21:06:34 UTC 2024
machine               : x86_64
processor             : 
byteorder             : little
LC_ALL                : None
LANG                  : en_AU.UTF-8
LOCALE                : en_AU.UTF-8

pandas                : 2.2.3
numpy                 : 2.2.1
pytz                  : 2020.4
dateutil              : 2.9.0.post0
pip                   : 24.3.1
Cython                : 3.0.11
sphinx                : None
IPython               : None
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : None
blosc                 : None
bottleneck            : 1.4.2
dataframe-api-compat  : None
fastparquet           : None
fsspec                : None
html5lib              : None
hypothesis            : None
gcsfs                 : None
jinja2                : None
lxml.etree            : None
matplotlib            : None
numba                 : None
numexpr               : 2.10.2
odfpy                 : None
openpyxl              : 3.1.2
pandas_gbq            : None
psycopg2              : 2.9.10
pymysql               : None
pyarrow               : 18.1.0
pyreadstat            : None
pytest                : 8.3.4
python-calamine       : None
pyxlsb                : None
s3fs                  : None
scipy                 : 1.14.1
sqlalchemy            : None
tables                : 3.10.1
tabulate              : None
xarray                : None
xlrd                  : 2.0.1
xlsxwriter            : None
zstandard             : None
tzdata                : 2024.2
qtpy                  : None
pyqt5                 : None
@ssche ssche added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex Index Related to the Index class or subclasses Needs Triage Issue that has not been reviewed by a pandas team member setops union, intersection, difference, symmetric_difference labels Jan 2, 2025
@rhshadrach
Copy link
Member

rhshadrach commented Jan 4, 2025

pickle can perform arbitrary code execution and thus presents security issues. Can you post your example without using pickle.

ix1 = pd.MultiIndex(levels=ix1.levels, codes=ix1.codes, names=ix1.names, verify_integrity=False)

@rhshadrach rhshadrach added the Needs Info Clarification about behavior needed to assess issue label Jan 4, 2025
@ssche
Copy link
Contributor Author

ssche commented Jan 6, 2025

Thanks @rhshadrach - that's really helpful for boiling down the essential parameters of the multiindex to reproduce the issue without using pickle.

When I extracted the level, codes, names from ix1 and ix2, I noticed a difference in ix2.codes. ix2 doesn't seem to contain a -1 code at a nan level value. ix2 originates from an aggregation function (first in this case, but I don't think this matters) for which groupby does not assign -1 codes for nan values (not sure if that's expected; I suppose that's not correct). Pandas 2.1.1 was doing the same thing, but a subsequent ix1.union(ix2) for which ix2 had 0 codes for nan levels still worked. That seems to be regression to 2.2.3.

Compare the following two snippets:

>>> pd.__version__
'2.1.1'
>>> np.__version__
'1.24.4'
>>> ix1 = pd.MultiIndex.from_product([[np.nan], [pd.Timestamp('2018-06-01 00:00:00')]], names=['dim1', 'dim2'])
>>> ix1
MultiIndex([(nan, '2018-06-01')],
           names=['dim1', 'dim2'])
>>> ix1.codes
FrozenList([[-1], [0]])
>>> 
>>> df = pd.DataFrame({'foo': [42]}, index=ix1)
>>> df = df.reset_index(drop=False)
>>> ix2 = df.groupby(['dim1', 'dim2'], dropna=False, observed=True).first().index
>>> 
>>> ix2
MultiIndex([(nan, '2018-06-01')],
           names=['dim1', 'dim2'])
>>> ix2.codes
FrozenList([[0], [0]])
             ^
             supposedly wrong code (should be -1?)
>>> 
>>> ix1.union(ix2)
MultiIndex([(nan, '2018-06-01')],
           names=['dim1', 'dim2'])

2.1.1 has a 0-code for a nan level after the groupby, but the union is still correct. In 2.2.3

>>> pd.__version__
'2.2.3'
>>> np.__version__
'2.2.1'
>>> ix1 = pd.MultiIndex.from_product([[np.nan], [pd.Timestamp('2018-06-01 00:00:00')]], names=['dim1', 'dim2'])
>>> ix1
MultiIndex([(nan, '2018-06-01')],
           names=['dim1', 'dim2'])
>>> ix1.codes
FrozenList([[-1], [0]])
>>> 
>>> df = pd.DataFrame({'foo': [42]}, index=ix1)
>>> df = df.reset_index(drop=False)
>>> ix2 = df.groupby(['dim1', 'dim2'], dropna=False, observed=True).first().index
>>> 
>>> ix2
MultiIndex([(nan, '2018-06-01')],
           names=['dim1', 'dim2'])
>>> ix2.codes
FrozenList([[0], [0]])
>>> 
>>> ix1.union(ix2)
MultiIndex([(nan, '2018-06-01'),
            (nan, '2018-06-01')],  # <- wrong - double up of rows (should be only 1)
           names=['dim1', 'dim2'])

>>> 

# let's try to fix `ix2` (`from_frame` seems to be "fixing" the index)
>>> ix2_fixed = pd.MultiIndex.from_frame(pd.DataFrame(index=ix2).reset_index())
>>> ix2_fixed.codes
FrozenList([[-1], [0]])
            # ^
            # expected for nan
>>> 
>>> ix1.union(ix2_fixed)
MultiIndex([(nan, '2018-06-01')],  # <- correct result, 1 row only
           names=['dim1', 'dim2'])

The question is - is groupby already wrong in assigning 0 values for nan levels or should union (and potentially others) not rely on index.codes?

@ssche ssche removed the Needs Info Clarification about behavior needed to assess issue label Jan 6, 2025
@ssche
Copy link
Contributor Author

ssche commented Jan 6, 2025

I went down the avenue of assuming that groupby should have -1 codes for nan values (so other nan level indices are compatible). This lead me to Grouping which calculates the codes.

class Grouping:
        ...
    def _codes_and_uniques(self):
       ...
            uniques = self._uniques
        else:
            # GH35667, replace dropna=False with use_na_sentinel=False
            # error: Incompatible types in assignment (expression has type "Union[
            # ndarray[Any, Any], Index]", variable has type "Categorical")
            codes, uniques = algorithms.factorize(  # type: ignore[assignment]
                self.grouping_vector, sort=self._sort, use_na_sentinel=self._dropna
            )
        return codes, uniques

Since use_na_sentinel is tied to _dropna (which is set to False because the nan rows should be kept), algorithms.factorize will calculate non-negative integer code values for nan values which differs to the codes in ix1.

However, I don't think that matters much as the codes may not be consistent across indices (i.e. the code-level relationship can be different for different indices). Ideally, .union() (which uses .difference()) should account for different indices, in my opinion.

Strangely enough, one cannot set up the above example via the default constructor of MultiIndex.

ix3 = pd.MultiIndex(levels=[[np.nan], [pd.Timestamp('2018-06-01 00:00:00')]], names=['dim1', 'dim2'], codes=[[-1], [0]])
ix4 = pd.MultiIndex(levels=[[np.nan], [pd.Timestamp('2018-06-01 00:00:00')]], names=['dim1', 'dim2'], codes=[[0], [0]])
ix3.union(ix4)

works as expected as ix4's codes are adjusted to [[-1], [0]] after the constructor is called.

I'm still puzzled as to whether there's an implicit assumption that nan values should be denoted as -1 codes (groupby doesn't adhere to this assumption, but union() seems to assume that and the fact that the constructor corrects the nan code supports that further).

@rhshadrach
Copy link
Member

Strangely enough, one cannot set up the above example via the default constructor of MultiIndex.

You need to pass verify_integrity=False, otherwise the validation process will modify your data so that the codes are -1.

The Cython groupby implementation is heavily tied to allowing nonnegative codes for NA values to implement dropna=False. These codes are used to determine the the label, which defines the group. E.g.

for i in range(N):
lab = labels[i]
if lab < 0:
continue
for j in range(K):
val = values[i, j]

I agree this appears to be a bug in .union. It should be able to handle codes as -1 or codes as nonnegative values whose corresponding level value is NA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Index Related to the Index class or subclasses Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex Needs Triage Issue that has not been reviewed by a pandas team member setops union, intersection, difference, symmetric_difference
Projects
None yet
Development

No branches or pull requests

2 participants