Validate string datatype in the column #807

dineshkumar-23 · 2022-03-29T11:48:56Z

dineshkumar-23
Mar 29, 2022

Hello,

I would like to validate a dataframe that has string datatype in one column.
For example,

df = pd.DataFrame({
    "device": range(10),
    "type": map(float, reversed(range(10))),
    "message": ["Hello"] * 10,
})
	device	type	message
0	0	9.0	Hello
1	1	8.0	Hello
2	2	7.0	Hello
3	3	6.0	Hello
4	4	5.0	Hello
5	5	4.0	Hello
6	6	3.0	Hello
7	7	2.0.  Hello
8	8	1.0	Hello
9	9	0.0	Hello

This is my dataframe. And i tried to validate the string column "message".

class Schema(pa.SchemaModel):
    device: Series[int]
    type: Series[float]
    message: Series[str]

try:
    Schema(df, lazy=True)
    print(f"Valid_df: {df}")
except pa.errors.SchemaErrors as e:
    print(e.failure_cases)
--------
Output
--------
Valid_df:    device  type       message
0       0   9.0  Hello
1       1   8.0  Hello
2       2   7.0  Hello
3       3   6.0  Hello
4       4   5.0  Hello
5       5   4.0  Hello
6       6   3.0  Hello
7       7   2.0  Hello
8       8   1.0  Hello
9       9   0.0  Hello

Now I changed the values in the "message" column to an integer and did the validation.

corrupt_df = df.copy()
corrupt_df.loc[:5, "message"] = 92
corrupt_df
--------
Output
--------
device | type | message
-- | -- | --
0 | 9.0 | 92
1 | 8.0 | 92
2 | 7.0 | 92
3 | 6.0 | 92
4 | 5.0 | 92
5 | 4.0 | 92
6 | 3.0 | Hello
7 | 2.0 | Hello
8 | 1.0 | Hello
9 | 0.0 | Hello

try:
    Schema(corrupt_df, lazy=True)
    print(f"Valid_df: {corrupt_df}")
except pa.errors.SchemaErrors as e:
    fail_index = e.failure_cases
    print(fail_index)
-------
Output
-------
Valid_df:    device  type        message
0       0   9.0               92
1       1   8.0               92
2       2   7.0               92
3       3   6.0               92
4       4   5.0               92
5       5   4.0               92
6       6   3.0  Hello
7       7   2.0  Hello
8       8   1.0  Hello
9       9   0.0  Hello

It does not throw me string validation error. Even with "coerce=True" it coerces the integer to a string and validates it as a string.

I want the validation to throw me an error if there is an integer or float value in a string column. Kindly let me know if there is a solution.

Thanks.

Answered by cosmicBboy

Mar 29, 2022

Hi @dineshkumar-23, so this is a limitation of the pre- pandas 1.0 string representation... specifying dtype=str actually uses numpy object datatype instead of a logical string datatype to represent the array.

Recommendation

I'd highly recommend using pandas.StringDtype in this case (assuming you're using pandas>=1):

class Schema(pa.SchemaModel):
    device: Series[int]
    type: Series[float]
    message: Series[pd.StringDtype]

    class Config:
        coerce = True

edit: add coerce=True to config

In this case, pandera will complain even with your original dataframe:

  File "/Users/nielsbantilan/git/pandera/pandera/error_handlers.py", line 32, in collect_error
    raise schema_error fr…

View full answer

cosmicBboy · 2022-03-29T13:05:07Z

cosmicBboy
Mar 29, 2022
Maintainer

Hi @dineshkumar-23, so this is a limitation of the pre- pandas 1.0 string representation... specifying dtype=str actually uses numpy object datatype instead of a logical string datatype to represent the array.

Recommendation

I'd highly recommend using pandas.StringDtype in this case (assuming you're using pandas>=1):

class Schema(pa.SchemaModel):
    device: Series[int]
    type: Series[float]
    message: Series[pd.StringDtype]

    class Config:
        coerce = True

edit: add coerce=True to config

In this case, pandera will complain even with your original dataframe:

  File "/Users/nielsbantilan/git/pandera/pandera/error_handlers.py", line 32, in collect_error
    raise schema_error from original_exc
pandera.errors.SchemaError: expected series 'message' to have type string[python], got object

Indeed, if you try to corrupt a StringDtype array pandas doesn't let you:

df = pd.DataFrame({
    "device": range(10),
    "type": map(float, reversed(range(10))),
    "message": ["Hello"] * 10,
}).astype({"message": pd.StringDtype()})

corrupt_df.loc[:5, "message"] = 92

Output:

ValueError: Cannot set non-string value '92' into a StringArray.

So using StringDtype is type-safe, while str is equivalent to using object.

NOTE: in your use case, coercing as actually dangerous because all other datatypes can be coerced into a string as long as the type implements the __str__ dunder method.

Workaround

Alternatively, you can define a custom check to verify the actual values of the message column

class Schema(pa.SchemaModel):
    device: Series[int]
    type: Series[float]
    message: Series[str]

    @pa.check("message")
    def check_numpy_string_type(cls, series: Series) -> Series:
        return series.map(lambda x: isinstance(x, str))

Schema(corrupt_df)

output:

pandera.errors.SchemaError: <Schema Column(name=message, type=DataType(str))> failed element-wise validator 0:
<Check check_numpy_string_type>
failure cases:
   index failure_case
0      0           92
1      1           92
2      2           92
3      3           92
4      4           92
5      5           92

Next Steps

I've opened up an issue here: #808 to special-case the str dtype, since this is a pandas quirk that's worth having an opinion about in pandera... historically we've stuck closely with pandas (and delegate most of the data type semantics to pandas) but I think pandera should start stepping in when it breaks user expectations like this.

9 replies

cosmicBboy Mar 31, 2022
Maintainer

yep! this is a known issue, an unfortunate default legacy setting to do with the n_failure_cases kwarg... this is fixed by #784 and it'll be out in the next release 0.10.0.

In the meantime, you can do:

@pa.check("message", n_failure_cases=None)

dineshkumar-23 Mar 31, 2022
Author

Amazing. Thank you.

NickleDave Apr 9, 2022

Hi @cosmicBboy are there any callouts in the documentation that say something along the lines of "prefer pd.StringDtype over str when possible" (assuming I have re-phrased your answer here correctly)?

I did some searches and didn't find any.

If not, maybe it would be worth adding them somewhere? A "how-to" / FAQ page and/or maybe some concrete examples of common use cases?

I ask because there are snippets that use str which kind of makes it seem like this is an okay thing to do, e.g. on the main index: https://pandera.readthedocs.io/en/stable/index.html#schema-model

I ran into a similar issue as this post (silently coerceing a column into obj because I said it should be str) and that's when it dawned on me something unexpected was going on. Then I looked at the main Schema Model page and saw this mention of StringDtype and realized that I was probably doing things a not-great way:
https://pandera.readthedocs.io/en/stable/schema_models.html#dtype-aliases

jeffzi Apr 10, 2022
Collaborator

Nullable strings were introduced in pandas 1.0, and are still considered experimental. Pandera follows pandas behavior:

import pandas as pd
import numpy as np

pd.Series("foo", dtype="string").astype(str).dtype == np.object_  # 'string' is the alias for pd.StringDtype

This can be confusing but Pandera mapping str to StringDtype would be surprising for experienced pandas users.
That said, I do agree the documentation should be clearer and re-explain pandas quirks.

prefer pd.StringDtype over str when possible

The recommendation comes from the fact that an object can wrap anything and validate will not guarantee the content is actually a string, since it only checks the dtype. I'm working on an extension of Pandera DataTypes that will fix this, see #808.

Even with the future fix, validation will be slower with str since it will have to scan the content. Generally speaking, StringDtype will guarantee that the content is a string, independently of pandera's validation.

TL;DR. I think we should open a best practices page in the documentation. One of such best practices is "prefer nullable dtypes whenever possible", with supporting explanations for this choice.

NickleDave Apr 14, 2022

Thank you @jeffzi -- I appreciate your taking the time to explain.

Your plan to make a "good practices" page makes sense to me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate string datatype in the column #807

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 9 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Validate string datatype in the column #807

dineshkumar-23 Mar 29, 2022

Recommendation

Replies: 1 comment · 9 replies

cosmicBboy Mar 29, 2022 Maintainer

Recommendation

Workaround

Next Steps

cosmicBboy Mar 31, 2022 Maintainer

dineshkumar-23 Mar 31, 2022 Author

NickleDave Apr 9, 2022

jeffzi Apr 10, 2022 Collaborator

NickleDave Apr 14, 2022

dineshkumar-23
Mar 29, 2022

Replies: 1 comment 9 replies

cosmicBboy
Mar 29, 2022
Maintainer

cosmicBboy Mar 31, 2022
Maintainer

dineshkumar-23 Mar 31, 2022
Author

jeffzi Apr 10, 2022
Collaborator