Validate string datatype in the column #807
-
Hello, I would like to validate a dataframe that has string datatype in one column.
This is my dataframe. And i tried to validate the string column "message".
Now I changed the values in the "message" column to an integer and did the validation.
It does not throw me string validation error. Even with "coerce=True" it coerces the integer to a string and validates it as a string. I want the validation to throw me an error if there is an integer or float value in a string column. Kindly let me know if there is a solution. Thanks. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 9 replies
-
Hi @dineshkumar-23, so this is a limitation of the pre- pandas 1.0 string representation... specifying RecommendationI'd highly recommend using pandas.StringDtype in this case (assuming you're using pandas>=1): class Schema(pa.SchemaModel):
device: Series[int]
type: Series[float]
message: Series[pd.StringDtype]
class Config:
coerce = True edit: add coerce=True to config In this case, pandera will complain even with your original dataframe:
Indeed, if you try to corrupt a df = pd.DataFrame({
"device": range(10),
"type": map(float, reversed(range(10))),
"message": ["Hello"] * 10,
}).astype({"message": pd.StringDtype()})
corrupt_df.loc[:5, "message"] = 92 Output:
So using StringDtype is type-safe, while NOTE: in your use case, coercing as actually dangerous because all other datatypes can be coerced into a string as long as the type implements the WorkaroundAlternatively, you can define a custom check to verify the actual values of the class Schema(pa.SchemaModel):
device: Series[int]
type: Series[float]
message: Series[str]
@pa.check("message")
def check_numpy_string_type(cls, series: Series) -> Series:
return series.map(lambda x: isinstance(x, str))
Schema(corrupt_df) output:
Next StepsI've opened up an issue here: #808 to special-case the |
Beta Was this translation helpful? Give feedback.
Hi @dineshkumar-23, so this is a limitation of the pre- pandas 1.0 string representation... specifying
dtype=str
actually uses numpyobject
datatype instead of a logical string datatype to represent the array.Recommendation
I'd highly recommend using pandas.StringDtype in this case (assuming you're using pandas>=1):
edit: add coerce=True to config
In this case, pandera will complain even with your original dataframe: