Releases: lshpaner/eda_toolkit
EDA Toolkit 0.0.15
Scatter Plot Function Updates
Avoid In-Place Modification of exclude_combinations
This addresses an issue where the scatter_fit_plot
function modifies the exclude_combinations
parameter in-place, causing errors when reused in subsequent calls.
Changes Made
- Create a local copy of
exclude_combinations
for normalization instead of modifying the input directly:exclude_combinations_normalized = {tuple(sorted(pair)) for pair in exclude_combinations}
Improve Progress Tracking and Resolve Last Plot Saving Issue
-
Separate Progress Bars for Grid Plot:
- Added a
tqdm
progress bar to track the rendering of subplots in the grid. - Introduced a second
tqdm
progress bar to handle the saving step of the entire grid plot.
- Added a
-
Fix for Last Plot Saving with
save_plots="all"
:- Ensured individual plots and the grid plot are saved independently without overlap or interference.
- Addressed an issue where the last individual plot was incorrectly saved or overwritten.
-
Accurate Updates and Feedback:
- Progress bars now provide clear updates for rendering and saving stages, avoiding any hanging or delays.
Updated tqdm
saving logic in scatter_fit_plot
- Refactored
tqdm
progress bar inscatter_fit_plot
to track the overall plot-saving process, covering both individual and grid plots. - Updated
tqdm
progress bar description inscatter_fit_plot
to use universal phrasing: "Saving scatter plot(s)." - Ensured consistency for singular or multiple plot-saving scenarios in progress tracking.
EDA Toolkit 0.0.14
Ensure Crosstabs Dictionary is Populated with return_dict=True
This resolves the issue where the stacked_crosstab_plot
function fails to populate and return the crosstabs dictionary (crosstabs_dict
) when return_dict=True
and output="plots_only"
. The fix ensures that crosstabs are always generated when return_dict=True
, regardless of the output parameter.
-
Always Generate Crosstabs with
return_dict=True
: -
Added logic to ensure crosstabs are created and populated in
crosstabs_dict
wheneverreturn_dict=True
, even if the output parameter is set to"plots_only"
. -
Separation of Crosstabs Display from Generation:
- The generation of crosstabs is now independent of the output parameter.
- Crosstabs display (
print
) occurs only when output includes"both"
or"crosstabs_only"
.
Enhancements and Fixes for scatter_fit_plot
Function
This addresses critical issues and introduces key enhancements for the scatter_fit_plot
function. These changes aim to improve usability, flexibility, and robustness of the function.
Enhancements and Fixes
1. Added exclude_combinations
Parameter
- Feature: Users can now exclude specific variable pairs from being plotted by providing a list of tuples with the combinations to omit.
2. Added combinations
Parameter to show_plot
- Feature: Users can now show just the list of combinations that are part of the selection process when
all_vars=True
3. Fixed Bug with Single Variable Pair Plotting
Bug
: When plotting a single variable pair withshow_plot="both"
, the function threw anAttributeError
.Fix
: Single-variable pairs are now properly handled.
4. Updated Default for show_plot
Parameter
- Enhancement: Changed the default value of
show_plot
to"both"
to prevent excessive individual plots when handling large variable sets.
5. Legend, xlim
, ylim
inputs were not being used; fixed.
Fix Default Title and Filename Handling in flex_corr_matrix
This resolves issues in the flex_corr_matrix
function where:
- No default title was provided when
title=None
, resulting in missing titles on plots. - Saved plot filenames were incorrect, leading to issues like
.png.png
whentitle
was not provided.
The fix ensures that a default title ("Correlation Matrix") is used for both plot display and file saving when no title
is explicitly provided. If title
is explicitly set to None
, the plot will have no title, but the saved filename will still use "correlation_matrix"
.
1. Default Filename and Title Logic:
- If no
title
is provided,"Correlation Matrix"
is used as the default for filenames and displayed titles. - If
title=None
is explicitly passed, no title is displayed on the plot.
2. File Saving Improvements:
- File names are generated based on the
title
or default to"correlation_matrix"
iftitle
is not provided. - Spaces in the
title
are replaced with underscores, and special characters like:
are removed to ensure valid filenames.
EDA Toolkit 0.0.13a
Description
This release introduces a series of updates and fixes across multiple functions in the to enhance error handling, improve cross-environment compatibility, streamline usability, and optimize performance. These changes address critical issues, add new features, and ensure consistent behavior in both terminal and notebook environments.
Add ValueError
for Insufficient Pool Size in add_ids
and Enhance ID Deduplication
This update enhances the add_ids function by adding explicit error handling and improving the uniqueness guarantee for
generated IDs. The following changes have been implemented:
Key Changes
-
New
ValueError
for Insufficient Pool Size: -
Calculates the pool size (
$(9 \times 10^{d-1}$ )) and compares it with the number of rows in the DataFrame.-
Behavior:
- Throws a ValueError if n_rows > pool_size.
- Prints a warning if n_rows approaches 90% of the pool size, suggesting an increase in digit length.
-
-
Improved ID Deduplication:
-
Introduced a set (
unique_ids
) to track generated IDs. -
IDs are checked against this set to ensure uniqueness before being added to the DataFrame.
-
Prevents collisions by regenerating IDs only for duplicates, minimizing retries and improving performance.
Benefits
- Ensures robust error handling, avoiding silent failures or excessive retries caused by small digit lengths.
- Guarantees unique IDs even for large DataFrames, improving reliability and scalability.
Enhance strip_trailing_period
to Support Strings and Mixed Data Types
- This enhances the
strip_trailing_period
function to handle trailing periods in both numeric and string values. The updated implementation ensures robustness for columns with mixed data types and gracefully handles special cases likeNaN
.
Key Enhancements
- Support for Strings with Trailing Periods:
- Removes trailing periods from string values, such as "123." or "test.".
- Mixed Data Types:
- Handles columns containing both numeric and string values seamlessly.
- Graceful Handling of
NaN
:- Skips processing for
NaN
values, leaving them unchanged.
- Skips processing for
- Robust Type Conversion:
- Converts numeric strings (e.g., "123.") back to float where applicable.
- Retains strings if conversion to float is not possible.
Changes in stacked_crosstab_plot
Remove IPython
Dependency by Replacing display
with print
This resolves an issue where the eda_toolkit
library required IPython
as a dependency due to the use of display(crosstab_df)
in the stacked_crosstab_plot
function. The dependency caused import failures in environments without IPython
, especially in non-Jupyter terminal-based workflows.
Changes Made
- Replaced
display
withprint
:
-
The line
display(crosstab_df)
was replaced withprint(crosstab_df)
to eliminate the need forIPython
. This ensures compatibility across terminal and Jupyter environments without requiring additional dependencies. -
Removed
IPython
Import:- The
from IPython.display import display
import statement was removed from the codebase.
- The
Updated Function Behavior:
- Crosstabs are displayed using print, maintaining functionality in all runtime environments.
- The change ensures no loss in usability or user experience.
Root Cause and Fix
The issue arose from the reliance on IPython.display.display
for rendering crosstab tables in Jupyter notebooks. Since IPython
is not a core dependency of eda_toolkit
, environments without IPython
experienced a ModuleNotFoundError
.
To address this, the display(crosstab_df)
statement was replaced with print(crosstab_df)
, simplifying the function while maintaining compatibility with both Jupyter and terminal environments.
Testing
-
Jupyter Notebook:
- Crosstabs are displayed as plain text via print(), rendered neatly in notebook outputs.
-
Terminal Session:
- Crosstabs are printed as expected, ensuring seamless use in terminal-based workflows.
Add Environment Detection to dataframe_columns
Function
This enhances the dataframe_columns
function to dynamically adjust its output based on the runtime environment (Jupyter Notebook or terminal). It resolves issues where the function's styled output was incompatible with terminal environments.
Changes Made
-
Environment Detection:
- Added a check to determine if the function is running in a Jupyter Notebook or terminal:
is_notebook_env = "ipykernel" in sys.modules
-
Dynamic Output Behavior:
- Terminal Environment:
- Returns a plain DataFrame (
result_df
) when running outside of a notebook or whenreturn_df=True
.
- Returns a plain DataFrame (
- Jupyter Notebook:
- Retains the styled DataFrame functionality when running in a notebook and
return_df=False
.
- Retains the styled DataFrame functionality when running in a notebook and
- Terminal Environment:
-
Improved Compatibility:
- The function now works seamlessly in both terminal and notebook environments without requiring additional dependencies.
-
Preserved Existing Features:
- Maintains sorting behavior via
sort_cols_alpha
. - Keeps the background color styling for specific columns (
unique_values_total
,max_unique_value
, etc.) in notebook environments.
- Maintains sorting behavior via
Add tqdm
Progress Bar to dataframe_columns
Function
This enhances the dataframe_columns
function by incorporating a tqdm
progress bar to track the processing of each column. This is particularly useful for analyzing large DataFrames, providing real-time feedback on the function's progress.
Changes Made
- Added
tqdm
Progress Bar:
- Wrapped the column processing loop with a
tqdm
progress bar:
for col in tqdm(df.columns, desc="Processing columns"):
...
- The progress bar is labeled with the description
"Processing columns"
for clarity. - The progress bar is non-intrusive and works seamlessly in both terminal and Jupyter Notebook environments.
box_violin_plot
Fix Plot Display for Terminal Applications and Simplify save_plot
Functionality
This addresses the following issues:
- Removes
plt.close(fig)
- Ensures plots display properly in terminal-based applications and IDEs outside Jupyter Notebooks.
- Fixes the incompatibility with non-interactive environments by leaving figures open after rendering.
- Simplifies
save_plot
Parameter- Converts
save_plot
into aboolean
for simplicity and better integration with the existingshow_plot
parameter. - Automatically saves plots based on the value of
show_plot
("individual,"
"grid,"
or"both"
) whensave_plot=True
.
- Converts
These changes improve the usability and flexibility of the plotting function across different environments.
Changes Made
-
Removed
plt.close(fig)
to allow plots to remain open in non-Jupyter environments. -
Updated the
save_plot
parameter to be aboolean
, streamlining the control logic withshow_plot
. -
Adjusted the relevant sections of the code to implement these changes.
-
Updated
ValueError
check based on the newsave_plots
input:# Check for valid save_plots value if not isinstance(save_plots, bool): raise ValueError("`save_plots` must be a boolean value (True or False).")
scatter_fit_plot
: Render Plots Before Saving
- Update the
scatter_fit_plot
function to render all plots (plt.show()
) before saving, improving user experience and output quality validation.
Changes
- Added
plt.show()
to render individual and grid plots before saving. - Integrated
tqdm
for progress tracking during saving individual plots and grid plots
Add tqdm
Progress Bar to save_dataframes_to_excel
This enhances the save_dataframes_to_excel
function by integrating a tqdm
progress bar for improved tracking of the DataFrame saving process. Users can now visually monitor the progress of writing each DataFrame to its respective sheet in the Excel file.
Changes Made
-
Added a
tqdm
Progress Bar:- Tracks the progress of saving DataFrames to individual sheets.
- Ensures that the user sees an incremental update as each DataFrame is written.
-
Updated Functionality:
- Incorporated the progress bar into the loop that writes DataFrames to sheets.
- Retained the existing formatting features (e.g., auto-fitting columns, numeric formatting, and header styles).
Add Progress Tracking and Enhance Functionality for summarize_all_combinations
This enhances the summarize_all_combinations
function by adding user-friendly progress tracking using tqdm
and addressing usability concerns. The following changes have been implemented:
- Progress Tracking with
tqdm
- Excel File Finalization:
- Addressed
UserWarning
messages related to close() being called on already closed files by explicitly managing file closure. - Added a final confirmation message when the Excel file is successfully saved.
Fix Plot Display Logic in plot_2d_pdp
This resolves an issue in the plot_2d_pdp
function where all plots (grid and individual) were being displayed unnecessarily when save_plots="all"
. The function now adheres strictly to the plot_type
parameter, showing only the intended plots. It also ensures unused plots are closed to prevent memory issues.
Changes Made:
-
Grid Plot Logic:
- Grid plots are only displayed if
plot_type="grid"
orplot_type="both"
. - If
save_plots="all"
orsave_plots="grid"
, plots are saved without being displayed unless specified byplot_type
.
- Grid plots are only displayed if
-
Individual Plot Logic:
- Individual plots are only displayed if
plot_type="individual"
orplot_type="both"
. - If
save_plots="all"
orsave_plots="individual"
, plots are saved but not displayed unless specified by `plo...
- Individual plots are only displayed if
EDA Toolkit 0.0.13
Description
This release introduces a series of updates and fixes across multiple functions in the to enhance error handling, improve cross-environment compatibility, streamline usability, and optimize performance. These changes address critical issues, add new features, and ensure consistent behavior in both terminal and notebook environments.
Add ValueError
for Insufficient Pool Size in add_ids
and Enhance ID Deduplication
This update enhances the add_ids function by adding explicit error handling and improving the uniqueness guarantee for
generated IDs. The following changes have been implemented:
Key Changes
-
New
ValueError
for Insufficient Pool Size: -
Calculates the pool size (
$(9 \times 10^{d-1}$ )) and compares it with the number of rows in the DataFrame.-
Behavior:
- Throws a ValueError if n_rows > pool_size.
- Prints a warning if n_rows approaches 90% of the pool size, suggesting an increase in digit length.
-
-
Improved ID Deduplication:
-
Introduced a set (
unique_ids
) to track generated IDs. -
IDs are checked against this set to ensure uniqueness before being added to the DataFrame.
-
Prevents collisions by regenerating IDs only for duplicates, minimizing retries and improving performance.
Benefits
- Ensures robust error handling, avoiding silent failures or excessive retries caused by small digit lengths.
- Guarantees unique IDs even for large DataFrames, improving reliability and scalability.
Enhance strip_trailing_period
to Support Strings and Mixed Data Types
- This enhances the
strip_trailing_period
function to handle trailing periods in both numeric and string values. The updated implementation ensures robustness for columns with mixed data types and gracefully handles special cases likeNaN
.
Key Enhancements
- Support for Strings with Trailing Periods:
- Removes trailing periods from string values, such as "123." or "test.".
- Mixed Data Types:
- Handles columns containing both numeric and string values seamlessly.
- Graceful Handling of
NaN
:- Skips processing for
NaN
values, leaving them unchanged.
- Skips processing for
- Robust Type Conversion:
- Converts numeric strings (e.g., "123.") back to float where applicable.
- Retains strings if conversion to float is not possible.
Changes in stacked_crosstab_plot
Remove IPython
Dependency by Replacing display
with print
This resolves an issue where the eda_toolkit
library required IPython
as a dependency due to the use of display(crosstab_df)
in the stacked_crosstab_plot
function. The dependency caused import failures in environments without IPython
, especially in non-Jupyter terminal-based workflows.
Changes Made
- Replaced
display
withprint
:
-
The line
display(crosstab_df)
was replaced withprint(crosstab_df)
to eliminate the need forIPython
. This ensures compatibility across terminal and Jupyter environments without requiring additional dependencies. -
Removed
IPython
Import:- The
from IPython.display import display
import statement was removed from the codebase.
- The
Updated Function Behavior:
- Crosstabs are displayed using print, maintaining functionality in all runtime environments.
- The change ensures no loss in usability or user experience.
Root Cause and Fix
The issue arose from the reliance on IPython.display.display
for rendering crosstab tables in Jupyter notebooks. Since IPython
is not a core dependency of eda_toolkit
, environments without IPython
experienced a ModuleNotFoundError
.
To address this, the display(crosstab_df)
statement was replaced with print(crosstab_df)
, simplifying the function while maintaining compatibility with both Jupyter and terminal environments.
Testing
-
Jupyter Notebook:
- Crosstabs are displayed as plain text via print(), rendered neatly in notebook outputs.
-
Terminal Session:
- Crosstabs are printed as expected, ensuring seamless use in terminal-based workflows.
Add Environment Detection to dataframe_columns
Function
This enhances the dataframe_columns
function to dynamically adjust its output based on the runtime environment (Jupyter Notebook or terminal). It resolves issues where the function's styled output was incompatible with terminal environments.
Changes Made
-
Environment Detection:
- Added a check to determine if the function is running in a Jupyter Notebook or terminal:
is_notebook_env = "ipykernel" in sys.modules
-
Dynamic Output Behavior:
- Terminal Environment:
- Returns a plain DataFrame (
result_df
) when running outside of a notebook or whenreturn_df=True
.
- Returns a plain DataFrame (
- Jupyter Notebook:
- Retains the styled DataFrame functionality when running in a notebook and
return_df=False
.
- Retains the styled DataFrame functionality when running in a notebook and
- Terminal Environment:
-
Improved Compatibility:
- The function now works seamlessly in both terminal and notebook environments without requiring additional dependencies.
-
Preserved Existing Features:
- Maintains sorting behavior via
sort_cols_alpha
. - Keeps the background color styling for specific columns (
unique_values_total
,max_unique_value
, etc.) in notebook environments.
- Maintains sorting behavior via
Add tqdm
Progress Bar to dataframe_columns
Function
This enhances the dataframe_columns
function by incorporating a tqdm
progress bar to track the processing of each column. This is particularly useful for analyzing large DataFrames, providing real-time feedback on the function's progress.
Changes Made
- Added
tqdm
Progress Bar:
- Wrapped the column processing loop with a
tqdm
progress bar:
for col in tqdm(df.columns, desc="Processing columns"):
...
- The progress bar is labeled with the description
"Processing columns"
for clarity. - The progress bar is non-intrusive and works seamlessly in both terminal and Jupyter Notebook environments.
box_violin_plot
Fix Plot Display for Terminal Applications and Simplify save_plot
Functionality
This addresses the following issues:
- Removes
plt.close(fig)
- Ensures plots display properly in terminal-based applications and IDEs outside Jupyter Notebooks.
- Fixes the incompatibility with non-interactive environments by leaving figures open after rendering.
- Simplifies
save_plot
Parameter- Converts
save_plot
into aboolean
for simplicity and better integration with the existingshow_plot
parameter. - Automatically saves plots based on the value of
show_plot
("individual,"
"grid,"
or"both"
) whensave_plot=True
.
- Converts
These changes improve the usability and flexibility of the plotting function across different environments.
Changes Made
-
Removed
plt.close(fig)
to allow plots to remain open in non-Jupyter environments. -
Updated the
save_plot
parameter to be aboolean
, streamlining the control logic withshow_plot
. -
Adjusted the relevant sections of the code to implement these changes.
-
Updated
ValueError
check based on the newsave_plots
input:# Check for valid save_plots value if not isinstance(save_plots, bool): raise ValueError("`save_plots` must be a boolean value (True or False).")
scatter_fit_plot
: Render Plots Before Saving
- Update the
scatter_fit_plot
function to render all plots (plt.show()
) before saving, improving user experience and output quality validation.
Changes
- Added
plt.show()
to render individual and grid plots before saving. - Integrated
tqdm
for progress tracking during saving individual plots and grid plots
Add tqdm
Progress Bar to save_dataframes_to_excel
This enhances the save_dataframes_to_excel
function by integrating a tqdm
progress bar for improved tracking of the DataFrame saving process. Users can now visually monitor the progress of writing each DataFrame to its respective sheet in the Excel file.
Changes Made
-
Added a
tqdm
Progress Bar:- Tracks the progress of saving DataFrames to individual sheets.
- Ensures that the user sees an incremental update as each DataFrame is written.
-
Updated Functionality:
- Incorporated the progress bar into the loop that writes DataFrames to sheets.
- Retained the existing formatting features (e.g., auto-fitting columns, numeric formatting, and header styles).
Add Progress Tracking and Enhance Functionality for summarize_all_combinations
This enhances the summarize_all_combinations
function by adding user-friendly progress tracking using tqdm
and addressing usability concerns. The following changes have been implemented:
- Progress Tracking with
tqdm
- Excel File Finalization:
- Addressed
UserWarning
messages related to close() being called on already closed files by explicitly managing file closure. - Added a final confirmation message when the Excel file is successfully saved.
Fix Plot Display Logic in plot_2d_pdp
This resolves an issue in the plot_2d_pdp
function where all plots (grid and individual) were being displayed unnecessarily when save_plots="all"
. The function now adheres strictly to the plot_type
parameter, showing only the intended plots. It also ensures unused plots are closed to prevent memory issues.
Changes Made:
-
Grid Plot Logic:
- Grid plots are only displayed if
plot_type="grid"
orplot_type="both"
. - If
save_plots="all"
orsave_plots="grid"
, plots are saved without being displayed unless specified byplot_type
.
- Grid plots are only displayed if
-
Individual Plot Logic:
- Individual plots are only displayed if
plot_type="individual"
orplot_type="both"
. - If
save_plots="all"
orsave_plots="individual"
, plots are saved but not displayed unless specified by `plo...
- Individual plots are only displayed if
EDA Toolkit 0.0.12
New Features
-
Added
data_doctor
function:A versatile tool designed to facilitate detailed feature analysis, outlier detection, and data transformation within a DataFrame.
Key Capabilities:
-
Outlier Detection:
- Detects and highlights outliers visually using boxplots, histograms, and other visualization options.
- Allows cutoffs to be applied directly, offering a configurable approach for handling extreme values.
-
Data Transformation:
- Supports a range of scaling transformations, including absolute, log, square root, min-max, robust, and Box-Cox transformations, among others.
- Configurable via
scale_conversion
andscale_conversion_kws
parameters to customize transformation approaches based on user needs.
-
Visualization Options:
- Provides flexible visualization choices, including KDE plots, histograms, and box/violin plots.
- Allows users to specify multiple plot types in a single call (e.g.,
plot_type=["hist", "kde"]
), facilitating comprehensive visual exploration of feature distributions.
-
Customizable Display:
- Adds text annotations, such as cutoff values, below plots, and enables users to adjust various styling parameters like
label_fontsize
,tick_fontsize
, andfigsize
.
- Adds text annotations, such as cutoff values, below plots, and enables users to adjust various styling parameters like
-
Output Control:
- Offers options to save plots directly to PNG or SVG formats, with file names reflecting key transformations and cutoff information for easy identification.
-
EDA Toolkit 0.0.11a2
Data Doctor Updates
1. new_col_name
logic for when scale_conversion==None
, but there are cutoffs to be applied to a new column, allowing such situations to go through so that the new column is created.
2. Fix for apply_as_new_col_to_df
logic
Updated the logic for generating the new column name when apply_as_new_col_to_df=True
. This ensures that the column name is correctly assigned based on the applied transformation or cutoff.
Original code:
# New column name options when apply_as_new_col_to_df == True
if apply_as_new_col_to_df == True and scale_conversion == None and apply_cutoff == True:
new_col_name = feature_name + "_" + 'w_cutoff'
elif apply_as_new_col_to_df == True and scale_conversion != None:
new_col_name = feature_name + "_" + scale_conversion
**Updated version**:
```python
# Default new column name in case no conditions are met
new_col_name = feature_name
# New column name options when apply_as_new_col_to_df == True
if apply_as_new_col_to_df:
if scale_conversion is None and apply_cutoff:
new_col_name = feature_name + "_w_cutoff"
elif scale_conversion is not None:
new_col_name = feature_name + "_" + scale_conversion
3. Custom ValueError
for missing conditions
Added a custom ValueError
to handle cases where the user sets apply_as_new_col_to_df=True
but does not specify either a scale_conversion
or enable apply_cutoff
. This provides clearer feedback to users and avoids unexpected behavior.
4. New error-handling block:
if apply_as_new_col_to_df:
if scale_conversion is None and not apply_cutoff:
raise ValueError(
"When applying a new column with `apply_as_new_col_to_df=True`, "
"you must specify either a `scale_conversion` or set `apply_cutoff=True`."
)
Overall Changes
- Corrected the logic for generating new column names when transformations or cutoffs are applied.
- Added a custom
ValueError
whenapply_as_new_col_to_df=True
but neither a validscale_conversion
norapply_cutoff=True
is specified. - Updated the docstring to reflect the new logic and error handling.
EDA Toolkit 0.0.11a1
Plotting Changes
Added histplot()
to the plot grid
# Histplot
sns.histplot(
x=feature_,
ax=axes[1],
**(hist_kws or {}),
)
axes[1].set_title(f"Histplot: {feature_name} (Scale: {scale_conversion})")
axes[1].set_xlabel(f"{feature_name}") # Add x-axis label here
Additional changes to plotting:
Added flexibility for keyword arguments (kde_kws
, hist_kws
, and box_kws
) in the data_doctor
function to allow users to customize Seaborn plots directly. This enhancement enables users to pass additional parameters for KDE, histogram, and boxplot customization, making the function more adaptable to specific plotting requirements.
Changes:
- added x-axis label to
histplot()
- Introduced the following dictionary-based keyword argument inputs:
kde_kws
: Allows customization of the KDE plot (e.g., color, fill, etc.).hist_kws
: Allows customization of the histogram plot (e.g., stat, color, etc.).box_kws
: Allows customization of the boxplot (e.g., palette, color, etc.).
- Updated docstrings to reflect these changes and improved the description of the plotting logic.
This should provide users with more control over the visual output without altering the core functionality of the data_doctor
function.
EDA Toolkit 0.0.11a
eda_toolkit
0.0.11a - Release Notes
Release Date: October 2024
We are excited to announce the 0.0.11a
release of eda_toolkit
, which brings an important new feature: the data_doctor
function. This release focuses on providing enhanced data quality checks and improving your exploratory data analysis workflow.
🚀 New Features:
data_doctor
Function
The data_doctor
function has been added to assist with automated data health checks. It performs a series of diagnostics on your dataset to identify potential issues such as:
- Missing Data: It scans for any null values across columns and provides a summary.
- Data Types: Verifies the consistency of data types across each column.
- Outliers: Detects and highlights statistical outliers based on customizable thresholds (e.g., IQR method).
- Duplicated Entries: Identifies duplicate rows in the dataset.
- Inconsistent Values: Flags anomalies or inconsistent data entries (such as mixed types in categorical variables).
- Unique Values: Reports unique value counts for each feature, helping spot features with low variance.
This function helps users clean their data efficiently by pointing out key issues that may need attention before proceeding with analysis or model training.
Example usage:
from eda_toolkit import data_doctor
# Run diagnostics on a DataFrame
report = data_doctor(df, outlier_method='iqr', display_full_report=True)
flex_corr_matrix fix
Fix: Set the default input title
in flex_corr_matrix()
to None
, since it was previously set to "Cervical Cancer Data: Correlation Matrix"
EDA Toolkit 0.0.11
Fix TypeError in stacked_crosstab_plot
for save_formats
Description:
Fixes a TypeError
in the stacked_crosstab_plot
function when save_formats
is None
. The update ensures that save_formats
defaults to an empty list, preventing iteration over a NoneType
object.
Changes:
- Initializes
save_formats
as an empty list if not provided. - Adds handling for string and tuple input types for
save_formats
.
Issue Fixed:
Resolves TypeError
when save_formats
is None
.
EDA Toolkit 0.0.10
Version 0.0.10
Legend Handling:
- The legend is now displayed only if there are valid legend handles (
len(handles) > 0
) and ifshow_legend
is set toTrue
. - The check
ax.get_legend().remove()
ensures that unnecessary legends are removed if they are empty or ifshow_legend
is set toFalse
.
Error Handling:
- Error handling in the
except
block has been enhanced to ensure that any exceptions related to legends or labels are managed properly. The legend handling logic still respects theshow_legend
flag even in cases where exceptions occur.
This update prevents empty legend squares from appearing and maintains the intended default behavior of showing legends only when they contain relevant content.