Skip to content

Releases: lshpaner/eda_toolkit

EDA Toolkit 0.0.15

28 Dec 20:52
Compare
Choose a tag to compare

Scatter Plot Function Updates

Avoid In-Place Modification of exclude_combinations

This addresses an issue where the scatter_fit_plot function modifies the exclude_combinations parameter in-place, causing errors when reused in subsequent calls.

Changes Made

  • Create a local copy of exclude_combinations for normalization instead of modifying the input directly:
    exclude_combinations_normalized = {tuple(sorted(pair)) for pair in exclude_combinations}

Improve Progress Tracking and Resolve Last Plot Saving Issue

  1. Separate Progress Bars for Grid Plot:

    • Added a tqdm progress bar to track the rendering of subplots in the grid.
    • Introduced a second tqdm progress bar to handle the saving step of the entire grid plot.
  2. Fix for Last Plot Saving with save_plots="all":

    • Ensured individual plots and the grid plot are saved independently without overlap or interference.
    • Addressed an issue where the last individual plot was incorrectly saved or overwritten.
  3. Accurate Updates and Feedback:

    • Progress bars now provide clear updates for rendering and saving stages, avoiding any hanging or delays.

Updated tqdm saving logic in scatter_fit_plot

  • Refactored tqdm progress bar in scatter_fit_plot to track the overall plot-saving process, covering both individual and grid plots.
  • Updated tqdm progress bar description in scatter_fit_plot to use universal phrasing: "Saving scatter plot(s)."
  • Ensured consistency for singular or multiple plot-saving scenarios in progress tracking.

EDA Toolkit 0.0.14

27 Dec 19:46
Compare
Choose a tag to compare

Ensure Crosstabs Dictionary is Populated with return_dict=True

This resolves the issue where the stacked_crosstab_plot function fails to populate and return the crosstabs dictionary (crosstabs_dict) when return_dict=True and output="plots_only". The fix ensures that crosstabs are always generated when return_dict=True, regardless of the output parameter.

  • Always Generate Crosstabs with return_dict=True:

  • Added logic to ensure crosstabs are created and populated in crosstabs_dict whenever return_dict=True, even if the output parameter is set to "plots_only".

  • Separation of Crosstabs Display from Generation:

    • The generation of crosstabs is now independent of the output parameter.
    • Crosstabs display (print) occurs only when output includes "both" or "crosstabs_only".

Enhancements and Fixes for scatter_fit_plot Function

This addresses critical issues and introduces key enhancements for the scatter_fit_plot function. These changes aim to improve usability, flexibility, and robustness of the function.


Enhancements and Fixes

1. Added exclude_combinations Parameter

  • Feature: Users can now exclude specific variable pairs from being plotted by providing a list of tuples with the combinations to omit.

2. Added combinations Parameter to show_plot

  • Feature: Users can now show just the list of combinations that are part of the selection process when all_vars=True

3. Fixed Bug with Single Variable Pair Plotting

  • Bug: When plotting a single variable pair with show_plot="both", the function threw an AttributeError.
  • Fix: Single-variable pairs are now properly handled.

4. Updated Default for show_plot Parameter

  • Enhancement: Changed the default value of show_plot to "both" to prevent excessive individual plots when handling large variable sets.

5. Legend, xlim, ylim inputs were not being used; fixed.


Fix Default Title and Filename Handling in flex_corr_matrix

This resolves issues in the flex_corr_matrix function where:

  1. No default title was provided when title=None, resulting in missing titles on plots.
  2. Saved plot filenames were incorrect, leading to issues like .png.png when title was not provided.

The fix ensures that a default title ("Correlation Matrix") is used for both plot display and file saving when no title is explicitly provided. If title is explicitly set to None, the plot will have no title, but the saved filename will still use "correlation_matrix".

1. Default Filename and Title Logic:

  • If no title is provided, "Correlation Matrix" is used as the default for filenames and displayed titles.
  • If title=None is explicitly passed, no title is displayed on the plot.

2. File Saving Improvements:

  • File names are generated based on the title or default to "correlation_matrix" if title is not provided.
  • Spaces in the title are replaced with underscores, and special characters like : are removed to ensure valid filenames.

EDA Toolkit 0.0.13a

25 Dec 00:43
Compare
Choose a tag to compare

Description

This release introduces a series of updates and fixes across multiple functions in the to enhance error handling, improve cross-environment compatibility, streamline usability, and optimize performance. These changes address critical issues, add new features, and ensure consistent behavior in both terminal and notebook environments.

Add ValueError for Insufficient Pool Size in add_ids and Enhance ID Deduplication

This update enhances the add_ids function by adding explicit error handling and improving the uniqueness guarantee for
generated IDs. The following changes have been implemented:

Key Changes

  • New ValueError for Insufficient Pool Size:

  • Calculates the pool size ($(9 \times 10^{d-1}$)) and compares it with the number of rows in the DataFrame.

    • Behavior:

      • Throws a ValueError if n_rows > pool_size.
      • Prints a warning if n_rows approaches 90% of the pool size, suggesting an increase in digit length.
  • Improved ID Deduplication:

  • Introduced a set (unique_ids) to track generated IDs.

  • IDs are checked against this set to ensure uniqueness before being added to the DataFrame.

  • Prevents collisions by regenerating IDs only for duplicates, minimizing retries and improving performance.

Benefits

  • Ensures robust error handling, avoiding silent failures or excessive retries caused by small digit lengths.
  • Guarantees unique IDs even for large DataFrames, improving reliability and scalability.

Enhance strip_trailing_period to Support Strings and Mixed Data Types

  • This enhances the strip_trailing_period function to handle trailing periods in both numeric and string values. The updated implementation ensures robustness for columns with mixed data types and gracefully handles special cases like NaN.

Key Enhancements

  • Support for Strings with Trailing Periods:
    • Removes trailing periods from string values, such as "123." or "test.".
  • Mixed Data Types:
    • Handles columns containing both numeric and string values seamlessly.
  • Graceful Handling of NaN:
    • Skips processing for NaN values, leaving them unchanged.
  • Robust Type Conversion:
    • Converts numeric strings (e.g., "123.") back to float where applicable.
    • Retains strings if conversion to float is not possible.

Changes in stacked_crosstab_plot

Remove IPython Dependency by Replacing display with print

This resolves an issue where the eda_toolkit library required IPython as a dependency due to the use of display(crosstab_df) in the stacked_crosstab_plot function. The dependency caused import failures in environments without IPython, especially in non-Jupyter terminal-based workflows.

Changes Made

  1. Replaced display with print:
  • The line display(crosstab_df) was replaced with print(crosstab_df) to eliminate the need for IPython. This ensures compatibility across terminal and Jupyter environments without requiring additional dependencies.

  • Removed IPython Import:

    • The from IPython.display import display import statement was removed from the codebase.

Updated Function Behavior:

  • Crosstabs are displayed using print, maintaining functionality in all runtime environments.
  • The change ensures no loss in usability or user experience.

Root Cause and Fix

The issue arose from the reliance on IPython.display.display for rendering crosstab tables in Jupyter notebooks. Since IPython is not a core dependency of eda_toolkit, environments without IPython experienced a ModuleNotFoundError.

To address this, the display(crosstab_df) statement was replaced with print(crosstab_df), simplifying the function while maintaining compatibility with both Jupyter and terminal environments.

Testing

  • Jupyter Notebook:

    • Crosstabs are displayed as plain text via print(), rendered neatly in notebook outputs.
  • Terminal Session:

    • Crosstabs are printed as expected, ensuring seamless use in terminal-based workflows.

Add Environment Detection to dataframe_columns Function

This enhances the dataframe_columns function to dynamically adjust its output based on the runtime environment (Jupyter Notebook or terminal). It resolves issues where the function's styled output was incompatible with terminal environments.

Changes Made

  1. Environment Detection:

    • Added a check to determine if the function is running in a Jupyter Notebook or terminal:
    is_notebook_env = "ipykernel" in sys.modules
  2. Dynamic Output Behavior:

    • Terminal Environment:
      • Returns a plain DataFrame (result_df) when running outside of a notebook or when return_df=True.
    • Jupyter Notebook:
      • Retains the styled DataFrame functionality when running in a notebook and return_df=False.
  3. Improved Compatibility:

    • The function now works seamlessly in both terminal and notebook environments without requiring additional dependencies.
  4. Preserved Existing Features:

    • Maintains sorting behavior via sort_cols_alpha.
    • Keeps the background color styling for specific columns (unique_values_total, max_unique_value, etc.) in notebook environments.

Add tqdm Progress Bar to dataframe_columns Function

This enhances the dataframe_columns function by incorporating a tqdm progress bar to track the processing of each column. This is particularly useful for analyzing large DataFrames, providing real-time feedback on the function's progress.

Changes Made

  1. Added tqdm Progress Bar:
  • Wrapped the column processing loop with a tqdm progress bar:
for col in tqdm(df.columns, desc="Processing columns"):
    ...
  1. The progress bar is labeled with the description "Processing columns" for clarity.
  2. The progress bar is non-intrusive and works seamlessly in both terminal and Jupyter Notebook environments.

box_violin_plot Fix Plot Display for Terminal Applications and Simplify save_plot Functionality

This addresses the following issues:

  1. Removes plt.close(fig)
    • Ensures plots display properly in terminal-based applications and IDEs outside Jupyter Notebooks.
    • Fixes the incompatibility with non-interactive environments by leaving figures open after rendering.
  2. Simplifies save_plot Parameter
    • Converts save_plot into a boolean for simplicity and better integration with the existing show_plot parameter.
    • Automatically saves plots based on the value of show_plot ("individual," "grid," or "both") when save_plot=True.

These changes improve the usability and flexibility of the plotting function across different environments.

Changes Made

  • Removed plt.close(fig) to allow plots to remain open in non-Jupyter environments.

  • Updated the save_plot parameter to be a boolean, streamlining the control logic with show_plot.

  • Adjusted the relevant sections of the code to implement these changes.

  • Updated ValueError check based on the new save_plots input:

    # Check for valid save_plots value
    if not isinstance(save_plots, bool): raise ValueError("`save_plots` must be a boolean value (True or False).")

scatter_fit_plot: Render Plots Before Saving

  • Update the scatter_fit_plot function to render all plots (plt.show()) before saving, improving user experience and output quality validation.

Changes

  • Added plt.show() to render individual and grid plots before saving.
  • Integrated tqdm for progress tracking during saving individual plots and grid plots

Add tqdm Progress Bar to save_dataframes_to_excel

This enhances the save_dataframes_to_excel function by integrating a tqdm progress bar for improved tracking of the DataFrame saving process. Users can now visually monitor the progress of writing each DataFrame to its respective sheet in the Excel file.

Changes Made

  • Added a tqdm Progress Bar:

    • Tracks the progress of saving DataFrames to individual sheets.
    • Ensures that the user sees an incremental update as each DataFrame is written.
  • Updated Functionality:

    • Incorporated the progress bar into the loop that writes DataFrames to sheets.
    • Retained the existing formatting features (e.g., auto-fitting columns, numeric formatting, and header styles).

Add Progress Tracking and Enhance Functionality for summarize_all_combinations

This enhances the summarize_all_combinations function by adding user-friendly progress tracking using tqdm and addressing usability concerns. The following changes have been implemented:

  1. Progress Tracking with tqdm
  2. Excel File Finalization:
  • Addressed UserWarning messages related to close() being called on already closed files by explicitly managing file closure.
  • Added a final confirmation message when the Excel file is successfully saved.

Fix Plot Display Logic in plot_2d_pdp

This resolves an issue in the plot_2d_pdp function where all plots (grid and individual) were being displayed unnecessarily when save_plots="all". The function now adheres strictly to the plot_type parameter, showing only the intended plots. It also ensures unused plots are closed to prevent memory issues.

Changes Made:

  1. Grid Plot Logic:

    • Grid plots are only displayed if plot_type="grid" or plot_type="both".
    • If save_plots="all" or save_plots="grid", plots are saved without being displayed unless specified by plot_type.
  2. Individual Plot Logic:

    • Individual plots are only displayed if plot_type="individual" or plot_type="both".
    • If save_plots="all" or save_plots="individual", plots are saved but not displayed unless specified by `plo...
Read more

EDA Toolkit 0.0.13

25 Dec 00:50
Compare
Choose a tag to compare

Description

This release introduces a series of updates and fixes across multiple functions in the to enhance error handling, improve cross-environment compatibility, streamline usability, and optimize performance. These changes address critical issues, add new features, and ensure consistent behavior in both terminal and notebook environments.

Add ValueError for Insufficient Pool Size in add_ids and Enhance ID Deduplication

This update enhances the add_ids function by adding explicit error handling and improving the uniqueness guarantee for
generated IDs. The following changes have been implemented:

Key Changes

  • New ValueError for Insufficient Pool Size:

  • Calculates the pool size ($(9 \times 10^{d-1}$)) and compares it with the number of rows in the DataFrame.

    • Behavior:

      • Throws a ValueError if n_rows > pool_size.
      • Prints a warning if n_rows approaches 90% of the pool size, suggesting an increase in digit length.
  • Improved ID Deduplication:

  • Introduced a set (unique_ids) to track generated IDs.

  • IDs are checked against this set to ensure uniqueness before being added to the DataFrame.

  • Prevents collisions by regenerating IDs only for duplicates, minimizing retries and improving performance.

Benefits

  • Ensures robust error handling, avoiding silent failures or excessive retries caused by small digit lengths.
  • Guarantees unique IDs even for large DataFrames, improving reliability and scalability.

Enhance strip_trailing_period to Support Strings and Mixed Data Types

  • This enhances the strip_trailing_period function to handle trailing periods in both numeric and string values. The updated implementation ensures robustness for columns with mixed data types and gracefully handles special cases like NaN.

Key Enhancements

  • Support for Strings with Trailing Periods:
    • Removes trailing periods from string values, such as "123." or "test.".
  • Mixed Data Types:
    • Handles columns containing both numeric and string values seamlessly.
  • Graceful Handling of NaN:
    • Skips processing for NaN values, leaving them unchanged.
  • Robust Type Conversion:
    • Converts numeric strings (e.g., "123.") back to float where applicable.
    • Retains strings if conversion to float is not possible.

Changes in stacked_crosstab_plot

Remove IPython Dependency by Replacing display with print

This resolves an issue where the eda_toolkit library required IPython as a dependency due to the use of display(crosstab_df) in the stacked_crosstab_plot function. The dependency caused import failures in environments without IPython, especially in non-Jupyter terminal-based workflows.

Changes Made

  1. Replaced display with print:
  • The line display(crosstab_df) was replaced with print(crosstab_df) to eliminate the need for IPython. This ensures compatibility across terminal and Jupyter environments without requiring additional dependencies.

  • Removed IPython Import:

    • The from IPython.display import display import statement was removed from the codebase.

Updated Function Behavior:

  • Crosstabs are displayed using print, maintaining functionality in all runtime environments.
  • The change ensures no loss in usability or user experience.

Root Cause and Fix

The issue arose from the reliance on IPython.display.display for rendering crosstab tables in Jupyter notebooks. Since IPython is not a core dependency of eda_toolkit, environments without IPython experienced a ModuleNotFoundError.

To address this, the display(crosstab_df) statement was replaced with print(crosstab_df), simplifying the function while maintaining compatibility with both Jupyter and terminal environments.

Testing

  • Jupyter Notebook:

    • Crosstabs are displayed as plain text via print(), rendered neatly in notebook outputs.
  • Terminal Session:

    • Crosstabs are printed as expected, ensuring seamless use in terminal-based workflows.

Add Environment Detection to dataframe_columns Function

This enhances the dataframe_columns function to dynamically adjust its output based on the runtime environment (Jupyter Notebook or terminal). It resolves issues where the function's styled output was incompatible with terminal environments.

Changes Made

  1. Environment Detection:

    • Added a check to determine if the function is running in a Jupyter Notebook or terminal:
    is_notebook_env = "ipykernel" in sys.modules
  2. Dynamic Output Behavior:

    • Terminal Environment:
      • Returns a plain DataFrame (result_df) when running outside of a notebook or when return_df=True.
    • Jupyter Notebook:
      • Retains the styled DataFrame functionality when running in a notebook and return_df=False.
  3. Improved Compatibility:

    • The function now works seamlessly in both terminal and notebook environments without requiring additional dependencies.
  4. Preserved Existing Features:

    • Maintains sorting behavior via sort_cols_alpha.
    • Keeps the background color styling for specific columns (unique_values_total, max_unique_value, etc.) in notebook environments.

Add tqdm Progress Bar to dataframe_columns Function

This enhances the dataframe_columns function by incorporating a tqdm progress bar to track the processing of each column. This is particularly useful for analyzing large DataFrames, providing real-time feedback on the function's progress.

Changes Made

  1. Added tqdm Progress Bar:
  • Wrapped the column processing loop with a tqdm progress bar:
for col in tqdm(df.columns, desc="Processing columns"):
    ...
  1. The progress bar is labeled with the description "Processing columns" for clarity.
  2. The progress bar is non-intrusive and works seamlessly in both terminal and Jupyter Notebook environments.

box_violin_plot Fix Plot Display for Terminal Applications and Simplify save_plot Functionality

This addresses the following issues:

  1. Removes plt.close(fig)
    • Ensures plots display properly in terminal-based applications and IDEs outside Jupyter Notebooks.
    • Fixes the incompatibility with non-interactive environments by leaving figures open after rendering.
  2. Simplifies save_plot Parameter
    • Converts save_plot into a boolean for simplicity and better integration with the existing show_plot parameter.
    • Automatically saves plots based on the value of show_plot ("individual," "grid," or "both") when save_plot=True.

These changes improve the usability and flexibility of the plotting function across different environments.

Changes Made

  • Removed plt.close(fig) to allow plots to remain open in non-Jupyter environments.

  • Updated the save_plot parameter to be a boolean, streamlining the control logic with show_plot.

  • Adjusted the relevant sections of the code to implement these changes.

  • Updated ValueError check based on the new save_plots input:

    # Check for valid save_plots value
    if not isinstance(save_plots, bool): raise ValueError("`save_plots` must be a boolean value (True or False).")

scatter_fit_plot: Render Plots Before Saving

  • Update the scatter_fit_plot function to render all plots (plt.show()) before saving, improving user experience and output quality validation.

Changes

  • Added plt.show() to render individual and grid plots before saving.
  • Integrated tqdm for progress tracking during saving individual plots and grid plots

Add tqdm Progress Bar to save_dataframes_to_excel

This enhances the save_dataframes_to_excel function by integrating a tqdm progress bar for improved tracking of the DataFrame saving process. Users can now visually monitor the progress of writing each DataFrame to its respective sheet in the Excel file.

Changes Made

  • Added a tqdm Progress Bar:

    • Tracks the progress of saving DataFrames to individual sheets.
    • Ensures that the user sees an incremental update as each DataFrame is written.
  • Updated Functionality:

    • Incorporated the progress bar into the loop that writes DataFrames to sheets.
    • Retained the existing formatting features (e.g., auto-fitting columns, numeric formatting, and header styles).

Add Progress Tracking and Enhance Functionality for summarize_all_combinations

This enhances the summarize_all_combinations function by adding user-friendly progress tracking using tqdm and addressing usability concerns. The following changes have been implemented:

  1. Progress Tracking with tqdm
  2. Excel File Finalization:
  • Addressed UserWarning messages related to close() being called on already closed files by explicitly managing file closure.
  • Added a final confirmation message when the Excel file is successfully saved.

Fix Plot Display Logic in plot_2d_pdp

This resolves an issue in the plot_2d_pdp function where all plots (grid and individual) were being displayed unnecessarily when save_plots="all". The function now adheres strictly to the plot_type parameter, showing only the intended plots. It also ensures unused plots are closed to prevent memory issues.

Changes Made:

  1. Grid Plot Logic:

    • Grid plots are only displayed if plot_type="grid" or plot_type="both".
    • If save_plots="all" or save_plots="grid", plots are saved without being displayed unless specified by plot_type.
  2. Individual Plot Logic:

    • Individual plots are only displayed if plot_type="individual" or plot_type="both".
    • If save_plots="all" or save_plots="individual", plots are saved but not displayed unless specified by `plo...
Read more

EDA Toolkit 0.0.12

30 Oct 19:12
Compare
Choose a tag to compare

New Features

  • Added data_doctor function:

    A versatile tool designed to facilitate detailed feature analysis, outlier detection, and data transformation within a DataFrame.

    Key Capabilities:

    • Outlier Detection:

      • Detects and highlights outliers visually using boxplots, histograms, and other visualization options.
      • Allows cutoffs to be applied directly, offering a configurable approach for handling extreme values.
    • Data Transformation:

      • Supports a range of scaling transformations, including absolute, log, square root, min-max, robust, and Box-Cox transformations, among others.
      • Configurable via scale_conversion and scale_conversion_kws parameters to customize transformation approaches based on user needs.
    • Visualization Options:

      • Provides flexible visualization choices, including KDE plots, histograms, and box/violin plots.
      • Allows users to specify multiple plot types in a single call (e.g., plot_type=["hist", "kde"]), facilitating comprehensive visual exploration of feature distributions.
    • Customizable Display:

      • Adds text annotations, such as cutoff values, below plots, and enables users to adjust various styling parameters like label_fontsize, tick_fontsize, and figsize.
    • Output Control:

      • Offers options to save plots directly to PNG or SVG formats, with file names reflecting key transformations and cutoff information for easy identification.

EDA Toolkit 0.0.11a2

21 Oct 22:51
Compare
Choose a tag to compare

Data Doctor Updates

1. new_col_name logic for when scale_conversion==None, but there are cutoffs to be applied to a new column, allowing such situations to go through so that the new column is created.

2. Fix for apply_as_new_col_to_df logic

Updated the logic for generating the new column name when apply_as_new_col_to_df=True. This ensures that the column name is correctly assigned based on the applied transformation or cutoff.

Original code:

# New column name options when apply_as_new_col_to_df == True
if apply_as_new_col_to_df == True and scale_conversion == None and apply_cutoff == True:
    new_col_name = feature_name + "_" + 'w_cutoff'
elif apply_as_new_col_to_df == True and scale_conversion != None:
    new_col_name = feature_name + "_" + scale_conversion
    
**Updated version**:

```python
# Default new column name in case no conditions are met
new_col_name = feature_name

# New column name options when apply_as_new_col_to_df == True
if apply_as_new_col_to_df:
    if scale_conversion is None and apply_cutoff:
        new_col_name = feature_name + "_w_cutoff"
    elif scale_conversion is not None:
        new_col_name = feature_name + "_" + scale_conversion

3. Custom ValueError for missing conditions

Added a custom ValueError to handle cases where the user sets apply_as_new_col_to_df=True but does not specify either a scale_conversion or enable apply_cutoff. This provides clearer feedback to users and avoids unexpected behavior.

4. New error-handling block:

if apply_as_new_col_to_df:
    if scale_conversion is None and not apply_cutoff:
        raise ValueError(
            "When applying a new column with `apply_as_new_col_to_df=True`, "
            "you must specify either a `scale_conversion` or set `apply_cutoff=True`."
        )

Overall Changes

  • Corrected the logic for generating new column names when transformations or cutoffs are applied.
  • Added a custom ValueError when apply_as_new_col_to_df=True but neither a valid scale_conversion nor apply_cutoff=True is specified.
  • Updated the docstring to reflect the new logic and error handling.

EDA Toolkit 0.0.11a1

20 Oct 19:16
Compare
Choose a tag to compare

Plotting Changes

Added histplot() to the plot grid

  # Histplot
  sns.histplot(
      x=feature_,
      ax=axes[1],
      **(hist_kws or {}),
  )
  axes[1].set_title(f"Histplot: {feature_name} (Scale: {scale_conversion})")
  axes[1].set_xlabel(f"{feature_name}")  # Add x-axis label here

Additional changes to plotting:

Added flexibility for keyword arguments (kde_kws, hist_kws, and box_kws) in the data_doctor function to allow users to customize Seaborn plots directly. This enhancement enables users to pass additional parameters for KDE, histogram, and boxplot customization, making the function more adaptable to specific plotting requirements.

Changes:

  • added x-axis label to histplot()
  • Introduced the following dictionary-based keyword argument inputs:
    • kde_kws: Allows customization of the KDE plot (e.g., color, fill, etc.).
    • hist_kws: Allows customization of the histogram plot (e.g., stat, color, etc.).
    • box_kws: Allows customization of the boxplot (e.g., palette, color, etc.).
  • Updated docstrings to reflect these changes and improved the description of the plotting logic.

This should provide users with more control over the visual output without altering the core functionality of the data_doctor function.

EDA Toolkit 0.0.11a

19 Oct 04:37
Compare
Choose a tag to compare

eda_toolkit 0.0.11a - Release Notes

Release Date: October 2024

We are excited to announce the 0.0.11a release of eda_toolkit, which brings an important new feature: the data_doctor function. This release focuses on providing enhanced data quality checks and improving your exploratory data analysis workflow.

🚀 New Features:

data_doctor Function

The data_doctor function has been added to assist with automated data health checks. It performs a series of diagnostics on your dataset to identify potential issues such as:

  • Missing Data: It scans for any null values across columns and provides a summary.
  • Data Types: Verifies the consistency of data types across each column.
  • Outliers: Detects and highlights statistical outliers based on customizable thresholds (e.g., IQR method).
  • Duplicated Entries: Identifies duplicate rows in the dataset.
  • Inconsistent Values: Flags anomalies or inconsistent data entries (such as mixed types in categorical variables).
  • Unique Values: Reports unique value counts for each feature, helping spot features with low variance.

This function helps users clean their data efficiently by pointing out key issues that may need attention before proceeding with analysis or model training.

Example usage:

from eda_toolkit import data_doctor

# Run diagnostics on a DataFrame
report = data_doctor(df, outlier_method='iqr', display_full_report=True)

flex_corr_matrix fix

Fix: Set the default input title in flex_corr_matrix() to None, since it was previously set to "Cervical Cancer Data: Correlation Matrix"

EDA Toolkit 0.0.11

24 Sep 22:29
Compare
Choose a tag to compare

Fix TypeError in stacked_crosstab_plot for save_formats

Description:

Fixes a TypeError in the stacked_crosstab_plot function when save_formats is None. The update ensures that save_formats defaults to an empty list, preventing iteration over a NoneType object.

Changes:

  • Initializes save_formats as an empty list if not provided.
  • Adds handling for string and tuple input types for save_formats.

Issue Fixed:

Resolves TypeError when save_formats is None.

EDA Toolkit 0.0.10

18 Sep 03:19
Compare
Choose a tag to compare

Version 0.0.10

Legend Handling:

  • The legend is now displayed only if there are valid legend handles (len(handles) > 0) and if show_legend is set to True.
  • The check ax.get_legend().remove() ensures that unnecessary legends are removed if they are empty or if show_legend is set to False.

Error Handling:

  • Error handling in the except block has been enhanced to ensure that any exceptions related to legends or labels are managed properly. The legend handling logic still respects the show_legend flag even in cases where exceptions occur.

This update prevents empty legend squares from appearing and maintains the intended default behavior of showing legends only when they contain relevant content.