datpro
======

.. py:module:: datpro


Submodules
----------

.. toctree::
   :maxdepth: 1

   /autoapi/datpro/datpro/index


Attributes
----------

.. autoapisummary::

   datpro.__version__


Functions
---------

.. autoapisummary::

   datpro.detect_anomalies
   datpro.plotify
   datpro.summarize_data


Package Contents
----------------

.. py:data:: __version__

.. py:function:: detect_anomalies(df: pandas.DataFrame, anomaly_type: Optional[str] = None) -> Dict[str, Union[Dict[str, Dict[str, Union[int, float]]], str]]

   Detect anomalies in a dataframe, including missing values, outliers, and duplicates.

   :param df: The input dataframe to analyze.
   :type df: pandas.DataFrame
   :param anomaly_type: Specify which anomaly to check ('missing_values', 'outliers', or 'duplicates').
                        If None, all anomaly types will be checked.
   :type anomaly_type: str, optional

   :returns: A dictionary containing detected anomalies based on the specified anomaly_type.
   :rtype: dict

   .. rubric:: Example

   >>> import pandas as pd
   >>> data = {'A': [1, 2, np.nan, 4], 'B': [100, 200, 300, 400], 'C': [1, 1, 1, 100]}
   >>> df = pd.DataFrame(data)
   >>> detect_anomalies(df, anomaly_type='missing_values')
   {'missing_values': {'A': {'missing_count': 1, 'missing_percentage': 25.0}}}


.. py:function:: plotify(df: pandas.DataFrame, plot_types: Optional[List[str]] = None, save: bool = False, save_path: str = 'plots', file_prefix: str = 'plot') -> Dict[str, altair.Chart]

   Visualize a DataFrame by generating specified plots based on column datatypes.

   :param df: The DataFrame containing the data to be visualized.
   :type df: pandas.DataFrame
   :param plot_types: A list of plot types to generate. Available options include:
                      - 'histogram' : Plot a histogram for numeric columns.
                      - 'density' : Plot a density plot for numeric columns.
                      - 'bar' : Plot a bar chart for categorical columns.
                      - 'scatter' : Plot scatter plots for pairwise numeric columns.
                      - 'correlation' : Plot a correlation heatmap for numeric columns.
                      - 'box' : Plot box plots for numeric vs categorical columns.
                      - 'stacked_bar' : Plot stacked bar charts for pairwise categorical columns.
                      If None, all plot types are generated by default.
   :type plot_types: list of str, optional
   :param save: If True, saves the plots to the specified path. Default is False.
   :type save: bool, optional
   :param save_path: The directory where plots should be saved. Default is 'plots'.
   :type save_path: str, optional
   :param file_prefix: The prefix for saved plot filenames. Default is 'plot'.
   :type file_prefix: str, optional

   :returns: A dictionary where keys are plot names and values are Altair Chart objects.
   :rtype: dict

   :raises TypeError: If the input is not a pandas DataFrame.
   :raises ValueError: If the input DataFrame is empty.

   .. rubric:: Notes

   - Numeric columns are those of types 'int64', 'float64'.
   - Categorical columns are those of types 'object', 'category', and 'bool'.

   .. rubric:: Examples

   >>> import pandas as pd
   >>> df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': ['x', 'y', 'x', 'y']})
   >>> charts = plotify(df, plot_types=['histogram', 'bar'])
   >>> charts['histogram_A'].show()


.. py:function:: summarize_data(df: pandas.DataFrame) -> pandas.DataFrame

   Summarizes numeric columns in a given DataFrame by calculating key statistical metrics.

   This function automatically detects numeric columns in the provided DataFrame and
   returns a summary DataFrame containing the minimum, 25th percentile (Q1), median (50th percentile),
   75th percentile (Q3), and maximum values for each numeric column.

   :param df: The input DataFrame containing data to be summarized.
   :type df: pandas.DataFrame

   :returns: A DataFrame where each row corresponds to a numeric column in the input DataFrame,
             and the columns represent the calculated statistics: min, 25%, 50% (median), 75%, and max.
   :rtype: pandas.DataFrame

   .. rubric:: Example

   >>> import pandas as pd
   >>> import numpy as np
   >>> data = {
   ...     "A": [1, 2, np.nan, 4],
   ...     "B": [100, 200, 300, 400],
   ...     "C": [1, 1, 1, 100]
   ... }
   >>> df = pd.DataFrame(data)
   >>> summarize_data(df)
        min   25%   50%   75%    max
   A    1.0   1.5   2.0   3.0    4.0
   B  100.0  175.0  250.0  325.0  400.0
   C    1.0   1.0   1.0   50.5  100.0