metaprivBIDS package

Submodules

metaprivBIDS.metaprivBIDS module

class metaprivBIDS.metaprivBIDS.ComboBoxDelegate(parent=None)

Bases: QStyledItemDelegate

createEditor(self, parent: PySide6.QtWidgets.QWidget, option: PySide6.QtWidgets.QStyleOptionViewItem, index: PySide6.QtCore.QModelIndex | PySide6.QtCore.QPersistentModelIndex) PySide6.QtWidgets.QWidget
setEditorData(self, editor: PySide6.QtWidgets.QWidget, index: PySide6.QtCore.QModelIndex | PySide6.QtCore.QPersistentModelIndex) None
setModelData(self, editor: PySide6.QtWidgets.QWidget, model: PySide6.QtCore.QAbstractItemModel, index: PySide6.QtCore.QModelIndex | PySide6.QtCore.QPersistentModelIndex) None
staticMetaObject = PySide6.QtCore.QMetaObject("ComboBoxDelegate" inherits "QStyledItemDelegate": )
class metaprivBIDS.metaprivBIDS.NumericStandardItem(text)

Bases: QStandardItem

metaprivBIDS.metaprivBIDS.main()
class metaprivBIDS.metaprivBIDS.metaprivBIDS

Bases: QMainWindow

addCigResultDisplay(layout)

Add a QTextBrowser inside the CIG result frame to display CIG results on the privacy info page.

addPrivacyButtons(layout)

Add buttons related to privacy calculations on the privacy info page.

add_buttons(layout)

Add a row of buttons to the specified layout.

This method creates a horizontal layout containing a set of buttons, each associated with a specific functionality of the application. It then adds this horizontal layout to the provided layout. Each button is styled with a unique background color and connected to its corresponding event handler.

Args:

layout (QVBoxLayout): The layout to which the buttons will be added.

Buttons:
  • “Load CSV/TSV File”: Opens a dialog to load a CSV/TSV file.

  • “Privacy Calculation”: Calculates unique rows for privacy analysis.

  • “Variable Optimization”: Finds the lowest unique columns for optimization.

  • “Preview Data”: Shows a preview of the loaded data.

  • “Compute SUDA”: Computes SUDA (Statistical Disclosure Control).

  • “Privacy Information Factor”: Displays privacy information.

add_frames(layout)

Add frames to the provided layout for displaying various sections of the UI.

This method creates two frames (‘load_results_frame’ and ‘variable_optimization_frame’) and adds them to the specified layout. Both frames are styled with a white border. It then calls methods to populate these frames with their respective layouts. Additionally, a ‘QLabel’ (‘result_label’) is created and added to the layout for displaying result messages.

Parameters:

layoutQVBoxLayout

The layout to which the frames and the result label will be added.

Returns:

QWidget

add_load_results_layout()

Add a layout to the ‘load_results_frame’ for managing column selection and configuration.

This method sets up a vertical layout within ‘load_results_frame’. It creates a ‘QTreeView’ to display columns with various attributes such as unique values and types. The tree view uses a ‘QStandardItemModel’ to manage the data, with headers: “Select Columns”, “Unique Value”, “Type”, and “Select Sensitive Attribute”. A custom ‘ComboBoxDelegate’ is set for the “Type” column to allow users to select a data type from a dropdown. The ‘setup_treeview’ method is called to configure the tree view, and the view is added to the layout.

Returns:

QWidget

add_noise(noise_type)

Adds noise to a selected continuous column in the data.

This method allows the user to select a continuous column from which noise will be added based on the specified noise type. The available noise types are Laplacian and Gaussian. The original column data is preserved before applying the noise.

Parameters:

noise_typestr

The type of noise to add. Options are:

  • ‘laplacian’: Adds Laplacian noise.

  • ‘gaussian’: Adds Gaussian noise.

Raises:

ValueError

If an invalid ‘noise_type’ is provided.

Notes:

  • The method first checks if there are any continuous columns available. If not, a warning is shown.

  • The user is prompted to select a column to which the noise will be added.

  • The method preserves the original data of the selected column before modifying it with the specified noise.

  • The ‘show_preview’ function is called after adding the noise to update any relevant displays.

Returns:

Continous Column with noise.

add_preview_page_widgets(layout)

Add preview page widgets to the specified layout.

This method creates and adds various widgets to the given layout, including: - A horizontal layout of buttons for rounding values, adding noise, reverting data, and combining values. - A ‘Graph Categorical’ button to display graphs for categorical data. - A label for metadata display and a dropdown for selecting columns. - A ‘Load JSON Metadata’ button for loading and displaying metadata.

Parameters:

layoutQVBoxLayout

The layout to which the widgets will be added.

add_variable_optimization_layout()

Add a layout to the variable_optimization_frame for displaying variable optimization results.

This method sets up a vertical layout within variable_optimization_frame. It creates a QTreeView (results_view) to display the variable optimization results using a QStandardItemModel. The model’s headers are set to “Quasi Identifiers”, “Unique Rows After Removal”, “Difference”, and “Normalized”. The tree view is configured using setup_treeview and added to the layout.

Returns:

QVBoxLayout

The layout containing the QTreeView for variable optimization results.

calculate_k_anonymity(selected_columns)

Calculates the K-Anonymity for the specified columns in the dataset.

Parameters:

selected_columnslist of str

The list of column names used to calculate K-Anonymity.

Returns:

int:

The K-Anonymity value, which is the smallest group size of records with the same values for the selected columns.

Notes:

  • This method groups the dataset by the selected columns and calculates the size of each group.

  • It returns the minimum group size as the K-Anonymity value.

Raises:

ValueError

If the ‘selected_columns’ list is empty or contains invalid column names.

calculate_l_diversity(selected_columns, sensitive_attr)

Calculates the L-Diversity for the specified columns in the dataset with respect to a sensitive attribute.

L-Diversity measures the diversity of the sensitive attribute within each group defined by the selected columns. It is defined as the minimum number of distinct values of the sensitive attribute in any group. Higher L-Diversity indicates better protection of sensitive information.

Parameters:

selected_columnslist of str

The list of column names used to define the grouping in the dataset.

sensitive_attrstr

The name of the column that represents the sensitive attribute whose diversity is to be measured.

Returns:

int:

The L-Diversity value, which is the minimum number of distinct values of the sensitive attribute within any group.

Notes:

  • This method groups the dataset by the selected columns and calculates the number of unique values for the sensitive attribute in each group.

  • It returns the minimum number of distinct values of the sensitive attribute across all groups.

Raises:

ValueError

If the ‘selected_columns’ list is empty, or if ‘sensitive_attr’ is not present in the dataset.

calculate_unique_rows()

Calculates and displays various statistics related to the selected columns in the dataset.

This method computes the number of unique rows based on the selected columns, as well as other metrics such as total rows, total columns, the number of selected columns, K-Anonymity, and L-Diversity. It updates a result label with these statistics.

Computes:

  • Unique rows: The number of rows where the combination of values in the selected columns is unique.

  • K-Anonymity: A measure of how well the dataset is anonymized with respect to the selected columns.

  • L-Diversity: A measure of diversity within the sensitive attribute values if a sensitive attribute is selected.

Uses:

  • ‘self.get_selected_columns()’: Retrieves the currently selected columns.

  • ‘self.get_sensitive_attribute()’: Retrieves the currently selected sensitive attribute.

  • ‘self.calculate_k_anonymity(selected_columns)’: Calculates the K-Anonymity for the selected columns.

  • ‘self.calculate_l_diversity(selected_columns, sensitive_attr)’: Calculates the L-Diversity for the selected columns and sensitive attribute.

Updates:

  • ‘self.result_label`: Displays the computed statistics including total rows, total columns, number of selected columns, number of unique rows, K-Anonymity, and L-Diversity.

Raises:

Exception

If an error occurs during the computation of unique rows or other statistics.

close_splash()
combine_selected_values(column_name, selected_items)

Combines selected values in a specified column with a new replacement value.

This method allows the user to combine multiple selected values in a given categorical column into a single replacement value. It performs the following actions: 1. Validates that at least two values are selected for combination. 2. Checks if the column’s original values are stored; if not, stores them. 3. Prompts the user to enter a replacement value for the selected items. 4. Updates the dataset by replacing the selected values with the new replacement value. 5. Records the combination in the history for potential future reference. 6. Refreshes the preview to reflect the changes.

Parameters:

column_namestr

The name of the column where values are to be combined.

selected_itemslist of QListWidgetItem

The list of selected items from the dialog containing the values to be combined.

Notes:

  • The method requires that at least two values are selected for combining. If fewer than two values are selected, a warning is shown.

  • The original values of the column are stored to allow for potential future reference or undo operations.

  • The method updates both the ‘combined_values_history’ and the dataset with the new combination.

Raises:

QMessageBox.Warning: If fewer than two values are selected or if the user does not provide a replacement value.

Returns:

QMessageBox

compute_and_display_suda2_results()

This function computes the SUDA2 metrics, converts object-type columns to numeric codes, and updates the frames with the results. The user is asked for the ‘missing’ value and ‘DisFraction’ in pop-up dialogs.

compute_cig()

Computes and displays the Categorical Information Gain (CIG) for selected columns in the DataFrame.

compute_combined_column_contribution()

Computes and evaluates the contribution of various column combinations to unique row counts in the dataset.

This method performs the following steps: 1. Checks if data is loaded; if not, it displays a warning. 2. Prompts the user to input minimum and maximum combination sizes for column combinations. 3. Iterates over all column combinations of sizes ranging from ‘min_size’ to ‘max_size’. 4. For each combination: - Calculates the number of unique rows (rows with unique values) in the dataset for the selected columns. - Calculates the number of unique rows excluding the selected columns. - Stores the combination, the number of unique rows, and the number of unique rows excluding columns. 5. Computes a score for each combination based on the total unique rows and the unique rows excluding columns. 6. Displays the results in a Qdialog.

Notes:

  • The results include the combination of columns, the count of unique rows, and a score based on the contribution to unique row counts.

  • The ‘show_results_dialog’ method is used to display the results to the user.

Raises:

QMessageBox: If no data is loaded or no columns are selected.

Returns:

Combined Column Contribution calculation.

create_noise_menu()

Create a noise menu with options to add Laplacian or Gaussian noise.

This method creates a QMenu with actions for adding Laplacian and Gaussian noise to the data. Each action is connected to the add_noise method with the respective noise type as a parameter.

Returns:

QMenu:

The menu containing noise addition actions.

describe_cig()

Generates and displays a statistical description of the CIG (Categorical Information Gain) DataFrame.

display_boxplot()
display_dis_score_boxplot()
find_lowest_unique_columns()

Identify columns with the lowest impact on unique row counts when removed.

This method analyzes the selected columns to determine their contribution to the uniqueness of rows in the dataset. It performs the following steps: 1. Retrieves selected columns. 2. Calculates the number of unique rows using the selected columns. 3. Iteratively removes each column to find the impact on the number of unique rows. 4. Computes the difference in unique row counts and normalizes this difference based on the number of unique values in the column. 5. Sorts and displays the results in descending order of normalized difference using update_treeview.

Results are displayed in a tree view and include: - Column name - Unique rows after removal - Difference in unique row counts - Normalized difference

If an error occurs, an error message is displayed on ‘result_label’.

generate_heatmap()

Generates and displays a heatmap of the CIG (Categorical Information Gain) DataFrame.

This method performs the following actions: 1. Checks if the ‘cigs_df’ attribute exists and is not empty. 2. Excludes the ‘RIG’ column from the DataFrame for heatmap generation. 3. Creates a heatmap using seaborn with a custom color map and saves it as an image file (‘heatmap.png’). 4. Displays the generated heatmap image in the user interface. 5. Ensures that the heatmap label and close button are visible.

If the ‘cigs_df’ attribute is not present or is empty, the method displays a message in ‘cig_result_browser’ prompting the user to compute CIG before generating the heatmap.

Notes:

  • The heatmap is saved as ‘heatmap.png’ in the current working directory.

  • The method checks for and manages the presence of ‘heatmap_label’ to display the heatmap and ensures the ‘close_heatmap_button’ is visible.

Raises:

Missing CIG computation: “Please compute CIG before generating the heatmap.”

Returns:

Matplotlib plot (Heatmap)

get_categorical_columns()

Retrieve a list of categorical columns from the model.

Iterates through the rows of columns_model to find columns marked as “Categorical”. Returns a list of the names of these columns.

Returns:

list of str

A list containing the names of columns identified as categorical.

get_selected_columns()

Retrieve the names of the selected columns in the columns view.

Collects and returns the names of columns currently selected in ‘columns_view’. If no columns are selected, displays a message on ‘result_label’.

Returns:

list of str

The list of selected column names.

get_sensitive_attribute()

Retrieve the name of the selected sensitive attribute.

Iterates through the rows of columns_model to find the column marked as the sensitive attribute (checked in the fourth column). Returns the name of the first checked sensitive attribute or None if none are found.

Returns:

str or None

The name of the sensitive attribute if one is selected, otherwise None.

handle_compute_cig()
handle_descrip_icon()
highlight_highest_mean_value()

Highlights the largest values in the ‘mean’ column of the QTableWidget.

initUI()

Initialize the user interface of the File Analyzer application. Sets up the main window, styles it, and initializes the pages using dedicated methods.

initializePages()

Initialize all pages and add them to the stacked widget.

load_data(file_path)

Loads data from a specified file into the application’s data structure.

This method reads a CSV or TSV file from the provided file path and processes the data. It distinguishes between continuous and categorical columns based on the number of unique values and updates the internal data structures accordingly.

Parameters:

file_pathstr

The path to the file to be loaded. The file extension should be either ‘.csv’ or ‘.tsv’.

Raises:

FileNotFoundError

If the specified file does not exist or cannot be found.

pd.errors.EmptyDataError

If the file is empty or cannot be read.

pd.errors.ParserError

If there is a parsing error in the file content.

Notes:

  • Column names are stripped of leading and trailing whitespace.

  • Columns are classified as “Continuous” if they have more than 45 unique values; otherwise, they are classified as “Categorical”.

  • The method stores a copy of the original data and updates the tree view with the column types and options for selecting columns.

Returns:

Loaded Data

load_file()

Open a file dialog to load a CSV or TSV file.

Displays a file dialog allowing the user to select a CSV, TSV, or other file type. If a file is selected, it calls load_data to load the file’s content.

load_metadata()

Loads metadata from a JSON file and updates the column dropdown.

Opens a file dialog to select a JSON file. Loads the selected file’s content into ‘self.metadata’ and updates the column dropdown. Shows an error message if file loading fails.

plot_tree_graph(column_name)

Plots a tree graph representing the relationships between the unique values of a selected categorical column, including combined values if applicable.

This method constructs a directed graph (DiGraph) using NetworkX to represent the hierarchy of values within the specified column. If there are combined values in the column (based on the combined_values_history), they are included in the graph, showing how individual values are combined into a replacement node. The graph is visualized using Matplotlib.

Workflow: 1. Creates a directed graph (‘DiGraph’) with the column name as the root node. 2. Adds nodes and edges for combined values from ‘combined_values_history’. 3. Adds nodes and edges for unique values in the selected column. 4. Removes edges related to combined values to prevent overlap. 5. Uses Graphviz layout for proper positioning of nodes. 6. Draws and displays the graph using Matplotlib and NetworkX.

Parameters:

column_namestr

The name of the categorical column from the dataset to plot.

Notes:

  • The graph will include nodes for individual values, and if any values were combined into a single entity, the relationship will be reflected in the graph.

  • Uses Graphviz for positioning (prog=’dot’) to ensure a hierarchical tree layout.

Returns:

nx.drawing.nx_agraph

revert_to_original()

Reverts the values in a selected column to their original state.

Updates the preview to reflect the reverted column. If no original data is available for the selected column, shows a warning message.

Raises:

QMessageBox

If an error occurs while reverting the column or if no original data is available.

round_values()

Rounds values in a selected continuous column to a specified precision or removes decimal numbers.

Rounds the values in the selected column according to the chosen precision or truncates the values. Stores the original data if not already stored and updates the preview.

Raises:

QMessageBox

If no continuous columns are available or if an error occurs during rounding or truncation.

save_and_display_boxplot_in_frame(k=2.2414)
save_boxplot_rig_values(k=2.2414)
save_cig_to_csv()

Saves the computed CIG DataFrame to a CSV file. Prompts the user to specify a file location and then saves the DataFrame.

save_suda_dataframe_to_csv()

This function saves the SUDA DataFrame (df) to a CSV file.

setupCentralWidget()

Setup the central widget and the main layout.

setupMainPage()

Setup the main page with buttons and frames.

setupMainPageSpecificComponents(layout)

Setup components specific to the main page such as custom buttons.

setupPreviewPage()

Setup the preview page with a table view and navigation buttons including a ‘Back to Main’ button.

setupPrivacyInfoPage()

Setup the privacy information page with buttons and static placeholder frames for CIG results and description.

setupSudaInfoPage()

Setup the SUDA information page with smaller buttons at the top and two side-by-side frames.

setup_treeview(view)

Configure the appearance and behavior of the given QTreeView.

Sets the selection mode, background color, font, and sorting behavior for the provided QTreeView. Adjusts settings based on whether the view is the columns view.

Parameters

viewQTreeView

The tree view to be configured.

show_combine_values_dialog()

Displays a dialog allowing the user to select and combine unique values from a specified categorical column in the dataset.

This method first checks if the dataset is loaded and then retrieves the categorical columns from the dataset. It allows the user to select a column from which to combine unique values. If the selected column has fewer than two unique values, a warning is shown. Otherwise, a dialog is presented where the user can select multiple unique values to combine. The user can then confirm the selection by clicking a button, which triggers the combination of selected values.

Workflow: 1. Checks if the dataset is loaded. 2. Retrieves the categorical columns from the dataset. 3. Opens a dialog for the user to select a categorical column. 4. Checks if the selected column has at least two unique values. 5. Opens another dialog to select and combine unique values from the column. 6. Applies styling to the dialog and its components for better visibility and usability.

Notes:

  • The method uses a ‘QDialog’ to allow the user to select values to combine.

  • If no categorical columns are available or if the dataset is not loaded, appropriate warnings are shown.

Raises:

QMessageBox.Warning: If no data is loaded, no categorical columns are available, or not enough unique values are present.

Returns:

QDialog Window

show_graph_categorical_dialog()

Opens a dialog to select and graph a categorical column from the loaded dataset.

Steps: 1. Checks if data is loaded. 2. Retrieves the categorical columns from the dataset. 3. Opens a dialog for the user to select a categorical column. 4. Plots a tree graph for the selected column using combined values, if available.

Raises:

QMessageBox.Warning: If no data is loaded or no categorical columns are available.

Parameters:

Column

Returns:

Visual display of graph

show_main_page()

Switches the displayed widget to the main page.

Sets the ‘stacked_widget’ to show the ‘main_page’. Function is used to navigate back to the main interface of the application.

show_metadata_for_column()

Displays metadata for the currently selected column in the dropdown.

Retrieves metadata for the selected column from self.metadata and displays it in self.metadata_display. Shows a warning if no metadata is available for the selected column.

show_preview()

Displays the first 10 rows of the loaded dataset in a preview table.

Raises:

QMessageBox

If no data is loaded.

Example:

Shows the first 10 rows of the dataset with column headers in a table view.

Notes:

  • Table height is set to a maximum of 200 pixels.

  • Dropdown menu for columns is updated accordingly.

show_preview_page()

Switches the displayed widget to the preview page.

Sets the ‘stacked_widget’ to show the ‘preview_page’. Function is used to navigate to the preview interface of the application.

show_privacy_info()
show_results_dialog(results_df, result_type='combined', mad_value=None)

Display a dialog with calculation results in a table format.

Shows the results of Combined Column Contributions or SUDA calculations in a modal dialog. The dialog includes an optional Median Absolute Deviation (MAD) value and allows saving the results as a CSV file.

Parameters:

results_df : pandas.DataFrame result_type : str, optional Type of results (“combined” or “suda”) for setting the dialog title and default filename. mad_value : float.

Returns:

QMessageBox / QDialog.

Raises:

FileNotFoundError If the file path for saving the CSV is invalid.

Exception:

For unexpected errors during the save process.

show_splash_screen()
show_suda_info()
staticMetaObject = PySide6.QtCore.QMetaObject("metaprivBIDS" inherits "QMainWindow": )
update_column_dropdown()

Update the column dropdown menu with metadata keys.

Clears the current items in the dropdown and repopulates it with column names from the metadata. Adds a default “Select Column” option at the top.

update_frame_with_dataframe(frame, dataframe)

This function updates a frame with the contents of a DataFrame, displayed as a table. It dynamically applies colors to the ‘variable’ column if it exists, and adjusts the width of the indi_frame based on the size of the column names and the content of each column.

update_treeview(model, data_list, add_checkbox=False)

Populate the given model with data and optionally add checkboxes.

Clears the model and populates it with items from ‘data_list’. Optionally adds a checkable item to each row if ‘add_checkbox’ is True.

Parameters

modelQStandardItemModel

The model to update with the new data.

data_listlist of list

The data to populate the model, where each sublist represents a row.

add_checkboxbool, optional

If True, adds a checkable item to each row (default is False).

update_value_list()

Update the column selection dropdown with the current data’s columns.

Clears the existing items in the column combobox and repopulates it with the columns from the loaded dataset. If no data is loaded, only the default “Select Column” option is available.

Module contents