Utils
This part of the documentation is currently under construction and will be restructured shortly. Here you can find generally useful functions that are utilized across the module and may also be used for individual variable definitions.
- corr_vars.utils.helpers.json_list_sql(json_obj)[source]
Convert a JSON list to string for SQL IN clause.
- Parameters:
json_obj (list[str]) – The JSON list.
- Returns:
The SQL list.
- Return type:
str
- corr_vars.utils.helpers.filter_by_condition(df, condition_func, description='', verbose=True, mode='drop')[source]
Drop rows from a DataFrame based on a condition function.
- Parameters:
df (pd.DataFrame) – The input DataFrame.
condition_func (callable) – A function that takes the DataFrame and returns a boolean Series.
description (str) – Description of the condition.
mode (str) – Whether to drop or keep the rows. Can be “drop” or “keep”.
- Returns:
The DataFrame with rows dropped based on the condition.
- Return type:
pd.DataFrame
- corr_vars.utils.helpers.merge_consecutive_stays(group)[source]
Merge consecutive ICU stays in a dataframe.
- Parameters:
group (pd.DataFrame) – ICU stays for a single case
- Returns:
The merged dataframe.
- Return type:
pd.DataFrame
- corr_vars.utils.helpers.df_find_closest(group, to_col, tdelta='0', pm_before='52w', pm_after='52w')[source]
Find the closest record to the target time.
- Parameters:
group (pd.DataFrame) – The input dataframe.
to_col (str) – The column containing the target time.
tdelta (str) – The time delta. Defaults to “0”.
pm_before (str) – The time range before the target time. Defaults to “52w”.
pm_after (str) – The time range after the target time. Defaults to “52w”.
- Returns:
The closest record.
- Return type:
pd.Series
- corr_vars.utils.helpers.aggregate_column(group, agg_func, column, params=[])[source]
- Parameters:
group (
DataFrameGroupBy)agg_func (
str)column (
str)params (
list[str])
- corr_vars.utils.helpers.parse_select(select_str)[source]
Parse the select syntax to extract function name, parameters and columns.
- Parameters:
select_str (str) – The select string.
- Returns:
(function_name, params, columns)
- Return type:
tuple
Examples
!first value, recordtime
!closest(timestamp, 1d, 2d) value, recordtime
!any value
!last value, recordtime
- corr_vars.utils.helpers.parse_col_list(cols, var_name)[source]
Parse a list of columns to a select string.
- Parameters:
cols (list[str]) – The columns to parse.
var_name (str) – The variable name.
- Returns:
The select string.
- Return type:
str
Examples
[“value”, “recordtime”] -> “value as var_name_value, recordtime as var_name_recordtime”
[“value”] -> “value AS var_name”
- corr_vars.utils.helpers.convert_interval_sql(interval_str)[source]
Convert shorthand interval notation to SQL INTERVAL syntax. Supports: y->year, M->month, w->week, d->day, h->hour, m->minute, s->second
- Parameters:
interval_str (str) – The interval string.
- Returns:
The SQL INTERVAL syntax.
- Return type:
str
Examples
“2h” -> “INTERVAL ‘2’ HOUR”
“-2h” -> “INTERVAL ‘-2’ HOUR”
“30m” -> “INTERVAL ‘30’ MINUTE”
- corr_vars.utils.helpers.get_cb_values(df_chunk, primary_key='case_id', tmin_col='tmin', tmax_col='tmax', ttarget_col=None)[source]
Get the case bounds CTE. This is used to apply a time filter to a native dynamic variable.
- Parameters:
df_chunk (pd.DataFrame) – A chunk of the cohort dataframe.
primary_key (str) – The primary key column. Defaults to “case_id”.
tmin_col (str) – The column containing the tmin. Defaults to “tmin”.
tmax_col (str) – The column containing the tmax. Defaults to “tmax”.
ttarget_col (str) – The column containing the target time. Defaults to None. (For !closest)
- Returns:
The case bounds CTE.
- Return type:
str
- corr_vars.utils.helpers.build_base_query(cb, var, database, primary_key, extr_end_date, ttarget_col=None)[source]
Build the base query for a native dynamic variable.
- Parameters:
cb (str) – The case bounds CTE.
var (NativeDynamic) – The native dynamic variable.
database (str) – The database name.
primary_key (str) – The primary key column. Defaults to “case_id”.
extr_end_date (str) – The extraction end date.
ttarget_col (str) – The target time column. Defaults to None.
- Returns:
Query, table info, and columns to keep.
- Return type:
tuple
- corr_vars.utils.helpers.get_time_series(obs, col_name, tdelta='0')[source]
Returns a series with the time since col_name (in datetime format) plus tdelta (in pd.Timedelta format).
- Parameters:
obs (pd.DataFrame) – The observation dataframe.
col_name (str) – The column name.
tdelta (str) – The time delta. Defaults to “0”.
- Returns:
A series with the time column + tdelta.
- Return type:
pd.Series
- corr_vars.utils.helpers.parse_time_args(obs, tmin=None, tmax=None)[source]
Parse the time arguments to get the time series.
- Parameters:
obs (pd.DataFrame) – The observation dataframe.
tmin (str | tuple[str, str] | None) – The tmin argument. Defaults to None.
tmax (str | tuple[str, str] | None) – The tmax argument. Defaults to None.
- Returns:
tmin (pd.Series), tmax (pd.Series).
- Return type:
tuple
- corr_vars.utils.helpers.sql_to_pandas(query)[source]
Converts SQL-like expressions to pandas-compatible expressions. Supported operations include IN, NOT IN, LIKE, AND, OR.
- Parameters:
query (str) – The SQL-like expression.
- Returns:
The equivalent pandas-compatible expression.
- Return type:
str
- corr_vars.utils.helpers.extract_df_data(df, col_dict=None, filter_dict=None, exact_match=False, remove_prefix=False, drop=False)[source]
Extracts data from a DataFrame.
- Parameters:
df (pandas.DataFrame) – The DataFrame to operate on.
col_dict (dict, optional) – A dictionary mapping column names to new names. Defaults to None.
(dict[str (filter_dict) – str], optional): A dictionary where keys are column names and values are lists of values to filter rows by (may include regex pattern for exact_match=False). Defaults to None.
list – str], optional): A dictionary where keys are column names and values are lists of values to filter rows by (may include regex pattern for exact_match=False). Defaults to None.
exact_match (bool, optional) – If True, performs exact matching when filtering. Defaults to False.
remove_prefix (bool, optional) – If True, removes prefix from default_key. Defaults to False.
drop (bool, optional) – If True, drop all columns not specified in col_dict.
- Returns:
A DataFrame containing the extracted data from the original DataFrame.
- Return type:
pandas.DataFrame
- corr_vars.utils.helpers.merge_consecutive(data, primary_key, recordtime='recordtime', recordtime_end='recordtime_end', time_threshold=Timedelta('0 days 00:30:00'))[source]
Combine consecutive sessions (<30min separation) of ecmo_vv_icu into a single session.
- Parameters:
data (pd.DataFrame) – The data to merge.
primary_key (str) – The primary key column.
recordtime (str) – The recordtime column. Defaults to “recordtime”.
recordtime_end (str) – The recordtime_end column. Defaults to “recordtime_end”.
time_threshold (pd.Timedelta) – The time threshold. Defaults to 30 minutes.
- Returns:
The merged data.
- Return type:
pd.DataFrame
- corr_vars.utils.helpers.extract_with_co6_parent(parent_name, suffixes, df, cohort, dbpass=None, table='it_copra6_hierarchy_v2', chunk_size=30000)[source]
Extract variables from the specified table following the Co6 hierarchy schema. Will merge child elements of a parent variable. You may reference this file for available parent and child relations.
- Parameters:
parent_name (str) – The parent name.
suffixes (dict[str, str]) – Dictionary mapping suffixes to column names. Must include a suffix that maps to “recordtime”.
df (pd.DataFrame) – Dataframe with case_id, tmin, tmax columns. [Copy of cohort.obs]
cohort (Cohort) – The cohort.
dbpass (str) – The database password (only to be passed from the Variable.extract() method, do not specify passwords in your code).
table (str) – The table to extract from. Defaults to “it_copra6_hierarchy_v2”.
chunk_size (int) – The chunk size. Defaults to 30000.
- Returns:
The extracted data (columns: case_id, recordtime, plus any additional columns specified)
- Return type:
pd.DataFrame
Examples
>>> df = extract_with_co6_parent( >>> parent_name="Score_SOFA", >>> suffixes={ >>> "_Wert": "value", >>> "_Date": "recordtime", >>> }, >>> df=cohort.obs, >>> cohort=cohort, >>> )