Variables

Base Variable Class

class corr_vars.core.extract.Variable(var_name, native, dynamic, requires=[], tmin=None, tmax=None)[source]

Bases: object

Base class for all variables.

Parameters:
  • var_name (str) – The variable name.

  • native (bool) – True if the variable is native (extracted from the database or simple aggregation of native variables).

  • dynamic (bool) – True if the variable is dynamic (time-series).

  • tmin (Union[str, tuple[str, str], None]) – The tmin argument. Can either be a string (column name) or a tuple of (column name, timedelta).

  • tmax (Union[str, tuple[str, str], None]) – The tmax argument. Can either be a string (column name) or a tuple of (column name, timedelta).

  • requires (list[str]) – List of variables required to calculate the variable.

Note that tmin and tmax can be None when you create a Variable object, but must be set before extraction. If you add the variable via cohort.add_variable(), it will be automatically set to the cohort’s tmin and tmax.

This base class should not be used directly; use one of the subclasses instead.

classmethod from_json(var_name, var_dict, tmin=None, tmax=None)[source]

Create a Variable object from a variable dictionary.

Parameters:
  • var_name (str) – The variable name.

  • var_dict (dict) – The variable dictionary (from vars.json).

  • tmin (str | tuple[str, str]) – The tmin argument. Can either be a string (column name) or a tuple of (column name, timedelta).

  • tmax (str | tuple[str, str]) – The tmax argument. Can either be a string (column name) or a tuple of (column name, timedelta).

Returns:

Variable object, depending on the variable type.

Return type:

Variable

classmethod from_corr_vars(var_name, cohort=None, tmin=None, tmax=None)[source]

Create a Variable object from a variable name.

Parameters:
  • var_name – The variable name (in vars.json).

  • cohort – Cohort object (to pass custom variable definitions).

  • tmin – The tmin argument. Can either be a string (column name) or a tuple of (column name, timedelta).

  • tmax – The tmax argument. Can either be a string (column name) or a tuple of (column name, timedelta).

Returns:

Variable object, depending on the variable type.

Return type:

Variable

call_var_function(cohort)[source]

Call the variable function if it exists.

Parameters:

cohort (Cohort)

Return type:

bool

Returns:

True if the function was called, False otherwise.

Operations will be applied to Variable.data directly.

Native Variables

class corr_vars.core.extract.NativeDynamic(var_name, table, where, value_dtype, cleaning, tmin=None, tmax=None)[source]

Bases: NativeVariable

Native dynamic variables are extracted directly from the database and represent time-series data.

The extracted data will be in long format, including columns like recordtime, value, and depending on the table definition, recordtime_end (e.g., for therapy events) and description (e.g., medication names). The resulting dataframe will be available as Cohort.obsm["var_name"] or as Variable.data.

The cleaning term will be applied at the end of the extraction process.

Parameters:
  • var_name (str) – Name of the variable.

  • table (str) – Source table to extract from (e.g., it_copra6_hierachy_v2, it_copra6_therapy). See HDL Hue for a list of available tables.

  • where (str) – SQL statement to filter the source table. Must use column names available in the source table.

  • value_dtype (str) – SQL data type of the value column, e.g., BIGINT, BOOLEAN, DATETIME, DOUBLE, STRING.

  • cleaning (dict) – Dictionary specifying lower and upper bounds for impossible values. Note: Should only include physically impossible values, not unlikely/percentile-based values. Consult a clinician if unsure!

  • tmin – Minimum time for the extraction.

  • tmax – Maximum time for the extraction.

Examples

>>> # Basic lab value extraction
>>> v = NativeDynamic(
...     var_name="blood_urea",
...     table="it_ishmed_labor",
...     where="c_katalog_leistungtext LIKE '%arnstoff%' AND c_wert <> '0'",
...     value_dtype="DOUBLE",
...     cleaning={"value": {"low": 2, "high": 500}}
...     tmin="hospital_admission",
...     tmax="hospital_discharge"
... )
>>> v.extract(cohort)
>>> v.data
>>> # Therapy event with start/end times
>>> # End time will be automatically added as recordtime_end for specified tables
>>> v = NativeDynamic(
...     var_name="ecmo_vva_vav_icu",
...     table="it_copra6_therapy",
...     where="c_apparat_mode IN ('v-v/a ECMO','v-a/v ECMO')",
...     value_dtype="VARCHAR",
...     cleaning=None,
...     tmin="hospital_admission",
...     tmax="hospital_discharge"
... )
>>> v.extract(cohort)
>>> v.data
on_admission(select='!first value')[source]

Create a new NativeStatic variable based on the current variable that extracts the value on admission.

Parameters:

select (str) – Select clause specifying aggregation function and columns. Defaults to !first value.

Return type:

NativeStatic

Returns:

NativeStatic

Examples

>>> # Return the first value
>>> var_adm = variable.on_admission()
>>> cohort.add_variable(var_adm)
>>> # Be more specific with your selection
>>> var_adm = variable.on_admission("!closest(hospital_admission,0,2h) value")
>>> cohort.add_variable(var_adm)
class corr_vars.core.extract.NativeStatic(var_name, select, base_var, where=None, tmin=None, tmax=None)[source]

Bases: NativeVariable

NativeStatic variables represent simple aggregations of NativeDynamic variables.

Parameters:
  • var_name (str) – Name of the variable.

  • select (str) – Select clause specifying aggregation function and columns.

  • base_var (str) – Name of the base variable (must be a native_dynamic variable).

  • where (str, optional) – Optional WHERE clause (in format for polars).

  • tmin (str, optional) – Minimum time for the extraction.

  • tmax (str, optional) – Maximum time for the extraction.

The select argument supports several aggregation functions:

  • !first [columns]: Returns the first row within this case
    >>> "!first value"  # Single column
    >>> "!first value, recordtime"  # Multiple columns
    
  • !last [columns]: Returns the last row within this case
    >>> "!last value"
    >>> "!last value, recordtime"
    
  • !any: Returns True if any value exists
    >>> "!any"
    >>> "!any value"
    
  • !closest(to_column, timedelta, plusminus) [columns]: Selects value closest to specified column
    Args:

    to_column: Column to compare “recordtime” against timedelta: Time to add to “to_column” for comparison plusminus: Allowed time mismatch (can specify different before/after with space)

    >>> "!closest(hospital_admission) value, recordtime"  # Closest to admission
    >>> "!closest(hospital_admission, 0, 2h 3h) value"  # 2h before to 3h after
    >>> "!closest(first_intubation_dtime, 6h, 2h) value"  # 6h after intubation ±2h
    
  • !mean [column]: Calculates mean value
    >>> "!mean value"
    
  • !median [column]: Calculates median value
    >>> "!median value"
    
  • !perc(quantile) [column]: Calculates specified percentile
    >>> "!perc(75) value"  # 75th percentile
    

The where argument supports Pandas-style boolean expressions. These are evaluated in the context of the base variable by pd.eval(). Where also supports magic commands (starting with !) to filter the data. Supported commands are:

  • !isin(column, [values]): Filters rows where the value in column is in values

  • !startswith(column, [values]): Filters rows where the value in column starts with any of the values

  • !endswith(column, [values]): Filters rows where the value in column ends with any of the values

extract(cohort, use_cache=True)[source]

Extract the variable. You do not need to call this yourself, as it is called internally when you add the variable to a cohort. However, you may call it directly to obtain variable data independently of the cohort. You still need a cohort object for case ids and other metadata.

Parameters:
  • cohort (Cohort) – Cohort object.

  • use_cache (bool) – Whether to use cached base variable. This is highly recommended, as direct SQL-extraction is currently being phased out.

Return type:

pd.DataFrame

Returns:

Extracted variable.

After extraction, you may also access the data as Variable.data.

Examples

>>> var = NativeStatic(
...     var_name="first_sodium_recordtime",
...     select="!first recordtime",
...     base_var="blood_sodium",
...     tmin="hospital_admission"
... )
>>> var.extract(cohort) # With var.extract(), the data will not be added to the cohort.
>>> var.data

Derived Variables

class corr_vars.core.extract.DerivedDynamic(var_name, requires, cleaning=None, tmin=None, tmax=None)[source]

Bases: DerivedVariable

Derived dynamic variables are extracted using a custom function.

Warning

You cannot add these variables manually yet, as they always require a custom function in variables.py. This will be addressed in the future.

Parameters:
  • var_name – Name of the variable.

  • requires (list[str]) – List of required variables.

  • tmin – Minimum time for the extraction.

  • tmax – Maximum time for the extraction.

  • cleaning (Optional[dict]) – Cleaning parameters ({column_name: {low: int, high: int}})

Examples

Currently, this does not work as you need to add a custom function in variables.py.

>>> DerivedDynamic(
...     var_name="pf_ratio",
...     requires=["blood_pao2_arterial", "vent_fio2"])
extract(cohort)[source]

Extract the variable.

Parameters:

cohort (Cohort) – Cohort object.

Returns:

Extracted variable.

Return type:

pd.DataFrame

class corr_vars.core.extract.DerivedStatic(var_name, requires, expression=None, tmin=None, tmax=None)[source]

Bases: DerivedVariable

Derived static variables are extracted using an expression.

Parameters:
  • var_name – Name of the variable.

  • requires (list[str]) – List of required variables.

  • expression (Optional[str]) – Expression to extract the variable.

  • tmin – Minimum time for the extraction.

  • tmax – Maximum time for the extraction.

Note that DerivedStatic variables are executed on the cohort.obs dataframe and must reference existing columns in cohort.obs.

For DerivedStatic variables, you may either provide an expression or a custom function in variables.py. Use expressions where possible, but custom functions if you require more complex logic.

Examples

>>> DerivedStatic(
...     var_name="inhospital_death",
...     requires=["hospital_discharge", "death_timestamp"],
...     expression="hospital_discharge <= death_timestamp"
... )
>>> DerivedStatic(
...     var_name="any_va_ecmo_icu",
...     requires=["ecmo_va_icu_ops", "ecmo_va_icu"]
...     expression=(ecmo_va_icu_ops | ecmo_va_icu)
... )
extract(cohort)[source]
Parameters:

cohort (Cohort)

Return type:

pd.DataFrame

class corr_vars.core.extract.ComplexVariable(var_name, dynamic, requires, tmin=None, tmax=None)[source]

Bases: DerivedVariable

A derived variable that requires a custom function to be called. Other than the expression, this is identical to a DerivedVariable.

Parameters:
  • var_name – Name of the variable.

  • dynamic – Whether the variable is dynamic.

  • requires (list[str]) – List of required variables.

  • tmin – Minimum time for the extraction.

  • tmax – Maximum time for the extraction.

The ComplexVariable requires a custom function in variables.py.

extract(cohort)[source]