CORR-Variables
The CORR-Variables package is a Python package for extracting and analyzing data from the Charité Outcomes Research Repository (CORR). It functions as a connector on top of the Hadoop-based Health Data Lake (HDL). It preprocesses the data into clinically meaningful and quality-checked variables to streamline research with real-world data at our Institution.
Installation
The CORR-Vars package is pre-installed and regularly updated on the IMI server (s-c01-imi-app01.charite.de).
Make sure to use the correct Python environment.
conda activate /data02/projects/icurepo/.pkg/env10
To install the package on your local machine, run the following command. CAVE: This only works if you have access to the private GitHub repository.
pip install git+https://github.com/thielem/corr-vars.git
API Reference
- Cohort
- Variables
- Utils
json_list_sql()filter_by_condition()merge_consecutive_stays()df_find_closest()aggregate_column()parse_select()parse_col_list()convert_interval_sql()get_cb_values()build_base_query()get_time_series()parse_time_args()sql_to_pandas()get_variables()extract_df_data()merge_consecutive()extract_with_co6_parent()
Quick Start
# Will import all public classes (Cohort, Variable, etc.)
from corr_vars import *
# Initialize a cohort
cohort = Cohort(obs_level="hospital_stay")
cohort.load_default_vars()
# View the first 5 rows
print(cohort.obs.head())
Core Components
Cohort
The main class for handling patient cohorts. Supports different observation levels:
Observation Level |
Primary Key |
tmin |
tmax |
|---|---|---|---|
hospital_stay |
case_id |
hospital_admission |
hospital_discharge |
icu_stay |
icu_stay_id |
icu_admission |
icu_discharge |
procedure |
procedure_id |
op_start_dtime_any |
op_end_dtime_any |
cohort = Cohort(obs_level="hospital_stay")
# Save cohort to file
cohort.save("my_cohort.corr")
# Load cohort from file
cohort = Cohort.load("my_cohort.corr")
# Export to CSV
cohort.to_csv("output_folder")
Variables
Different types of variables are supported:
NativeDynamic: Time-series variables extracted from the database
NativeStatic: Static variables from the database or simple aggregations based on NativeDynamic variables
DerivedStatic: Computed static variables
DerivedDynamic: Computed time-series variables
Complex: Custom variables. Can be anything defined by the Python function provided by the user.
To view all available variables, we recommend using the Graphical Variable Explorer.
# Initialize cohort
cohort = Cohort(obs_level="icu_stay")
# Add static variables
# These are added to cohort.obs DataFrame
cohort.add_variable('any_proning_icu')
>>> cohort.obs.head()
icu_stay_id any_proning_icu ...
0 12345 True ...
1 12346 False ...
2 12347 True ...
...
# Add dynamic (time-series) variables
# These are added to cohort.obsm dictionary
cohort.add_variable('blood_sodium')
>>> cohort.obsm.keys()
['blood_sodium']
# Access time-series data
>>> cohort.obsm['blood_sodium'].head()
icu_stay_id recordtime value
0 12345 2024-01-01 08:00 140
1 12345 2024-01-01 12:00 138
2 12345 2024-01-01 16:00 142
...
Development
The source code is available on GitHub: https://github.com/thielem/corr-vars
Version
Current version: 0.2.0