Most Accurate Synthetic Healthcare Data: Back to the Future, aka TENET, aka Reverse Model

Ricky Sahu
by Ricky Sahu
2024-08-14

GenHealth now has the ability to create the most accurate synthetic patient histories for patients with any combination of conditions, procedures, demographics, medications, etc. We did this by reversing our data, training a new reverse model, then running inference for some trigger event(s) on that reverse model and reversing the output to create a synthetic history, and finally feeding that reverse output into the forward model to create a synthetic future after the said trigger event and the synthetic history.

If you followed, yes that is exactly what we did! If you didn’t here’s more detail.

Basically our Large Medical Model (LMM)’s approach is to predict the future sequence of events by training on past sequences. What we did to create our reverse model is simply reversed the series of events and eventually combine that output with that of our normal forward looking model.

  1. Reverse patient histories
  2. Train a new model on those reversed histories to create a reverse model
  3. Run inference on the reverse model to create a set of events that goes backwards in time
  4. Reverse that set of events to get the temporally forward series of events, thereby creating a synthetic history
  5. Run inference on our forward model with that synthetic history to create additional events after the trigger event
  6. Concat the synthetic history + the trigger event + the synthetic future to create the entirely synthetic patient sequence

Our approach does not use any pre-defined rules or logic in creating patient histories. Instead it benefits from our generative AI approach’s ability to attend to sequential relationships among the data.

We generate a history and future on either side of a “trigger sequence.” This trigger sequence could be something that specifies an episode of care or a patient timeline that details the demographic selection criteria. For example it could be “32 year old females with breast cancer”. GenHealth takes that input and creates a history for all of the events that led up to this trigger sequence. Then we take all of that synthetic history and the trigger sequence to create all of the future events. Putting all three of those pieces together produces an entire synthetic individual healthcare record including the relationships and timelines between potentially thousands of events. We can create thousands of patients using this methodolgy to create an analytical dataset which can be explored through our G-Mode app.

Here’s an example of a fully synthetic patient sequence produced by our Large Medical Model

┌─────────────────────┬─────────────────────────┬────────────────────────┐
│ date                ┆ token_type              ┆ token                  │
│ ---                 ┆ ---                     ┆ ---                    │
│ datetime[μs]        ┆ str                     ┆ str                    │
╞═════════════════════╪═════════════════════════╪════════════════════════╡
│ 2019-01-01 00:00:00 ┆ ENROLLMENT_START        ┆                        │
│ 2019-01-01 00:00:00 ┆ YEAR                    ┆ 2019                   │
│ 2019-01-01 00:00:00 ┆ PLAN_TYPE               ┆ Commercial             │
│ 2019-01-01 00:00:00 ┆ GENDER                  ┆ FEMALE                 │
│ 2019-01-01 00:00:00 ┆ RACE                    ┆ UNKNOWN                │
│ 2019-01-01 00:00:00 ┆ AGE                     ┆ 12                     │
│ 2019-01-01 00:00:00 ┆ TIME_GAP                ┆ GAP_51                 │
│ 2019-02-21 00:00:00 ┆ ON_ADMISSION            ┆                        │
│ 2019-02-21 00:00:00 ┆ UNCLEAR_IF_ON_ADMISSION ┆                        │
│ 2019-02-21 00:00:00 ┆ ICD_10_CM               ┆ D51.9                  │
│ 2019-02-21 00:00:00 ┆ ICD_10_CM               ┆ E55.9                  │
...
│ 2019-02-21 00:00:00 ┆ CPT_HCPCS_CODE          ┆ 84443                  │
│ 2019-02-21 00:00:00 ┆ PLACE_OF_SERVICE        ┆ INDEPENDENT_LABORATORY │
│ 2019-02-21 00:00:00 ┆ CPT_HCPCS_CODE          ┆ 85025                  │
│ 2019-02-21 00:00:00 ┆ TOTAL_CLAIM_COST_KEY    ┆ CLAIM_COST             │
│ 2019-02-21 00:00:00 ┆ COST                    ┆ 134.00                 │
│ 2019-02-21 00:00:00 ┆ TIME_GAP                ┆ GAP_57                 │
│ 2019-04-19 00:00:00 ┆ ON_ADMISSION            ┆                        │
│ 2019-04-19 00:00:00 ┆ UNCLEAR_IF_ON_ADMISSION ┆                        │
│ 2019-04-19 00:00:00 ┆ ICD_10_CM               ┆ D51.9                  │
│ 2019-04-19 00:00:00 ┆ ICD_10_CM               ┆ E55.9                  │
...
│ 2019-04-22 00:00:00 ┆ CPT_HCPCS_CODE          ┆ 36415                  │
│ 2019-04-22 00:00:00 ┆ PLACE_OF_SERVICE        ┆ INDEPENDENT_LABORATORY │
│ 2019-04-22 00:00:00 ┆ CPT_HCPCS_CODE          ┆ 36415                  │
│ 2019-04-22 00:00:00 ┆ TOTAL_CLAIM_COST_KEY    ┆ CLAIM_COST             │
│ 2019-04-22 00:00:00 ┆ COST                    ┆ 274.00                 │
│ 2019-04-22 00:00:00 ┆ TOTAL_CLAIM_COST_KEY    ┆ CLAIM_COST             │
│ 2019-04-22 00:00:00 ┆ COST                    ┆ 143.00                 │
│ 2019-04-22 00:00:00 ┆ TIME_GAP                ┆ GAP_49                 │

We are already applying our synthetic data to research and analytics through our G-Mode generative AI healthcare analytics application with select healthcare providers and research organizations who may not have as many patients for a specific cohort or need a more specific population than what may naturally occur.