Why RLHF is Unnecessary for Transformers Trained on Raw Medical Events

Ricky Sahu
by Ricky Sahu

Reinforcement learning with human feedback (RLHF) is a powerful technique that has shown great promise in improving machine learning models. However, in the case of large medical model (LMM) transformers that are trained on raw medical event data composed of codes like ICD, CPT, RxNorm, NPIs, etc., RLHF is not necessary.

LMM transformers are already highly optimized for processing medical data, and there is no need to supplement their training with human feedback when these models are aiming to accurately replicate the medical record. Typically RLHF is required to contain the base training of a large language model to the guidance and design of the creating organization because language and unstructured text data can be manipulated in ways that may result in negative outcomes.

RLHF in an LLM may be used to identify potential biases in the data, specifically in a large language model, RLHF works by providing a “reward function” to acceptable responses to human trainers. Therefore the model learns to respond in that manner. However, with the use of medical codes, which is a relatively structured data format, the possibility of bias is reduced as long as the medical events are representative of the real world (as long as there is little or no fraud in the training data). Therefore, while RLHF may be useful in other contexts, it is unnecessary and could be counter productive for LMM transformers trained on medical data where you do not want to reward model predictions that are different from the distribution in the data.

GenHealth’s transformers are trained in an unsupervised manner on medical data. Our core GPT model (called DOOG-E) is designed to learn from massive amounts of structured data, and is able to extract meaningful information from raw medical event data without the need for human feedback. By training on large amounts of data, LMM transformers are able to learn patterns and relationships that are difficult for humans to discern and have often not even been discovered in published research. For example we’ve discovered novel relationships between drugs like Losartan and patient conditions including hearing loss after taking it as well as the high likelihood of developing Parkinson’s as an older male who recently had a stroke.

While RLHF is a valuable technique for improving machine learning models, it is not necessary for LMM transformers trained on raw medical event data. Large medical models models, like any other base transformer, are highly optimized for producing future sequences, and, in the case of LMMs specifically, processing medical data to generate future medical journeys without the need for human feedback.